Measuring data drift with the unstable population indicator

M.R. Haas*, L. Sibbald

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

1 Citation (Scopus)
33 Downloads (Pure)

Abstract

Measuring data drift is essential in machine learning applications where model scoring (evaluation) is done on data samples that differ from those used in training. The Kullback-Leibler divergence is a common measure of shifted probability distributions, for which discretized versions are invented to deal with binned or categorical data. We present the Unstable Population Indicator, a robust, flexible and numerically stable, discretized implementation of Jeffrey’s divergence, along with an implementation in a Python package that can deal with continuous, discrete, ordinal and nominal data in a variety of popular data types. We show the numerical and statistical properties in controlled experiments. It is not advised to employ a common cut-off to distinguish stable from unstable populations, but rather to let that cut-off depend on the use case.
Original languageEnglish
Article number1
Pages (from-to)1-12
Number of pages12
JournalData Science
Volume7
Issue number1
DOIs
Publication statusPublished - 2024

Keywords

  • data drift
  • data shift
  • machine learning
  • kl-divergence

Fingerprint

Dive into the research topics of 'Measuring data drift with the unstable population indicator'. Together they form a unique fingerprint.

Cite this