Abstract
Measuring data drift is essential in machine learning applications where model scoring (evaluation) is done on data samples that differ from those used in training. The Kullback-Leibler divergence is a common measure of shifted probability distributions, for which discretized versions are invented to deal with binned or categorical data. We present the Unstable Population Indicator, a robust, flexible and numerically stable, discretized implementation of Jeffrey’s divergence, along with an implementation in a Python package that can deal with continuous, discrete, ordinal and nominal data in a variety of popular data types. We show the numerical and statistical properties in controlled experiments. It is not advised to employ a common cut-off to distinguish stable from unstable populations, but rather to let that cut-off depend on the use case.
Original language | English |
---|---|
Article number | 1 |
Pages (from-to) | 1-12 |
Number of pages | 12 |
Journal | Data Science |
Volume | 7 |
Issue number | 1 |
DOIs | |
Publication status | Published - 2024 |
Keywords
- data drift
- data shift
- machine learning
- kl-divergence