TY - GEN
T1 - Presumably Correct Undersampling
AU - Nápoles, Gonzalo
AU - Grau, Isel
N1 - Publisher Copyright:
© 2024, Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - This paper presents a data pre-processing algorithm to tackle
class imbalance in classification problems by undersampling the majority class. It relies on a formalism termed Presumably Correct Decision
Sets aimed at isolating easy (presumably correct) and difficult (presumably incorrect) instances in a classification problem. The former are
instances with neighbors that largely share their class label, while the
latter have neighbors that mostly belong to a different decision class. The
proposed algorithm replaces the presumably correct instances belonging
to the majority decision class with prototypes, and it operates under the
assumption that removing these instances does not change the boundaries of the decision space. Note that this strategy opposes other methods
that remove pairs of instances from different classes that are each other’s
closest neighbors. We argue that the training and test data should have
similar distribution and complexity and that making the decision classes
more separable in the training data would only increase the risks of
overfitting. The experiments show that our method improves the generalization capabilities of a baseline classifier, while outperforming other
undersampling algorithms reported in the literature.
AB - This paper presents a data pre-processing algorithm to tackle
class imbalance in classification problems by undersampling the majority class. It relies on a formalism termed Presumably Correct Decision
Sets aimed at isolating easy (presumably correct) and difficult (presumably incorrect) instances in a classification problem. The former are
instances with neighbors that largely share their class label, while the
latter have neighbors that mostly belong to a different decision class. The
proposed algorithm replaces the presumably correct instances belonging
to the majority decision class with prototypes, and it operates under the
assumption that removing these instances does not change the boundaries of the decision space. Note that this strategy opposes other methods
that remove pairs of instances from different classes that are each other’s
closest neighbors. We argue that the training and test data should have
similar distribution and complexity and that making the decision classes
more separable in the training data would only increase the risks of
overfitting. The experiments show that our method improves the generalization capabilities of a baseline classifier, while outperforming other
undersampling algorithms reported in the literature.
KW - Class Imbalance
KW - Pattern Classification
KW - Presumably Correct Decision Sets
KW - Undersampling
UR - http://www.scopus.com/inward/record.url?scp=85178575739&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-49018-7_30
DO - 10.1007/978-3-031-49018-7_30
M3 - Conference contribution
AN - SCOPUS:85178575739
SN - 9783031490170
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 420
EP - 433
BT - Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
PB - Springer Science and Business Media Deutschland GmbH
T2 - 26th Iberoamerican Congress on Pattern Recognition, CIARP 2023
Y2 - 27 November 2023 through 30 November 2023
ER -