Class Imbalance Revisited: a New Experimental Setup to Assess the Performance of Treatment Methods
In review - under consideration for publication in KAIS journal

Ronaldo C. Prati, Gustavo E. A. P. A. Batista and Diego F. Silva

Abstract In the last decade, class imbalance has attracted a huge amount of attention from researchers and practitioners. Class imbalance is ubiquitous in Machine Learning, Data Mining and Pattern Recognition applications; therefore, these research communities have responded to such interest with literally dozens of methods and techniques. Surprisingly, there are still many fundamental open-ended questions such as "Are all learning paradigms equally affected by class imbalance?", "What is the expected performance loss for different imbalance degrees?" and "How much of the performance losses can be recovered by the treatment methods?". In this paper, we propose a simple experimental design to assess the performance of class imbalance treatment methods. This experimental setup uses real data sets with artificially modified class distributions to evaluate classifiers in a wide range of class imbalance. We apply such experimental design in a large-scale experimental evaluation with twenty-two data sets and seven learning algorithms from different paradigms. We also propose a statistical procedure aimed to evaluate the relative degradation and recoveries, based on confidence intervals. This procedure allows a simple yet insightful visualization of the results, as well as provide the basis for drawing statistical conclusions. Our results indicate that the expected performance loss, as a percentage of the performance obtained with the balanced distribution, is quite modest (below 5%) for the most balanced distributions up to 10% of minority examples. However, the loss tends to increase quickly for higher degrees of class imbalance, reaching 20% for 1% of minority class examples. Support Vector Machine is the classifier paradigm that is less affected by class imbalance, being almost insensitive to all but the most imbalanced distributions. Finally, we show that the treatment methods only partially recover the performance losses. On average, typically about 30% or less of the performance that was lost due to class imbalance was recovered by these methods.

Keywords Class imbalance, experimental setup, sampling methods

Contacts ronaldo.prati@ufabc.edu.br; {gbatista, diegofsilva}@icmc.usp.br

Available files

Detailed Experimental Results

We provide detailed numerical results (not shown in the paper due to brevity) in spreadsheet format (MS Excel). These numerical results are classification performance, performance losses (as a percentage of the balanced distribution) and performance recovery (as a percentage of the balanced distribution as well). All results are measured in AUC and are detailed per data set (in the paper we only show mean peformance loss and recovery for all data sets). The spreadsheet files are the following:
- Classification performance in AUC and performance loss compared with the balanced distribution (mean and per data set)
- Classification performance in AUC and performance loss compared to the balanced distribution (mean and per data set) after applying Random Over-sampling
- Classification performance in AUC and performance loss compared to the balanced distribution (mean and per data set) after applying SMOTE
- Classification performance in AUC and performance loss compared to the balanced distribution (mean and per data set) after applying Borderline-SMOTE
- Classification performance in AUC and performance loss compared to the balanced distribution (mean and per data set) after applying ADASYN
- Classification performance in AUC and performance loss compared to the balanced distribution (mean and per data set) by applying METACOST

Data

In order to promote reproducibility, most of the data used in the paper are publicly available. The only exceptions are the "Microcalcications in Mammography" data set, which was made available to us by Nitesh Chawla, and the "Hoar-frost Detection" data set, which we make it available here.

Source Code

The source code (in R) to calculate the Confidence Intervals and generate figures can be downloaded here. The zip file includes the source code, a pdf with figures of CIs and another zip file containing the data used by the script.

The source code (in C/C++) of the indecers used in the paper can be downloaded here. If you are using our scripts (with WEKA containing Borderline-SMOTE and Adasyn implementations) to run all the experiments, this file should be extracted to /usr/local/.

There are some other (auxiliar) source code in Perl. Ask any of authors to get them.

Other links

KAIS

The website of the Knowledge and Information Systems

Authors' pages

Ronaldo C. Prati
Gustavo E. A. P. A. Batista
Diego F. Silva