Monday, November 21, 2016

Evolutionary Undersampling for Extremely Imbalanced Big Data Classification under Apache Spark


In this work, we propose a big data scheme for extremely imbalance problems implemented under Apache Spark, which aims at solving the lack of density problem. First, the whole training dataset is split into chunks, and the positive examples are extracted from it. Then, we broadcast the positive set, so that, all the nodes have a single in-memory copy of the positive samples. For each chunk of the negative data, we aim to obtain a balanced subset of data using a sample of the positive set. Later, EUS is applied to reduce the size of both classes and maximize the classification performance, obtaining a reduced set that is used to learn a model. Finally, the different models are combined to predict the classes of the test set.

I. Triguero, M. Galar, D. Merino, J. Maillo, H. Bustince, and F. Herrera. Evolutionary undersampling for extremely imbalanced big data classification under apache spark. 2016.

No comments:

Post a Comment