In this work, we propose a
big data scheme for extremely imbalance problems implemented under Apache Spark,
which aims at solving the lack of density problem. First, the whole training
dataset is split into chunks, and the positive examples are extracted from it.
Then, we broadcast the positive set, so that, all the nodes have a single in-memory
copy of the positive samples. For each chunk of the negative data, we aim to
obtain a balanced subset of data using a sample of the positive set. Later, EUS
is applied to reduce the size of both classes and maximize the classification
performance, obtaining a reduced set that is used to learn a model. Finally,
the different models are combined to predict the classes of the test set.
I. Triguero, M. Galar, D. Merino, J. Maillo, H. Bustince, and F. Herrera. Evolutionary undersampling for extremely imbalanced big data classification under apache spark. 2016.
I. Triguero, M. Galar, D. Merino, J. Maillo, H. Bustince, and F. Herrera. Evolutionary undersampling for extremely imbalanced big data classification under apache spark. 2016.
No comments:
Post a Comment