Class Imbalance in Julia

#classimbalance #machinelearning #gsoc #jsoc

In this post, I will introduce the Google Summer of Code project that I have been involved in with my mentor @ablaom for the past couple of months. The project is a package with methods to correct for class imbalance in Julia: Imbalance.jl and a helper package MLJBalancing.jl to make it easy to use class imbalance methods with classification models from MLJ.

Class Imbalance

Class imbalance is a well-known issue in machine learning where the performance of a classification model is hindered due to an imbalance in the distribution of the target variable over the available data. For instance, a model trained on a fraud detection dataset with 99% genuine transactions and only 1% fraud transactions may perform very poorly in terms of correctly predicting fraud transactions.

A machine learning model may not or may to some degree be sensitive to class imbalance depending on the underlying learning algorithm, hypothesis set and loss function. In situations where class imbalance does pose a problem, which may be often the case, addressing it through techniques like class weighting or data resampling can lead to significant improvements in the model's performance on unseen data.

The edge that resampling may have over class weighting is that it is algorithm independent (e.g., does not assume there is an explicit loss function being minimized) and that in its simplest form (naive random oversampling) it can be shown (under conditions) to be equivalent to class weighting. Moreover, in more ideal cases, besides of improving the balance, it may bear similarity with collecting more data or help the model find better separating hypersurfaces for the task.

Imbalance.jl

The motivation of this package has been to offer a pool of resampling techniques that can be used to solve the class imbalance problem. For instance, similar to imbalanced-learn in Python.

The following are the resampling techniques that were implemented during my journey in Google Summer of Code:

Oversampling

Random Oversampling
Random Walk Oversampling (RWO)
Random Oversampling Examples (ROSE)
Synthetic Minority Oversampling Technique (SMOTE)
Borderline SMOTE1
SMOTE-Nominal (SMOTE-N)
SMOTE-Nominal Categorical (SMOTE-NC)

Undersampling

Random Undersampling
Cluster Undersampling
EditedNearestNeighbors Undersampling
Tomek Links Undersampling

Ensemble

Balanced Bagging Classifier (@MLJBalancing.jl)

Hybrid

via BalancedModel (@MLJBalancing.jl)

Package Features

Features offered by Imbalance.jl and MLJBalancing are as shown:

Available Methods

Methods support all four major types of resampling approaches
Methods generally work on multiclass settings
Methods that deal with nominal data are also available
Preference was given to methods that are more popular in the literature or industry

Interface Support

Methods generally support both matrix and table inputs.
Target may or may not be provided separately
All Imbalance.jl methods support a pure functional interface (default), an MLJ model interface and a TableTransforms interface
Possible to wrap an arbitrary number of resampler models with an MLJ model to behave as a unified model using MLJBalancing

User Experience

Comprehensive documentation
Examples (with shown output) that work after copy-pasting accompany each method
Each method also comes with an illustrative example which shows a grid plot and an animation of the method in action and can be accessed from the documentation
Vast majority of implemented methods are also used with real datasets and models to analyze hyperparameters or improve model performance. This is done through a series of 9 tutorials that can be accessed from the documentation
Both illustrative and practical examples can be viewed and possibly run online on Google Colab via a link (and instructions) in the documentation
All Imbalance.jl methods are intuitively explained via Medium stories written by the author

Developer Experience

All internal functions are documented and include comments to justify or simplify written code when needed
Features such as generalizing to table inputs, automatic encoding or multiclass settings are provided by generic functions that are used in all methods; redundancy is in general avoided.
A developer guide exists in the documentation for new contributors
Methods are implemented in more smaller functions to aid unit testing

Future Work

Although many resampling methods are supported by the package, they still do not cover all the most popular methods in the literature such as K-means SMOTE or Condensed Nearest Neighbors. Check the contributor's guide for more details. In general, the body literature on class imbalance includes a huge number of resampling algorithms. Most of which are variations of one another.

Challenges

I was relatively new to Julia when I started working on this project. Being able to undertake this project under the guidance of @ablaom has been an invaluable learning experience and definitely far from typical. His responsiveness on Slack, weekly meetings and code reviews played a major role towards the successful conclusion of the project. The project was of course proposed by @ablaom.

Top comments (3)

hungpham3112 • Oct 13 '23

I'm working with binary classification in detecting flood. The class imbalancing makes my model overfitting a little bit. I think I will have a look at some functions in this package to find solution for that.