In this post, I will introduce the Google Summer of Code project that I have been involved in with my mentor @ablaom for the past couple of months. The project is a package with methods to correct for class imbalance in Julia: Imbalance.jl and a helper package MLJBalancing.jl to make it easy to use class imbalance methods with classification models from MLJ.
Class imbalance is a well-known issue in machine learning where the performance of a classification model is hindered due to an imbalance in the distribution of the target variable over the available data. For instance, a model trained on a fraud detection dataset with 99% genuine transactions and only 1% fraud transactions may perform very poorly in terms of correctly predicting fraud transactions.
A machine learning model may not or may to some degree be sensitive to class imbalance depending on the underlying learning algorithm, hypothesis set and loss function. In situations where class imbalance does pose a problem, which may be often the case, addressing it through techniques like class weighting or data resampling can lead to significant improvements in the model's performance on unseen data.
The edge that resampling may have over class weighting is that it is algorithm independent (e.g., does not assume there is an explicit loss function being minimized) and that in its simplest form (naive random oversampling) it can be shown (under conditions) to be equivalent to class weighting. Moreover, in more ideal cases, besides of improving the balance, it may bear similarity with collecting more data or help the model find better separating hypersurfaces for the task.
The motivation of this package has been to offer a pool of resampling techniques that can be used to solve the class imbalance problem. For instance, similar to imbalanced-learn in Python.
The following are the resampling techniques that were implemented during my journey in Google Summer of Code:
- Random Oversampling
- Random Walk Oversampling (RWO)
- Random Oversampling Examples (ROSE)
- Synthetic Minority Oversampling Technique (SMOTE)
- Borderline SMOTE1
- SMOTE-Nominal (SMOTE-N)
- SMOTE-Nominal Categorical (SMOTE-NC)
- Random Undersampling
- Cluster Undersampling
- EditedNearestNeighbors Undersampling
- Tomek Links Undersampling
- Balanced Bagging Classifier (@MLJBalancing.jl)
- via BalancedModel (@MLJBalancing.jl)
Features offered by
MLJBalancing are as shown:
- Methods support all four major types of resampling approaches
- Methods generally work on multiclass settings
- Methods that deal with nominal data are also available
- Preference was given to methods that are more popular in the literature or industry
- Methods generally support both matrix and table inputs.
- Target may or may not be provided separately
- All Imbalance.jl methods support a pure functional interface (default), an MLJ model interface and a TableTransforms interface
- Possible to wrap an arbitrary number of resampler models with an MLJ model to behave as a unified model using
- Comprehensive documentation
- Examples (with shown output) that work after copy-pasting accompany each method
- Each method also comes with an illustrative example which shows a grid plot and an animation of the method in action and can be accessed from the documentation
- Vast majority of implemented methods are also used with real datasets and models to analyze hyperparameters or improve model performance. This is done through a series of 9 tutorials that can be accessed from the documentation
- Both illustrative and practical examples can be viewed and possibly run online on Google Colab via a link (and instructions) in the documentation
- All Imbalance.jl methods are intuitively explained via Medium stories written by the author
- All internal functions are documented and include comments to justify or simplify written code when needed
- Features such as generalizing to table inputs, automatic encoding or multiclass settings are provided by generic functions that are used in all methods; redundancy is in general avoided.
- A developer guide exists in the documentation for new contributors
- Methods are implemented in more smaller functions to aid unit testing
Although many resampling methods are supported by the package, they still do not cover all the most popular methods in the literature such as K-means SMOTE or Condensed Nearest Neighbors. Check the contributor's guide for more details. In general, the body literature on class imbalance includes a huge number of resampling algorithms. Most of which are variations of one another.
I was relatively new to Julia when I started working on this project. Being able to undertake this project under the guidance of @ablaom has been an invaluable learning experience and definitely far from typical. His responsiveness on Slack, weekly meetings and code reviews played a major role towards the successful conclusion of the project. The project was of course proposed by @ablaom.