Julia Community 🟣: Essam

Categorical Encoding in Julia

Essam — Fri, 23 Aug 2024 23:38:46 +0000

In this post, I will introduce the Google Summer of Code project that I have been involved in with my mentor @ablaom for the past three months. It's a new package MLJTransforms.jl which makes five major contributions:

➊ Introducing Contrast Categorical Methods to MLJ

These include Dummy Coding, Sum Coding, Backward/Forward Difference Coding, Helmert Coding as well as generic Contrast/Hypothesis Coding. It's even easily possible to apply different encoding techniques to different columns of an input table!

These are provided via the ContrastEncoder construct implemented in the MLJTransforms.jl package. The motivation behind grouping them is that they can all be viewed as a special case of Hypothesis Coding; the difference from one encoding method to the other is in the hypothesis that the method could be viewed to be testing.

➋ Introducing Other Well Known Categorical Encoding Methods

This includes simpler ones such as Ordinal and Frequency Encoding as well as more sophisticated methods such as Target Encoding. Target encoding supports both binary and multiclass targets and allows regularization to avoid overfitting. It's one of the most renowned and effective methods for categorical encoding.

These are provided via the OrdinalEncoder, FrequencyEncoder and TargetEncoder constructs implemented in the MLJTransformers.jl package.

Prior to this, MLJ only natively supported the OneHotEncoder for categorical encoding.

➌ Introducing Utility Transformers and Encoders

Many machine learning settings can benefit from treating missing values as an extra category for categorical variables. For this, I implemented a Missingness Encoder that can help fill the missing values for the three major data types of categorical variables. It's even quite useful to cascade this with other categorical encoders that may not be able to deal with missingness.

Another issue that often comes up when dealing with categorical features is that when their cardinality is high, a classification model may easily overfit; if 100% of the three examples having category "A" belong to class "X" have then might as well predict "X" whenever the category "A". For this, a CardinalityReducer was implemented to group categories that can be regarded as infrequent.

These are implemented in the MissingnessEncoder and CardinalityReducer constructs in the MLJTransforms.jl package.

➍ Porting Encoders and Transformers from MLJModels.jl

MLJModels.jl already housed a number of transformers as well as a nicely implemented OneHotEncoder. These were ported to MLJTransforms.jl so that all encoder/transformers are in the same package.

➎ Introducing the EntityEmbedder

The scope of the summer project originally required only implementing this. Indeed, it proved to be more challenging than other models I implemented in this project. Entity embedding is newer deep learning approach for categorical encoding introduced in 2016 by Cheng Guo and Felix Berkhahn. It employs a set of embedding layers to map each categorical feature into a dense continuous vector in a similar fashion to how they are employed in NLP architectures.

Consequently, the NeuralNetworkClassifier, NeuralNetworkRegressor and the MultitargetNeuralNetworkRegressor can be trained and evaluated with heterogenous data (i.e., containing categorical features). Moreover, they now offer a transform which encode the categorical features with the learnt embeddings to be used by an upstream machine learning model.

To see what it did take to implement this, see the corresponding PR where I also mentioned the implementation plan I followed.

➕📕 Documentation !

Although my work extended beyond the original scope of the summer project, I considered it worthwhile to contribute further, knowing it could benefit others in the future. With this in mind, I carefully built and organized the documentation page
, creating a taxonomy for the methods I implemented, adding four tutorials to help users learn about categorical encoding techniques, and streamlining the process for other developers to contribute to the codebase.
Do you like the work? See the Imbalance project I worked on last year as well here. Thank you!

🎁 Final Bonus

In the community bonding period, I took some time to expose and revamp documentation for MLJFlux.jl. This included preparing seven workflow examples to present various features of the package as well as a novel tutorial on how to use RNNs for sequence classification via the package. See that and more here.

Class Imbalance in Julia

Essam — Sun, 08 Oct 2023 20:21:43 +0000

In this post, I will introduce the Google Summer of Code project that I have been involved in with my mentor @ablaom for the past couple of months. The project is a package with methods to correct for class imbalance in Julia: Imbalance.jl and a helper package MLJBalancing.jl to make it easy to use class imbalance methods with classification models from MLJ.

Class Imbalance

Class imbalance is a well-known issue in machine learning where the performance of a classification model is hindered due to an imbalance in the distribution of the target variable over the available data. For instance, a model trained on a fraud detection dataset with 99% genuine transactions and only 1% fraud transactions may perform very poorly in terms of correctly predicting fraud transactions.

A machine learning model may not or may to some degree be sensitive to class imbalance depending on the underlying learning algorithm, hypothesis set and loss function. In situations where class imbalance does pose a problem, which may be often the case, addressing it through techniques like class weighting or data resampling can lead to significant improvements in the model's performance on unseen data.

The edge that resampling may have over class weighting is that it is algorithm independent (e.g., does not assume there is an explicit loss function being minimized) and that in its simplest form (naive random oversampling) it can be shown (under conditions) to be equivalent to class weighting. Moreover, in more ideal cases, besides of improving the balance, it may bear similarity with collecting more data or help the model find better separating hypersurfaces for the task.

Imbalance.jl

The motivation of this package has been to offer a pool of resampling techniques that can be used to solve the class imbalance problem. For instance, similar to imbalanced-learn in Python.

The following are the resampling techniques that were implemented during my journey in Google Summer of Code:

Oversampling

Random Oversampling
Random Walk Oversampling (RWO)
Random Oversampling Examples (ROSE)
Synthetic Minority Oversampling Technique (SMOTE)
Borderline SMOTE1
SMOTE-Nominal (SMOTE-N)
SMOTE-Nominal Categorical (SMOTE-NC)

Undersampling

Random Undersampling
Cluster Undersampling
EditedNearestNeighbors Undersampling
Tomek Links Undersampling

Ensemble

Balanced Bagging Classifier (@MLJBalancing.jl)

Hybrid

via BalancedModel (@MLJBalancing.jl)

Package Features

Features offered by Imbalance.jl and MLJBalancing are as shown:

Available Methods

Methods support all four major types of resampling approaches
Methods generally work on multiclass settings
Methods that deal with nominal data are also available
Preference was given to methods that are more popular in the literature or industry

Interface Support

Methods generally support both matrix and table inputs.
Target may or may not be provided separately
All Imbalance.jl methods support a pure functional interface (default), an MLJ model interface and a TableTransforms interface
Possible to wrap an arbitrary number of resampler models with an MLJ model to behave as a unified model using MLJBalancing

User Experience

Comprehensive documentation
Examples (with shown output) that work after copy-pasting accompany each method
Each method also comes with an illustrative example which shows a grid plot and an animation of the method in action and can be accessed from the documentation
Vast majority of implemented methods are also used with real datasets and models to analyze hyperparameters or improve model performance. This is done through a series of 9 tutorials that can be accessed from the documentation
Both illustrative and practical examples can be viewed and possibly run online on Google Colab via a link (and instructions) in the documentation
All Imbalance.jl methods are intuitively explained via Medium stories written by the author

Developer Experience

All internal functions are documented and include comments to justify or simplify written code when needed
Features such as generalizing to table inputs, automatic encoding or multiclass settings are provided by generic functions that are used in all methods; redundancy is in general avoided.
A developer guide exists in the documentation for new contributors
Methods are implemented in more smaller functions to aid unit testing

Future Work

Although many resampling methods are supported by the package, they still do not cover all the most popular methods in the literature such as K-means SMOTE or Condensed Nearest Neighbors. Check the contributor's guide for more details. In general, the body literature on class imbalance includes a huge number of resampling algorithms. Most of which are variations of one another.

Challenges

I was relatively new to Julia when I started working on this project. Being able to undertake this project under the guidance of @ablaom has been an invaluable learning experience and definitely far from typical. His responsiveness on Slack, weekly meetings and code reviews played a major role towards the successful conclusion of the project. The project was of course proposed by @ablaom.