Categorical Encoding in Julia

#machinelearning #encoding #gsoc #jsoc

In this post, I will introduce the Google Summer of Code project that I have been involved in with my mentor @ablaom for the past three months. The project makes five major contributions:

➊ Introducing Contrast Categorical Methods to MLJ

These include Dummy Coding, Sum Coding, Backward/Forward Difference Coding, Helmert Coding as well as generic Contrast/Hypothesis Coding. It's even easily possible to apply different encoding techniques to different columns of an input table!

These are provided via the ContrastEncoder construct implemented in the MLJTransforms.jl package. The motivation behind grouping them is that they can all be viewed as a special case of Hypothesis Coding; the difference from one encoding method to the other is in the hypothesis that the method could be viewed to be testing.

➋ Introducing Other Well Known Categorical Encoding Methods

This includes simpler ones such as Ordinal and Frequency Encoding as well as more sophisticated methods such as Target Encoding. Target encoding supports both binary and multiclass targets and allows regularization to avoid overfitting. It's one of the most renowned and effective methods for categorical encoding.

These are provided via the OrdinalEncoder, FrequencyEncoder and TargetEncoder constructs implemented in the MLJTransformers.jl package.

Prior to this, MLJ only natively supported the OneHotEncoder for categorical encoding.

➌ Introducing Utility Transformers and Encoders

Many machine learning settings can benefit from treating missing values as an extra category for categorical variables. For this, I implemented a Missingness Encoder that can help fill the missing values for the three major data types of categorical variables. It's even quite useful to cascade this with other categorical encoders that may not be able to deal with missingness.

Another issue that often comes up when dealing with categorical features is that when their cardinality is high, a classification model may easily overfit; if 100% of the three examples having category "A" belong to class "X" have then might as well predict "X" whenever the category "A". For this, a CardinalityReducer was implemented to group categories that can be regarded as infrequent.

These are implemented in the MissingnessEncoder and CardinalityReducer constructs in the MLJTransforms.jl package.

➍ Porting Encoders and Transformers from MLJModels.jl

MLJModels.jl already housed a number of transformers as well as a nicely implemented OneHotEncoder. These were ported to MLJTransforms.jl so that all encoder/transformers are in the same package.

➎ Introducing the EntityEmbedder

The scope of the summer project originally required only implementing this. Indeed, it proved to be more challenging than other models I implemented in this project. Entity embedding is newer deep learning approach for categorical encoding introduced in 2016 by Cheng Guo and Felix Berkhahn. It employs a set of embedding layers to map each categorical feature into a dense continuous vector in a similar fashion to how they are employed in NLP architectures.

Consequently, the NeuralNetworkClassifier, NeuralNetworkRegressor and the MultitargetNeuralNetworkRegressor can be trained and evaluated with heterogenous data (i.e., containing categorical features). Moreover, they now offer a transform which encode the categorical features with the learnt embeddings to be used by an upstream machine learning model.

To see what it did take to implement this, see the corresponding PR where I also mentioned the implementation plan I followed.

➕ There is more to come!

Although my work goes beyond what was required for the summer poject, I see great value in further adding to it because I know someone will hopefully benefit from that one day. With this, the following are still on my agenda:

Export the EntityEmbedder from MLJTransforms.jl as well
Expose the MLJ method docstrings, which include model information and example usage via a clean documentation webpage and present some tutorials.
Potentially add Polynomial Encoding

Do you like the work? See the Imbalance project I worked on last year as well here. Thank you!

🎁 Final Bonus

In the community bonding period, I took some time to expose and revamp documentation for MLJFlux.jl. This included preparing seven workflow examples to present various features of the package as well as a novel tutorial on how to use RNNs for sequence classification via the package. See that and more here.

Julia Community 🟣