<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Julia Community 🟣: Essam</title>
    <description>The latest articles on Julia Community 🟣 by Essam (@essamwisam).</description>
    <link>https://forem.julialang.org/essamwisam</link>
    <image>
      <url>https://forem.julialang.org/images/0dqswNoVBoi0zCsn26OPVS1gpFuNcgwFEZoFoMDJ3-E/rs:fill:90:90/g:sm/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L3VzZXIvcHJvZmls/ZV9pbWFnZS8xNDI4/LzgwOGYxM2EyLWU2/YzYtNGQyNi1iNGE4/LTc0ZjFjZTlmNTQ1/Mi5qcGVn</url>
      <title>Julia Community 🟣: Essam</title>
      <link>https://forem.julialang.org/essamwisam</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.julialang.org/feed/essamwisam"/>
    <language>en</language>
    <item>
      <title>Categorical Encoding in Julia</title>
      <dc:creator>Essam</dc:creator>
      <pubDate>Fri, 23 Aug 2024 23:38:46 +0000</pubDate>
      <link>https://forem.julialang.org/essamwisam/categorical-encoding-in-julia-2fcb</link>
      <guid>https://forem.julialang.org/essamwisam/categorical-encoding-in-julia-2fcb</guid>
      <description>&lt;p&gt;In this post, I will introduce the Google Summer of Code project that I have been involved in with my mentor &lt;a class="mentioned-user" href="https://forem.julialang.org/ablaom"&gt;@ablaom&lt;/a&gt; for the past three months. It's a new package &lt;a href="https://github.com/JuliaAI/MLJTransforms.jl" rel="noopener noreferrer"&gt;MLJTransforms.jl&lt;/a&gt; which makes five major contributions:&lt;/p&gt;

&lt;h3&gt;
  
  
  ➊ Introducing Contrast Categorical Methods to MLJ
&lt;/h3&gt;

&lt;p&gt;These include &lt;strong&gt;Dummy Coding&lt;/strong&gt;, &lt;strong&gt;Sum Coding&lt;/strong&gt;, &lt;strong&gt;Backward/Forward Difference Coding&lt;/strong&gt;, &lt;strong&gt;Helmert Coding&lt;/strong&gt; as well as generic &lt;strong&gt;Contrast/Hypothesis Coding&lt;/strong&gt;. It's even easily possible to apply different encoding techniques to different columns of an input table!&lt;/p&gt;

&lt;p&gt;These are provided via the &lt;code&gt;ContrastEncoder&lt;/code&gt; construct implemented in the &lt;code&gt;MLJTransforms.jl&lt;/code&gt; package. The motivation behind grouping them is that they can all be viewed as a special case of &lt;strong&gt;Hypothesis Coding&lt;/strong&gt;; the difference from one encoding method to the other is in the hypothesis that the method could be viewed to be testing. &lt;/p&gt;

&lt;h3&gt;
  
  
  ➋ Introducing Other Well Known Categorical Encoding Methods
&lt;/h3&gt;

&lt;p&gt;This includes simpler ones such as &lt;strong&gt;Ordinal&lt;/strong&gt; and &lt;strong&gt;Frequency Encoding&lt;/strong&gt; as well as more sophisticated methods such as &lt;strong&gt;Target Encoding&lt;/strong&gt;. Target encoding supports both binary and multiclass targets and allows regularization to avoid overfitting. It's one of the most renowned and effective methods for categorical encoding.&lt;/p&gt;

&lt;p&gt;These are provided via the &lt;code&gt;OrdinalEncoder&lt;/code&gt;, &lt;code&gt;FrequencyEncoder&lt;/code&gt; and &lt;code&gt;TargetEncoder&lt;/code&gt; constructs implemented in the &lt;code&gt;MLJTransformers.jl&lt;/code&gt; package.&lt;/p&gt;

&lt;p&gt;Prior to this, MLJ only natively supported the &lt;code&gt;OneHotEncoder&lt;/code&gt; for categorical encoding.&lt;/p&gt;

&lt;h3&gt;
  
  
  ➌ Introducing Utility Transformers and Encoders
&lt;/h3&gt;

&lt;p&gt;Many machine learning settings can benefit from treating missing values as an extra category for categorical variables. For this, I implemented a &lt;strong&gt;Missingness Encoder&lt;/strong&gt; that can help fill the missing values for the three major data types of categorical variables. It's even quite useful to cascade this with other categorical encoders that may not be able to deal with missingness.&lt;/p&gt;

&lt;p&gt;Another issue that often comes up when dealing with categorical features is that when their cardinality is high, a classification model may easily overfit; if 100% of the three examples having category "A" belong to class "X" have then might as well predict "X" whenever the category "A". For this, a &lt;strong&gt;CardinalityReducer&lt;/strong&gt; was implemented to group categories that can be regarded as infrequent.&lt;/p&gt;

&lt;p&gt;These are implemented in the &lt;code&gt;MissingnessEncoder&lt;/code&gt; and &lt;code&gt;CardinalityReducer&lt;/code&gt; constructs in the &lt;code&gt;MLJTransforms.jl&lt;/code&gt; package.&lt;/p&gt;

&lt;h3&gt;
  
  
  ➍ Porting Encoders and Transformers from MLJModels.jl
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;MLJModels.jl&lt;/code&gt; already housed a number of transformers as well as a nicely implemented &lt;code&gt;OneHotEncoder&lt;/code&gt;. These were ported to &lt;code&gt;MLJTransforms.jl&lt;/code&gt; so that all encoder/transformers are in the same package.&lt;/p&gt;

&lt;h3&gt;
  
  
  ➎ Introducing the EntityEmbedder
&lt;/h3&gt;

&lt;p&gt;The scope of the summer project originally required only implementing this. Indeed, it proved to be more challenging than other models I implemented in this project. Entity embedding is newer deep learning approach for categorical encoding introduced in 2016 by Cheng Guo and Felix Berkhahn. It employs a set of embedding layers to map each categorical feature into a dense continuous vector in a similar fashion to how they are employed in NLP architectures.&lt;/p&gt;

&lt;p&gt;Consequently, the &lt;code&gt;NeuralNetworkClassifier&lt;/code&gt;, &lt;code&gt;NeuralNetworkRegressor&lt;/code&gt; and the &lt;code&gt;MultitargetNeuralNetworkRegressor&lt;/code&gt; can be trained and evaluated with heterogenous data (i.e., containing categorical features). Moreover, they now offer a &lt;code&gt;transform&lt;/code&gt; which encode the categorical features with the learnt embeddings to be used by an upstream machine learning model.&lt;/p&gt;

&lt;p&gt;To see what it did take to implement this, see the corresponding &lt;a href="https://github.com/FluxML/MLJFlux.jl/pull/267" rel="noopener noreferrer"&gt;PR&lt;/a&gt; where I also mentioned the implementation plan I followed.&lt;/p&gt;

&lt;h3&gt;
  
  
  ➕📕 Documentation !
&lt;/h3&gt;

&lt;p&gt;Although my work extended beyond the original scope of the summer project, I considered it worthwhile to contribute further, knowing it could benefit others in the future. With this in mind, I carefully built and organized the &lt;a href="https://juliaai.github.io/MLJTransforms.jl/dev/" rel="noopener noreferrer"&gt;documentation page&lt;/a&gt;&lt;br&gt;
, creating a taxonomy for the methods I implemented, adding four tutorials to help users learn about categorical encoding techniques, and streamlining the process for other developers to contribute to the codebase.&lt;br&gt;
Do you like the work? See the Imbalance project I worked on last year as well &lt;a href="https://forem.julialang.org/essamwisam/class-imbalance-in-julia-3jek"&gt;here&lt;/a&gt;. Thank you!&lt;/p&gt;

&lt;h3&gt;
  
  
  🎁 Final Bonus
&lt;/h3&gt;

&lt;p&gt;In the community bonding period, I took some time to expose and revamp documentation for &lt;a href="https://github.com/FluxML/MLJFlux.jl" rel="noopener noreferrer"&gt;MLJFlux.jl&lt;/a&gt;. This included preparing seven workflow examples to present various features of the package as well as a novel tutorial on how to use RNNs for sequence classification via the package. See that and more &lt;a href="https://github.com/FluxML/MLJFlux.jl" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>encoding</category>
      <category>gsoc</category>
      <category>jsoc</category>
    </item>
    <item>
      <title>Class Imbalance in Julia</title>
      <dc:creator>Essam</dc:creator>
      <pubDate>Sun, 08 Oct 2023 20:21:43 +0000</pubDate>
      <link>https://forem.julialang.org/essamwisam/class-imbalance-in-julia-3jek</link>
      <guid>https://forem.julialang.org/essamwisam/class-imbalance-in-julia-3jek</guid>
      <description>&lt;p&gt;In this post, I will introduce the Google Summer of Code project that I have been involved in with my mentor &lt;a class="mentioned-user" href="https://forem.julialang.org/ablaom"&gt;@ablaom&lt;/a&gt; for the past couple of months. The project is a package with methods to correct for class imbalance in Julia: &lt;a href="https://github.com/JuliaAI/Imbalance.jl"&gt;Imbalance.jl&lt;/a&gt; and a helper package &lt;a href="https://github.com/JuliaAI/MLJBalancing.jl"&gt;MLJBalancing.jl&lt;/a&gt; to make it easy to use class imbalance methods with classification models from MLJ.&lt;/p&gt;

&lt;h2&gt;
  
  
  Class Imbalance
&lt;/h2&gt;

&lt;p&gt;Class imbalance is a well-known issue in machine learning where the performance of a classification model is hindered due to an imbalance in the distribution of the target variable over the available data. For instance, a model trained on a fraud detection dataset with 99% genuine transactions and only 1% fraud transactions may perform very poorly in terms of correctly predicting fraud transactions.&lt;/p&gt;

&lt;p&gt;A machine learning model may not or may to some degree be sensitive to class imbalance depending on the underlying learning algorithm, hypothesis set and loss function. In situations where class imbalance does pose a problem, which may be often the case, addressing it through techniques like class weighting or data resampling can lead to significant improvements in the model's performance on unseen data.&lt;/p&gt;

&lt;p&gt;The edge that resampling may have over class weighting is that it is algorithm independent (e.g., does not assume there is an explicit loss function being minimized) and that in its simplest form (naive random oversampling) it can be shown (under conditions) to be equivalent to class weighting. Moreover, in more ideal cases, besides of improving the balance, it may bear similarity with collecting more data or help the model find better separating hypersurfaces for the task.&lt;/p&gt;

&lt;h2&gt;
  
  
  Imbalance.jl
&lt;/h2&gt;

&lt;p&gt;The motivation of this package has been to offer a pool of resampling techniques that can be used to solve the class imbalance problem. For instance, similar to &lt;a href="https://imbalanced-learn.org/stable/index.html"&gt;imbalanced-learn&lt;/a&gt; in Python.&lt;/p&gt;

&lt;p&gt;The following are the resampling techniques that were implemented during my journey in Google Summer of Code:&lt;/p&gt;

&lt;h4&gt;
  
  
  Oversampling
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Random Oversampling&lt;/li&gt;
&lt;li&gt;Random Walk Oversampling (RWO)&lt;/li&gt;
&lt;li&gt;Random Oversampling Examples (ROSE)&lt;/li&gt;
&lt;li&gt;Synthetic Minority Oversampling Technique (SMOTE)&lt;/li&gt;
&lt;li&gt;Borderline SMOTE1&lt;/li&gt;
&lt;li&gt;SMOTE-Nominal (SMOTE-N)&lt;/li&gt;
&lt;li&gt;SMOTE-Nominal Categorical (SMOTE-NC)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Undersampling
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Random Undersampling&lt;/li&gt;
&lt;li&gt;Cluster Undersampling&lt;/li&gt;
&lt;li&gt;EditedNearestNeighbors Undersampling&lt;/li&gt;
&lt;li&gt;Tomek Links Undersampling&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Ensemble
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Balanced Bagging Classifier (@MLJBalancing.jl)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Hybrid
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;via BalancedModel (@MLJBalancing.jl)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Package Features
&lt;/h2&gt;

&lt;p&gt;Features offered by &lt;code&gt;Imbalance.jl&lt;/code&gt; and &lt;code&gt;MLJBalancing&lt;/code&gt; are as shown:&lt;/p&gt;

&lt;h4&gt;
  
  
  Available Methods
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Methods support all four major types of resampling approaches &lt;/li&gt;
&lt;li&gt;Methods generally work on multiclass settings&lt;/li&gt;
&lt;li&gt;Methods that deal with nominal data are also available&lt;/li&gt;
&lt;li&gt;Preference was given to methods that are more popular in the literature or industry&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Interface Support
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Methods generally support both matrix and table inputs. &lt;/li&gt;
&lt;li&gt;Target may or may not be provided separately&lt;/li&gt;
&lt;li&gt;All Imbalance.jl methods support a pure functional interface (default), an MLJ model interface and a TableTransforms interface&lt;/li&gt;
&lt;li&gt;Possible to wrap an arbitrary number of resampler models with an MLJ model to behave as a unified model using &lt;code&gt;MLJBalancing&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  User Experience
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Comprehensive documentation&lt;/li&gt;
&lt;li&gt;Examples (with shown output) that work after copy-pasting accompany each method&lt;/li&gt;
&lt;li&gt;Each method also comes with an illustrative example which shows a grid plot and an animation of the method in action and can be accessed from the &lt;a href="https://juliaai.github.io/Imbalance.jl/dev/algorithms/oversampling_algorithms/"&gt;documentation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Vast majority of implemented methods are also used with real datasets and models to analyze hyperparameters or improve model performance. This is done through a series of 9 tutorials that can be accessed from the &lt;a href="https://juliaai.github.io/Imbalance.jl/dev/examples/"&gt;documentation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Both illustrative and practical examples can be viewed and possibly run online on Google Colab via a link (and instructions) in the documentation&lt;/li&gt;
&lt;li&gt;All Imbalance.jl methods are intuitively explained via &lt;a href="https://essamwissam.medium.com/"&gt;Medium&lt;/a&gt; stories written by the author&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Developer Experience
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;All internal functions are documented and include comments to justify or simplify written code when needed&lt;/li&gt;
&lt;li&gt;Features such as generalizing to table inputs, automatic encoding or multiclass settings are provided by generic functions that are used in all methods; redundancy is in general avoided.&lt;/li&gt;
&lt;li&gt;A developer guide exists in the documentation for new contributors&lt;/li&gt;
&lt;li&gt;Methods are implemented in more smaller functions to aid unit testing &lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Future Work
&lt;/h2&gt;

&lt;p&gt;Although many resampling methods are supported by the package, they still do not cover all the most popular methods in the literature such as K-means SMOTE or Condensed Nearest Neighbors. Check the &lt;a href="https://juliaai.github.io/Imbalance.jl/dev/contributing/"&gt;contributor's guide&lt;/a&gt; for more details. In general, the body literature on class imbalance includes a huge number of resampling algorithms. Most of which are variations of one another.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenges
&lt;/h2&gt;

&lt;p&gt;I was relatively new to Julia when I started working on this project. Being able to undertake this project under the guidance of &lt;a class="mentioned-user" href="https://forem.julialang.org/ablaom"&gt;@ablaom&lt;/a&gt; has been an invaluable learning experience and definitely far from typical. His responsiveness on Slack, weekly meetings and code reviews played a major role towards the successful conclusion of the project. The project was of course proposed by &lt;a class="mentioned-user" href="https://forem.julialang.org/ablaom"&gt;@ablaom&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>classimbalance</category>
      <category>machinelearning</category>
      <category>gsoc</category>
      <category>jsoc</category>
    </item>
  </channel>
</rss>
