Julia Community 🟣: Machine Learning Julia (MLJ.jl)

Julia Boards the Titanic- A brief introduction to the MLJ.jl package

Anthony Blaom, PhD — Wed, 15 Feb 2023 22:24:07 +0000

This is a gentle introduction to Julia's machine learning toolbox MLJ focused on users new to Julia. In it we train a decision tree to predict whether a new passenger would survive a hypothetical replay of the Titanic disaster. The blog is loosely based on these notebooks.

Prerequisites: No prior experience with Julia, but you should know how to open a Julia REPL session in some terminal or console. A nodding acquaintance with supervised machine learning would be helpful.

Experienced data scientists may want to check out the more advanced tutorial, MLJ for Data Scientists in Two Hours.

Decision Trees

Generally, decision trees are not the best performing machine learning models. However, they are extremely fast to train, easy to interpret, and have flexible data requirements. They are also the building blocks of more advanced models, such as random forests and gradient boosted trees, one of the most successful and widely applied class of machine learning models today. All these models are available in the MLJ toolbox and are trained in the same way as the decision tree.

Here's a diagram representing what a decision tree, trained on the Titanic dataset, might look like:

For example, in this model, a male over the age of 9.5 is predicted to die, having a survival probability of 0.17.

Package installation

We start by creating a new Julia package environment called titanic, for tracking versions of the packages we will need. Do this by typing these commands at the julia> prompt, with carriage returns added at the end of each line:

using Pkg
Pkg.activate("titanic", shared=true)

To add the packages we need to your environment, enter the ] character at the julia> prompt, to change it to (titanic) pkg>. Then enter:

add MLJ DataFrames BetaML

It may take a few minutes for these packages to be installed and "precompiled".

Tip. Next time you want to use exactly the same combination of packages in a new Julia session, you can skip the add command and instead just enter the two lines above them.

When the (titanic) pkg> prompt returns, enter status to see the package versions that were installed. Here's what each package does:

MLJ (machine learning toolbox): provides a common interface for interacting with models provided by different packages, and for automating common model-generic tasks, such as hyperparameter optimization demonstrated at the end of this blog.
DataFrames: Allows you to manipulate tabular data that fits into memory. Tip. Checkout these cheatsheets.
BetaML: Provides the core decision algorithm we will be building for Titanic prediction.

Learn more about Julia package management here.

For now, return to the julia> prompt by pressing the "delete" or "backspace" key.

Establishing correct data representation

using MLJ
import DataFrames as DF

After entering the first line above we are ready to use any function in MLJ's documentation as it appears there. After the second, we can use functions from DataFrames, but must qualify the function names with a prefix DF., as we'll see later.

In MLJ, and some other statistics packages, a "scientific type" or scitype indicates how MLJ will interpret data (as opposed to how it is represented on your machine). For example, while we have

julia> typeof(1)
Int64

julia> typeof(true)
Bool

we have

julia> scitype(1)
Count

but also

julia> scitype(true)
Count

Tip. To learn more about a Julia command, use the ? character. For example, try typing ?scitype at the julia> prompt.

In MLJ, model data requirements are articulated using scitypes, which allows you to focus on what your data represents in the real world, instead of how it is stored on your computer.

Here are the most common "scalar" scitypes:

We'll grab our Titanic data set from OpenML, a platform for sharing machine learning datasets and workflows. The second line below converts the downloaded data into a dataframe.

table = OpenML.load(42638)
df = DF.DataFrame(table)

We can use DataFrames to get summary statistics for the features in our dataset:

DF.describe(df)

Row	variable	mean	min	median	max	nmissing	eltype
1	pclass		1		3	0	CategoricalValue{String, UInt32}
2	sex		female		male	0	CategoricalValue{String, UInt32}
3	age	29.7589	0.42	30.0	80.0	0	Float64
4	sibsp	0.523008	0.0	0.0	8.0	0	Float64
5	fare	32.2042	0.0	14.4542	512.329	0	Float64
6	cabin		E31		C148	687	Union{Missing, CategoricalValue{…
7	embarked		C		S	2	Union{Missing, CategoricalValue{…
8	survived		0		1	0	CategoricalValue{String, UInt32}

In particular, we see that cabin has a lot of missing values, and we'll shortly drop it for simplicity.

To get a summary of feature scitypes, we use schema:

schema(df)

Row	names	scitypes	types
1	pclass	Multiclass{3}	CategoricalValue{String, UInt32}
2	sex	Multiclass{2}	CategoricalValue{String, UInt32}
3	age	Continuous	Float64
4	sibsp	Continuous	Float64
5	fare	Continuous	Float64
6	cabin	Union{Missing, Multiclass{186}}	Union{Missing, CategoricalValue{…
7	embarked	Union{Missing, Multiclass{3}}	Union{Missing, CategoricalValue{…
8	survived	Multiclass{2}	CategoricalValue{String, UInt32}

Now sibsp represents the number of siblings/spouses, which is not a continuous variable. So we fix that like this:

coerce!(df, :sibsp => Count)

Call schema(df) again, to check a successful change.

Splitting into train and test sets

To objectively evaluate the performance of our final model, we split off 30% of our data into a holdout set, called df_test, which will not used for training:

df, df_test = partition(df, 0.7, rng=123)

You can check the number of observations in each set with DF.nrow(df) and DF.nrow(df_test).

Splitting data into input features and target

In supervised learning, the target is the variable we want to predict, in this case survived. The other features will be inputs to our predictor. The following code puts the df column with name survived into the vector y (the target) and everything else, except cabin, which we're dropping, into a new dataframe called X (the input features).

y, X = unpack(df, ==(:survived), !=(:cabin));

You can check X and y have the expected form by doing schema(X) and scitype(y).

We'll want to do the same for the holdout test set:

y_test, X_test = unpack(df_test, ==(:survived), !=(:cabin));

Choosing a supervised model:

There are not many models that can directly handle missing values and a mixture of scitypes, as we have here. Here's how to list the ones that do:

julia> models(matching(X, y))
 (name = ConstantClassifier, package_name = MLJModels, ... )
 (name = DecisionTreeClassifier, package_name = BetaML, ... )
 (name = DeterministicConstantClassifier, package_name = MLJModels, ... )
 (name = RandomForestClassifier, package_name = BetaML, ... )

This shortcoming can be addressed with data preprocessing provided by MLJ but not covered here, such as one-hot encoding and missing value imputation. We'll settle for the indicated decision tree.

The code for the decision tree model is not available until we explicitly load it, but we can already inspect its documentation. Do this by entering doc("DecisionTreeClassifier", pkg="BetaML"). (To browse all MLJ model documentation use the Model Browser.)

An MLJ-specific method for loading the model code (and necessary packages) is shown below:

Tree = @load DecisionTreeClassifier pkg=BetaML
tree = Tree()

The first line loads the model type, which we've called Tree; the second creates an object storing default hyperparameters for a Tree model. This tree will be displayed thus:

DecisionTreeClassifier(
  max_depth = 0,
  min_gain = 0.0,
  min_records = 2,
  max_features = 0,
  splitting_criterion = BetaML.Utils.gini,
  rng = Random._GLOBAL_RNG())

We can specify different hyperparameters like this:

tree = Tree(max_depth=10)

Training the model

We now bind the data to be used for training and the hyperparameter object tree we just created in a new object called a machine:

mach = machine(tree, X, y)

We train the model on all bound data by calling fit! on the machine. The exclamation mark ! in fit! tells us that fit! mutates (changes) its argument. In this case the model's learned parameters (the actual decision tree) is stored in the mach object:

fit!(mach)

Before getting predictions for new inputs, let's start by looking at predictions for the inputs we trained on:

p = predict(mach, X)

Notice that these are probabilistic predictions. For example, we have

julia> p[6]
           UnivariateFinite{Multiclass{2}}
     ┌                                        ┐
   0 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.914894
   1 ┤■■■ 0.0851064
     └                                        ┘

Extracting a raw probability requires an extra step. For example, to get the survival probability (1 corresponding to survival and 0 to death), we do this:

julia> pdf(p[6], "1")
0.0851063829787234

We can also get "point" predictions using the mode function and Julia's broadcasting syntax:

julia> yhat = mode.(p)
julia> yhat[3:5]
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "0"
 "0"
 "1"

Evaluating model performance

Let's see how accurate our model is at predicting on the data it trained on:

julia> accuracy(yhat, y)
0.921474358974359

Over 90% accuracy! Better check the accuracy on the test data that the model hasn't seen:

julia> yhat = mode.(predict(mach, X_test));
julia> accuracy(yhat, y_test)
0.7790262172284644

Oh dear. We are most likely overfitting the model. Still, not a bad first step.

The evaluation we have just performed is known as holdout evaluation. MLJ provides tools for automating such evaluations, as well as more sophisticated ones, such as cross-validation. See this simple example and the detailed documentation for more information.

Tuning the model

Changing any hyperparameter of our model will alter it's performance. In particular, changing certain parameters may mitigate overfitting.

In MLJ we can "wrap" the model to make it automatically optimize a given hyperparameter, which it does by internally creating its own holdout set for evaluation (or using some other resampling scheme, such as cross-validation) and systematically searching over a specified range of one or more hyperparameters. Let's do that now for our decision tree.

First, we define a hyperparameter range over which to search:

r = range(tree, :max_depth, lower=0, upper=8)

Note that, according to the document string for the decision tree (which we can retrieve now with ?Tree) we see that 0 here means "no limit on max_depth".

Next, we apply MLJ's TunedModel wrapper to our tree, specifying the range and performance measure to use as a basis for optimization, as well as the resampling strategy we want to use, and the search method (grid in this case).

tuned_tree = TunedModel(
    tree,
    tuning = Grid(),
    range=r,
    measure = accuracy,
    resampling=Holdout(fraction_train=0.7),
)

The new model tuned_tree behaves like the old, except that the max_depth hyperparameter effectively becomes a learned parameter instead.

Training this tuned_tree actually performs two operations, under the hood:

Search for the best model using an internally constructed holdout set
Retrain the "best" model on all available data

mach2 = machine(tuned_tree, X, y)
fit!(mach2)

Here's how we can see what the optimal model actually is:

julia> fitted_params(mach2).best_model
DecisionTreeClassifier(
  max_depth = 6,
  min_gain = 0.0,
  min_records = 2,
  max_features = 0,
  splitting_criterion = BetaML.Utils.gini,
  rng = Random._GLOBAL_RNG())

Finally, let's test the self-tuning model on our existing holdout set:

yhat2 = mode.(predict(mach2, X_test))
accuracy(yhat2, y_test)
0.8164794007490637

Although we cannot assign this outcome statistical signicance, without a more detailed analysis, this appears to be an improvement on our original depth=10 model.

Learning more

Suggestions for learning more about Julia and MLJ are here.

Case Study: Documenting machine learning models in a Julia ML framework

Logan Kilpatrick — Wed, 30 Nov 2022 17:06:52 +0000

Julia is a relatively new, general purpose programming language. MLJ (Machine Learning in Julia) is a toolbox written in Julia providing a common interface and meta-algorithms for selecting, tuning, evaluating, composing and comparing a variety of machine learning models implemented in Julia and other languages.

Authors: Anthony Blaom, Logan Kilpatrick and David Josephs
Problem Statement

While MLJ provides detailed documentation for its model-generic functionality (eg, hyperparameter optimization) users previously relied on third party package providers for model-specific documentation. This is physically scattered, occasionally terse, and not in any standard format. This was viewed as a barrier to adoption, especially by users new to machine learning, which is a large demographic.

Proposal Abstract

Having decided on a standard for model document strings, this project’s goal was to roll out model document strings for individual models. For a suitably identified technical writer, this was to involve:

Learning to use MLJ for data science projects
Understanding the document string specification
Reading and understanding third party model documentation
Boosting machine learning knowledge where appropriate to inform accurate document strings
Collaborating through code reviews in the writing of new document strings

Details of the proposal are on the Julia website.

Project Description

Creating the proposal

Our Google Season of Docs process always starts with an open solicitation to the community for project ideas. Those are generally crowd sourced and added to the Julia website. From there, the core Julia team evalautes each possible proposal based on the level of contributor interest, impact to the community, and enthusiasm of the mentor. As we have learned with Google Summer of Code over the last 10 years, the contributor experience is profoundly shaped by the mentor so we work hard to make sure there is someone with expertise and adequate time to support each project if selected.

This year, we were lucky enough to have a project that checked all three boxed. MLJ’s usage in the Julia ecosystem has expanded significantly over time so it seemed like a worthwhile investment to support the project with documentation help, especially around something critical like model information.

Once we officially announced that the MLJ project was the one selected, we shared this widely with the community for input. Generally, unless people are close to the proposed project itself, people don’t have much to say. Nonetheless, this process is still critical for transparency in the open source community.

Budget

Our budget was estimated based on previous years of supporting technical writers in similar domains and scopes of work. Estimating is always more of an art than science which is why we tend to add a buffer of time/budget to support unexpected hiccups.

Initially, we intended to have two main mentors but due to mentor availability, we only ended up with one person (Anthony), who did most of the mentoring work. We ended up spending the full amount allocated for the project per our expectations (expect ordering our wrap up t-shirts which is still in progress).

Participants

List the project participants. MLJ’s co-creator and lead developer Anthony Blaom managed the project, reviewed contributions, and provided mentorship to the technical writer David Josephs. Several third party model package developers/authors were also involved in documentation review, including GitHub users @ExpandingMan, @sylvaticus, @davnn, @tlienart, @okonsamuel. Logan Kilpatrick co-wrote the proposal, helped with recruitment, and took care of project administration.

When we knew we would be getting funding, we immediately shared the hiring details with the community on Slack, Discourse, and posted a job listing on LinkedIn to cast the widest possible net. Prospective candidates were asked to write a little about their background, describe previous technical writing experience and open-source contributions. This information, together with published examples of their technical writing, were evaluated. Two candidates were invited for one-on-one zoom interviews, which followed up on the written application and gave candidates an opportunity to demonstrate oral communication skills, which were deemed essential.

Did anyone drop out? No.

Since familiarity with Julia was strongly preferred, and some data science proficiency essential, it was challenging finding a large pool of candidates. In the end we selected a candidate who was strong in data science but less experienced with Julia. That said, our writer David had just started working for a company that codes in Julia, and that worked out nicely for us. David was quickly up-to-speed with the Julia proficiency we needed. Our experience reaffirms to us the importance in our work of scientific domain knowledge (machine learning) and good communication skills, over specific technical skills, such as proficiency with a certain tool.
Timeline

Our original proposal details a timeline. Our initial ambition included documentation for all models, with the exception of the sk-learn models; time was divided equally among model-providing packages. In hindsight this was a poor distribution as some packages provide a lot more models than others. Gauging progress was further complicated by the fact that some models had vastly more hyper-parameters to document.

Results

A tracking issue nicely summarizes results of the project and its status going forward beyond Google Season of Docs 2022. Documentation additions were made in the following packages, linked to the relevant pull requests:

Also, the technical writer made these code additions, to synthesize multi-target supervised learning datasets, to improve some doc-string examples:

Were there any deliverables in the proposal that did not get created? List those as well. The following packages did not get new docstrings, but were included in the original proposal:

Did this project result in any new or updated processes or procedures in your organization? No.

Metrics

What metrics did you choose to measure the success of the project? Were you able to collect those metrics? Did the metrics correlate well or poorly with the behaviors or outcomes you wanted for the project? Did your metrics change since your proposal? Did you add or remove any metrics? How often do you intend to collect metrics going forward?

Initially progress was measured by the number of third party packages documented but, as described above, a better measure was the proportion of individual models documented. As the project is quite close to being finished, I don’t imagine we need to rethink our metrics for this project.

Analysis

What went well? What was unexpected? What hurdles or setbacks did you face? Do you consider your project successful? Why or why not? (If it's too early to tell, explain when you expect to be able to judge the success of your project.)

This documentation project was always going to have some tedium associated with it, and it was fantastic to have help. Our technical writer was super enthusiastic and eager to learn things beyond the project remit. This enthusiasm helped me (Anthony) a lot to boost my own engagement. All in all, the communication side of things went very well.

I think having our writer David working at a Julia shop (startup using Julia) was an unexpected benefit, as I that increased exposure of the MLJ project. We had a few volunteer contributions from a co-worker, for example. Of course our project and David’s company shared the goal of boosting David’s Julia proficiency quickly. I believe David’s new expertise in MLJ is a definite benefit for his company, which currently builds Julia deep learning models.

Another benefit of the project was that the process of documentation occasionally highlighted issues or improvements with the software, which were then addressed or tagged for later projects. Moreover, David provided valuable feedback on his own experience with the software, as a new user.

As manager of the project, I did not anticipate how much time pull-request reviews would take. I’ve learned that reviewing documentation is at least as intensive as code review. In doc review there’s no set of tests to provide extra reassurance; you really need to carefully check every word.

Fortunately, there were no big setbacks. I would definitely rate the project as a success: We were able to achieve most of our goals, and this is certain to smooth out the on-ramp for new MLJ users. The final analysis will come over time, as we check our engagement levels, and check user feedback. A survey has been prepared and is to be rolled out soon.

Summary

In 2-4 paragraphs, summarize your project experience. Highlight what you learned, and what you would choose to do differently in the future. What advice would you give to other projects trying to solve a similar problem with documentation?

In this project a Google Season of Docs Technical Writer added document strings to models provided by most of the machine learning packages interfacing with the MLJ machine learning framework. This writing was primarily supervised and reviewed by one other contributor, the framework’s lead author and co-creator.

The main lesson for the MLJ team has been that creating good docstrings is a lot of work, with the review process as intensive as code review. It is easy to underestimate the resources needed for good documentation. Recruiting for short-term Julia related development is challenging, given the language’s young age.

In recruitment, it pays to value domain knowledge and good oral and written communication skills over specific skills, like proficiency in a particular language, assuming you have more than a few months of engagement. Doing so in this case led to a satisfying outcome. (By contrast, we have found a lack of Julia proficiency in GSoC projects more challenging.)

Appendix

A blog post describes our technical writer’s experience working on the project.

Acknowledgements

Anthony Blaom acknowledges the support of a New Zealand Strategic Science Investment awarded to the University of Auckland, which funded his work on MLJ during the project.

My experience working as a technical writer for MLJ

David Josephs — Tue, 22 Nov 2022 05:32:36 +0000

The last 6 or so months, I have had the great honor and pleasure of being a technical writer for MLJ as part of Google's Season of Docs.

At the start of this year, I made some big changes in my life, getting a new job at a company that aims to do a lot of good in the world, and switching from Python to Julia. Almost immediately, I fell in love with Julia and wanted to get involved in the open source community. Since I lacked confidence in my ability to actually write Julia code, I decided to sign up for Google Season of Docs for MLJ! Now that this is coming to an end, I would like to share my experiences from the last 6 months, and hopefully encourage other Julia learners to get involved with projects they care about (and write docstrings)

Documenting MLJ

At the start of this all, MLJ didn't really have a problem with the lack of docstrings, it is much more the lack of consistent and helpful docstrings. This problem arises because at the highest level, MLJ essentially provides a convenient, unified frontend to other packages and algorithms (yes, there is much more to the story here, but bear with me!). This means the code is distributed throughout several locations, with different owners and different levels of required maintenance. To resolve this, MLJ rolled out the MLJ document string standard. For my "season" of docs, I spent my time bringing the docstrings for all the existing MLJ models up to this standard.

How to write an MLJ docstring

I think the most useful thing I can share from the last 6 months is the process I used to write MLJ docstrings in not too long of a time!

Probably the easiest part of the MLJ docstring is the "header", which basically describes what the model is and what it does. So for example, let's say I have a classification model which uses some sort of separating hyperplane to do binary classification. At a minimum, my header would look something like:


"""
`SomeSortOfHyperPlaneClassifier`: A classification model that uses some sort of separating hyperplane to do binary classification,
as first described in [link to some paper that describes it]. Maybe we put a few details specific to our implementation here.
"""
SomeSortOfHyperPlaneClassifier

After you have your header, generally what I do next is document all the hyperparameters. To do this, typically I open up the source code and search for the name of the model, looking for a struct definition with the same name. Maybe it will look something like this:

@mlj_model mutable struct SomeSortOfHyperPlaneClassifier <: MMI.Deterministic
    scale::Bool = true

All the fields of the struct are the models hyperparameters! Once you have found these, the task is to figure out what they do. This can be accomplished in a few ways:

Already knowing what they do
Looking them up in the documentation for the package MLJ is interfacing with
Reading source code! (hooray Julia for readable source code!)

Once you have these documented, it is time for the fun part! You can now open up a repl and load up MLJ and the MLJ interface package you are working on. Following this contrived example, we could do something like:

using MLJ
import MLJSomeSortOfHyperPlaneModelsInterface

Now we start to work out our example, because the rest of the documentation essentially exists to show what you need to get your model up and running!

The first step, now that you have a repl loaded, is to figure out what sort of input types the model accepts, and how the data needs to look. We can figure this out in one of two ways:

Using MLJ's model metadata, which should live somewhere in the source code, looking something like:


 metadata_model(
     SomeSortOfHyperPlaneClassifier,
     input=Table(Continuous),
     target=AbstractVector{<:Finite{2}},
     weights=false,
     path="$(PKG).SomeSortOfHyperPlaneClassifier"
 )

This model for example takes in a table of continuous values, and returns a vector of finite predictions (binary classification).

Trial and error, thanks to MLJ's incredibly helpful error messages, which, if you feed an inappropriate scientific type, will tell you exactly what types the model received and what types it expected.

With this information figured out, you can fill out the information in the second section of the MLJ docstring, as follows:

# Training data
In MLJ or MLJBase, bind an instance `model` to data with one of:

    mach = machine(model, X, y)
    mach = machine(model, X, y, w)

Here

- `X`: is any table of input features (eg, a `DataFrame`) whose columns
are of scitype `SCIENTIFIC INPUT TYPE HERE`; check the scitype with `schema(X)`
- `y`: is the target, which can be any `AbstractVector` whose element
scitype is `SCIENTIFIC OUTPUT TYPE HERE`; check the scitype with `scitype(y)`
- `w`: is a vector of `Real` per-observation weights

Train the machine using `fit!(mach, rows=...)`.

Next, we go ahead and pick our data for the example. Since the example should be easily understood by beginners, it is advisable to use standard datasets like iris, mnist, crabs, or boston_housing in your example. For edge cases like multitarget regression, there is the make_regression function. In some cases, for example if you are documenting a model that is heavily used in a specific domain (e.g. independent component analysis and signals, or naive bayes and simple text classification), it is good to pick or create a second dataset for how the model would be appropriately used in the domain. Once you have the data chosen, you can go ahead and train your model! Next, you check out fitted_params(my_trained_model) and report(my_trained_model). These bits are typically easy, and they go into corresponding sections in your docstrings. Finally, for its existence to matter the model also needs to be able to do inference. So, you need to find out what sort of predictions your model makes. Does it return probability distributions when you call predict? If so, does it implement predict_mode or predict_mean to return point predictions? Is it a decomposition model that projects data into a lower dimensional space? If so it probably implements a transform method! Whatever methods it implements, these get documented in the Operations section of your docstring. Now, all you have to do is copy your code from the repl to the docstring in the example section, add references to related models, and you are done! It is not so bad, and at the end of the day you have a docstring that:

Has all the information necessary for someone who is relatively new to Julia or machine learning, while being easy to read and digest
Has an example that people can play around with
Has a clear description of hyperparameters so it can be tuned

Staying organized

There are a few things that the above section didn't cover, and may not be relevant to absolutely everyone's workflow, but were definitely helpful to me. The biggest one is staying organized! Some interfaces implement a LOT of models (looking at you MLJMultivariateStatsInterface). Writing all the docstrings out in one file for several miles is extremely difficult, as if you accidentally scroll, it may take you some time to figure out which model you are looking at, and you may put some bits in the wrong places. Also, while you certainly can get spellcheck and syntax highlighting in Julia docstrings with treesitter (using language injections!), it certainly isnt the easiest way and does not guarantee you have well formatted docstrings. Instead, it is best to

Make a checklist of all the docstrings you want to write
Write them all in separate markdown files
Paste them in as docstrings to the source code Otherwise, if you are even slightly like me, you will likely get lost in a wall of docstrings with a lot of very similar looking words (thanks to standardization!)

Conclusion

Writing these docstrings is not too hard, and a great way to learn more about both the thing you are documenting as well as get comfortable reading and writing Julia code! If you are relatively new to Julia, I cannot recommend looking at packages you either use or want to contribute to and checking out their docstrings. Odds are, because maintaining code is hard and maintaining code and docstrings is harder, the owners of the code would be happy to have an extra brain thinking about their documentation, and will be nice to you when you make a PR (I know it is a little scary when you first start making changes to somone else's). If you are a package maintainer and maybe don't have a ton of time to keep your docstrings up to date, or they aren't as thorough as you want them to be, open an issue on github so people know you are open to receiving documentation help!

Running list of PRs

lightgbm (wip)
outlier detection neighbors (wip)
MLJTSVDInterface
MLJTSVDInterface part 2
MLJText
MLJText part 2
MLJModels
MLJNaiveBayesInterface
MLJFlux
MLJFlux part 2
MLJGLMInterface
MLJGLMInterface part 2
MLJGLMInterface part 3
MLJClusteringInterface
MLJBase multi-target make_regression
multi-target make_regression part 2
The big MLJMultivariateStatsInterface pr