David Josephs for Machine Learning Julia (MLJ.jl)

Posted on Nov 22, 2022 • Edited on Nov 30, 2022

My experience working as a technical writer for MLJ

#mlj #ml #jsoc #technicalwriter

The last 6 or so months, I have had the great honor and pleasure of being a technical writer for MLJ as part of Google's Season of Docs.

At the start of this year, I made some big changes in my life, getting a new job at a company that aims to do a lot of good in the world, and switching from Python to Julia. Almost immediately, I fell in love with Julia and wanted to get involved in the open source community. Since I lacked confidence in my ability to actually write Julia code, I decided to sign up for Google Season of Docs for MLJ! Now that this is coming to an end, I would like to share my experiences from the last 6 months, and hopefully encourage other Julia learners to get involved with projects they care about (and write docstrings)

Documenting MLJ

At the start of this all, MLJ didn't really have a problem with the lack of docstrings, it is much more the lack of consistent and helpful docstrings. This problem arises because at the highest level, MLJ essentially provides a convenient, unified frontend to other packages and algorithms (yes, there is much more to the story here, but bear with me!). This means the code is distributed throughout several locations, with different owners and different levels of required maintenance. To resolve this, MLJ rolled out the MLJ document string standard. For my "season" of docs, I spent my time bringing the docstrings for all the existing MLJ models up to this standard.

How to write an MLJ docstring

I think the most useful thing I can share from the last 6 months is the process I used to write MLJ docstrings in not too long of a time!

Probably the easiest part of the MLJ docstring is the "header", which basically describes what the model is and what it does. So for example, let's say I have a classification model which uses some sort of separating hyperplane to do binary classification. At a minimum, my header would look something like:


"""
`SomeSortOfHyperPlaneClassifier`: A classification model that uses some sort of separating hyperplane to do binary classification,
as first described in [link to some paper that describes it]. Maybe we put a few details specific to our implementation here.
"""
SomeSortOfHyperPlaneClassifier

After you have your header, generally what I do next is document all the hyperparameters. To do this, typically I open up the source code and search for the name of the model, looking for a struct definition with the same name. Maybe it will look something like this:

@mlj_model mutable struct SomeSortOfHyperPlaneClassifier <: MMI.Deterministic
    scale::Bool = true

All the fields of the struct are the models hyperparameters! Once you have found these, the task is to figure out what they do. This can be accomplished in a few ways:

Already knowing what they do
Looking them up in the documentation for the package MLJ is interfacing with
Reading source code! (hooray Julia for readable source code!)

Once you have these documented, it is time for the fun part! You can now open up a repl and load up MLJ and the MLJ interface package you are working on. Following this contrived example, we could do something like:

using MLJ
import MLJSomeSortOfHyperPlaneModelsInterface

Now we start to work out our example, because the rest of the documentation essentially exists to show what you need to get your model up and running!

The first step, now that you have a repl loaded, is to figure out what sort of input types the model accepts, and how the data needs to look. We can figure this out in one of two ways:

Using MLJ's model metadata, which should live somewhere in the source code, looking something like:


 metadata_model(
     SomeSortOfHyperPlaneClassifier,
     input=Table(Continuous),
     target=AbstractVector{<:Finite{2}},
     weights=false,
     path="$(PKG).SomeSortOfHyperPlaneClassifier"
 )

This model for example takes in a table of continuous values, and returns a vector of finite predictions (binary classification).

Trial and error, thanks to MLJ's incredibly helpful error messages, which, if you feed an inappropriate scientific type, will tell you exactly what types the model received and what types it expected.

With this information figured out, you can fill out the information in the second section of the MLJ docstring, as follows:

# Training data
In MLJ or MLJBase, bind an instance `model` to data with one of:

    mach = machine(model, X, y)
    mach = machine(model, X, y, w)

Here

- `X`: is any table of input features (eg, a `DataFrame`) whose columns
are of scitype `SCIENTIFIC INPUT TYPE HERE`; check the scitype with `schema(X)`
- `y`: is the target, which can be any `AbstractVector` whose element
scitype is `SCIENTIFIC OUTPUT TYPE HERE`; check the scitype with `scitype(y)`
- `w`: is a vector of `Real` per-observation weights

Train the machine using `fit!(mach, rows=...)`.

Next, we go ahead and pick our data for the example. Since the example should be easily understood by beginners, it is advisable to use standard datasets like iris, mnist, crabs, or boston_housing in your example. For edge cases like multitarget regression, there is the make_regression function. In some cases, for example if you are documenting a model that is heavily used in a specific domain (e.g. independent component analysis and signals, or naive bayes and simple text classification), it is good to pick or create a second dataset for how the model would be appropriately used in the domain. Once you have the data chosen, you can go ahead and train your model! Next, you check out fitted_params(my_trained_model) and report(my_trained_model). These bits are typically easy, and they go into corresponding sections in your docstrings. Finally, for its existence to matter the model also needs to be able to do inference. So, you need to find out what sort of predictions your model makes. Does it return probability distributions when you call predict? If so, does it implement predict_mode or predict_mean to return point predictions? Is it a decomposition model that projects data into a lower dimensional space? If so it probably implements a transform method! Whatever methods it implements, these get documented in the Operations section of your docstring. Now, all you have to do is copy your code from the repl to the docstring in the example section, add references to related models, and you are done! It is not so bad, and at the end of the day you have a docstring that:

Has all the information necessary for someone who is relatively new to Julia or machine learning, while being easy to read and digest
Has an example that people can play around with
Has a clear description of hyperparameters so it can be tuned

Staying organized

There are a few things that the above section didn't cover, and may not be relevant to absolutely everyone's workflow, but were definitely helpful to me. The biggest one is staying organized! Some interfaces implement a LOT of models (looking at you MLJMultivariateStatsInterface). Writing all the docstrings out in one file for several miles is extremely difficult, as if you accidentally scroll, it may take you some time to figure out which model you are looking at, and you may put some bits in the wrong places. Also, while you certainly can get spellcheck and syntax highlighting in Julia docstrings with treesitter (using language injections!), it certainly isnt the easiest way and does not guarantee you have well formatted docstrings. Instead, it is best to

Make a checklist of all the docstrings you want to write
Write them all in separate markdown files
Paste them in as docstrings to the source code Otherwise, if you are even slightly like me, you will likely get lost in a wall of docstrings with a lot of very similar looking words (thanks to standardization!)

Conclusion

Writing these docstrings is not too hard, and a great way to learn more about both the thing you are documenting as well as get comfortable reading and writing Julia code! If you are relatively new to Julia, I cannot recommend looking at packages you either use or want to contribute to and checking out their docstrings. Odds are, because maintaining code is hard and maintaining code and docstrings is harder, the owners of the code would be happy to have an extra brain thinking about their documentation, and will be nice to you when you make a PR (I know it is a little scary when you first start making changes to somone else's). If you are a package maintainer and maybe don't have a ton of time to keep your docstrings up to date, or they aren't as thorough as you want them to be, open an issue on github so people know you are open to receiving documentation help!

Running list of PRs

lightgbm (wip)
outlier detection neighbors (wip)
MLJTSVDInterface
MLJTSVDInterface part 2
MLJText
MLJText part 2
MLJModels
MLJNaiveBayesInterface
MLJFlux
MLJFlux part 2
MLJGLMInterface
MLJGLMInterface part 2
MLJGLMInterface part 3
MLJClusteringInterface
MLJBase multi-target make_regression
multi-target make_regression part 2
The big MLJMultivariateStatsInterface pr

Top comments (3)

Fortune Walla • Nov 22 '22

Bravo! Seems like quite a complex process that you are a part of as I have seen that MLJ has about 160 models.

How much ML domain knowledge is required to become a technical writer for MLJ?

David Josephs Machine Learning Julia (MLJ.jl) • Nov 22 '22

To be honest, it is not as hard as it sounds. Having domain knowledge certainly makes it much easier, but I cannot claim to have some deep knowledge of every machine learning model MLJ touches, but after reading A) the documentation of the package that the model comes from and B) personal reading of the likely numerous blog posts, papers, and articles about the model being documented (as well as incredibly friendly and helpful code review from maintainers, and the occasional dive into source code), you end up gaining a lot of understanding on the way! Just takes more time if you are not already pretty familiar with the models :)

Fortune Walla • Nov 23 '22

thanks for the reply! Seem like a good way to start off without feeling intimidated by the details.