Julia Community 🟣: Patrick Altmeyer

Conformal Prediction Intervals for any Regression Model

Patrick Altmeyer — Wed, 14 Dec 2022 00:00:14 +0000

Conformal Prediction in Julia — Part 3

This is the third (and for now final) part of a series of posts that introduce Conformal Prediction in Julia using ConformalPrediction.jl. The first post introduced Conformal Prediction for supervised classification tasks: we learned that conformal classifiers produce set-valued predictions that are guaranteed to include the true label of a new sample with a certain probability. In the second post we applied these ideas to a more hands-on example: we saw how easy it is to use ConformalPrediction.jl to conformalize a Deep Learning image classifier.

In this post, we will look at regression problems instead, that is supervised learning tasks involving a continuous outcome variable. Regression tasks are as ubiquitous as classification tasks. For example, we might be interested in using a machine learning model to predict house prices or the inflation rate of the Euro or the parameter size of the next large language model. In fact, many readers may be more familiar with regression models than classification, in which case it may also be easier for you to understand Conformal Prediction (CP) in this context.

📖 Background

Before we start, let’s briefly recap what CP is all about. Don’t worry, we’re not about to deep-dive into methodology. But just to give you a high-level description upfront:

Conformal prediction (a.k.a. conformal inference) is a user-friendly paradigm for creating statistically rigorous uncertainty sets/intervals for the predictions of such models. Critically, the sets are valid in a distribution-free sense: they possess explicit, non-asymptotic guarantees even without distributional assumptions or model assumptions.

Angelopoulos and Bates (2021) (arXiv)

Intuitively, CP works under the premise of turning heuristic notions of uncertainty into rigorous uncertainty estimates through repeated sampling or the use of dedicated calibration data.

In what follows we will explore what CP can do by going through a standard machine learning workflow using MLJ.jl and ConformalPrediction.jl. There will be less focus on how exactly CP works, but references will point you to additional resources.

💡💡💡 Interactive Version

This post is also available as a fully interactive Pluto.jl 🎈 notebook: click here. In my own experience, this may take some time to load, certainly long enough to get yourself a hot beverage ☕ and first read on here. But I promise you that the wait is worth it!

📈 Data

Most machine learning workflows start with data. For illustrative purposes we will work with synthetic data. The helper function below can be used to generate some regression data.

function get_data(;N=1000, xmax=3.0, noise=0.5, fun::Function=fun(X) = X * sin(X))
    # Inputs:
    d = Distributions.Uniform(-xmax, xmax)
    X = rand(d, N)
    X = MLJBase.table(reshape(X, :, 1))

    # Outputs:
    ε = randn(N) .* noise
    y = @.(fun(X.x1)) + ε
    y = vec(y)
    return X, y
end

Figure 1 illustrates our observations (dots) along with the ground-truth mapping from inputs to outputs (line). We have defined that mapping f: 𝒳 ↦ 𝒴 as follows:

f(X) = X * cos(X)

Figure 1: Some synthetic regression data. Observations are shown as dots. The ground-truth mapping from inputs to outputs is shown as a dashed line. Image by author.

🏋️ Model Training using MLJ

ConformalPrediction.jl is interfaced to MLJ.jl (Blaom et al. 2020): a comprehensive Machine Learning Framework for Julia. MLJ.jl provides a large and growing suite of popular machine learning models that can be used for supervised and unsupervised tasks. Conformal Prediction is a model-agnostic approach to uncertainty quantification, so it can be applied to any common supervised machine learning model.

The interface to MLJ.jl therefore seems natural: any (supervised) MLJ.jl model can now be conformalized using ConformalPrediction.jl. By leveraging existing MLJ.jl functionality for common tasks like training, prediction and model evaluation, this package is light-weight and scalable. Now let's see how all of that works ...

To start with, let’s split our data into a training and test set:

train, test = partition(eachindex(y), 0.4, 0.4, shuffle= true)

Now let’s define a model for our regression task:

Model = @load KNNRegressor pkg = NearestNeighborModels
model = Model()

💡💡💡 Have it your way!

Think this dataset is too simple? Wondering why on earth I’m not using XGBoost for this task? In the interactive version of this post you have full control over the data and the model. Try it out!

Using standard MLJ.jl workflows let us now first train the unconformalized model. We first wrap our model in data:

mach_raw = machine(model, X, y)

Then we fit the machine to the training data:

MLJBase.fit!(mach_raw, rows=train, verbosity= 0)

Figure 2 below shows the resulting point predictions for the test data set:

Figure 2: Point predictions for our machine learning model. Image by author.

How is our model doing? It’s never quite right, of course, since predictions are estimates and therefore uncertain. Let’s see how we can use Conformal Prediction to express that uncertainty.

🔥 Conformalizing the Model

We can turn our model into a conformalized model in just one line of code:

conf_model = conformal_model(model)

By default conformal_model creates an Inductive Conformal Regressor (more on this below) when called on a <:Deterministic model. This behaviour can be changed by using the optional method key argument.

To train our conformal model we can once again rely on standard MLJ.jl workflows. We first wrap our model in data:

mach = machine(conf_model, X, y)

Then we fit the machine to the data:

MLJBase.fit!(mach, rows=train, verbosity= 0)

Now let us look at the predictions for our test data again. The chart below shows the results for our conformalized model. Predictions from conformal regressors are range-valued: for each new sample the model returns an interval (yₗ, yᵤ) ∈ 𝒴 that covers the test sample with a user-specified probability (1-α), where α is the expected error rate. This is known as the marginal coverage guarantee and it is proven to hold under the assumption that training and test data are exchangeable.

Figure 3: Prediction intervals for our conformalized machine learning model. Image by author.

Intuitively, a higher coverage rate leads to larger prediction intervals: since a larger interval covers a larger subspace of 𝒴, it is more likely to cover the true value.

I don’t expect you to believe me that the marginal coverage property really holds. In fact, I couldn’t believe it myself when I first learned about it. If you like mathematical proofs, you can find one in this tutorial, for example. If you like convincing yourself through empirical observations, read on below …

🧐 Evaluation

To verify the marginal coverage property empirically we can look at the empirical coverage rate of our conformal predictor (see Section 3 of the tutorial for details). To this end our package provides a custom performance measure emp_coverage that is compatible with MLJ.jl model evaluation workflows. In particular, we will call evaluate! on our conformal model using emp_coverage as our performance metric. The resulting empirical coverage rate should then be close to the desired level of coverage.

model_evaluation =
    evaluate!(mach, operation=MLJBase.predict, measure=emp_coverage, verbosity=0)
println("Empirical coverage: $(round(model_evaluation.measurement[1], digits=3))")
println("Coverage per fold: $(round.(model_evaluation.per_fold[1], digits=3))")

Empirical coverage: 0.902 
Coverage per fold: [0.94, 0.904, 0.874, 0.874, 0.898, 0.922]

✅ ✅ ✅ Great! We got an empirical coverage rate that is slightly higher than desired 😁 … but why isn’t it exactly the same?

In most cases it will be slightly higher than desired, since (1-α) is a lower bound. But note that it can also be slightly lower than desired. That is because the coverage property is “marginal” in the sense that the probability is averaged over the randomness in the data. For most purposes a large enough calibration set size (n>1000) mitigates that randomness enough. Depending on your choices above, the calibration set may be quite small (set to 500), which can lead to coverage slack (see Section 3 in the tutorial).

So what’s happening under the hood?

Inductive Conformal Prediction (also referred to as Split Conformal Prediction) broadly speaking works as follows:

Partition the training into a proper training set and a separate calibration set
Train the machine learning model on the proper training set.
Using some heuristic notion of uncertainty (e.g., absolute error in the regression case), compute nonconformity scores using the calibration data and the fitted model.
For the given coverage ratio compute the corresponding quantile q of the empirical distribution of nonconformity scores.
For the given quantile and test sample X, form the corresponding conformal prediction set like so: C(X) = {y: s(X,y) ≤ q}

🔃 Recap

This has been a super quick tour of ConformalPrediction.jl. We have seen how the package naturally integrates with MLJ.jl, allowing users to generate rigorous predictive uncertainty estimates for any supervised machine learning model.

Are we done?

Quite cool, right? Using a single API call we are able to generate rigorous prediction intervals for all kinds of different regression models. Have we just solved predictive uncertainty quantification once and for all? Do we even need to bother with anything else? Conformal Prediction is a very useful tool, but like so many other things, it is not the final answer to all our problems. In fact, let’s see if we can take CP to its limits.

The helper function to generate data from above takes an optional argument xmax. By increasing that value, we effectively expand the domain of our input. Let's do that and see how our conformal model does on this new out-of-domain data.

Figure 4: Prediction intervals for our conformalized machine learning model applied to out-of-domain data. Image by author.

Whooooops 🤕 … looks like we’re in trouble: in Figure 4 the prediction intervals do not cover out-of-domain test samples well. What happened here?

By expanding the domain of out inputs, we have violated the exchangeability assumption. When that assumption is violated, the marginal coverage property does not hold. But do not despair! There are ways to deal with this.

📚 Read on

If you are curious to find out more, be sure to read on in the docs. There are also a number of useful resources to learn more about Conformal Prediction, a few of which I have listed below:

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification by Angelopoulos and Bates (2022).
Awesome Conformal Prediction repository by Manokhin (2022)
MAPIE : a comprehensive Python library for conformal prediction.
My previous two blog posts.

Enjoy!

References

Angelopoulos, Anastasios N., and Stephen Bates. 2021. “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” https://arxiv.org/abs/2107.07511.

Blaom, Anthony D., Franz Kiraly, Thibaut Lienart, Yiannis Simillides, Diego Arenas, and Sebastian J. Vollmer. 2020. “MLJ: A Julia Package for Composable Machine Learning.” Journal of Open Source Software 5 (55): 2704. https://doi.org/10.21105/joss.02704.

Originally published at https://www.paltmeyer.com on December 12, 2022.

How to Conformalize a Deep Image Classifier

Patrick Altmeyer — Mon, 05 Dec 2022 00:00:27 +0000

Conformal Prediction in Julia — Part 2

Conformal Predictions sets with varying degrees of uncertainty. Image by author.

Deep Learning is popular and — for some tasks like image classification — remarkably powerful. But it is also well-known that Deep Neural Networks (DNN) can be unstable (Goodfellow, Shlens, and Szegedy 2014) and poorly calibrated. Conformal Prediction can be used to mitigate these pitfalls.

In the first part of this series of posts on Conformal Prediction, we looked at the basic underlying methodology and how CP can be implemented in Julia using ConformalPrediction.jl. This second part of the series is a more goal-oriented how-to guide: it demonstrates how you can conformalize a deep learning image classifier built in Flux.jl in just a few lines of code.

🎯 The Task at Hand

The task at hand is to predict the labels of handwritten images of digits using the famous MNIST dataset (LeCun 1998). Importing this popular machine learning dataset in Julia is made remarkably easy through MLDatasets.jl:

using MLDatasets
N = 1000
Xraw, yraw = MNIST(split=:train)[:]
Xraw = Xraw[:,:,1:N]
yraw = yraw[1:N]

🚧 Building the Network

To model the mapping from image inputs to labels will rely on a simple Multi-Layer Perceptron (MLP). A great Julia library for Deep Learning is Flux.jl. But wait ... doesn't ConformalPrediction.jl work with models trained in MLJ.jl? That's right, but fortunately there exists a Flux.jl interface to MLJ.jl, namely MLJFlux.jl. The interface is still in its early stages, but already very powerful and easily accessible for anyone (like myself) who is used to building Neural Networks in Flux.jl.

In Flux.jl, you could build an MLP for this task as follows,

using Flux

mlp = Chain(
    Flux.flatten,
    Dense(prod((28,28)), 32, relu),
    Dense(32, 10)
)

where (28,28) is just the input dimension (28x28 pixel images). Since we have ten digits, our output dimension is 10. For a full tutorial on how to build an MNIST image classifier relying solely on Flux.jl, check out this tutorial.

We can do the exact same thing in MLJFlux.jl as follows,

using MLJFlux

builder = MLJFlux.@builder Chain(
    Flux.flatten,
    Dense(prod(n_in), 32, relu),
    Dense(32, n_out)
)

where here we rely on the @builder macro to make the transition from Flux.jl to MLJ.jl as seamless as possible. Finally, MLJFlux.jl already comes with a number of helper functions to define plain-vanilla networks. In this case, we will use the ImageClassifier with our custom builder and cross-entropy loss:

ImageClassifier = @load ImageClassifier
clf = ImageClassifier(
    builder=builder,
    epochs=10,
    loss=Flux.crossentropy
)

The generated instance clf is a model (in the MLJ.jl sense) so from this point on we can rely on standard MLJ.jl workflows. For example, we can wrap our model in data to create a machine and then evaluate it on a holdout set as follows:

mach = machine(clf, X, y)

evaluate!(
    mach,
    resampling=Holdout(rng=123, fraction_train=0.8),
    operation=predict_mode,
    measure=[accuracy]
)

The accuracy of our very simple model is not amazing, but good enough for the purpose of this tutorial. For each image, our MLP returns a softmax output for each possible digit: 0,1,2,3,…,9. Since each individual softmax output is valued between zero and one, yₖ ∈ (0,1), this is commonly interpreted as a probability: yₖ ≔ p(y=k|X). Edge cases — that is values close to either zero or one — indicate high predictive certainty. But this is only a heuristic notion of predictive uncertainty (Angelopoulos and Bates 2021). Next, we will turn this heuristic notion of uncertainty into a rigorous one using Conformal Prediction.

🔥 Conformalizing the Network

Since clf is a model, it is also compatible with our package: ConformalPrediction.jl. To conformalize our MLP, we therefore only need to call conformal_model(clf). Since the generated instance conf_model is also just a model, we can still rely on standard MLJ.jl workflows. Below we first wrap it in data and then fit it.

using ConformalPrediction
conf_model = conformal_model(clf; method=:simple_inductive, coverage=.95)
mach = machine(conf_model, X, y)
fit!(mach)

Aaaand … we’re done! Let’s look at the results in the next section.

📊 Results

Figure 2 below presents the results. Figure 2 (a) displays highly certain predictions, now defined in the rigorous sense of Conformal Prediction: in each case, the conformal set (just beneath the image) includes only one label.

Figure 2 (b) and Figure 2 (c) display increasingly uncertain predictions of set size two and three, respectively. They demonstrate that CP is well equipped to deal with samples characterized by high aleatoric uncertainty: digits four (4), seven (7) and nine (9) share certain similarities. So do digits five (5) and six (6) as well as three (3) and eight (8). These may be hard to distinguish from each other even after seeing many examples (and even for a human). It is therefore unsurprising to see that these digits often end up together in conformal sets.

Figure 2 (a): Randomly selected prediction sets of size |C|=1. Image by author.

Figure 2 (b): Randomly selected prediction sets of size |C|=2. Image by author.

Figure 2 (c): Randomly selected prediction sets of size |C|=3. Image by author.

🧐 Evaluation

To evaluate the performance of conformal models, specific performance measures can be used to assess if the model is correctly specified and well-calibrated (Angelopoulos and Bates 2021). We will look at this in some more detail in another post in the future. For now, just be aware that these measures are already available in ConformalPrediction.jl and we will briefly showcase them here.

As for many other things, ConformalPrediction.jl taps into the existing functionality of MLJ.jl for model evaluation. In particular, we will see below how we can use the generic evaluate! method on our machine. To assess the correctness of our conformal predictor, we can compute the empirical coverage rate using the custom performance measure emp_coverage. With respect to model calibration we will look at the model's conditional coverage. For adaptive, well-calibrated conformal models, conditional coverage is high. One general go-to measure for assessing conditional coverage is size-stratified coverage. The custom measure for this purpose is just called size_stratified_coverage, aliased by ssc.

The code below implements the model evaluation using cross-validation. The Simple Inductive Classifier that we used above is not adaptive and hence the attained conditional coverage is low compared to the overall empirical coverage, which is close to 0.95, so in line with the desired coverage rate specified above.

_eval = evaluate!(
    mach,
    resampling=CV(),
    operation=predict,
    measure=[emp_coverage, ssc]
)
println("Empirical coverage: $(round(_eval.measurement[1], digits=3))")
println("SSC: $(round(_eval.measurement[2], digits=3))")

Empirical coverage: 0.957
SSC: 0.556

We can attain higher adaptivity (SSC) when using adaptive prediction sets:

conf_model = conformal_model(clf; method=:adaptive_inductive, coverage=.95)
mach = machine(conf_model, X, y)
fit!(mach)
_eval = evaluate!(
    mach,
    resampling=CV(),
    operation=predict,
    measure=[emp_coverage, ssc]
)
println("Empirical coverage: $(round(_eval.measurement[1], digits=3))")
println("SSC: $(round(_eval.measurement[2], digits=3))")

Empirical coverage: 0.99
SSC: 0.942

We can also have a look at the resulting set size for both approaches using a custom Plots.jl recipe (Figure 3). In line with the above, the spread is wider for the adaptive approach, which reflects that “the procedure is effectively distinguishing between easy and hard inputs” (A. N. Angelopoulos and Bates 2021).

plt_list = []
for (_mod, mach) in results
    push!(plt_list, bar(mach.model, mach.fitresult, X; title=String(_mod)))
end
plot(plt_list..., size=(800,300),bg_colour=:transparent)

Figure 3: Distribution of set sizes for both approaches. Image by author.

🔁 Recap

In this short guide we have seen how easy it is to conformalize a deep learning image classifier in Julia using ConformalPrediction.jl. Almost any deep neural network trained in Flux.jl is compatible with MLJ.jl and can therefore be conformalized in just a few lines of code. This makes it remarkably easy to move uncertainty heuristics to rigorous predictive uncertainty estimates. We have also seen a sneak peek at performance evaluation of conformal predictors. Stay tuned for more!

🎓 References

Angelopoulos, Anastasios N., and Stephen Bates. 2021. “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” https://arxiv.org/abs/2107.07511.

Goodfellow, Ian J, Jonathon Shlens, and Christian Szegedy. 2014. “Explaining and Harnessing Adversarial Examples.” https://arxiv.org/abs/1412.6572.

LeCun, Yann. 1998. “The MNIST Database of Handwritten Digits.”

Originally published at https://www.paltmeyer.com on December 5, 2022.

A year of using Quarto with Julia

Patrick Altmeyer — Mon, 21 Nov 2022 00:00:09 +0000

Earlier this year in July, I gave a short Experience Talk at JuliaCon. In a related blog post I explained how the introduction of Quarto made my transition from R to Julia painless: I would be able to start learning Julia without having to give up on all the benefits associated with R Markdown.

In November, 2022, I am presenting on this topic again at the 2nd JuliaLang Eindhoven meetup. In addition to the slides, I thought I’d share a small companion blog post that highlights some useful tips and tricks for anyone interested in using Quarto with Julia.

General things

We will start in this section with a few general recommendations.

Setup

I continue to recommend using VSCode for any work with Quarto and Julia. The Quarto docs explain how to get started by installing the necessary Quarto and IJulia extensions. Since most Julia users will regularly want to update their Julia version, I would additionally recommend to add IJulia.jl to your ~/.julia/config/startup.jl file:

# Setup OhMyREPL, Revise and Term
import Pkg
let
    pkgs = ["Revise", "OhMyREPL", "Term", "IJulia"]
    for pkg in pkgs
        if Base.find_package(pkg) === nothing
            Pkg.add(pkg)
        end
    end
end

Additionally, you only need to remember that …

… if you install a new Julia binary […], you must update the IJulia installation […] by running Pkg.build("IJulia")

Source: IJulia docs

I guess this step can also be automated in ~/.julia/config/startup.jl, but haven't tried that yet.

Using .ipynb vs .qmd

I also continue to recommend working with Quarto notebooks as opposed to Jupyter notebooks (files ending in .qmd and .ipynb, respectively). This is partially just based on preference (from R Markdown I'm used to working with .Rmd files), but there is also a good reason to consider using .qmd, even if you're used to working with Jupyter: the code chunks in your Quarto notebook automatically link to the Julia REPL in VSCode. In other words, you can run code chunks in your notebook and then access any variable that you may have created in the REPL. I find this quite useful, cause it allows me to quickly test code. Perhaps there's a good way to do this with Jupyter notebooks as well, but when I last used them I would always have to insert new code cells to test stuff.

Either way switching between Jupyter and Quarto notebooks is straight-forward: quarto convert notebook.qmd will convert any Quarto notebook into a Jupyter notebook and vice versa. One potential benefit of Jupyter notebooks is their connection to Google Colab: it is possible to store Jupyter notebooks on Github and make them available on Colab, allowing users to quickly interact with your code without the need to clone anything. If this is important to you, you can still work with .qmd documents and simply specify keep-ipynb: true in the YAML header.

Dynamic Content

The world and the data that describes it is not static 📈. Why should scientific outputs be?

One of the things I have always really loved about R Markdown was the ability to use inline code: the Knitr engine allows you to call and render any object x that you have created in preceding R chunks like this: r x. This is very powerful, because it enables us to bridge the gap between computations and output. In other words, it allows us to easily produce reproducible and dynamic content.

Until recently I had not been aware that this is also possible for Julia. Consider the following example. The code below depends on remote data that is continuously updated:

using MarketData
snp = yahoo("^GSPC")

using Dates
last_trade_day = timestamp(snp[end])[1]
p_close = values(snp[end,:Close])[1]
last_trade_day_formatted = Dates.format(last_trade_day, "U d, yyyy")

It loads the most recent publicly available data on equity prices from Yahoo finance. In an ideal world, we’d like any updates to these inputs to be reflected in our output. That way you can just re-render the Quarto notebook to get an updated report. To render Julia code inline, we use Markdown.jl like so:

using Markdown
Markdown.parse("""
When the S&P 500 last traded, on $(last_trade_day_formatted), it closed at $(p_close). 
""")

When the S&P 500 last traded, on November 18, 2022, it closed at 3965.340088.

In practice, one would of course set #| echo: false in this case. Whatever content you publish, this approach will keep it up-to-date. This practice of simply re-rendering the source notebook also ensures that any other output remains up-to-date (e.g. Figure 1).

Code Execution

Related to the previous point, I typically define the following execution options in my _quarto.yml or _metadata.yml. The freeze: auto option ensures that documents are only rerendered if the source changes. In cases where code should always be re-executed you whould want to set freeze: false, instead. I set output: false because typically I have a lot of code chunks that don't generate any output that is of immediate interest to readers.

Reproducibility

To ensure that your content can be reproduced easily, it may additionally be helpful to explicitly specify the Julia version you used ( jupyter: julia-1.8) and set up a global or local Julia environments. Inserting the following at the beginning of your Quarto notebook

using Pkg; Pkg.activate("<path>")

ensures that the desired environment that lives in is actually activated and used.

Package Documentation

I have also continued to use Quarto in combination with Documenter.jl to document my Julia packages. This essentially boils down to writing up documentation using interactive .qmd notebooks and then rendering those to .md files as inputs for Documenter.jl. There are a few good reasons for this approach, especially if you're used to working with Quarto anyway:

Re-rendering any docs with eval: true provides an additional layer of quality assurance: if any of the code chunks throws an error, you know that your documentation is outdated (perhaps due to an API change). It also offers a straight-forward way to test package functions that produce non-testable (e.g. stochastic) output. In such cases, the use of jldoctest is not always straight-forward (see here).
You get some stuff for free, e.g. citation management. Unfortunately, as far as I’m aware there is still no support for cross-referencing.
You can use Quarto execution options like execute-dir: project and resources: www/ to globally specify the working directory and a directory for external resources like images.

There are also a few peculiarities to be aware of. To avoid any issues with Documenter.jl, I've found it useful to ensure that the rendered .md files do not contain any raw HTML and to preserve text wrapping:

format: 
  commonmark:
    variant: -raw_html
    wrap: preserve

When working with .qmd files you also need to use a slightly different syntax for admonitions. The following syntax inside the .qmd

| !!! note \"An optional title\"
|     Here is something that you should pay attention to.

will generate the desired output inside the rendered .md:

!!! note "An optional title"
    Here is something that you should pay attention to.

Any of my package repos — CounterfactualExplanations.jl, LaplaceRedux.jl, ConformalPrediction.jl — should provide additional colour on this topic.

Quarto for Academic Journal Articles

Quarto supports LaTeX templates/classes, which has helped me with paper submissions in the past (e.g. my pending JuliaCon Proceedings submissions). I’ve found that rticles still has an edge here, but the list of out-of-the-box templates for journal articles is growing. Should I find some time in the future, I will try to add a template for JuliaCon Proceedings. The beauty of this is that it should enable publishers to not only use traditional forms of publication (PDF), but also include more dynamic formats with ease (think distill, but more than that.)

Wrapping up

This short post has provided a bit of an update on using Quarto with Julia. From my own experience so far, things have been getting easier and better (thanks to the amazing work of Quarto dev team). I’m exicted to see things improve even further and still think that Quarto is a revolutionary new tool for scientific publishing. Let’s hope publishers eventually recognise this value 👀.

Originally published at https://www.paltmeyer.com on November 21, 2022.

Conformal Prediction in Julia

Patrick Altmeyer — Tue, 25 Oct 2022 00:00:39 +0000

Conformal Prediction in Julia

Part 1 — Introduction

Figure 1: Prediction sets for two different samples and changing coverage rates. As coverage grows, so does the size of the prediction sets. Image by author.

A first crucial step towards building trustworthy AI systems is to be transparent about predictive uncertainty. Model parameters are random variables and their values are estimated from noisy data. That inherent stochasticity feeds through to model predictions and should to be addressed, at the very least in order to avoid overconfidence in models.

Beyond that obvious concern, it turns out that quantifying model uncertainty actually opens up a myriad of possibilities to improve up- and down-stream modelling tasks like active learning and robustness. In Bayesian Active Learning, for example, uncertainty estimates are used to guide the search for new input samples, which can make ground-truthing tasks more efficient (Houlsby et al. 2011). With respect to model performance in downstream tasks, uncertainty quantification can be used to improve model calibration and robustness (Lakshminarayanan, Pritzel, and Blundell 2016).

In previous posts we have looked at how uncertainty can be quantified in the Bayesian context (see here and here). Since in Bayesian modelling we are generally concerned with estimating posterior distributions, we get uncertainty estimates almost as a byproduct. This is great for all intends and purposes, but it hinges on assumptions about prior distributions. Personally, I have no quarrel with the idea of making prior distributional assumptions. On the contrary, I think the Bayesian framework formalises the idea of integrating prior information in models and therefore provides a powerful toolkit for conducting science. Still, in some cases this requirement may be seen as too restrictive or we may simply lack prior information.

Enter: Conformal Prediction (CP) — a scalable frequentist approach to uncertainty quantification and coverage control. In this post we will go through the basic concepts underlying CP. A number of hands-on usage examples in Julia should hopefully help to convey some intuition and ideally attract people interested in contributing to a new and exciting open-source development.

📖 Background

Conformal Prediction promises to be an easy-to-understand, distribution-free and model-agnostic way to generate statistically rigorous uncertainty estimates. That’s quite a mouthful, so let’s break it down: firstly, as I will hopefully manage to illustrate in this post, the underlying concepts truly are fairly straight-forward to understand; secondly, CP indeed relies on only minimal distributional assumptions; thirdly, common procedures to generate conformal predictions really do apply almost universally to all supervised models, therefore making the framework very intriguing to the ML community; and, finally, CP does in fact come with a frequentist coverage guarantee that ensures that conformal prediction sets contain the true value with a user-chosen probability. For a formal proof of this marginal coverage property and a detailed introduction to the topic, I recommend the tutorial by Angelopoulos and Bates (2021).

In what follows we will loosely treat the tutorial by Angelopoulos and Bates (2021) and the general framework it sets as a reference. You are not expected to have read the paper, but I also won’t reiterate any details here.

CP can be used to generate prediction intervals for regression models and prediction sets for classification models (more on this later). There is also some recent work on conformal predictive distributions and probabilistic predictions. Interestingly, it can even be used to complement Bayesian methods. Angelopoulos and Bates (2021), for example, point out that prior information should be incorporated into prediction sets and demonstrate how Bayesian predictive distributions can be conformalised in order to comply with the frequentist notion of coverage. Relatedly, Hoff (2021) proposes a Bayes-optimal prediction procedure. And finally, Stanton, Maddox, and Wilson (2022) very recently proposed a way to introduce conformal prediction in Bayesian Optimisation. I find this type of work that combines different schools of thought very promising, but I’m drifting off a little … So, without further ado, let us look at some code.

📦 Conformal Prediction in Julia

In this section of this first short post on CP we will look at how conformal prediction can be implemented in Julia. In particular, we will look at an approach that is compatible with any of the many supervised machine learning models available in MLJ: a beautiful, comprehensive machine learning framework funded by the Alan Turing Institute and the New Zealand Strategic Science Investment Fund. We will go through some basic usage examples employing a new Julia package that I have been working on: ConformalPrediction.jl.

ConformalPrediction.jl is a package for uncertainty quantification through conformal prediction for machine learning models trained in MLJ. At the time of writing it is still in its early stages of development, but already implements a range of different approaches to CP. Contributions are very much welcome:

Documentation

Contributor’s Guide

Split Conformal Classification

We consider a simple binary classification problem. Let (Xᵢ, Yᵢ), i=1,…,n denote our feature-label pairs and let μ: 𝒳 ↦ 𝒴 denote the mapping from features to labels. For illustration purposes we will use the moons dataset 🌙. Using MLJ.jl we first generate the data and split into into a training and test set:

using MLJ 
using Random Random.seed!(123)  

# Data:
X, y = make_moons(500; noise=0.15) 
train, test = partition(eachindex(y), 0.8, shuffle=true)

Here we will use a specific case of CP called split conformal prediction which can then be summarised as follows:

Partition the training into a proper training set and a separate calibration set: 𝒟ₙ=𝒟[train] ∪ 𝒟[cali].
Train the machine learning model on the proper training set: μ(Xᵢ, Yᵢ) for i ∈ 𝒟[train].
Compute nonconformity scores, 𝒮, using the calibration data 𝒟[cali] and the fitted model μ(Xᵢ, Yᵢ) for i ∈ 𝒟[train].
For a user-specified desired coverage ratio (1-α) compute the corresponding quantile, q̂, of the empirical distribution of nonconformity scores, 𝒮.
For the given quantile and test sample X[test], form the corresponding conformal prediction set: C(X[test])={y: s(X[test], y) ≤ q̂}

This is the default procedure used for classification and regression in ConformalPrediction.jl.

You may want to take a look at the source code for the classification case here. As a first important step, we begin by defining a concrete type SimpleInductiveClassifier that wraps a supervised model from MLJ.jl and reserves additional fields for a few hyperparameters. As a second step, we define the training procedure, which includes the data-splitting and calibration step. Finally, as a third step we implement the procedure in the equation above (step 5) to compute the conformal prediction set.

The permalinks above take you to the version of the package that was up-to-date at the time of writing. Since the package is in its early stages of development, the code base and API can be expected to change.

Now let’s take this to our 🌙 data. To illustrate the package functionality we will demonstrate the envisioned workflow. We first define our atomic machine learning model following standard MLJ.jl conventions. Using ConformalPrediction.jl we then wrap our atomic model in a conformal model using the standard API call conformal_model(model::Supervised; kwargs...). To train and predict from our conformal model we can then rely on the conventional MLJ.jl procedure again. In particular, we wrap our conformal model in data (turning it into a machine) and then fit it on the training set. Finally, we use our machine to predict the label for a new test sample Xtest:

# Model:
KNNClassifier = @load KNNClassifier pkg=NearestNeighborModels 
model = KNNClassifier(;K=50)   

# Training:
using ConformalPrediction 
conf_model = conformal_model(model; coverage=.9) 
mach = machine(conf_model, X, y) 
fit!(mach, rows=train)  

# Conformal Prediction:
Xtest = selectrows(X, first(test)) 
ytest = y[first(test)] 
predict(mach, Xtest)[1]

> UnivariateFinite{Multiclass{2}}     
     ┌ ┐ 
   0 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.94   
     └ ┘

The final predictions are set-valued. While the softmax output remains unchanged for the SimpleInductiveClassifier, the size of the prediction set depends on the chosen coverage rate, (1-α).

When specifying a coverage rate very close to one, the prediction set will typically include many (in some cases all) of the possible labels. Below, for example, both classes are included in the prediction set when setting the coverage rate equal to (1-α)=1.0. This is intuitive, since high coverage quite literally requires that the true label is covered by the prediction set with high probability.

conf_model = conformal_model(model; coverage=coverage) 
mach = machine(conf_model, X, y) 
fit!(mach, rows=train)  

# Conformal Prediction:
Xtest = (x1=[1],x2=[0]) 
predict(mach, Xtest)[1]

> UnivariateFinite{Multiclass{2}}    
     ┌ ┐ 
   0 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.5   
   1 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.5   
     └ ┘

Conversely, for low coverage rates, prediction sets can also be empty. For a choice of (1-α)=0.1, for example, the prediction set for our test sample is empty. This is a bit difficult to think about intuitively and I have not yet come across a satisfactory, intuitive interpretation (should you have one, please share!). When the prediction set is empty, the predict call currently returns missing:

conf_model = conformal_model(model; coverage=coverage) 
mach = machine(conf_model, X, y) 
fit!(mach, rows=train)  

# Conformal Prediction: 
predict(mach, Xtest)[1]

> missing

Figure 2 should provide some more intuition as to what exactly is happening here. It illustrates the effect of the chosen coverage rate on the predicted softmax output and the set size in the two-dimensional feature space. Contours are overlayed with the moon data points (including test data). The two samples highlighted in red, X₁ and X₂, have been manually added for illustration purposes. Let’s look at these one by one.

Firstly, note that X₁ (red cross) falls into a region of the domain that is characterized by high predictive uncertainty. It sits right at the bottom-right corner of our class-zero moon 🌜 (orange), a region that is almost entirely enveloped by our class-one moon 🌛 (green). For low coverage rates the prediction set for X₁ is empty: on the left-hand side this is indicated by the missing contour for the softmax probability; on the right-hand side we can observe that the corresponding set size is indeed zero. For high coverage rates the prediction set includes both y=0 and y=1, indicative of the fact that the conformal classifier is uncertain about the true label.

With respect to X₂, we observe that while also sitting on the fringe of our class-zero moon, this sample populates a region that is not fully enveloped by data points from the opposite class. In this region, the underlying atomic classifier can be expected to be more certain about its predictions, but still not highly confident. How is this reflected by our corresponding conformal prediction sets?

Well, for low coverage rates (roughly <0.9) the conformal prediction set does not include y=0: the set size is zero (right panel). Only for higher coverage rates do we have C(X₂)={0}: the coverage rate is high enough to include y=0, but the corresponding softmax probability is still fairly low. For example, for (1-α)=0.9 we have p̂(y=0|X₂)=0.72.

These two examples illustrate an interesting point: for regions characterised by high predictive uncertainty, conformal prediction sets are typically empty (for low coverage) or large (for high coverage). While set-valued predictions may be something to get used to, this notion is overall intuitive.

Figure 2: The effect of the coverage rate on the conformal prediction set. Softmax probabilities are shown on the left. The size of the prediction set is shown on the right. Image by author.

🏁 Conclusion

This has really been a whistle-stop tour of Conformal Prediction: an active area of research that probably deserves much more attention. Hopefully, though, this post has helped to provide some color and, if anything, made you more curious about the topic. Let’s recap the most important points from above:

Conformal Prediction is an interesting frequentist approach to uncertainty quantification that can even be combined with Bayes.
It is scalable and model-agnostic and therefore well applicable to machine learning.
ConformalPrediction.jl implements CP in pure Julia and can be used with any supervised model available from MLJ.jl.
Implementing CP directly on top of an existing, powerful machine learning toolkit demonstrates the potential usefulness of this framework to the ML community.
Standard conformal classifiers produce set-valued predictions: for ambiguous samples these sets are typically large (for high coverage) or empty (for low coverage).

Below I will leave you with some further resources.

📚 Further Resources

Chances are that you have already come across the Awesome Conformal Prediction repo: Manokhin (n.d.) provides a comprehensive, up-to-date overview of resources related to the conformal prediction. Among the listed articles you will also find Angelopoulos and Bates (2021), which inspired much of this post. The repo also points to open-source implementations in other popular programming languages including Python and R.

References

Angelopoulos, Anastasios N., and Stephen Bates. 2021. “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” https://arxiv.org/abs/2107.07511.

Hoff, Peter. 2021. “Bayes-Optimal Prediction with Frequentist Coverage Control.” https://arxiv.org/abs/2105.14045.

Houlsby, Neil, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. 2011. “Bayesian Active Learning for Classification and Preference Learning.” https://arxiv.org/abs/1112.5745.

Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2016. “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” https://arxiv.org/abs/1612.01474.

Manokhin, Valery. n.d. “Awesome Conformal Prediction.”

Stanton, Samuel, Wesley Maddox, and Andrew Gordon Wilson. 2022. “Bayesian Optimization with Conformal Coverage Guarantees.” https://arxiv.org/abs/2210.12496.

For attribution, please cite this work as:

Patrick Altmeyer, and Patrick Altmeyer. 2022. “Conformal Prediction in Julia 🟣🔴🟢.” October 25, 2022.

Originally published at https://www.paltmeyer.com on October 25, 2022.

A new tool for explainable AI

Patrick Altmeyer — Wed, 20 Apr 2022 00:00:00 +0000

Turning a 9 (nine) into a 4 (four). Image by author.

Counterfactual explanations, which I introduced in one of my previous posts, offer a simple and intuitive way to explain black-box models without opening them. Still, as of today there exists only one open-source library that provides a unifying approach to generate and benchmark counterfactual explanations for models built and trained in Python (Pawelczyk et al. 2021). This is great, but of limited use to users of other programming languages 🥲.

Enter CounterfactualExplanations.jl: a Julia package that can be used to explain machine learning algorithms developed and trained in Julia, Python and R. Counterfactual explanations fall into the broader category of explainable artificial intelligence (XAI).

Explainable AI typically involves models that are not inherently interpretable but require additional tools to be explainable to humans. Examples of the latter include ensembles, support vector machines and deep neural networks. This is not to be confused with interpretable AI, which involves models that are inherently interpretable and transparent such as general additive models (GAM), decision trees and rule-based models.

Some would argue that we best avoid explaining black-box models altogether (Rudin 2019) and instead focus solely on interpretable AI. While I agree that initial efforts should always be geared towards interpretable models, stopping there would entail missed opportunities and anyway is probably not very realistic in times of DALL-E and Co.

Even though […] interpretability is of great importance and should be pursued, explanations can, in principle, be offered without opening the “black box.”

Wachter, Mittelstadt, and Russell (2017)

This post introduces the main functionality of the new Julia package. Following a motivating example using a model trained in Julia, we will see how easy the package can be adapted to work with models trained in Python and R. Since the motivation for this post is also to hopefully attract contributors, the final section outlines some of the exciting developments we have planned.

Counterfactuals for image data 🖼

To introduce counterfactual explanations I used a simple binary classification problem in my previous post. It involved a linear classifier and a linearly separable, synthetic data set with just two features. This time we are going to step it up a notch: we will generate counterfactual explanations MNIST data. The MNIST dataset contains 60,000 training samples of handwritten digits in the form of 28x28 pixel grey-scale images (LeCun 1998). Each image is associated with a label indicating the digit (0–9) that the image represents.

The CounterfactualExplanations.jl package ships with two black-box models that were trained to predict labels for this data: firstly, a simple multi-layer perceptron (MLP) and, secondly, a corresponding deep ensemble. Originally proposed by Lakshminarayanan, Pritzel, and Blundell (2016), deep ensembles are really just ensembles of deep neural networks. They are still among the most popular approaches to Bayesian deep learning. For more information on Bayesian deep learning see my previous post: [TDS], [blog].

Black-box models

While the package can currently handle a few simple classification models natively, it is designed to be easily extensible through users and contributors. Extending the package to deal with custom models typically involves only two simple steps:

Subtyping : the custom model needs to be declared as a subtype of the package-internal type AbstractFittedModel.
Multiple dispatch : the package-internal functions logits and probs need to be extended through custom methods for the new model type.

The code that implements these two steps can be found in the corresponding post on my own blog.

Counterfactual generators

Next, we need to specify the counterfactual generators we want to use. The package currently ships with two default generators that both need gradient access: firstly, the generic generator introduced by Wachter, Mittelstadt, and Russell (2017) and, secondly, a greedy generator introduced by Schut et al. (2021).

The greedy generator is designed to be used with models that incorporate uncertainty in their predictions such as the deep ensemble introduced above. It works for probabilistic (Bayesian) models, because they only produce high-confidence predictions in regions of the feature domain that are populated by training samples. As long as the model is expressive enough and well-specified, counterfactuals in these regions will always be realistic and unambiguous since by construction they should look very similar to training samples. Other popular approaches to counterfactual explanations like REVISE (Joshi et al. 2019) and CLUE (Antorán et al. 2020) also play with this simple idea.

The following two lines of code instantiate the two generators for the problem at hand:

generic = GenericGenerator(;loss=:logitcrossentropy) 
greedy = GreedyGenerator(;loss=:logitcrossentropy)

Explanations

Once the model and counterfactual generator are specified, running counterfactual search is very easy using the package. For a given factual (x), target class (target) and data set (counterfactual_data), simply running

generate_counterfactual(x, target, counterfactual_data, M, generic)

will generate the results, in this case using the generic generator (generic) for the MLP (M). Since we have specified two different black-box models and two different counterfactual generators, we have four combinations of a model and a generator in total. For each of these combinations I have used the generate_counterfactual function to produce the results in Figure 1.

In every case the desired label switch is in fact achieved, but arguably from a human perspective only the counterfactuals for the deep ensemble look like a four. The generic generator produces mild perturbations in regions that seem irrelevant from a human perspective, but nonetheless yields a counterfactual that can pass as a four. The greedy approach clearly targets pixels at the top of the handwritten nine and yields the best result overall. For the non-Bayesian MLP, both the generic and the greedy approach generate counterfactuals that look much like adversarial examples: they perturb pixels in seemingly random regions on the image.

Figure 1: Counterfactual explanations for MNIST: turning a nine (9) into a four (4). Image by author.

Language interoperability 👥

The Julia language offers unique support for programming language interoperability. For example, calling R or Python is made remarkably easy through RCall.jl and PyCall.jl, respectively. This functionality can be leveraged to use CounterfactualExplanations.jl to generate explanations for models that were developed in other programming languages. At this time there is no native support for foreign programming languages, but the following example involving a torch neural network trained in R demonstrates how versatile the package is. The corresponding example involving PyTorch is analogous and therefore omitted, but available here.

Explaining a model trained in R

We will consider a simple MLP trained for a binary classification task. As before we first need to adapt this custom model for use with our package. The code below the two necessary steps — sub-typing and method extension. Logits are returned by the torch model and copied from the R environment into the Julia scope. Probabilities are then computed inside the Julia scope by passing the logits through the sigmoid function.

using Flux
using CounterfactualExplanations, CounterfactualExplanations.Models
import CounterfactualExplanations.Models: logits, probs # import functions in order to extend

# Step 1)
struct TorchNetwork <: Models.AbstractFittedModel
    nn::Any
end

# Step 2)
function logits(M::TorchNetwork, X::AbstractArray)
  nn = M.nn
  y = rcopy(R"as_array($nn(torch_tensor(t($X))))")
  y = isa(y, AbstractArray) ? y : [y]
  return y'
end
function probs(M::TorchNetwork, X::AbstractArray)
  return σ.(logits(M, X))
end
M = TorchNetwork(R"model")

Compared to models trained in Julia, we need to do a little more work at this point. Since our counterfactual generators need gradient access, we essentially need to allow our package to communicate with the R torch library. While this may sound daunting, it turns out to be quite manageable: all we have to do is respecify the function that computes the gradient with respect to the counterfactual loss function so that it can deal with the TorchNetwork type we defined above. The code below implements this.

import CounterfactualExplanations.Generators: ∂ℓ
using LinearAlgebra

# Countefactual loss:
function ∂ℓ(
    generator::AbstractGradientBasedGenerator, 
    counterfactual_state::CounterfactualState) 
  M = counterfactual_state.M
  nn = M.nn
  x′ = counterfactual_state.x′
  t = counterfactual_state.target_encoded
  R"""
  x <- torch_tensor($x′, requires_grad=TRUE)
  output <- $nn(x)
  loss_fun <- nnf_binary_cross_entropy_with_logits
  obj_loss <- loss_fun(output,$t)
  obj_loss$backward()
  """
  grad = rcopy(R"as_array(x$grad)")
  return grad
end

That is all the adjustment needed to use CounterfactualExplanations.jl for our custom R model. Figure 2 shows a counterfactual path for a randomly chosen sample with respect to the MLP trained in R.

Figure 2: Counterfactual path using the generic counterfactual generator for a model trained in R. Image by author.

We need you! 🫵

The ambition for CounterfactualExplanations.jl is to provide a go-to place for counterfactual explanations to the Julia community and beyond. This is a grand ambition, especially for a package that has so far been built by a single developer who has little prior experience with Julia. We would therefore very much like to invite community contributions. If you have an interest in trustworthy AI, the open-source community and Julia, please do get involved! This package is still in its early stages of development, so any kind of contribution is welcome: advice on the core package architecture, pull requests, issues, discussions and even just comments below would be much appreciated.

To give you a flavour of what type of future developments we envision, here is a non-exhaustive list:

Native support for additional counterfactual generators and predictive models including those built and trained in Python or R.
Additional datasets for testing, evaluation and benchmarking.
Improved preprocessing including native support for categorical features.
Support for regression models.

Finally, if you like this project but don’t have much time, then simply sharing this article or starring the repo on GitHub would also go a long way.

Thanks 💐

Lisa Schut and Oscar Key — corresponding authors of Schut (2021) — have been tremendously helpful in providing feedback on this post and answering a number of questions I had about their paper. Thank you!

References

Antorán, Javier, Umang Bhatt, Tameem Adel, Adrian Weller, and José Miguel Hernández-Lobato. 2020. “Getting a Clue: A Method for Explaining Uncertainty Estimates.” arXiv Preprint arXiv:2006.06848.

Joshi, Shalmali, Oluwasanmi Koyejo, Warut Vijitbenjaronk, Been Kim, and Joydeep Ghosh. 2019. “Towards Realistic Individual Recourse and Actionable Explanations in Black-Box Decision Making Systems.” arXiv Preprint arXiv:1907.09615.

Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2016. “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” arXiv Preprint arXiv:1612.01474.

LeCun, Yann. 1998. “The MNIST Database of Handwritten Digits.” Http://Yann. Lecun. Com/Exdb/Mnist/.

Pawelczyk, Martin, Sascha Bielawski, Johannes van den Heuvel, Tobias Richter, and Gjergji Kasneci. 2021. “Carla: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms.” arXiv Preprint arXiv:2108.00783.

Rudin, Cynthia. 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence 1 (5): 206–15.

Schut, Lisa, Oscar Key, Rory Mc Grath, Luca Costabello, Bogdan Sacaleanu, Yarin Gal, et al. 2021. “Generating Interpretable Counterfactual Explanations by Implicit Minimisation of Epistemic and Aleatoric Uncertainties.” In International Conference on Artificial Intelligence and Statistics, 1756–64. PMLR.

Wachter, Sandra, Brent Mittelstadt, and Chris Russell. 2017. “Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR.” Harv. JL & Tech. 31: 841.

Originally published at https://www.paltmeyer.com on April 20, 2022.

Go deep, but also … go Bayesian!

Patrick Altmeyer — Fri, 18 Feb 2022 00:00:00 +0000

Truly effortless Bayesian Deep Learning in Julia

Deep learning has dominated AI research in recent years — but how much promise does it really hold? That is very much an ongoing and increasingly polarising debate that you can follow live on Twitter. On one side you have optimists like Ilya Sutskever, chief scientist of OpenAI, who believes that large deep neural networks may already be slightly conscious — that’s “may” and “slightly” and only if you just go deep enough? On the other side you have prominent skeptics like Judea Pearl who has long since argued that deep learning still boils down to curve fitting — purely associational and not even remotely intelligent (Pearl and Mackenzie 2018).

The case for Bayesian Deep Learning

Whatever side of this entertaining twitter dispute you find yourself on, the reality is that deep-learning systems have already been deployed at large scale both in academia and industry. More pressing debates therefore revolve around the trustworthiness of these existing systems. How robust are they and in what way exactly do they arrive at decisions that affect each and every one of us? Robustifying deep neural networks generally involves some form of adversarial training, which is costly, can hurt generalization (Raghunathan et al. 2019) and does ultimately not guarantee stability (Bastounis, Hansen, and Vlačić 2021). With respect to interpretability, surrogate explainers like LIME and SHAP are among the most popular tools, but they too have been shown to lack robustness (Slack et al. 2020).

Exactly why are deep neural networks unstable and in-transparent? The first thing to note is that the number of free parameters is typically huge (if you ask Mr Sutskever it really probably cannot be huge enough!). That alone makes it very hard to monitor and interpret the inner workings of deep-learning algorithms. Perhaps more importantly though, the number of parameters relative to the size of the data is generally huge:

[…] deep neural networks are typically very underspecified by the available data, and […] parameters [therefore] correspond to a diverse variety of compelling explanations for the data. (Wilson 2020)

In other words, training a single deep neural network may (and usually does) lead to one random parameter specification that fits the underlying data very well. But in all likelihood there are many other specifications that also fit the data very well. This is both a strength and vulnerability of deep learning: it is a strength because it typically allows us to find one such “compelling explanation” for the data with ease through stochastic optimization; it is a vulnerability because one has to wonder:

How compelling is an explanation really if it competes with many other equally compelling, but potentially very different explanations?

A scenario like this very much calls for treating predictions from deep learning models probabilistically (Wilson 2020). Formally, we are interested in estimating the posterior predictive distribution as the following Bayesian model average (BMA):

The integral implies that we essentially need many predictions from many different specifications of the model. Unfortunately, this means more work for us or rather our computers. Fortunately though, researchers have proposed many ingenious ways to approximate the equation above in recent years: Gal and Ghahramani (2016) propose using dropout at test time while Lakshminarayanan et al. (2016) show that averaging over an ensemble of just five models seems to do the trick. Still, despite their simplicity and usefulness these approaches involve additional computational costs compared to training just a single network. As we shall see now though, another promising approach has recently entered the limelight: Laplace approximation (LA).

If you have read my previous post on Bayesian Logistic Regression, then the term Laplace should already sound familiar to you. As a matter of fact, we will see that all concepts covered in that previous post can be naturally extended to deep learning. While some of these concepts will be revisited below, I strongly recommend you check out the previous post before reading on here. Without further ado let us now see how LA can be used for truly effortless deep learning.

Laplace Approximation

While LA was first proposed in the 18th century, it has so far not attracted serious attention from the deep learning community largely because it involves a possibly large Hessian computation. Daxberger et al. (2021) are on a mission to change the perception that LA has no use in DL: in their NeurIPS 2021 paper they demonstrate empirically that LA can be used to produce Bayesian model averages that are at least at par with existing approaches in terms of uncertainty quantification and out-of-distribution detection and significantly cheaper to compute. They show that recent advancements in autodifferentation can be leveraged to produce fast and accurate approximations of the Hessian and even provide a fully-fledged Python library that can be used with any pretrained Torch model. For this post, I have built a much less comprehensive, pure-play equivalent of their package in Julia — BayesLaplace.jl can be used with deep learning models built in Flux.jl, which is Julia’s main DL library. As in the previous post on Bayesian logistic regression I will rely on Julia code snippits instead of equations to convey the underlying maths. If you’re curious about the maths, the NeurIPS 2021 paper provides all the detail you need. You will also find a slightly more detailed version of this article on my blog.

From Bayesian Logistic Regression …

Let’s recap: in the case of logistic regression we had assumed a zero-mean Gaussian prior for the weights that are used to compute logits, which in turn are fed to a sigmoid function to produce probabilities. We saw that under this assumption solving the logistic regression problem corresponds to minimizing the following differentiable loss function:

As our first step towards Bayesian deep learning, we observe the following: the loss function above corresponds to the objective faced by a single-layer artificial neural network with sigmoid activation and weight decay. In other words, regularized logistic regression is equivalent to a very simple neural network architecture and hence it is not surprising that underlying concepts can in theory be applied in much the same way.

So let’s quickly recap the next core concept: LA relies on the fact that the second-order Taylor expansion of our loss function evaluated at the maximum a posteriori (MAP) estimate amounts to a multi-variate Gaussian distribution. In particular, that Gaussian is centered around the MAP estimate with covariance equal to the inverse Hessian evaluated at the mode (Murphy 2022).

That is basically all there is to the story: if we have a good estimate of the Hessian we have an analytical expression for an (approximate) posterior over parameters. So let’s go ahead and implement this approach in Julia using BayesLaplace.jl. The code below generates some toy data, builds and trains a single-layer neural network and finally fits a post-hoc Laplace approximation:

# Import libraries.
using Flux, Plots, Random, PlotThemes, Statistics, BayesLaplace
theme(:wong)

# Toy data:
xs, y = toy_data_linear(100)
X = hcat(xs...); # bring into tabular format
data = zip(xs,y)

# Neural network:
nn = Chain(Dense(2,1))
λ = 0.5
sqnorm(x) = sum(abs2, x)
weight_regularization(λ=λ) = 1/2 * λ^2 * sum(sqnorm, Flux.params(nn))
loss(x, y) = Flux.Losses.logitbinarycrossentropy(nn(x), y) + weight_regularization()

# Training:
using Flux.Optimise: update!, ADAM
opt = ADAM()
epochs = 50

for epoch = 1:epochs
  for d in data
    gs = gradient(params(nn)) do
      l = loss(d...)
    end
    update!(opt, params(nn), gs)
  end
end

# Laplace approximation:
la = laplace(nn, λ=λ)
fit!(la, data)
p_plugin = plot_contour(X',y,la;title="Plugin",type=:plugin);
p_laplace = plot_contour(X',y,la;title="Laplace")
# Plot the posterior distribution with a contour plot.
plot(p_plugin, p_laplace, layout=(1,2), size=(1000,400))

The resulting plot below visualizes the posterior predictive distribution in the 2D feature space. For comparison I have added the corresponding plugin estimate as well. Note how for the Laplace approximation the predicted probabilities fan out indicating that confidence decreases in regions scarce of data.

Figure 1: Posterior predictive distribution of Logistic regression in the 2D feature space using plugin estimator (left) and Laplace approximation (right). Image by author.

… to Bayesian Neural Networks

Now let’s step it up a notch: we will repeat the exercise from above, but this time for data that is not linearly separable using a simple MLP instead of the single-layer neural network we used above. The code below is almost the same as above:

# Import libraries.
using Flux, Plots, Random, PlotThemes, Statistics, BayesLaplace
theme(:wong)

# Toy data:
xs, y = toy_data_linear(100)
X = hcat(xs...); # bring into tabular format
data = zip(xs,y)

# Build MLP:
n_hidden = 32
D = size(X)[1]
nn = Chain(
    Dense(D, n_hidden, σ),
    Dense(n_hidden, 1)
)  
λ = 0.01
sqnorm(x) = sum(abs2, x)
weight_regularization(λ=λ) = 1/2 * λ^2 * sum(sqnorm, Flux.params(nn))
loss(x, y) = Flux.Losses.logitbinarycrossentropy(nn(x), y) + weight_regularization()

# Training:
using Flux.Optimise: update!, ADAM
opt = ADAM()
epochs = 200

for epoch = 1:epochs
  for d in data
    gs = gradient(params(nn)) do
      l = loss(d...)
    end
    update!(opt, params(nn), gs)
  end
end

# Laplace approximation:
la = laplace(nn, λ=λ)
fit!(la, data)
p_plugin = plot_contour(X',y,la;title="Plugin",type=:plugin);
p_laplace = plot_contour(X',y,la;title="Laplace")
# Plot the posterior distribution with a contour plot.
plot(p_plugin, p_laplace, layout=(1,2), size=(1000,400))

Figure 2 demonstrates that once again the Laplace approximation yields a posterior predictive distribution that is more conservative than the over-confident plugin estimate.

Figure 2: Posterior predictive distribution of MLP in the 2D feature space using plugin estimator (left) and Laplace approximation (right). Image by author.

To see why this is a desirable outcome consider the zoomed out version of Figure 2 below: the plugin estimator classifies with full confidence in regions completely scarce of any data. Arguably Laplace approximation produces a much more reasonable picture, even though it too could likely be improved by fine-tuning our prior and the neural network architecture.

Figure 3: Posterior predictive distribution of MLP in the 2D feature space using plugin estimator (left) and Laplace approximation (right). Zoomed out. Image by author.

Wrapping up

Recent state-of-the-art research on neural information processing suggests that Bayesian deep learning can be effortless: Laplace approximation for deep neural networks appears to work very well and it does so at minimal computational cost (Daxberger et al. 2021). This is great news, because the case for turning Bayesian is strong: society increasingly relies on complex automated decision-making systems that need to be trustworthy. More and more of these systems involve deep learning, which in and of itself is not trustworthy. We have seen that typically there exist various viable parameterizations of deep neural networks each with their own distinct and compelling explanation for the data at hand. When faced with many viable options, don’t put all of your eggs in one basket. In other words, go Bayesian!

Resources

To get started with Bayesian deep learning I have found many useful and free resources online, some of which are listed below:

Turing.jl tutorial on Bayesian deep learning in Julia.
Various RStudio AI blog posts including this one and this one.
TensorFlow blog post on regression with probabilistic layers.
Kevin Murphy’s draft text book, now also available as print.

References

Bastounis, Alexander, Anders C Hansen, and Verner Vlačić. 2021. “The Mathematics of Adversarial Attacks in AI-Why Deep Learning Is Unstable Despite the Existence of Stable Neural Networks.” arXiv Preprint arXiv:2109.06098.

Daxberger, Erik, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. 2021. “Laplace Redux-Effortless Bayesian Deep Learning.” Advances in Neural Information Processing Systems 34.

Gal, Yarin, and Zoubin Ghahramani. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In International Conference on Machine Learning, 1050–59. PMLR.

Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2016. “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” arXiv Preprint arXiv:1612.01474.

Murphy, Kevin P. 2022. Probabilistic Machine Learning: An Introduction. MIT Press.

Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. Basic books.

Raghunathan, Aditi, Sang Michael Xie, Fanny Yang, John C Duchi, and Percy Liang. 2019. “Adversarial Training Can Hurt Generalization.” arXiv Preprint arXiv:1906.06032.

Slack, Dylan, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. 2020. “Fooling Lime and Shap: Adversarial Attacks on Post Hoc Explanation Methods.” In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 180–86.

Wilson, Andrew Gordon. 2020. “The Case for Bayesian Deep Learning.” arXiv Preprint arXiv:2001.10995.

Originally published at https://www.paltmeyer.com on February 18, 2022.

Bayesian Logistic Regression

Patrick Altmeyer — Mon, 15 Nov 2021 00:00:00 +0000

If you’ve ever searched for evaluation metrics to assess model accuracy, chances are that you found many different options to choose from. Accuracy is in some sense the holy grail of prediction so it’s not at all surprising that the machine learning community spends a lot time thinking about it. In a world where more and more high-stake decisions are being automated, model accuracy is in fact a very valid concern.

But does this recipe for model evaluation seem like a sound and complete approach to automated decision-making? Haven’t we forgot anything? Some would argue that we need to pay more attention to model uncertainty. No matter how many times you have cross-validated your model, the loss metric that it is being optimized against as well as its parameters and predictions remain inherently random variables. Focusing merely on prediction accuracy and ignoring uncertainty altogether can install a false level of confidence in automated decision-making systems. Any trustworthy approach to learning from data should therefore at the very least be transparent about its own uncertainty.

How can we estimate uncertainty around model parameters and predictions? Frequentist methods for uncertainty quantification generally involve either closed-form solutions based on asymptotic assumptions or bootstrapping (see for example here for the case of logistic regression). In Bayesian statistics and machine learning we are instead concerned with modelling the posterior distribution over model parameters. This approach to uncertainty quantification is known as Bayesian Inference because we treat model parameters in a Bayesian way: we make assumptions about their distribution based on prior knowledge or beliefs and update these beliefs in light of new evidence. The frequentist approach avoids the need for being explicit about prior beliefs, which in the past has sometimes been considered as _un_scientific. Still, frequentist methods come with their own assumptions and pitfalls (see for example Murphy (2012)) for a discussion). Without diving further into this argument, let us now see how Bayesian Logistic Regression can be implemented from the bottom up.

The ground truth

In this post we will work with a synthetic toy data set composed of binary labels and corresponding feature vectors. Working with synthetic data has the benefit that we have control over the ground truth that generates our data. In particular, we will assume that the binary labels are indeed generated by a logistic regression model. Features are generated from a mixed Gaussian model.

To add a little bit of life to our example we will assume that the binary labels classify samples into cats and dogs, based on their height and tail length. The figure below shows the synthetic data in the two-dimensional feature domain. Following an introduction to Bayesian Logistic Regression in the next section we will use this data to estimate our model.

Ground truth labels. Image by author.

The maths

Adding maths to articles on Medium remains a bit of a hassle, so here we will rely entirely on intuition and avoid formulas altogether. One of the perks of using Julia language is that it allows the use of Unicode characters, so the code we will see below looks almost like maths anyway. If you want to see the full mathematical treatment for a complete understanding of all the details, feel free to check out the extended version of this article on my website.

Problem setup

The starting point for Bayesian Logistic Regression is Bayes’ Theorem, which formally states that the posterior distribution of parameters is proportional to the product of two quantities: the likelihood of observing the data given the parameters and the prior density of parameters. Applied to our context this can intuitively be understood as follows: our posterior beliefs around the logistic regression coefficients are formed by both our prior beliefs and the evidence we observe (i.e. the data).

Under the assumption that individual label-feature pairs are independently and identically distributed, their joint likelihood is simply the product over their individual densities (Bernoulli). The prior beliefs around parameters are at our discretion. In practice they may be derived from previous experiments. Here we will use a zero-mean spherical Gaussian prior for reasons explained further below.

Unlike with linear regression there are no closed-form analytical solutions to estimating or maximising the posterior likelihood, but fortunately accurate approximations do exist (Murphy 2022). One of the simplest approaches called Laplace Approximation is straight-forward to implement and computationally very efficient. It relies on the observation that under the assumption of a Gaussian prior, the posterior of logistic regression is also approximately Gaussian: in particular, it this Gaussian distribution is centered around the maximum a posteriori (MAP) estimate with a covariance matrix equal to the inverse Hessian evaluated at the mode. Below we will see how the MAP and corresponding Hessian can be estimated.

Solving the problem

In practice we do not maximize the posterior directly. Instead we minimize the negative log likelihood, which is equivalent and easier to implement. The Julia code below shows the implementation of this loss function and its derivatives.

DISCLAIMER ❗️I should mention that (at the time of writing in 2021) this was the first time I program in Julia, so for any Julia pros out there: please bear with me! More than happy to hear your suggestions in the comments.

As you can see, the negative log likelihood is equal to the sum over two terms: firstly, a sum over (log) Bernoulli distributions — corresponding to the likelihood of the data — and, secondly, a (log) Gaussian distribution — corresponding to our prior beliefs.

# Loss:
function 𝓁(w,w_0,H_0,X,y)
    N = length(y)
    D = size(X)[2]
    μ = sigmoid(w,X)
    Δw = w-w_0
    l = - ∑( y[n] * log(μ[n]) + (1-y[n]) * log(1-μ[n]) for n=1:N) + 1/2 * Δw'H_0*Δw
    return l
end

# Gradient:
function ∇𝓁(w,w_0,H_0,X,y)
    N = length(y)
    μ = sigmoid(w,X)
    Δw = w-w_0
    g = ∑((μ[n]-y[n]) * X[n,:] for n=1:N)
    return g + H_0*Δw
end

# Hessian:
function ∇∇𝓁(w,w_0,H_0,X,y)
    N = length(y)
    μ = sigmoid(w,X)
    H = ∑(μ[n] * (1-μ[n]) * X[n,:] * X[n,:]' for n=1:N)
    return H + H_0
end

Since minimizing this loss function is a convex optimization problem we have many efficient algorithms to choose from in order to solve this problem. With the Hessian also at hand it seems natural to use a second-order method, because incorporating information about the curvature of the loss function generally leads to faster convergence. Here we will implement Newton’s method like so:

# Newton's Method
function arminjo(𝓁, g_t, θ_t, d_t, args, ρ, c=1e-4)
    𝓁(θ_t .+ ρ .* d_t, args...) <= 𝓁(θ_t, args...) .+ c .* ρ .* d_t'g_t
end

function newton(𝓁, θ, ∇𝓁, ∇∇𝓁, args; max_iter=100, τ=1e-5)
    # Intialize:
    converged = false # termination state
    t = 1 # iteration count
    θ_t = θ # initial parameters
    # Descent:
    while !converged && t<max_iter 
        global g_t = ∇𝓁(θ_t, args...) # gradient
        global H_t = ∇∇𝓁(θ_t, args...) # hessian
        converged = all(abs.(g_t) .< τ) && isposdef(H_t) # check first-order condition
        # If not converged, descend:
        if !converged
            d_t = -inv(H_t)*g_t # descent direction
            # Line search:
            ρ_t = 1.0 # initialize at 1.0
            count = 1
            while !arminjo(𝓁, g_t, θ_t, d_t, args, ρ_t) 
                ρ_t /= 2
            end
            θ_t = θ_t .+ ρ_t .* d_t # update parameters
        end
        t += 1
    end
    # Output:
    return θ_t, H_t 
end

Suppose now that we have trained the Bayesian Logistic Regression model as our binary classifier using our training data and a new unlabelled sample arrives. As with any binary classifier we can predict the missing label by simply plugging the new sample into our fitted model. If at training phase we have found the model to achieve good accuracy, we may expect good out-of-sample performance. But since we are still dealing with an expected value of a random variable, we would generally like to have an idea of how noisy this prediction is.

Formally, we are interested in the posterior predictive distribution, which without any further assumption is a mathematically intractable integral. It can be numerically estimated through Monte Carlo — by simply repeatedly sampling parameters from the posterior distribution — or by using what is called a probit approximation. The latter uses the finding that the sigmoid function can be well approximated by a rescaled standard Gaussian cdf (see the figure below). Approximating the sigmoid function in this way allows us to derive an analytical solution for the posterior predictive. This approach was used to generate the results in the following section.

Demonstration of the probit approximation. Image by author.

The estimates

The first figure below shows the resulting posterior distribution for the coefficients on height and tail length at varying degrees of prior uncertainty. The red dot indicates the unconstrained maximum likelihood estimate (MLE). Note that as the prior uncertainty tends towards zero the posterior approaches the prior. This is intuitive since we have imposed that we have no uncertainty around our prior beliefs and hence no amount of new evidence can move us in any direction. Conversely, for very large levels of prior uncertainty the posterior distribution is centered around the unconstrained MLE: prior knowledge is very uncertain and hence the posterior is dominated by the likelihood of the data.

Posterior distribution at varying degrees of prior uncertainty σ. Image by author.

What about the posterior predictive? The story is similar: since for very low levels of prior uncertainty the posterior is completely dominated by the zero-mean prior all samples are classified as 0.5 (top left panel in the figure below). As we gradually increase uncertainty around our prior the predictive posterior depends more and more on the data: uncertainty around predicted labels is high only in regions that are not populated by samples. Not surprisingly, this effect is strongest for the MLE where we see some evidence of overfitting.

Predictive posterior distribution at varying degrees of prior uncertainty σ. Image by author.

Wrapping up

In this post we have seen how Bayesian Logistic Regression can be implemented from scratch in Julia language. The estimated posterior distribution over model parameters can be used to quantify uncertainty around coefficients and model predictions. I have argued that it is important to be transparent about model uncertainty to avoid being overly confident in estimates.

There are many more benefits associated with Bayesian (probabilistic) machine learning. Understanding where in the input domain our model exerts high uncertainty can for example be instrumental in labelling data: see for example Gal, Islam, and Ghahramani (2017) and follow-up works for an interesting application to active learning for image data. Similarly, there is a recent work that uses estimates of the posterior predictive in the context of algorithmic recourse (Schut et al. 2021). For a brief introduction to algorithmic recourse see my previous post.

As a great reference for further reading about probabilistic machine learning I can highly recommend Murphy (2022). An electronic version of the book is currently freely available as a draft. Finally, if you are curious to see the full source code in detail and want to try yourself at the code, you can check out this interactive notebook.

References

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. springer.

Gal, Yarin, Riashat Islam, and Zoubin Ghahramani. 2017. “Deep Bayesian Active Learning with Image Data.” In International Conference on Machine Learning, 1183–92. PMLR.

Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT press.

— -. 2022. Probabilistic Machine Learning: An Introduction. MIT Press.

Full article published at https://www.paltmeyer.com on November 15, 2021.

Julia Community 🟣: Patrick Altmeyer

Conformal Prediction Intervals for any Regression Model

Conformal Prediction in Julia — Part 3

📖 Background

📈 Data

🏋️ Model Training using MLJ

🔥 Conformalizing the Model

🧐 Evaluation

So what’s happening under the hood?

🔃 Recap

Are we done?

📚 Read on

References

How to Conformalize a Deep Image Classifier

Conformal Prediction in Julia — Part 2

🎯 The Task at Hand

🚧 Building the Network

🔥 Conformalizing the Network

📊 Results

🧐 Evaluation

🔁 Recap

🎓 References

A year of using Quarto with Julia

General things

Setup

Using .ipynb vs .qmd

Dynamic Content

Code Execution

Reproducibility

Package Documentation

Quarto for Academic Journal Articles

Wrapping up

Conformal Prediction in Julia

Conformal Prediction in Julia

Part 1 — Introduction

📖 Background

📦 Conformal Prediction in Julia

Split Conformal Classification

🏁 Conclusion

📚 Further Resources

References

A new tool for explainable AI

Counterfactuals for image data 🖼

Black-box models

Counterfactual generators

Explanations

Language interoperability 👥

Explaining a model trained in R

We need you! 🫵

Further reading 📚

Thanks 💐

References

Go deep, but also … go Bayesian!

Truly effortless Bayesian Deep Learning in Julia

The case for Bayesian Deep Learning

Laplace Approximation

From Bayesian Logistic Regression …

… to Bayesian Neural Networks

Wrapping up

Resources

References

Bayesian Logistic Regression

The ground truth

The maths

Problem setup

Solving the problem

The estimates

Wrapping up

References