Julia Community šŸŸ£

Cover image for Statistical Plotting with Julia: Gadfly.jl
Roland SchƤtzle
Roland SchƤtzle

Posted on

Statistical Plotting with Julia: Gadfly.jl

This article appeared in Towards Data Science on Aug 25th, 2022

How to create statistical plots using the Gadfly.jl package

This is the first of several articles where I compare different Julia graphics packages for creating statistical plots. I start the series here with the Gadfly-package.

In the introduction to the series (The Grammar of Graphics or how to do ggplot-style plotting in Julia), Iā€™ve explained the Grammar of Graphics (GoG) which is the conceptual base for these graphics packages. In that article Iā€™ve also introduced the data which will be used for the plotting examples.

Gadfly

Gadfly is a very complete implementation of the Grammar of Graphics. Its original author is Daniel C. Jones, but the package has currently more than 100 contributors listed on GitHub. The first versions appeared in 2014. In the meantime it is a very mature package with only a few new releases per year.

Itā€™s completely written in Julia and plays well with rest of the Julia ecosystem. There is e.g. a tight integration with DataFrames.jl and via the IJulia package it can be directly used within Jupyter notebooks.

For the rendering of publication quality graphics itā€™s able to render SVG out of the box and using Cairo.jl and Fontconfig.jl it can also produce formats like PNG, PDF, PS and PGF.

The plots produced by Gadfly offer some interactivity like panning, zooming and toggling.

Example Plots

For the comparison I will use a few diagram types (or geometries as they are called by the GoG) which are commonly used in data science, namely:

  • bar plots
  • scatter plots
  • histograms
  • box plots
  • violin plots

Gadfly offers of course many types more as you can see in this gallery. But in order to obtain a 1:1-comparison between all packages, I stuck with the types listed above.

The data for the examples is assumed to be ready in the DataFrames structures countries, subregions_cum and regions_cum presented in the introducing article to the series.

Most plots are first presented in a basic version, using the defaults of the graphics package and get then refined using customized attributes (for labels, background color etc.).

Bar Plots

Population by Region

We start with a simple bar chart, that shows population size (in 2019) by region. This is done using the following plot-command mapping data to aesthetics and using a bar-geometry as we learned in the introducing article about the Grammar of Graphics:

plot(regions_cum, 
        x = :Region, y = :Pop2019, color = :Region, 
        Geom.bar)
Enter fullscreen mode Exit fullscreen mode

ā€¦ resulting in the following bar chart:

region by population - 1

In a second version we donā€™t rely on defaults, but set axis labels, title and background color manually. Apart from that we donā€™t want the numbers on the y-axis in scientific format and there should be some space between the bars (to conform to the definition of a bar chart). This leads to the following code, where Guide-elements are used for the labels, a Scale for changing the numbers on the y-axis and a Theme for general attributes like background color or bar spacing.

ā€¦ creating the following beautified bar chart:

population by region - 2

Population by Subregion

The next bar chart depicts population by subregion using the following plot-command:

plot(subregions_cum, 
        x = :Subregion, y = :Pop2019, color = :Region, 
        Geom.bar)
Enter fullscreen mode Exit fullscreen mode

ā€¦ resulting in the following bar chart:

population by subregion - 1

We can see that there is room for improvement: As there are quite a few subregions and their names a relatively long, a horizontal bar diagram might be more readable. Apart from this we adapt again labels, title, background color etc. leading to the following code, where we switch to a horizontal layout using the parameter orientation on the bar geometry:

ā€¦ resulting indeed in a more readable bar chart:

population by subregion - 2

It gets even more readable, if we sort the subregions subregions_cum by population size (Pop2019) before rendering the diagram using the following command:

subregions_cum_sorted = sort(subregions_cum, :Pop2019)
Enter fullscreen mode Exit fullscreen mode

If we apply the plot command from above to the sorted data subregions_cum_sorted we finally get:

population by subregion - 3

Scatter Plots

In the next step we have a look at the population at the country level in relation to the growth rate. A scatter plot is good way to visualize this relationship. We get one using a point geometry as follows:

plot(countries,
        x = :Pop2019, y = :PopChangePct, color = :Region,
        Geom.point)
Enter fullscreen mode Exit fullscreen mode

ā€¦ resulting in this scatter plot:

Population in relation to growth rate - 1

As we also mapped the region to the color aesthetics, we get a more differentiated picture involving region information in addition.

But the distribution of the data is quite skewed ā€” most countries have a population below 200 Mio. So a logarithmic scale on the x-axis might give a better insight into the data. And again, we add some labels, background color etc. leading to the following code:

ā€¦ giving us the following improved scatter plot:

Population in relation to growth rate - 2

The labels-parameter for the log scale needs a bit of an explanation: Without this specification we would get the logarithms (to base 10) on the x-axis, which is for many people hard to understand. Instead we want just population numbers (e.g. 100.0 instead of 2). So we pass a function to labels which calculates the ā€˜correctā€™ labels. The log value x is converted to 10x10^x to get a ā€˜readableā€™ number, then rounded to two digits and finally converted to a string (which is the expected type for a label).

Histograms

Bar plots and histograms have the same geometry (in the sense of the ā€œGrammar of Graphicsā€). But in order to get categorical data on the x-axis the data used for a histogram has to be mapped to (artificial) categories in a process called ā€˜binningā€™. In the GoG this is done using a so-called bin statistic.

Gadfly doesnā€™t follow (or at least doesnā€™t show) the theory in this place. It introduces instead a separate geometry for histograms (which might be more practical for everyday use).

So we get a histogram that shows the distribution of GDP per capita among the different countries with the following plot-command using a histogram geometry:

plot(countries, x = :GDPperCapita, Geom.histogram)
Enter fullscreen mode Exit fullscreen mode

ā€¦ resulting in this histogram:

distribution of GDP per capita - 1

The number of bins used can be controlled by the bincount-parameter of the histogram geometry. And again we can add labels etc. resulting in the following code:

ā€¦ leading to the following improved histogram:

distribution of GDP per capita - 2

Box Plots and Violin Plots

To obtain an insight into the distribution of some numerical data, box plots or violin plots are typically used. Each of these diagram types has its specific virtues. So letā€™s visualize the distribution of the GDP per capita for each region using these plots.

Box Plot

Letā€™s immediately use the ā€˜beautifiedā€™ version using a boxplot-geometry:

ā€¦ giving us the following box plot:

distribution of GDP per capita by region - 1

Violin Plot

The code for a violin plot for this visualization looks quite similar. The only difference being the use of a violin-geometry (instead of a boxplot):

ā€¦ leading to the following violin plot:

distribution of GDP per capita by region - 2

Here we note that the defaults for the scaling of the y-axis donā€™t work as good as with the box plot. Apart from that, the really interesting part of the distribution lies in the range from 0 to 100,000. Therefore we want to restrict the plot to that range on the y-axis, doing sort of a zoom-in.

Zooming in

This can easily be achieved by adding the following line to the list of plot-parameters:

Coord.cartesian(ymin = 0, ymax = 100000),
Enter fullscreen mode Exit fullscreen mode

ā€¦ leading to the following violin diagram:

distribution of GDP per capita by region - 3

The same restriction to the y-axis can be applied to the box plot:

distribution of GDP per capita by region - 4

Conclusions

As we can see, Gadfly follows most of the time quite closely the concepts of the Grammar of Graphics. Thatā€™s one of the reasons why the plot specifications are so consistent (same things are always specified in the same way independent of context) und thus easy to learn and to memorize.

You reach only some limits when it comes to edge cases. E.g. if you specify a scatter plot where there is only a mapping to the x-axis but not to the y-axis. According to the GoG you should get points distributed on a line (the x-axis). That doesnā€™t work with Gadfly. And there is e.g. no polar coordinate system implemented (but could be done in the future).

But if your visualization needs are centered around the (large) list of geometries which are implemented in Gadfly and you donā€™t need rather exotic customizations of these diagrams then you will be quite happy with Gadfly.

If you want to try out the examples by yourself you can get a Pluto notebook which is sort of an executable variant of this article from my GitHub repository.

Oldest comments (0)