<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Julia Community 🟣: Rik Huijzer</title>
    <description>The latest articles on Julia Community 🟣 by Rik Huijzer (@rikhuijzer).</description>
    <link>https://forem.julialang.org/rikhuijzer</link>
    <image>
      <url>https://forem.julialang.org/images/0WDeQp0ZpssCsibScULiOaplBUUzD_c9WAf2N9ybP0M/rs:fill:90:90/g:sm/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L3VzZXIvcHJvZmls/ZV9pbWFnZS8zNTIv/NjZkZDg5MTctNDI0/YS00M2ZlLThjNzUt/ODVjNzM3MThmZjdh/LmpwZWc</url>
      <title>Julia Community 🟣: Rik Huijzer</title>
      <link>https://forem.julialang.org/rikhuijzer</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.julialang.org/feed/rikhuijzer"/>
    <language>en</language>
    <item>
      <title>Frequentist and Bayesian coin flipping in Julia</title>
      <dc:creator>Rik Huijzer</dc:creator>
      <pubDate>Tue, 23 Aug 2022 09:57:00 +0000</pubDate>
      <link>https://forem.julialang.org/rikhuijzer/frequentist-and-bayesian-coin-flipping-26gn</link>
      <guid>https://forem.julialang.org/rikhuijzer/frequentist-and-bayesian-coin-flipping-26gn</guid>
      <description>&lt;p&gt;To me, it is still unclear what exactly is the difference between Frequentist and Bayesian statistics. Most explanations involve terms such as "likelihood", "uncertainty" and "prior probabilities". Here, I'm going to show the difference between both statistical paradigms by using a coin flipping example. In the examples, the effect of showing more data to both paradigms will be visualised.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generating data
&lt;/h2&gt;

&lt;p&gt;Lets start by generating some data from a fair coin flip, that is, the probability of heads is 0.5.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;begin&lt;/span&gt;
    &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CairoMakie&lt;/span&gt;

    &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;AlgebraOfGraphics&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Lines&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Scatter&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;draw&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;visual&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mapping&lt;/span&gt;
    &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;Distributions&lt;/span&gt;
    &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;HypothesisTests&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;OneSampleTTest&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;confint&lt;/span&gt;
    &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;StableRNGs&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;StableRNG&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;p_true&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;is_heads&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;let&lt;/span&gt;
    &lt;span class="n"&gt;rng&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;StableRNG&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Bernoulli&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_true&lt;/span&gt;&lt;span class="x"&gt;),&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="x"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To give some intuition about the sample, the first six elements of &lt;code&gt;is_heads&lt;/code&gt; are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;is_heads&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;6-element Vector{Bool}:
 1
 1
 1
 0
 0
 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Calculate probability estimates
&lt;/h2&gt;

&lt;p&gt;The Frequentist estimate for a one sample t-test after seeing nnn samples can be calculated with&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; frequentist_estimate&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;t_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OneSampleTTest&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;is_heads&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;middle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t_result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xbar&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;confint&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t_result&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="x"&gt;(;&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;middle&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the Bayesian estimate, we can use the closed-form solution (&lt;a href="https://turing.ml/dev/tutorials/00-introduction/"&gt;https://turing.ml/dev/tutorials/00-introduction/&lt;/a&gt;). A closed-form solution is not available for many real-world problems, but quite useful for this example.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;closed_form_prior&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Beta&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; update_belief&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;heads&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;is_heads&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;tails&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;heads&lt;/span&gt;
    &lt;span class="n"&gt;updated_belief&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Beta&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;closed_form_prior&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;α&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;heads&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;closed_form_prior&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;β&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tails&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;updated_belief&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;beliefs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;closed_form_prior&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt; &lt;span class="n"&gt;update_belief&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="x"&gt;)];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; bayesian_estimate&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;distribution&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;beliefs&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;quantile&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;distribution&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.025&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;middle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;distribution&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.975&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="x"&gt;(;&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;middle&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can write a function to plot the estimates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; plot_estimates&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;estimate_function&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;draws&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;
    &lt;span class="n"&gt;estimates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;estimate_function&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;draws&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;middles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;middle&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;estimates&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;lowers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lower&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;estimates&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;uppers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;upper&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;estimates&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="x"&gt;(;&lt;/span&gt; &lt;span class="n"&gt;draws&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;estimates&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;P&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;middles&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;layers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;visual&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Scatter&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df_middle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="x"&gt;(;&lt;/span&gt; &lt;span class="n"&gt;P&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fill&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;draws&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;),&lt;/span&gt; &lt;span class="n"&gt;draws&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt; &lt;span class="n"&gt;draws&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt; &lt;span class="mi"&gt;83&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;layers&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_middle&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;visual&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Lines&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;visual&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;linestyle&lt;/span&gt;&lt;span class="o"&gt;=:&lt;/span&gt;&lt;span class="n"&gt;dash&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;zip&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;draws&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lowers&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uppers&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;df_bounds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="x"&gt;(;&lt;/span&gt; &lt;span class="n"&gt;P&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="x"&gt;],&lt;/span&gt; &lt;span class="n"&gt;draws&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;layers&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_bounds&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;visual&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Lines&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;

    &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="x"&gt;(;&lt;/span&gt; &lt;span class="n"&gt;yticks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limits&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="x"&gt;((&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.2&lt;/span&gt;&lt;span class="x"&gt;),&lt;/span&gt; &lt;span class="nb"&gt;nothing&lt;/span&gt;&lt;span class="x"&gt;),&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;map&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mapping&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;P&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;"Probability of heads"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;draws&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;"Observed number of draws"&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;draw&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;layers&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And plot the Frequentist and Bayesian estimates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;plot_estimates&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frequentist_estimate&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Frequentist estimates"&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://forem.julialang.org/images/jsZbELbv7Bxpz3n4k8nIoJoC0ZFJc5hQ3KVSAfj2ytE/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL3B5/N3BsbjN5cGZ0eGx3/ejJkeWxmLnBuZw" class="article-body-image-wrapper"&gt;&lt;img src="https://forem.julialang.org/images/jsZbELbv7Bxpz3n4k8nIoJoC0ZFJc5hQ3KVSAfj2ytE/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL3B5/N3BsbjN5cGZ0eGx3/ejJkeWxmLnBuZw" alt="Frequentist estimate" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;plot_estimates&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bayesian_estimate&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Bayesian estimates"&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://forem.julialang.org/images/d8PV3OwXeAs2SbEujGaVRQLh_03om47MahLXVQ2kBjU/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzLzYz/b2lwaXpwODhuYzZm/MGZkZmJvLnBuZw" class="article-body-image-wrapper"&gt;&lt;img src="https://forem.julialang.org/images/d8PV3OwXeAs2SbEujGaVRQLh_03om47MahLXVQ2kBjU/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzLzYz/b2lwaXpwODhuYzZm/MGZkZmJvLnBuZw" alt="Bayesian estimate" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Based on these plots, we can conclude two things. Firstly, the Bayesian approach provides better estimates for small sample sizes. The Bayesian approach successfully uses the fact that a probability should be between 0 and 1, which was given to the model via the &lt;code&gt;Beta(1, 1)&lt;/code&gt; prior. For increasingly larger sample sizes, the difference between both statistical paradigms vanish in this situation. Secondly, collecting more and more samples until the result is significant is dangerous. This approach is called &lt;em&gt;optional stopping&lt;/em&gt;. Around 10 samples, the frequentist' test would conclude that the data must come from a distribution with a mean higher than 0.5, whereas we know that this is false. Cumming (&lt;a href="https://www.routledge.com/Understanding-The-New-Statistics-Effect-Sizes-Confidence-Intervals-and/Cumming/p/book/9780415879682"&gt;2011&lt;/a&gt;) calls this the "dance of the &lt;em&gt;p&lt;/em&gt;-values".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EDIT:&lt;/strong&gt; Christopher Rowley pointed out that it would be more fair to run a frequentist &lt;code&gt;BinomialTest&lt;/code&gt; since that will output a confidence interval in [0, 1].&lt;/p&gt;

&lt;p&gt;Built with Julia 1.8.0 and&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AlgebraOfGraphics 0.6.11
CairoMakie 0.8.13
Distributions 0.25.70
HypothesisTests 0.10.10
StableRNGs 1.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run this blog post locally, open &lt;a href="https://www.huijzer.xyz/posts/notebooks/frequentist-bayesian-coin-flipping.jl"&gt;this notebook&lt;/a&gt; with Pluto.jl.&lt;/p&gt;

</description>
      <category>bayesian</category>
      <category>statistics</category>
    </item>
    <item>
      <title>Random forest, Shapley values and multicollinearity</title>
      <dc:creator>Rik Huijzer</dc:creator>
      <pubDate>Tue, 07 Jun 2022 20:19:00 +0000</pubDate>
      <link>https://forem.julialang.org/rikhuijzer/random-forest-shapley-values-and-multicollinearity-4c17</link>
      <guid>https://forem.julialang.org/rikhuijzer/random-forest-shapley-values-and-multicollinearity-4c17</guid>
      <description>&lt;p&gt;Linear statistical models are great for many use-cases since they are easy to use and easy to interpret. Specifically, linear models can use features (also known as independent variables, predictors or covariates) to predict an outcome (also known as dependent variables).&lt;/p&gt;

&lt;p&gt;In a linear model, a higher coefficient for a feature, the more a feature played a role in making a prediction. However, when variables in a regression model are correlated, these conclusions don't hold anymore.&lt;/p&gt;

&lt;p&gt;One way to solve this is to use clustering techniques such as principal component analysis (PCA) (Dormann et al., &lt;a href="https://doi.org/10.1111/j.1600-0587.2012.07348.x"&gt;2012&lt;/a&gt;). With PCA, latent clusters are automatically determined. Unfortunately, these latent clusters now became, what I would like to call, magic blobs. Proponents of these techniques could say: "But we know that there is an underlying variable which causes our effect, why can't this variable be the same as the cluster that we found?" Well, because these blobs are found in the data and not in the real-world.&lt;/p&gt;

&lt;p&gt;To link these blobs (officially, clusters) back to the real-world, one can try to find the features closes to the blobs in one way or another, but this will always introduce bias. Another approach is to drop features which are highly correlated and expected to be less important.&lt;/p&gt;

&lt;p&gt;Some say that random forests combined with Shapley values can deal with collinearity reasonably well. This is because the random forest can find complex relations in the data and because Shapley values are based on mathematically proven ideas. Others say that the Shapley values will pick one of the correlated features and ignore the others. In this post, I aim to simulating collinear data and see how good the conclusions of the model are.&lt;/p&gt;

&lt;h2&gt;
  
  
  Simulating data
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;begin&lt;/span&gt;
    &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;CairoMakie&lt;/span&gt;
    &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;DataFrames&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Not&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;select&lt;/span&gt;
    &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;Distributions&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Normal&lt;/span&gt;
    &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;LightGBM&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MLJInterface&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LGBMRegressor&lt;/span&gt;
    &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;MLJ&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;fit!&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;machine&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predict&lt;/span&gt;
    &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;Random&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;seed!&lt;/span&gt;
    &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;Shapley&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MonteCarlo&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shapley&lt;/span&gt;
    &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;StableRNGs&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;StableRNG&lt;/span&gt;
    &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;Statistics&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cor&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;y_noise&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;coefficient&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;coefficient&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Normal&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="x"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;indexes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;150.0&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;r2&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt; &lt;span class="n"&gt;digits&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;let&lt;/span&gt;
    &lt;span class="n"&gt;seed!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;indexes&lt;/span&gt;
    &lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_noise&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexes&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;U&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_noise&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexes&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;V&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_noise&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexes&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;W&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_noise&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexes&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;Y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_noise&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexes&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="x"&gt;(;&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://forem.julialang.org/images/cJvymG78DGRBoXGiGBZ9EOgDV9GV94biJqZRJhm9BnE/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL203/cXp5amk4N3lueWVl/a3pwOTF1LnBuZw" class="article-body-image-wrapper"&gt;&lt;img src="https://forem.julialang.org/images/cJvymG78DGRBoXGiGBZ9EOgDV9GV94biJqZRJhm9BnE/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL203/cXp5amk4N3lueWVl/a3pwOTF1LnBuZw" alt="Data" width="880" height="684"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Fitting a model
&lt;/h2&gt;

&lt;p&gt;For the LGBM regressor, I've set some hyperparameters to lower values because the default parameters are optimized for large datasets.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; regressor&lt;/span&gt;&lt;span class="x"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="x"&gt;[&lt;/span&gt;
        &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;num_leaves&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
        &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;max_depth&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
        &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;min_data_per_group&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
        &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;learning_rate&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;
    &lt;span class="x"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;LGBMRegressor&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt; &lt;span class="x"&gt;;&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; fit_model&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Not&lt;/span&gt;&lt;span class="x"&gt;([&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="x"&gt;]))&lt;/span&gt;
    &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;
    &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fit!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;machine&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;regressor&lt;/span&gt;&lt;span class="x"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="x"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To get an idea of what the model is doing, we can plot the predictions on top of the data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://forem.julialang.org/images/36hjy-3zssH4VHoaigbCRnNvDwIYKFJ_K_EwBjrr5Uk/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL3F2/MjNhNDBpN20ycTE3/ejVnYnNzLnBuZw" class="article-body-image-wrapper"&gt;&lt;img src="https://forem.julialang.org/images/36hjy-3zssH4VHoaigbCRnNvDwIYKFJ_K_EwBjrr5Uk/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL3F2/MjNhNDBpN20ycTE3/ejVnYnNzLnBuZw" alt="Predictions on top of the data" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;My main worry after seeing this is how we can avoid overfitting random forests on our data. Intuitively, it makes sense that the model overfits due to it's high flexibility combined with few samples. Of course, not all real-world phenomenons are linear, but it seems that the model is spending too much effort on fitting noise. Luckily, approaches such as &lt;a href="https://huijzer.xyz/posts/nested-cv"&gt;(nested) cross-validation&lt;/a&gt; can estimate how well a model will generalize.&lt;/p&gt;

&lt;p&gt;Anyway, I'm digressing. Back to the original problem of whether we can infer anything from the non-linear model about feature importance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shapley values
&lt;/h2&gt;

&lt;p&gt;Shapley values are based on a theory by Shapley (&lt;a href="https://doi.org/10.1515/9781400881970-018"&gt;1953&lt;/a&gt;). The goal of these values is to estimate how much each feature has contributed to the prediction. One way to do this is by changing the input for a feature while keeping all other feature inputs constant and seeing how much the output changes. This repeated sampling is called the Monte Carlo method. We can aggregate the Monte Carlo results to estimate how much a feature contributes to the outcome.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; shapley_values&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fit_model&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MonteCarlo&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;shapley&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="x"&gt;),&lt;/span&gt; &lt;span class="n"&gt;mc&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;&lt;span class="x"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As a simple check, let's first see what happens if we only pass the dataset with X, T, W and Y. Because T has no relation to the outcome Y and W has a strong relation to the outcome Y, we expect that T gets a low score and W gets a high score.&lt;/p&gt;

&lt;p&gt;In the plot below, I've shown all the Shapley values from the Monte Carlo simulation on the left and aggregated them on the right. As expected, the plots on the right clearly show that W has a greater contribution.&lt;/p&gt;

&lt;p&gt;As a sidenote, usually people only show the plot on the right (mean of absolute values) when talking about Shapley values, but I think it is good to have the one on the left (Shapley values) too to give the full picture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://forem.julialang.org/images/MflTBzaKqRCTT8Oxc_vrl2hH4ztsQ0aM95BTVWlpqbU/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL2Nz/OHV1Znltczd5dW9m/ejM0aTAzLnBuZw" class="article-body-image-wrapper"&gt;&lt;img src="https://forem.julialang.org/images/MflTBzaKqRCTT8Oxc_vrl2hH4ztsQ0aM95BTVWlpqbU/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL2Nz/OHV1Znltczd5dW9m/ejM0aTAzLnBuZw" alt="Test Shapley values" width="880" height="391"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, it's time for the final test: do Shapley values give apropriate credit to correlated features? We check this by passing all the features instead of only X, T, W and Y.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://forem.julialang.org/images/WYKMaYFIgYAWWbQZfBG8TQlzcKEpxaAjsAPYmZbPAp4/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzLzRs/OWVjYWhuenBsbGZ5/czQ1YnNmLnBuZw" class="article-body-image-wrapper"&gt;&lt;img src="https://forem.julialang.org/images/WYKMaYFIgYAWWbQZfBG8TQlzcKEpxaAjsAPYmZbPAp4/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzLzRs/OWVjYWhuenBsbGZ5/czQ1YnNmLnBuZw" alt="More Shapley values" width="880" height="782"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;Based on this outcome, I would say that Shapley values do a pretty good job in combination with the LightGBM model. The values T, U, V, W are ordered by their correlation with the outcome (and each other). As expected, features lower in the plot get a higher mean of absolute values, that is, feature importance. However, basing scientific conclusions on these outcomes may be deceptive because it's likely that there is a high variance on the reported feature importances. The main risk is here is that the model didn't properly fit resulting in incorrect outcomes. Another risk is that the overfitted model generalizes poorly. Both risks also hold for linear models, so I guess it's just the best we have.&lt;/p&gt;

&lt;p&gt;Built with Julia 1.8.0 and&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CairoMakie 0.8.13
DataFrames 1.3.4
Distributions 0.25.67
LightGBM 0.5.2
MLJ 0.18.4
Shapley 0.1.1
StableRNGs 1.0.0 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run this blog post locally, open &lt;a href="https://huijzer.xyz/posts/notebooks/shapley.jl"&gt;this notebook&lt;/a&gt; with Pluto.jl.&lt;/p&gt;

</description>
      <category>launch</category>
      <category>stats</category>
      <category>ml</category>
    </item>
  </channel>
</rss>
