<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Julia Community 🟣: Anthony Blaom, PhD</title>
    <description>The latest articles on Julia Community 🟣 by Anthony Blaom, PhD (@ablaom).</description>
    <link>https://forem.julialang.org/ablaom</link>
    <image>
      <url>https://forem.julialang.org/images/T9scDOUeBG5iHr04EcKOMjoHZWkP3a40OMa34q1QhFE/rs:fill:90:90/g:sm/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L3VzZXIvcHJvZmls/ZV9pbWFnZS85OTUv/YjM1MjEzOWMtNjI2/YS00NDc1LWE3YTIt/OTdlMDMwOTRhOWQ2/LmpwZWc</url>
      <title>Julia Community 🟣: Anthony Blaom, PhD</title>
      <link>https://forem.julialang.org/ablaom</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.julialang.org/feed/ablaom"/>
    <language>en</language>
    <item>
      <title>Julia Boards the Titanic- A brief introduction to the MLJ.jl package</title>
      <dc:creator>Anthony Blaom, PhD</dc:creator>
      <pubDate>Wed, 15 Feb 2023 22:24:07 +0000</pubDate>
      <link>https://forem.julialang.org/mlj/julia-boards-the-titanic-1ne8</link>
      <guid>https://forem.julialang.org/mlj/julia-boards-the-titanic-1ne8</guid>
      <description>&lt;p&gt;This is a gentle introduction to Julia's machine learning toolbox &lt;a href="https://JuliaAI.github.io/MLJ.jl/stable/" rel="noopener noreferrer"&gt;MLJ&lt;/a&gt; focused on users new to Julia. In it we train a decision tree to predict whether a new passenger would survive a hypothetical replay of the Titanic disaster. The blog is loosely based on &lt;a href="https://github.com/ablaom/HelloJulia.jl/tree/dev/notebooks/03_machine_learning" rel="noopener noreferrer"&gt;these notebooks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt; No prior experience with Julia, but you should know how to open a Julia REPL session in some terminal or console. A nodding acquaintance with &lt;a href="https://www.digitalocean.com/community/tutorials/an-introduction-to-machine-learning" rel="noopener noreferrer"&gt;supervised machine learning&lt;/a&gt; would be helpful.&lt;/p&gt;

&lt;p&gt;Experienced data scientists may want to check out the more advanced tutorial, &lt;a href="https://juliaai.github.io/DataScienceTutorials.jl/end-to-end/telco/" rel="noopener noreferrer"&gt;MLJ for Data Scientists in Two Hours&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Trees
&lt;/h2&gt;

&lt;p&gt;Generally, &lt;a href="https://en.wikipedia.org/wiki/Decision_tree" rel="noopener noreferrer"&gt;decision trees&lt;/a&gt; are not the best performing machine learning models. However, they are extremely fast to train, easy to interpret, and have flexible data requirements. They are also the building blocks of more advanced models, such as &lt;a href="https://en.wikipedia.org/wiki/Random_forest" rel="noopener noreferrer"&gt;random forests&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Gradient_boosting" rel="noopener noreferrer"&gt;gradient boosted trees&lt;/a&gt;, one of the most successful and widely applied class of machine learning models today. All these models are available in the MLJ toolbox and are trained in the same way as the decision tree. &lt;/p&gt;

&lt;p&gt;Here's a diagram representing what a decision tree, trained on the Titanic dataset, might look like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://forem.julialang.org/images/rAMWj5qLXC85eBy1eV6IAnWIMtAtR3xvoVeIQfg1Wus/rt:fit/w:800/g:sm/q:0/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL29w/am9jaW50NGJta3Rl/N3JwbHc5LmpwZw" class="article-body-image-wrapper"&gt;&lt;img src="https://forem.julialang.org/images/rAMWj5qLXC85eBy1eV6IAnWIMtAtR3xvoVeIQfg1Wus/rt:fit/w:800/g:sm/q:0/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL29w/am9jaW50NGJta3Rl/N3JwbHc5LmpwZw" alt="decision tree" width="465" height="473"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For example, in this model, a male over the age of 9.5 is predicted to die, having a survival probability of 0.17.&lt;/p&gt;

&lt;h2&gt;
  
  
  Package installation
&lt;/h2&gt;

&lt;p&gt;We start by creating a new Julia package environment called &lt;code&gt;titanic&lt;/code&gt;, for tracking versions of the packages we will need. Do this by typing these commands at the &lt;code&gt;julia&amp;gt;&lt;/code&gt; prompt, with carriage returns added at the end of each line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;Pkg&lt;/span&gt;
&lt;span class="n"&gt;Pkg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;activate&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"titanic"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To add the packages we need to your environment, enter the &lt;code&gt;]&lt;/code&gt; character at the &lt;code&gt;julia&amp;gt;&lt;/code&gt; prompt, to change it to &lt;code&gt;(titanic) pkg&amp;gt;&lt;/code&gt;. Then enter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;add&lt;/span&gt; &lt;span class="n"&gt;MLJ&lt;/span&gt; &lt;span class="n"&gt;DataFrames&lt;/span&gt; &lt;span class="n"&gt;BetaML&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It may take a few minutes for these packages to be installed and "precompiled".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tip.&lt;/strong&gt; Next time you want to use exactly the same combination of packages in a new Julia session, you can skip the &lt;code&gt;add&lt;/code&gt; command and instead just enter the two lines above them.&lt;/p&gt;

&lt;p&gt;When the &lt;code&gt;(titanic) pkg&amp;gt;&lt;/code&gt; prompt returns, enter &lt;code&gt;status&lt;/code&gt; to see the package versions that were installed. Here's what each package does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://JuliaAI.github.io/MLJ.jl/stable/" rel="noopener noreferrer"&gt;MLJ&lt;/a&gt; (machine learning toolbox): provides a common interface for interacting with models provided by different packages, and for automating common model-generic tasks, such as &lt;a href="https://en.wikipedia.org/wiki/Hyperparameter_optimization" rel="noopener noreferrer"&gt;hyperparameter optimization&lt;/a&gt; demonstrated at the end of this blog.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://dataframes.juliadata.org/stable/" rel="noopener noreferrer"&gt;DataFrames&lt;/a&gt;: Allows you to manipulate tabular data that fits into memory. &lt;strong&gt;Tip.&lt;/strong&gt; Checkout these &lt;a href="https://ahsmart.com/pub/data-wrangling-with-data-frames-jl-cheat-sheet/index.html" rel="noopener noreferrer"&gt;cheatsheets&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://github.com/sylvaticus/BetaML.jl" rel="noopener noreferrer"&gt;BetaML&lt;/a&gt;: Provides the core decision algorithm we will be building for Titanic prediction.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Learn more about Julia package management &lt;a href="https://docs.julialang.org/en/v1/stdlib/Pkg/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For now, return to the &lt;code&gt;julia&amp;gt;&lt;/code&gt; prompt by pressing the "delete" or "backspace" key.&lt;/p&gt;

&lt;h2&gt;
  
  
  Establishing correct data representation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;MLJ&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataFrames&lt;/span&gt; &lt;span class="n"&gt;as&lt;/span&gt; &lt;span class="n"&gt;DF&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After entering the first line above we are ready to use any function in MLJ's documentation as it appears there. After the second, we can use functions from DataFrames, but must qualify the function names with a prefix &lt;code&gt;DF.&lt;/code&gt;, as we'll see later.&lt;/p&gt;

&lt;p&gt;In MLJ, and some other statistics packages, a &lt;a href="https://juliaai.github.io/ScientificTypes.jl/dev/" rel="noopener noreferrer"&gt;"scientific type"&lt;/a&gt; or &lt;em&gt;scitype&lt;/em&gt; indicates how MLJ will &lt;em&gt;interpret&lt;/em&gt; data (as opposed to how it is represented on your machine). For example, while we have&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;typeof&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="kt"&gt;Int64&lt;/span&gt;

&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;typeof&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="kt"&gt;Bool&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;we have&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;scitype&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Count&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;but also&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;scitype&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Count&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tip.&lt;/strong&gt; To learn more about a Julia command, use the &lt;code&gt;?&lt;/code&gt; character. For example, try typing &lt;code&gt;?scitype&lt;/code&gt; at the &lt;code&gt;julia&amp;gt;&lt;/code&gt; prompt.&lt;/p&gt;

&lt;p&gt;In MLJ, model data requirements are articulated using scitypes, which allows you to focus on what your data represents in the real world, instead of how it is stored on your computer.&lt;/p&gt;

&lt;p&gt;Here are the most common "scalar" scitypes:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://forem.julialang.org/images/jsGnG784ey3efPt5lpB_zq_oo7Y_lSvM3rNwPTkfu6Q/rt:fit/w:800/g:sm/q:0/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL2Fk/MXpsaXF0emY2ZDY1/eHBjcWNtLnBuZw" class="article-body-image-wrapper"&gt;&lt;img src="https://forem.julialang.org/images/jsGnG784ey3efPt5lpB_zq_oo7Y_lSvM3rNwPTkfu6Q/rt:fit/w:800/g:sm/q:0/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL2Fk/MXpsaXF0emY2ZDY1/eHBjcWNtLnBuZw" alt="scalar scitypes" width="598" height="81"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We'll grab our Titanic data set from &lt;a href="https://www.openml.org" rel="noopener noreferrer"&gt;OpenML&lt;/a&gt;, a platform for sharing machine learning datasets and workflows. The second line below converts the downloaded data into a dataframe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OpenML&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42638&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DF&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can use DataFrames to get summary statistics for the features in our dataset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;DF&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;describe&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Row&lt;/th&gt;
&lt;th&gt;variable&lt;/th&gt;
&lt;th&gt;mean&lt;/th&gt;
&lt;th&gt;min&lt;/th&gt;
&lt;th&gt;median&lt;/th&gt;
&lt;th&gt;max&lt;/th&gt;
&lt;th&gt;nmissing&lt;/th&gt;
&lt;th&gt;eltype&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;pclass&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;CategoricalValue{String, UInt32}&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;sex&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;female&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;male&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;CategoricalValue{String, UInt32}&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;age&lt;/td&gt;
&lt;td&gt;29.7589&lt;/td&gt;
&lt;td&gt;0.42&lt;/td&gt;
&lt;td&gt;30.0&lt;/td&gt;
&lt;td&gt;80.0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Float64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;sibsp&lt;/td&gt;
&lt;td&gt;0.523008&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;td&gt;8.0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Float64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;fare&lt;/td&gt;
&lt;td&gt;32.2042&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;td&gt;14.4542&lt;/td&gt;
&lt;td&gt;512.329&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Float64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;cabin&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;E31&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;C148&lt;/td&gt;
&lt;td&gt;687&lt;/td&gt;
&lt;td&gt;Union{Missing, CategoricalValue{…&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;embarked&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;S&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Union{Missing, CategoricalValue{…&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;survived&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;CategoricalValue{String, UInt32}&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In particular, we see that &lt;code&gt;cabin&lt;/code&gt; has a lot of missing values, and we'll shortly drop it for simplicity.&lt;/p&gt;

&lt;p&gt;To get a summary of feature scitypes, we use &lt;code&gt;schema&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Row&lt;/th&gt;
&lt;th&gt;names&lt;/th&gt;
&lt;th&gt;scitypes&lt;/th&gt;
&lt;th&gt;types&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;pclass&lt;/td&gt;
&lt;td&gt;Multiclass{3}&lt;/td&gt;
&lt;td&gt;CategoricalValue{String, UInt32}&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;sex&lt;/td&gt;
&lt;td&gt;Multiclass{2}&lt;/td&gt;
&lt;td&gt;CategoricalValue{String, UInt32}&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;age&lt;/td&gt;
&lt;td&gt;Continuous&lt;/td&gt;
&lt;td&gt;Float64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;sibsp&lt;/td&gt;
&lt;td&gt;Continuous&lt;/td&gt;
&lt;td&gt;Float64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;fare&lt;/td&gt;
&lt;td&gt;Continuous&lt;/td&gt;
&lt;td&gt;Float64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;cabin&lt;/td&gt;
&lt;td&gt;Union{Missing, Multiclass{186}}&lt;/td&gt;
&lt;td&gt;Union{Missing, CategoricalValue{…&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;embarked&lt;/td&gt;
&lt;td&gt;Union{Missing, Multiclass{3}}&lt;/td&gt;
&lt;td&gt;Union{Missing, CategoricalValue{…&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;survived&lt;/td&gt;
&lt;td&gt;Multiclass{2}&lt;/td&gt;
&lt;td&gt;CategoricalValue{String, UInt32}&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now &lt;code&gt;sibsp&lt;/code&gt; represents the number of siblings/spouses, which is not a continuous variable. So we fix that like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;coerce!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;sibsp&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Call &lt;code&gt;schema(df)&lt;/code&gt; again, to check a successful change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Splitting into train and test sets
&lt;/h2&gt;

&lt;p&gt;To objectively evaluate the performance of our final model, we split off 30% of our data into a &lt;em&gt;holdout set&lt;/em&gt;, called &lt;code&gt;df_test&lt;/code&gt;, which will not used for training:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;partition&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;123&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can check the number of observations in each set with &lt;code&gt;DF.nrow(df)&lt;/code&gt; and &lt;code&gt;DF.nrow(df_test)&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Splitting data into input features and target
&lt;/h2&gt;

&lt;p&gt;In supervised learning, the &lt;em&gt;target&lt;/em&gt; is the variable we want to predict, in this case &lt;code&gt;survived&lt;/code&gt;. The other features will be inputs to our predictor. The following code puts the &lt;code&gt;df&lt;/code&gt; column with name &lt;code&gt;survived&lt;/code&gt; into the vector &lt;code&gt;y&lt;/code&gt; (the target) and everything else, except &lt;code&gt;cabin&lt;/code&gt;, which we're dropping, into a new dataframe called &lt;code&gt;X&lt;/code&gt; (the input features).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;unpack&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;survived&lt;/span&gt;&lt;span class="x"&gt;),&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;cabin&lt;/span&gt;&lt;span class="x"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can check &lt;code&gt;X&lt;/code&gt; and &lt;code&gt;y&lt;/code&gt; have the expected form by doing &lt;code&gt;schema(X)&lt;/code&gt; and &lt;code&gt;scitype(y)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We'll want to do the same for the holdout test set:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;unpack&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_test&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;survived&lt;/span&gt;&lt;span class="x"&gt;),&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;cabin&lt;/span&gt;&lt;span class="x"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Choosing a supervised model:
&lt;/h2&gt;

&lt;p&gt;There are not many models that can directly handle missing values and a mixture of scitypes, as we have here. Here's how to list the ones that do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;matching&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="x"&gt;))&lt;/span&gt;
 &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ConstantClassifier&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;package_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MLJModels&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="x"&gt;)&lt;/span&gt;
 &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DecisionTreeClassifier&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;package_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BetaML&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="x"&gt;)&lt;/span&gt;
 &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeterministicConstantClassifier&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;package_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MLJModels&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="x"&gt;)&lt;/span&gt;
 &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;package_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BetaML&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shortcoming can be addressed with data preprocessing &lt;a href="https://JuliaAI.github.io/MLJ.jl/stable/model_browser/#Model-Browser" rel="noopener noreferrer"&gt;provided by MLJ&lt;/a&gt; but not covered here, such as one-hot encoding and missing value imputation. We'll settle for the indicated decision tree.&lt;/p&gt;

&lt;p&gt;The code for the decision tree model is not available until we explicitly load it, but we can already inspect its documentation. Do this by entering &lt;code&gt;doc("DecisionTreeClassifier", pkg="BetaML")&lt;/code&gt;. (To browse &lt;em&gt;all&lt;/em&gt; MLJ model documentation use the &lt;a href="https://JuliaAI.github.io/MLJ.jl/stable/model_browser/#Model-Browser" rel="noopener noreferrer"&gt;Model Browser&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;An MLJ-specific method for loading the model code (and necessary packages) is shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;Tree&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nd"&gt;@load&lt;/span&gt; &lt;span class="n"&gt;DecisionTreeClassifier&lt;/span&gt; &lt;span class="n"&gt;pkg&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BetaML&lt;/span&gt;
&lt;span class="n"&gt;tree&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Tree&lt;/span&gt;&lt;span class="x"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first line loads the model &lt;em&gt;type&lt;/em&gt;, which we've called &lt;code&gt;Tree&lt;/code&gt;; the second creates an object storing default hyperparameters for a &lt;code&gt;Tree&lt;/code&gt; model. This &lt;code&gt;tree&lt;/code&gt; will be displayed thus:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;DecisionTreeClassifier&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;max_depth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;min_gain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;min_records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;max_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;splitting_criterion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BetaML&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Utils&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gini&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;rng&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_GLOBAL_RNG&lt;/span&gt;&lt;span class="x"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can specify different hyperparameters like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;tree&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Tree&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Training the model
&lt;/h2&gt;

&lt;p&gt;We now bind the data to be used for training and the hyperparameter object &lt;code&gt;tree&lt;/code&gt; we just created in a new object called a &lt;em&gt;machine&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;mach&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;machine&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tree&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We train the model on all bound data by calling &lt;code&gt;fit!&lt;/code&gt; on the machine. The exclamation mark &lt;code&gt;!&lt;/code&gt; in &lt;code&gt;fit!&lt;/code&gt; tells us that &lt;code&gt;fit!&lt;/code&gt; mutates (changes) its argument. In this case the model's learned parameters (the actual decision tree) is stored in the &lt;code&gt;mach&lt;/code&gt; object:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;fit!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mach&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before getting predictions for new inputs, let's start by looking at predictions for the inputs we trained on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mach&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice that these are &lt;em&gt;probabilistic&lt;/em&gt; predictions. For example, we have&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;
           &lt;span class="n"&gt;UnivariateFinite&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Multiclass&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;}}&lt;/span&gt;
     &lt;span class="n"&gt;┌&lt;/span&gt;                                        &lt;span class="n"&gt;┐&lt;/span&gt;
   &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="n"&gt;┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■&lt;/span&gt; &lt;span class="mf"&gt;0.914894&lt;/span&gt;
   &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;┤■■■&lt;/span&gt; &lt;span class="mf"&gt;0.0851064&lt;/span&gt;
     &lt;span class="n"&gt;└&lt;/span&gt;                                        &lt;span class="n"&gt;┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Extracting a raw probability requires an extra step. For example, to get the survival probability (&lt;code&gt;1&lt;/code&gt; corresponding to survival and &lt;code&gt;0&lt;/code&gt; to death), we do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pdf&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="x"&gt;],&lt;/span&gt; &lt;span class="s"&gt;"1"&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="mf"&gt;0.0851063829787234&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can also get "point" predictions using the &lt;code&gt;mode&lt;/code&gt; function and Julia's broadcasting syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;yhat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;yhat&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt; &lt;span class="n"&gt;CategoricalArrays&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CategoricalArray&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="kt"&gt;UInt32&lt;/span&gt;&lt;span class="x"&gt;}&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
 &lt;span class="s"&gt;"0"&lt;/span&gt;
 &lt;span class="s"&gt;"0"&lt;/span&gt;
 &lt;span class="s"&gt;"1"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Evaluating model performance
&lt;/h2&gt;

&lt;p&gt;Let's see how accurate our model is at predicting on the data it trained on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yhat&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="mf"&gt;0.921474358974359&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Over 90% accuracy! Better check the accuracy on the test data that the model hasn't seen:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;yhat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mach&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="x"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yhat&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="mf"&gt;0.7790262172284644&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Oh dear. We are most likely &lt;a href="https://en.wikipedia.org/wiki/Overfitting" rel="noopener noreferrer"&gt;overfitting&lt;/a&gt; the model. Still, not a bad first step.&lt;/p&gt;

&lt;p&gt;The evaluation we have just performed is known as &lt;em&gt;holdout&lt;/em&gt; evaluation. MLJ provides tools for automating such evaluations, as well as more sophisticated ones, such as &lt;a href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)" rel="noopener noreferrer"&gt;cross-validation&lt;/a&gt;. See &lt;a href="https://JuliaAI.github.io/MLJ.jl/stable/getting_started/#Getting-Started" rel="noopener noreferrer"&gt;this simple example&lt;/a&gt; and &lt;a href="https://JuliaAI.github.io/MLJ.jl/stable/evaluating_model_performance/" rel="noopener noreferrer"&gt;the detailed documentation&lt;/a&gt; for more information.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tuning the model
&lt;/h2&gt;

&lt;p&gt;Changing any hyperparameter of our model will alter it's performance. In particular, changing certain parameters may mitigate overfitting.&lt;/p&gt;

&lt;p&gt;In MLJ we can "wrap" the model to make it automatically optimize a given hyperparameter, which it does by internally creating its own holdout set for evaluation (or using some other resampling scheme, such as cross-validation) and systematically searching over a specified range of one or more hyperparameters. Let's do that now for our decision tree.&lt;/p&gt;

&lt;p&gt;First, we define a hyperparameter range over which to search:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;range&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tree&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that, according to the document string for the decision tree (which we can retrieve now with &lt;code&gt;?Tree&lt;/code&gt;) we see that &lt;code&gt;0&lt;/code&gt; here means "no limit on &lt;code&gt;max_depth&lt;/code&gt;".&lt;/p&gt;

&lt;p&gt;Next, we apply MLJ's &lt;code&gt;TunedModel&lt;/code&gt; &lt;a href="//JuliaAI.github.io/MLJ.jl/stable/tuning_models/"&gt;wrapper&lt;/a&gt; to our tree, specifying the range and performance measure to use as a basis for optimization, as well as the resampling strategy we want to use, and the search method (grid in this case).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;tuned_tree&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TunedModel&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;tree&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tuning&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Grid&lt;/span&gt;&lt;span class="x"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;range&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;measure&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;resampling&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Holdout&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fraction_train&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="x"&gt;),&lt;/span&gt;
&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The new model &lt;code&gt;tuned_tree&lt;/code&gt; behaves like the old, except that the &lt;code&gt;max_depth&lt;/code&gt; hyperparameter effectively becomes a &lt;em&gt;learned&lt;/em&gt; parameter instead.&lt;/p&gt;

&lt;p&gt;Training this &lt;code&gt;tuned_tree&lt;/code&gt; actually performs two operations, under the hood:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Search for the best model using an internally constructed holdout set&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Retrain the "best" model on &lt;em&gt;all&lt;/em&gt; available data&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mach2 = machine(tuned_tree, X, y)
fit!(mach2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's how we can see what the optimal model actually is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;fitted_params&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mach2&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;best_model&lt;/span&gt;
&lt;span class="n"&gt;DecisionTreeClassifier&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;max_depth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;min_gain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;min_records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;max_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;splitting_criterion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BetaML&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Utils&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gini&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;rng&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_GLOBAL_RNG&lt;/span&gt;&lt;span class="x"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, let's test the self-tuning model on our existing holdout set:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;yhat2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mach2&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="x"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yhat2&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="mf"&gt;0.8164794007490637&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Although we cannot assign this outcome statistical signicance, without a more detailed analysis, this appears to be an improvement on our original &lt;code&gt;depth=10&lt;/code&gt; model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Learning more
&lt;/h2&gt;

&lt;p&gt;Suggestions for learning more about Julia and MLJ are &lt;a href="https://JuliaAI.github.io/MLJ.jl/stable/learning_mlj/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>mlj</category>
      <category>tutorial</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
