<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Julia Community 🟣: Alex Tantos</title>
    <description>The latest articles on Julia Community 🟣 by Alex Tantos (@atantos).</description>
    <link>https://forem.julialang.org/atantos</link>
    <image>
      <url>https://forem.julialang.org/images/wMWzxRuo64Ogut3lt3V_UdgEM9aeZmSBtijGJfmMLn4/rs:fill:90:90/g:sm/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L3VzZXIvcHJvZmls/ZV9pbWFnZS83MTgv/M2Y2ZGU2ZjktMTQ1/My00MTcxLWI1ZjQt/ZDdmMmExYmE3YWY0/LmpwZWc</url>
      <title>Julia Community 🟣: Alex Tantos</title>
      <link>https://forem.julialang.org/atantos</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.julialang.org/feed/atantos"/>
    <language>en</language>
    <item>
      <title>Handling Strings and GadFly-Plotting while Learning about Zipf's Law</title>
      <dc:creator>Alex Tantos</dc:creator>
      <pubDate>Thu, 10 Nov 2022 11:21:14 +0000</pubDate>
      <link>https://forem.julialang.org/atantos/handling-strings-and-gadfly-plotting-while-learning-zipfs-law-o85</link>
      <guid>https://forem.julialang.org/atantos/handling-strings-and-gadfly-plotting-while-learning-zipfs-law-o85</guid>
      <description>&lt;p&gt;Last week I had to teach for my Computational Linguistics' class the Zipf's Law using &lt;code&gt;Julia&lt;/code&gt;. This post includes the Julia code used for demonstrating the Zipf's Law.&lt;/p&gt;

&lt;p&gt;I will not spend time on explaining here why this empirically-motivated law that characterizes natural languages holds. I will confine myself here to saying that some brilliant people before the American linguist George Kingsley Zipf spreads the word about its existence noticed a very interesting pattern: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There are few words in a text that appear most of the times while all the rest appear very few times resulting in a distribution that is reminiscent of the power law distribution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal of this post, however, is to simply demonstrate the validity of the law by using the dataset of the &lt;a href="https://www.kaggle.com/datasets/8713039e45dd7f1586ecde8057392f518aca089fe30a03d7b0982bd6518616c0?resource=download&amp;amp;select=Musical_instruments_reviews.csv"&gt;Amazon Musical Instruments Reviews&lt;/a&gt;, available in Kaggle. &lt;/p&gt;

&lt;p&gt;On the path of achieving this I will be using Julia for handling strings; more specifically, I intend to show how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;read in text as a &lt;code&gt;String&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;delete punctuation marks and tokenize a string into word tokens &lt;/li&gt;
&lt;li&gt;create a Dictionary with word frequencies&lt;/li&gt;
&lt;li&gt;sort the Dictionary based on word frequencies&lt;/li&gt;
&lt;li&gt;barplotting the frequency values to see the &lt;em&gt;power law&lt;/em&gt;-like distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's take these steps one-by-one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reading in text as a &lt;code&gt;String&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;I will be using the &lt;code&gt;CSV&lt;/code&gt; and &lt;code&gt;DataFrames&lt;/code&gt; packages to read in the &lt;code&gt;Musical_instruments_reviews.csv&lt;/code&gt; file as a &lt;code&gt;DataFrame&lt;/code&gt; that contains the &lt;strong&gt;Amazon Musical Instruments Reviews&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;DatFrames&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CSV&lt;/span&gt;

&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;instruments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CSV&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/Users/atantos/Documents/julia/DataFrames/Instrument_reviews_dataframe/Musical_instruments_reviews.csv"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="x"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you navigate through the &lt;code&gt;instruments&lt;/code&gt; &lt;code&gt;DataFrame&lt;/code&gt;, you will see that the &lt;code&gt;reviewText&lt;/code&gt; column contains the review texts. Each member  of the &lt;code&gt;String&lt;/code&gt; &lt;code&gt;reviewText&lt;/code&gt; column vector is a text. The &lt;code&gt;join()&lt;/code&gt; function joins these texts into a single big &lt;code&gt;String&lt;/code&gt; that we can further manipulate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;reviewtext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instruments&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reviewText&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="x"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Data Cleaning and Tokenization
&lt;/h2&gt;

&lt;p&gt;The second step is to do some cleaning on the textual data by deleting the punctuation marks and then tokenizing the cleaner output. The following line does the two processing tasks in one step. replace() takes in the reviewed texts and replaces four punctuation marks expressed by the regular expression pattern &lt;code&gt;r"(;|,|\.|!)"&lt;/code&gt; with the null string. In other words, it deletes them. &lt;br&gt;
The cleaned text is further tokenized with split using as a splitting criterion the space character &lt;code&gt;" "&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;reviewtext_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reviewtext&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;  &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="s"&gt;"(;|,|\.|!)"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="x"&gt;),&lt;/span&gt; &lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;940593&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt; &lt;span class="kt"&gt;Vector&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;SubString&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="x"&gt;}}&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
 &lt;span class="s"&gt;"Not"&lt;/span&gt;
 &lt;span class="s"&gt;"much"&lt;/span&gt;
 &lt;span class="s"&gt;"to"&lt;/span&gt;
 &lt;span class="s"&gt;"write"&lt;/span&gt;
 &lt;span class="n"&gt;⋮&lt;/span&gt;
 &lt;span class="s"&gt;"recommended"&lt;/span&gt;
 &lt;span class="s"&gt;"product45/5"&lt;/span&gt;
 &lt;span class="s"&gt;"stars"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Creating a Word-Frequency Dictionary
&lt;/h2&gt;

&lt;p&gt;The most well-known related counting method in &lt;code&gt;Julia&lt;/code&gt; is based on the &lt;code&gt;StatsBase.countmap()&lt;/code&gt; function that outputs a Dictionary with words and their frequencies. The first step is to call &lt;code&gt;StasBase&lt;/code&gt;'s functionality in the current namespace and then we may use its exported function &lt;code&gt;countmap()&lt;/code&gt; without needing the package qualification; meaning that we don't need to use the &lt;code&gt;package_name.method()&lt;/code&gt; notation, as in &lt;code&gt;StatsBase.countmap()&lt;/code&gt;. What &lt;code&gt;countmap()&lt;/code&gt; does is that it takes a vector of any type of values and returns a dictionary with keys being the vector elements  (the words in our case) and values being the occurrence frequency of these words; namely the elements of the initial vector &lt;code&gt;reviewtext_tokens&lt;/code&gt;. Below, &lt;code&gt;word_dict&lt;/code&gt; is a dictionary with keys being the vector elements of its argument, &lt;code&gt;reviewtext_tokens&lt;/code&gt; and values on the right of the right-pointing arrow are their occurrence frequencies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;StatsBase&lt;/span&gt;

&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;word_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;countmap&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reviewtext_tokens&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="kt"&gt;Dict&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;SubString&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="x"&gt;},&lt;/span&gt; &lt;span class="kt"&gt;Int64&lt;/span&gt;&lt;span class="x"&gt;}&lt;/span&gt; &lt;span class="n"&gt;with&lt;/span&gt; &lt;span class="mi"&gt;43812&lt;/span&gt; &lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
  &lt;span class="s"&gt;"B0002E2EOE)"&lt;/span&gt;           &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="s"&gt;"itPS"&lt;/span&gt;                  &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="s"&gt;"tunerYes"&lt;/span&gt;              &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="s"&gt;"optionLEVY'S"&lt;/span&gt;          &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="s"&gt;"whiz"&lt;/span&gt;                  &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
  &lt;span class="s"&gt;"simultaneouslyTotally"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="s"&gt;"gathered"&lt;/span&gt;              &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
  &lt;span class="n"&gt;⋮&lt;/span&gt;                       &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;⋮&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Recall that according to Zipf's Law, &lt;em&gt;a few words are very common in a text or corpus of texts and the rest, very rarely, occur&lt;/em&gt;. To be able to visualize this asymmetry on the word frequency distribution, we need to first sort the words based on their frequency in decreasing order. Sorting a dictionary based on its values is easy in &lt;code&gt;Julia&lt;/code&gt;. The &lt;code&gt;sort()&lt;/code&gt; function allows you to use an anonymous function, &lt;code&gt;x-&amp;gt;x[2]&lt;/code&gt;, defined within the named argument &lt;code&gt;by&lt;/code&gt; so that you can focus on the values of the Dictionaries' pairs. Notice that for sorting in decreasing order you need to set the &lt;code&gt;rev&lt;/code&gt; argument to &lt;code&gt;true&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;sorted_word_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collect&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word_dict&lt;/span&gt;&lt;span class="x"&gt;),&lt;/span&gt; &lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;],&lt;/span&gt; &lt;span class="n"&gt;rev&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;43812&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt; &lt;span class="kt"&gt;Vector&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;Pair&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;SubString&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="x"&gt;},&lt;/span&gt; &lt;span class="kt"&gt;Int64&lt;/span&gt;&lt;span class="x"&gt;}}&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
          &lt;span class="s"&gt;"the"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;39206&lt;/span&gt;
            &lt;span class="s"&gt;"a"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;27175&lt;/span&gt;
          &lt;span class="s"&gt;"and"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;26223&lt;/span&gt;
            &lt;span class="s"&gt;"I"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;25333&lt;/span&gt;
                &lt;span class="n"&gt;⋮&lt;/span&gt;
      &lt;span class="s"&gt;"things)"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="s"&gt;"Tone-master"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
 &lt;span class="s"&gt;"onesMaterial"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Creating a sorted dictionary causes a small complication that one needs to be aware of. Sorted dictionaries have distinct keys and values that are not identical to their unsorted counterparts from which they were constructed. Their keys are the indices of the sorted pair element and their values consist of the key-value pairs of the initial dictionary. This means that in order to access the values of the initial unsorted dictionary that live within the new sorted dictionary, you need to first get the values of the sorted dictionary and ask for the second member of its pairs for retrieving the values with &lt;code&gt;getindex(value, 2)&lt;/code&gt;. Let's see in practice what I mean by that. Here is the array of keys of the sorted dictionary that contains all the ranking indices .&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sorted_word_dict&lt;/span&gt;&lt;span class="x"&gt;)]&lt;/span&gt;
&lt;span class="mi"&gt;43812&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt; &lt;span class="kt"&gt;Vector&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;Int64&lt;/span&gt;&lt;span class="x"&gt;}&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
     &lt;span class="mi"&gt;1&lt;/span&gt;
     &lt;span class="mi"&gt;2&lt;/span&gt;
     &lt;span class="mi"&gt;3&lt;/span&gt;
     &lt;span class="mi"&gt;4&lt;/span&gt;
     &lt;span class="n"&gt;⋮&lt;/span&gt;
 &lt;span class="mi"&gt;43810&lt;/span&gt;
 &lt;span class="mi"&gt;43811&lt;/span&gt;
 &lt;span class="mi"&gt;43812&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The values of &lt;code&gt;sorted_word_dict&lt;/code&gt;, on the other hand, has the key-value pairs of the initial dictionary &lt;code&gt;word_dict&lt;/code&gt;, as you can see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sorted_word_dict&lt;/span&gt;&lt;span class="x"&gt;)]&lt;/span&gt;
&lt;span class="mi"&gt;43812&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt; &lt;span class="kt"&gt;Vector&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;Pair&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;SubString&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="x"&gt;},&lt;/span&gt; &lt;span class="kt"&gt;Int64&lt;/span&gt;&lt;span class="x"&gt;}}&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
          &lt;span class="s"&gt;"the"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;39206&lt;/span&gt;
            &lt;span class="s"&gt;"a"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;27175&lt;/span&gt;
          &lt;span class="s"&gt;"and"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;26223&lt;/span&gt;
            &lt;span class="s"&gt;"I"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;25333&lt;/span&gt;
                &lt;span class="n"&gt;⋮&lt;/span&gt;
      &lt;span class="s"&gt;"things)"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="s"&gt;"Tone-master"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
 &lt;span class="s"&gt;"onesMaterial"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keeping the internal structure of &lt;code&gt;sorted_word_dict&lt;/code&gt; in mind, below, we are asking for the value of the key-value pairs that live  within &lt;code&gt;sorted_word_dict&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;freqs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;getindex&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sorted_word_dict&lt;/span&gt;&lt;span class="x"&gt;)]&lt;/span&gt;
&lt;span class="mi"&gt;43812&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt; &lt;span class="kt"&gt;Vector&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;Int64&lt;/span&gt;&lt;span class="x"&gt;}&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
 &lt;span class="mi"&gt;39206&lt;/span&gt;
 &lt;span class="mi"&gt;27175&lt;/span&gt;
 &lt;span class="mi"&gt;26223&lt;/span&gt;
 &lt;span class="mi"&gt;25333&lt;/span&gt;
     &lt;span class="n"&gt;⋮&lt;/span&gt;
     &lt;span class="mi"&gt;1&lt;/span&gt;
     &lt;span class="mi"&gt;1&lt;/span&gt;
     &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To be able to access the keys you need to access the first member of the pairs of the initial unsorted dicitionary that is stored within &lt;code&gt;sorted_word_dict&lt;/code&gt; by using &lt;code&gt;getindex(value, 1)&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;getindex&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sorted_word_dict&lt;/span&gt;&lt;span class="x"&gt;)]&lt;/span&gt;
&lt;span class="mi"&gt;43812&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt; &lt;span class="kt"&gt;Vector&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;SubString&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="x"&gt;}}&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
 &lt;span class="s"&gt;"the"&lt;/span&gt;
 &lt;span class="s"&gt;"a"&lt;/span&gt;
 &lt;span class="s"&gt;"and"&lt;/span&gt;
 &lt;span class="s"&gt;"I"&lt;/span&gt;
 &lt;span class="n"&gt;⋮&lt;/span&gt;
 &lt;span class="s"&gt;"things)"&lt;/span&gt;
 &lt;span class="s"&gt;"Tone-master"&lt;/span&gt;
 &lt;span class="s"&gt;"onesMaterial"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Barplotting with Gadfly
&lt;/h2&gt;

&lt;p&gt;One of the most well-known &lt;code&gt;Julia&lt;/code&gt; plotting packages is &lt;code&gt;GadFly&lt;/code&gt;. Being a fan of &lt;code&gt;R&lt;/code&gt; and its powerful &lt;code&gt;ggplot&lt;/code&gt; package, navigating through &lt;code&gt;GadFly&lt;/code&gt;'s was a breeze.¹ Here is how you could barplot the sorted frequencies of the 40 most frequent words keeping the words as labels on the x-axis and using the &lt;code&gt;dodge&lt;/code&gt; position.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;Gadfly&lt;/span&gt;

&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Gadfly&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="x"&gt;],&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;freqs&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="x"&gt;],&lt;/span&gt; &lt;span class="n"&gt;Geom&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;position&lt;/span&gt;&lt;span class="o"&gt;=:&lt;/span&gt;&lt;span class="n"&gt;dodge&lt;/span&gt;&lt;span class="x"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://forem.julialang.org/images/J2eThUZvPWx9HdfFY3RPEfWIz2kv2Gt9folYdWgrzvo/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzLzJ5/emF3MW0wcGJjaXVs/eHY0ZHg1LnBuZw" class="article-body-image-wrapper"&gt;&lt;img src="https://forem.julialang.org/images/J2eThUZvPWx9HdfFY3RPEfWIz2kv2Gt9folYdWgrzvo/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzLzJ5/emF3MW0wcGJjaXVs/eHY0ZHg1LnBuZw" alt="Image description" width="880" height="596"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, the shape of the distribution proves empirically the truth of Zipf's Law.&lt;/p&gt;

&lt;p&gt;[1]: Roland Schaetzle wrote an excellent post on &lt;a href="https://towardsdatascience.com/statistical-plotting-with-julia-gadfly-jl-39582f91d7cc"&gt;TDS&lt;/a&gt; that is highly recommended by the &lt;code&gt;Julia&lt;/code&gt; community for those who migrate from &lt;code&gt;R&lt;/code&gt; to &lt;code&gt;Julia&lt;/code&gt; and want to have a similar plotting experience to ggplot.&lt;/p&gt;

</description>
      <category>strings</category>
      <category>zipf</category>
      <category>gadfly</category>
    </item>
    <item>
      <title>Working with nested JSON strings/files in Julia</title>
      <dc:creator>Alex Tantos</dc:creator>
      <pubDate>Wed, 05 Oct 2022 18:35:54 +0000</pubDate>
      <link>https://forem.julialang.org/atantos/working-with-nested-json-stringsfiles-in-julia-42a7</link>
      <guid>https://forem.julialang.org/atantos/working-with-nested-json-stringsfiles-in-julia-42a7</guid>
      <description>&lt;p&gt;&lt;a href="https://forem.julialang.org/images/QcNcLb8scKPTClS_8HVr2fGga6Ov5v_kwWNQs4txrpA/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL3Vm/aGNlMWRxNDZqazJh/dmU5Z3BhLnBuZw" class="article-body-image-wrapper"&gt;&lt;img src="https://forem.julialang.org/images/QcNcLb8scKPTClS_8HVr2fGga6Ov5v_kwWNQs4txrpA/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL3Vm/aGNlMWRxNDZqazJh/dmU5Z3BhLnBuZw" alt="Image description" width="880" height="187"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why and when is &lt;code&gt;JSON&lt;/code&gt; used?
&lt;/h2&gt;

&lt;p&gt;Nowadays, there are tons of softwares in all types of scientific and/or business fields that produce/output data expected to be further analysed/manipulated. Data exchange between different platforms/software/languages is prevalent in a data analyst's daily routine, but it also creates all sorts of issues that can be subsumed under the umbrella of the so-called interoperability problem. This is exactly the problem of finding a common data exchange format between different platforms/software/languages.&lt;br&gt;
Nowadays, &lt;em&gt;JavaScript Object Notation&lt;/em&gt; (&lt;code&gt;JSON&lt;/code&gt;) is getting more and more popular as the data-exchange format that faces the problem. At least in NLP, a data-intensive field, &lt;code&gt;JSON&lt;/code&gt; strings/files are ubiquitous. &lt;code&gt;JSON&lt;/code&gt; is a lightweight human-readable text-based serialization format that is easily manipulable, i.e. &lt;code&gt;JSON&lt;/code&gt; strings can easily be parsed and generated.&lt;/p&gt;
&lt;h2&gt;
  
  
  The JSON String and the Goal
&lt;/h2&gt;

&lt;p&gt;The great power of this hierarchical way of representing data is that it allows arbitrarily many layers of nested information. Let's take a real life scenario of extracting specific values out of deeply nested attributes in a &lt;code&gt;JSON&lt;/code&gt; string. For a large scale annotation project, our team has been working on &lt;code&gt;JSON&lt;/code&gt; strings/files output by the &lt;a href="https://www.tagtog.com"&gt;Tagtog&lt;/a&gt; platform¹. Here is a short &lt;code&gt;JSON&lt;/code&gt; string that I will be using for this post:&lt;br&gt;
&lt;br&gt;
 &lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;jsonstr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"""
{
  "&lt;/span&gt;&lt;span class="err"&gt;annotatable&lt;/span&gt;&lt;span class="s2"&gt;": {
    "&lt;/span&gt;&lt;span class="err"&gt;parts&lt;/span&gt;&lt;span class="s2"&gt;": [
      "&lt;/span&gt;&lt;span class="err"&gt;s&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="err"&gt;v&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="s2"&gt;"
     ]
  },
  "&lt;/span&gt;&lt;span class="err"&gt;anncomplete&lt;/span&gt;&lt;span class="s2"&gt;": true,
  "&lt;/span&gt;&lt;span class="err"&gt;sources&lt;/span&gt;&lt;span class="s2"&gt;": [],
  "&lt;/span&gt;&lt;span class="err"&gt;metas&lt;/span&gt;&lt;span class="s2"&gt;": {},
  "&lt;/span&gt;&lt;span class="err"&gt;relations&lt;/span&gt;&lt;span class="s2"&gt;": [],
  "&lt;/span&gt;&lt;span class="err"&gt;entities&lt;/span&gt;&lt;span class="s2"&gt;": [
    {
      "&lt;/span&gt;&lt;span class="err"&gt;classId&lt;/span&gt;&lt;span class="s2"&gt;": "&lt;/span&gt;&lt;span class="err"&gt;e_&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="s2"&gt;",
      "&lt;/span&gt;&lt;span class="err"&gt;part&lt;/span&gt;&lt;span class="s2"&gt;": "&lt;/span&gt;&lt;span class="err"&gt;s&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="err"&gt;v&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="s2"&gt;",
      "&lt;/span&gt;&lt;span class="err"&gt;offsets&lt;/span&gt;&lt;span class="s2"&gt;": [
        {
          "&lt;/span&gt;&lt;span class="err"&gt;start&lt;/span&gt;&lt;span class="s2"&gt;": **263**,
          "&lt;/span&gt;&lt;span class="err"&gt;text&lt;/span&gt;&lt;span class="s2"&gt;": **"&lt;/span&gt;&lt;span class="err"&gt;θελω&lt;/span&gt;&lt;span class="s2"&gt;"**
        }
      ],
      "&lt;/span&gt;&lt;span class="err"&gt;coordinates&lt;/span&gt;&lt;span class="s2"&gt;": [],
      "&lt;/span&gt;&lt;span class="err"&gt;confidence&lt;/span&gt;&lt;span class="s2"&gt;": {
        "&lt;/span&gt;&lt;span class="err"&gt;state&lt;/span&gt;&lt;span class="s2"&gt;": "&lt;/span&gt;&lt;span class="err"&gt;pre-added&lt;/span&gt;&lt;span class="s2"&gt;",
        "&lt;/span&gt;&lt;span class="err"&gt;who&lt;/span&gt;&lt;span class="s2"&gt;": [
          "&lt;/span&gt;&lt;span class="err"&gt;user:alextantos&lt;/span&gt;&lt;span class="s2"&gt;",
        ],
        "&lt;/span&gt;&lt;span class="err"&gt;prob&lt;/span&gt;&lt;span class="s2"&gt;": 1
      },
      "&lt;/span&gt;&lt;span class="err"&gt;fields&lt;/span&gt;&lt;span class="s2"&gt;": {
        **"&lt;/span&gt;&lt;span class="err"&gt;f_&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="s2"&gt;"**: {
          "&lt;/span&gt;&lt;span class="err"&gt;value&lt;/span&gt;&lt;span class="s2"&gt;": **"&lt;/span&gt;&lt;span class="err"&gt;desire&lt;/span&gt;&lt;span class="s2"&gt;"**,
          "&lt;/span&gt;&lt;span class="err"&gt;confidence&lt;/span&gt;&lt;span class="s2"&gt;": {
            "&lt;/span&gt;&lt;span class="err"&gt;state&lt;/span&gt;&lt;span class="s2"&gt;": "&lt;/span&gt;&lt;span class="err"&gt;pre-added&lt;/span&gt;&lt;span class="s2"&gt;",
            "&lt;/span&gt;&lt;span class="err"&gt;who&lt;/span&gt;&lt;span class="s2"&gt;": [
              "&lt;/span&gt;&lt;span class="err"&gt;user:alextantos&lt;/span&gt;&lt;span class="s2"&gt;"
            ],
            "&lt;/span&gt;&lt;span class="err"&gt;prob&lt;/span&gt;&lt;span class="s2"&gt;": 1
          }
        }
      },
      "&lt;/span&gt;&lt;span class="err"&gt;normalizations&lt;/span&gt;&lt;span class="s2"&gt;": {}
    },
    {
      "&lt;/span&gt;&lt;span class="err"&gt;classId&lt;/span&gt;&lt;span class="s2"&gt;": "&lt;/span&gt;&lt;span class="err"&gt;e_&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="s2"&gt;",
      "&lt;/span&gt;&lt;span class="err"&gt;part&lt;/span&gt;&lt;span class="s2"&gt;": "&lt;/span&gt;&lt;span class="err"&gt;s&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="err"&gt;v&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="s2"&gt;",
      "&lt;/span&gt;&lt;span class="err"&gt;offsets&lt;/span&gt;&lt;span class="s2"&gt;": [
        {
          "&lt;/span&gt;&lt;span class="err"&gt;start&lt;/span&gt;&lt;span class="s2"&gt;": **271**,
          "&lt;/span&gt;&lt;span class="err"&gt;text&lt;/span&gt;&lt;span class="s2"&gt;": **"&lt;/span&gt;&lt;span class="err"&gt;σου&lt;/span&gt;&lt;span class="s2"&gt;"**
        }
      ],
      "&lt;/span&gt;&lt;span class="err"&gt;coordinates&lt;/span&gt;&lt;span class="s2"&gt;": [],
      "&lt;/span&gt;&lt;span class="err"&gt;confidence&lt;/span&gt;&lt;span class="s2"&gt;": {
        "&lt;/span&gt;&lt;span class="err"&gt;state&lt;/span&gt;&lt;span class="s2"&gt;": "&lt;/span&gt;&lt;span class="err"&gt;pre-added&lt;/span&gt;&lt;span class="s2"&gt;",
        "&lt;/span&gt;&lt;span class="err"&gt;who&lt;/span&gt;&lt;span class="s2"&gt;": [
          "&lt;/span&gt;&lt;span class="err"&gt;user:alextantos&lt;/span&gt;&lt;span class="s2"&gt;"
        ],
        "&lt;/span&gt;&lt;span class="err"&gt;prob&lt;/span&gt;&lt;span class="s2"&gt;": 1
      },
      "&lt;/span&gt;&lt;span class="err"&gt;fields&lt;/span&gt;&lt;span class="s2"&gt;": {
        **"&lt;/span&gt;&lt;span class="err"&gt;f_&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="s2"&gt;"**: {
          "&lt;/span&gt;&lt;span class="err"&gt;value&lt;/span&gt;&lt;span class="s2"&gt;": **"&lt;/span&gt;&lt;span class="err"&gt;second_person_weak&lt;/span&gt;&lt;span class="s2"&gt;"**,
          "&lt;/span&gt;&lt;span class="err"&gt;confidence&lt;/span&gt;&lt;span class="s2"&gt;": {
            "&lt;/span&gt;&lt;span class="err"&gt;state&lt;/span&gt;&lt;span class="s2"&gt;": "&lt;/span&gt;&lt;span class="err"&gt;pre-added&lt;/span&gt;&lt;span class="s2"&gt;",
            "&lt;/span&gt;&lt;span class="err"&gt;who&lt;/span&gt;&lt;span class="s2"&gt;": [
              "&lt;/span&gt;&lt;span class="err"&gt;user:alextantos&lt;/span&gt;&lt;span class="s2"&gt;"
            ],
            "&lt;/span&gt;&lt;span class="err"&gt;prob&lt;/span&gt;&lt;span class="s2"&gt;": 1
          }
        }
      },
      "&lt;/span&gt;&lt;span class="err"&gt;normalizations&lt;/span&gt;&lt;span class="s2"&gt;": {}
    }
  ]
}
"""&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The goal is to extract the asterisk-surrounded information on the code chunk above and end up in having the following tabular data:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h2&gt;
  
  
  Converting the &lt;code&gt;JSON&lt;/code&gt; String to an All-Inclusive &lt;code&gt;DataFrame&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Before unwrapping and extracting the relevant information out of the &lt;code&gt;JSON&lt;/code&gt; string, let's first convert it to an all-inclusive &lt;code&gt;DataFrame&lt;/code&gt; that contains all the layers of information. Aside from &lt;code&gt;DataFrames&lt;/code&gt; and &lt;code&gt;Chain&lt;/code&gt;, the relevant packages I will be using for &lt;code&gt;JSON&lt;/code&gt; string manipulation are &lt;code&gt;JSON3&lt;/code&gt; and &lt;code&gt;JSONTables&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  A few words about the relevant packages
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;JSON3&lt;/strong&gt;&lt;br&gt;
This package provides two main functions: &lt;code&gt;JSON3.read()&lt;/code&gt; and &lt;code&gt;JSON3.write()&lt;/code&gt;. With JSON3.read a JSON string is converted into a &lt;code&gt;JSON3.Object&lt;/code&gt; or &lt;code&gt;JSON3.Array&lt;/code&gt;. The major advantage of having &lt;code&gt;JSON3.Object&lt;/code&gt; or &lt;code&gt;JSON3.Array&lt;/code&gt; objects is that they both allow for dot or bracket indexing on &lt;code&gt;JSON3&lt;/code&gt; strings. Moreover, they may be further converted to &lt;code&gt;Dict&lt;/code&gt; or even &lt;code&gt;Vector&lt;/code&gt; objects. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JSONTables&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;README.md&lt;/code&gt; file of the &lt;code&gt;JSONTables&lt;/code&gt; repo says it all. So, this package &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;provides a JSON integration with the Tables.jl interface, that is, it provides the jsontable function as a way to treat a JSON object of arrays, or a JSON array of objects, as a Tables.jl-compatible source. This allows, among other things, loading JSON "tabular" data into a DataFrame, or a JuliaDB.jl table, or written out directly as a csv file.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  JSON string =&amp;gt; DataFrame 
&lt;/h3&gt;

&lt;p&gt;There are three steps we need to follow so that a JSON string is converted into a DataFrame.&lt;br&gt;
&lt;strong&gt;First step&lt;/strong&gt;: Reading in &lt;code&gt;jsonstr&lt;/code&gt; with &lt;code&gt;JSON.read()&lt;/code&gt; (recall that &lt;code&gt;jsonstr&lt;/code&gt; is created on the first section)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;Chain&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DataFrames&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;JSON3&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;JSONTables&lt;/span&gt;
&lt;span class="n"&gt;json3str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;JSON3&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jsonstr&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;jsonlines&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Something I did not mention above is that the &lt;code&gt;JSON&lt;/code&gt; strings output by the Tagtog platform adopt the &lt;em&gt;JSON Lines&lt;/em&gt; text file format, a very well-known slightly modified popular version of &lt;code&gt;JSON&lt;/code&gt; that includes the line separator,&lt;code&gt;'\n'&lt;/code&gt;. Notice that the &lt;code&gt;jsonlines&lt;/code&gt; argument, above, is set to &lt;code&gt;true&lt;/code&gt; exactly for handling the &lt;em&gt;JSON Lines&lt;/em&gt; text file format correctly. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second step&lt;/strong&gt;: Converting a JSON3 string into a Tables.jl-compatible object.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;json3table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jsontable&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json3str&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Third step&lt;/strong&gt;: Converting json3table into a DataFrame object.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;json3df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json3table&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Let's have a look at the result:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;As expected, the output is a mess. The reason is that the &lt;code&gt;json3df&lt;/code&gt; &lt;code&gt;JSON&lt;/code&gt; string we started with has been unwrapped on its first level only. As a result, the five columns of &lt;code&gt;json3df&lt;/code&gt; map to the outer-most shell of the initial complex &lt;code&gt;JSON&lt;/code&gt; string jsonstr. Moreover, the first four of them are not interesting to us, since they are present only for metadata recording-keeping purposes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Focusing on the Relevant Attribute
&lt;/h3&gt;

&lt;p&gt;Now, if we observe the initial &lt;code&gt;JSON&lt;/code&gt; string, to be able to extract the eight pieces of information that we are interested in, we should lase-focus on the :entities attibute column that includes that complicated JSON3.array object. One could potentially do that with &lt;code&gt;json3df[1,:entities]&lt;/code&gt;, but, as I just mentioned, this returns a &lt;code&gt;JSON3.array&lt;/code&gt; object that is not compliant to the &lt;code&gt;Tables.jl&lt;/code&gt; interface and, thus, cannot be converted to a &lt;code&gt;DataFrame&lt;/code&gt;. But this is an easy step to take; we simply use &lt;code&gt;jsontable()&lt;/code&gt; and &lt;code&gt;DataFrame()&lt;/code&gt; as done below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;json3dfclear&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jsontable&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json3df&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;entities&lt;/span&gt;&lt;span class="x"&gt;]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Going even deeper into json3dfclear 
&lt;/h2&gt;

&lt;p&gt;The last part of this journey actually leads to the initial goal, namely to extract the eight asterisk-surrounded pieces of information within the &lt;code&gt;jsonstr&lt;/code&gt; object of the first section, above, and put them on separate columns of a &lt;code&gt;DataFrame&lt;/code&gt;. So, here is the code:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;json3dfclear&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nd"&gt;@chain&lt;/span&gt; &lt;span class="n"&gt;json3dfclear&lt;/span&gt; &lt;span class="k"&gt;begin&lt;/span&gt;
    &lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ByRow&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vcat&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="x"&gt;)))&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;field_ids&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ByRow&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vcat&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="x"&gt;)))&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;field_values&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;offsets&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ByRow&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vcat&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="x"&gt;)))&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;offset_values&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;field_values&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ByRow&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vcat&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="x"&gt;)))&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;field_name&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;rest&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;offset_values&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ByRow&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vcat&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="x"&gt;)))&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt; &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;field_ids&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;field_name&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;After selecting the &lt;code&gt;:fields&lt;/code&gt; column and reshaping the keys and the values of the respective &lt;code&gt;JSON3.array&lt;/code&gt; objects that it contains and naming the newly-created columns as &lt;code&gt;:field_ids&lt;/code&gt; and &lt;code&gt;:field_values&lt;/code&gt;, respectively, so that the third and fourth pieces of information are extracted, i.e. the field id and the field value, the &lt;code&gt;:offsets&lt;/code&gt; column is also selected so that the values of the offsets are picked. The last step is to do a series of transformations on the &lt;code&gt;:field_values&lt;/code&gt; and &lt;code&gt;:offset_values&lt;/code&gt; columns which are also &lt;code&gt;JSON3.array&lt;/code&gt; objects and bundle all the relevant information that we want to have. So, again by reshaping the data and using the &lt;code&gt;vcat()&lt;/code&gt; function, the values are easily extracted. For an in-depth comprehension of the objects that are created and included on the DataFrame &lt;code&gt;json3dfclear&lt;/code&gt;, pay attention at its structure and contents.&lt;/p&gt;

&lt;p&gt;Lastly, we select only the relevant columns that result in the following table:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;[1]: Tagtog is a great web-based annotation platform with a bunch of nice features for collaborative annotation that I have been using for my corpus linguistics' classes and I strongly recommend you visit their website.&lt;/p&gt;

</description>
      <category>json</category>
      <category>json3</category>
      <category>jsontables</category>
      <category>dataframes</category>
    </item>
    <item>
      <title>Finding (Semantically) Similar Vectors with Julia is Easy: The First Step</title>
      <dc:creator>Alex Tantos</dc:creator>
      <pubDate>Thu, 08 Sep 2022 12:54:23 +0000</pubDate>
      <link>https://forem.julialang.org/atantos/finding-semantically-similar-vectors-with-julia-is-easy-the-first-step-32ch</link>
      <guid>https://forem.julialang.org/atantos/finding-semantically-similar-vectors-with-julia-is-easy-the-first-step-32ch</guid>
      <description>&lt;h1&gt;
  
  
  Why Should You Care for Semantic Similarity?
&lt;/h1&gt;

&lt;p&gt;Recall last time you read a medium post or an article in a newspaper. Even before you started reading that text you had had specific expectations as to the kind of vocabulary used in it and the type of terminology that you would meet. That set of expectations has been built up by the way you learned to classify the world around you, the relations between people and society, the ideologies that they carry etc. All these expectations are embodied in the actual language of the text and your ability to use them so that you can choose what to read and what to ignore is closely related to tracing semantically similar words, phrases, paragraphs and texts. This valuable human skill of tracing semantically similar things is highly desirable in a number of scientific fields and practical applications. However, nowadays, more often than not, human intuition is not the right tool to approach big datasets or to reveal hidden aspects of even moderate datasets. And this is one of the many occassions in life that maths save us by offering us &lt;strong&gt;objectively defined ways to measure semantic similarity before deciding when two pieces of data are semantically similar&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;There is a wide range of application areas for semantic similarity, as already mentioned. In NLP as well as in any other fields where string/byte sequencing is central for data analysis and modeling, such as in computational biology where GC content of DNA sequences is recorded and analyzed or in image processing whereby tracing byte sequence patterns is important, one very important first step is &lt;strong&gt;to measure semantic similarity between features of the collected/simulated data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Focusing on NLP, &lt;strong&gt;the term &lt;em&gt;feature&lt;/em&gt; refers to words, phrases, sentences, paragraphs or even whole text documents&lt;/strong&gt;. The idea is that if a system is able to compare two words, two phrases, two paragraphs and so on with each other and successfully compute their semantic similarity/dissimilarity, a door of possibilities opens for improving the performance in many NLP tasks such as information retrieval, information extraction, machine translation, text summarization, topic modeling, sentiment analysis, question answering, paraphrasing etc.&lt;/p&gt;

&lt;h1&gt;
  
  
  Sparse or Dense Vector Representations?
&lt;/h1&gt;

&lt;p&gt;Measuring semantic similarity presupposes that the data are represented suitably. Howerver, many types of data, including textual data, are unstructured and need to first be preprocessed and transformed to a format or representation that can be further exploited for calculating similarities. As in many similar cases, linear algebra is the right tool for us. Translating words, phrases, paragraphs or texts as numeric vectors that represent meaningful textual units brought tremendous changes in NLP at the beginning of 2000's. As a matter of fact, the first tradition of vector space models supported the idea that textual units (i.e. words, phrases, paragraphs and texts) can be meaningfully represented via sparse vectors. &lt;strong&gt;The linguistic meaning of these units is condensed or squeezed or embedded in an n-dimensional vector space that we can use to observe and extract meaningful relations among these units&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A new revolution came not long afterwards. The famous -by now- paper on &lt;code&gt;Word2Vec&lt;/code&gt; appeared on 2013 and led to a burst of dense vector representations.¹ Alhtough the dense vector tradition outperforms the sparse vector one in almost all tasks, there are still some advantages in using sparse vectors. For once, if the available trained models were not based on the language variety data that you are interested in, then you would probably need to train your own dense vector model; and training a dense vector model, especially a large one, requires a large amount of data rendering it very costly-inefficient both in terms of time, computing resources and even environmental impact (see Hugging Face’s course on &lt;a href="https://huggingface.co/course/chapter1/4?fw=pt"&gt;Transformers&lt;/a&gt; for more details on the environmental impact of training new dense vector representation models).&lt;/p&gt;

&lt;p&gt;Summing up, there are two different vector representation traditions related to semantic similarity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;the first tradition is based on sparse vector representations and prevailed at the beginning of 2000’s until around 2013, when&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;the famous -by now- paper on Word2Vec appeared that introduced dense vector representations and established the more recent tradition on word embeddings.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As just mentioned, training new dense vector representation or using the existing ones may be advantageous in some but not in all cases. Moreover, creating sparse vector representations is a healthy habbit for a) inspecting the frequencies and/or weights of textual units, b) obtaining some first good insights of the writing style and the text genre and c) extracting useful language use patterns.&lt;/p&gt;

&lt;p&gt;There are numerous high-quality tutorials, papers and Youtube videos that explain in detail what sparse vector representations are and it is not my intention to replace them. In this post, I will create from scratch a sparse vector representation for the words of a short text passage before I compute the semantic similarity between word pairs in a following post. I will also extract the profile of a word that occurred in the same text passage. To compute semantic similarity based on sparse vector representations, one needs to pay attention to the following three basic steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;building the word-word co-occurrence matrix&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;measuring association with context&lt;/li&gt;
&lt;li&gt;measuring vector similarity&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  The Co-occurrence Matrix
&lt;/h1&gt;

&lt;p&gt;Sparse vector representations are based on various types of co-occurrence formats. The most common ones are word-document and word-word vectors.²&lt;/p&gt;

&lt;h2&gt;
  
  
  The Word-Document Matrix
&lt;/h2&gt;

&lt;p&gt;Each cell in a word-document vector includes the (raw or weighted) frequency of a specific word in a single text of a collection of texts. There are two sparse word vectors on the table below: the first raw of the table represents numerically the word love and the second the word programming. Each cell number is the frequency of the word on the respective text.&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Placing each word vector reminds you of something, right? hmm..you guessed it well.. by gathering all unique words of a corpus (i.e., collections of texts) and placing their vectors on top of each other results in a matrix. This exact matrix is also called the word-document (or term-document) matrix. The rows of that matrix correspond to words and the columns to the text documents of the corpus.&lt;/p&gt;

&lt;p&gt;Notice that, as expected in real corpora, there are several cells in the above word-document matrix that have a zero value; for example the word love did not occur in the texts &lt;code&gt;text4&lt;/code&gt;, &lt;code&gt;text5&lt;/code&gt; and &lt;code&gt;text6&lt;/code&gt;. Imagine now a billion word corpus that consists of hundreds of thousands of texts. Counting the occurrences of unique words in the texts inevitably results in a co-occurrence matrix with many 0 values, since (except for the so-called stopwords that carry grammatical meaning and appear in all texts) it is very often the case that a word does not occur in a text of such corpus. That is why the row (word) vectors of these co-occurrence matrices are considered sparce vector representations. They only sparsely have a value other than 0 in their cells.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Word-Word Co-occurrence Matrix
&lt;/h2&gt;

&lt;p&gt;The only difference between a word-document and word-word matrix is that the columns as well as the rows in the latter are both labeled by words. This means that the (raw or weighted) frequencies recorded in the cells represent the occurrence frequency of a word found in a certain distance of another word. The distance is a &lt;em&gt;parameter&lt;/em&gt;, let’s say, that you are expected to have already prespecified.&lt;/p&gt;

&lt;p&gt;So, for a distance parameter set to 3, a word-word co-occurrence matrix might look like the table below. A cell of the word-word matrix displays the co-occurrence frequency that the two words labeled in the corresponding row and column of that cell occur within a window of 3 words.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h2&gt;
  
  
  Let’s Get Down to Work with Coding a Co-occurence Matrix
&lt;/h2&gt;

&lt;p&gt;Let’s first load &lt;code&gt;TextAnalysis.jl&lt;/code&gt;, the most well-known &lt;code&gt;Julia&lt;/code&gt; package for text processing, that will be offering us valuable functions until the end of this post.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;TextAnalysis&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Downloads&lt;/span&gt;
&lt;span class="n"&gt;str1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Downloads&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;download&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://raw.githubusercontent.com/JuliaLang/julia/master/doc/src/manual/strings.md"&lt;/span&gt;&lt;span class="x"&gt;),&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="x"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In the above code chunk, the variable name &lt;code&gt;str1&lt;/code&gt; is assigned the string of the raw markdown-ed text of the &lt;em&gt;Strings&lt;/em&gt; chapter in the &lt;code&gt;Julia&lt;/code&gt; documentaion. Note that the text has been downloaded from Github and read-in as an object of type &lt;code&gt;String&lt;/code&gt; in &lt;code&gt;Julia&lt;/code&gt;. &lt;code&gt;TextAnalysis.jl&lt;/code&gt; does not diretly handle strings of type String and first needs them converted to one of its own data types used for optimizing string processing and manipulation: &lt;code&gt;FileDocument&lt;/code&gt; , &lt;code&gt;StringDocument&lt;/code&gt; and &lt;code&gt;NGramDocument&lt;/code&gt;. The relevant type for our str1 object is &lt;code&gt;StringDocument&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now, here is how we can create the word-word co-occurence matrix for the words in &lt;code&gt;str1&lt;/code&gt; that can be found in a distance window of 3 words.³&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;coo_str1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CooMatrix&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StringDocument&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;str1&lt;/span&gt;&lt;span class="x"&gt;),&lt;/span&gt; &lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="x"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;code&gt;CooMatrix&lt;/code&gt; type constructor accepts an object of either &lt;code&gt;FileDocument&lt;/code&gt; or &lt;code&gt;StringDocument&lt;/code&gt; type, while it does not accept objects of type &lt;code&gt;NGramDocument&lt;/code&gt;, and returns an object of type &lt;code&gt;CooMatrix&lt;/code&gt;. As with any other object in &lt;code&gt;Julia&lt;/code&gt;, to inspect the returned object &lt;code&gt;coo_str1&lt;/code&gt;, you need to use the &lt;code&gt;fieldnames()&lt;/code&gt; on the data type that &lt;code&gt;coo_str1&lt;/code&gt; belongs to.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;fieldnames&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;typeof&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coo_str1&lt;/span&gt;&lt;span class="x"&gt;))&lt;/span&gt;
&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;coom&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;terms&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;column_indices&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The coom field stores the actual co-occurence matrix with the normalized frequencies of all words on the &lt;code&gt;Strings&lt;/code&gt; chapter of the &lt;code&gt;Julia&lt;/code&gt; online documentation. As expected, even for this relatively short text, the co-occurence matrix is pretty large.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coo_str1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coom&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1451&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1451&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Let’s take a sneak peek into the contents of its first two rows:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;coo_str1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coom&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;Notice that the cell values are not integers, since by default the raw frequencies are normalized by the distance between the word positions of the co-occurred words. Another important thing to keep here is the high number of 0 values in the table that signifies that there are lots of word pairs that do not co-occur in a window of 3 words.&lt;/p&gt;

&lt;p&gt;If you would like to extract the non-normalized, raw, co-occurrence frequencies you need to adjust the value of the keyword argument normalize that by default is set to true.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;coo_str1_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CooMatrix&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StringDocument&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;str1&lt;/span&gt;&lt;span class="x"&gt;),&lt;/span&gt; &lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;normalize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Here are the first two rows of the word-word co-occurence matrix that is based on raw frequency:&lt;/p&gt;

&lt;p&gt;coo_str1_raw.coom[1:2,:]&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;So far, so good. However, I am almost certain that you are probably wondering right now…“how, on earth, could I browse through such a matrix that lacks any row and column labels?” Let’s try to alleviate your concerns and respond to this question in the next section.&lt;/p&gt;

&lt;h3&gt;
  
  
  Labeling Rows &amp;amp; Columns with Words
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;column_indices&lt;/code&gt; field of the &lt;code&gt;coo_str1&lt;/code&gt; is an object of &lt;code&gt;OrderedDict&lt;/code&gt; type, a type that resembles a hash map data structure, that maps the words to a number. For instance, the word &lt;em&gt;regular&lt;/em&gt; maps to the index 1021 on &lt;code&gt;coo_str1.coom&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;coo_str1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_indices&lt;/span&gt;
&lt;span class="n"&gt;OrderedDict&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;Int64&lt;/span&gt;&lt;span class="x"&gt;}&lt;/span&gt; &lt;span class="n"&gt;with&lt;/span&gt; &lt;span class="mi"&gt;1451&lt;/span&gt; &lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
  &lt;span class="s"&gt;"1"&lt;/span&gt;                                          &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;419&lt;/span&gt;
  &lt;span class="s"&gt;"regular"&lt;/span&gt;                                    &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1021&lt;/span&gt;
  &lt;span class="s"&gt;"Vector"&lt;/span&gt;                                     &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;665&lt;/span&gt;
  &lt;span class="s"&gt;"abracadabra"&lt;/span&gt;                                &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;976&lt;/span&gt;
  &lt;span class="s"&gt;"comparisons"&lt;/span&gt;                                &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;408&lt;/span&gt;
  &lt;span class="s"&gt;"whose"&lt;/span&gt;                                      &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;873&lt;/span&gt;
  &lt;span class="s"&gt;"’"&lt;/span&gt;                                          &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1051&lt;/span&gt;
  &lt;span class="s"&gt;"Many"&lt;/span&gt;                                       &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;451&lt;/span&gt;
  &lt;span class="s"&gt;"continuation."&lt;/span&gt;                              &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;734&lt;/span&gt;
  &lt;span class="s"&gt;"gives"&lt;/span&gt;                                      &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1001&lt;/span&gt;
  &lt;span class="s"&gt;"to/from"&lt;/span&gt;                                    &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;195&lt;/span&gt;
  &lt;span class="s"&gt;"unquoted"&lt;/span&gt;                                   &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;892&lt;/span&gt;
  &lt;span class="s"&gt;"plain"&lt;/span&gt;                                      &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;127&lt;/span&gt;
  &lt;span class="s"&gt;"https://www.pcre.org/current/doc/html/pcre"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1065&lt;/span&gt;
  &lt;span class="s"&gt;"matched"&lt;/span&gt;                                    &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1091&lt;/span&gt;
  &lt;span class="s"&gt;"Any"&lt;/span&gt;                                        &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1267&lt;/span&gt;
  &lt;span class="n"&gt;⋮&lt;/span&gt;                                            &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;⋮&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, the 1021&lt;sup&gt;st&lt;/sup&gt; row of &lt;code&gt;coo_str1.coom&lt;/code&gt; has the 1451 co-occurrence frequencies of &lt;em&gt;regular&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;coo_str1_raw&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coom&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;coo_str1_raw&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_indices&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"regular"&lt;/span&gt;&lt;span class="x"&gt;],&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;
&lt;span class="mi"&gt;1451&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt; &lt;span class="n"&gt;SparseArrays&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;SparseVector&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;Float64&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;Int64&lt;/span&gt;&lt;span class="x"&gt;}&lt;/span&gt; &lt;span class="n"&gt;with&lt;/span&gt; &lt;span class="mi"&gt;53&lt;/span&gt; &lt;span class="n"&gt;stored&lt;/span&gt; &lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;   &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;4.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;  &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;6.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;  &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;4.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;  &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;2.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;  &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;6.0&lt;/span&gt;
          &lt;span class="n"&gt;⋮&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1076&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;2.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1087&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;2.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1231&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;2.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1232&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;2.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1236&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;2.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1254&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;2.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since the co-occurrence matrix is symmetric, the columns of &lt;code&gt;coo_str1.coom&lt;/code&gt; are identical to its rows, as you can see below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;coo_str1_raw&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coom&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt;&lt;span class="n"&gt;coo_str1_raw&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_indices&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"regular"&lt;/span&gt;&lt;span class="x"&gt;]]&lt;/span&gt;
&lt;span class="mi"&gt;1451&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt; &lt;span class="n"&gt;SparseArrays&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;SparseVector&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;Float64&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;Int64&lt;/span&gt;&lt;span class="x"&gt;}&lt;/span&gt; &lt;span class="n"&gt;with&lt;/span&gt; &lt;span class="mi"&gt;53&lt;/span&gt; &lt;span class="n"&gt;stored&lt;/span&gt; &lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;   &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;4.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;  &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;6.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;  &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;4.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;  &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;2.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;  &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;6.0&lt;/span&gt;
          &lt;span class="n"&gt;⋮&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1076&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;2.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1087&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;2.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1231&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;2.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1232&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;2.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1236&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;2.0&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1254&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mf"&gt;2.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The indices in &lt;code&gt;coo_str1_raw.column_indices&lt;/code&gt;, i.e. the values of this &lt;code&gt;OrderedDict&lt;/code&gt;, are identical with the position of the words in the &lt;code&gt;coo_str1_raw.terms&lt;/code&gt; vector of strings and correspond to the row/column number in &lt;code&gt;coo_str1_raw.coom&lt;/code&gt; co-occurrence matrix (recall that &lt;code&gt;coo_str1_raw.coom&lt;/code&gt; is smmetric). coo_str1_raw.terms points to the unique terms, i.e. words, of &lt;code&gt;str1&lt;/code&gt;. This means that regular is in the 1021st position of the &lt;code&gt;coo_str1_raw.terms&lt;/code&gt; vector. Let’s take advantage of this and use it for extracting the words that we want. Then, for getting the co-occurrence frequency of the pair of words &lt;em&gt;unquoted&lt;/em&gt; and &lt;em&gt;appearing&lt;/em&gt;, we simply use basic indexing. The return value 0.0 tells us that the two words did not co-occurr in a window size of 3 words.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;coo_str1_raw&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coom&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;coo_str1_raw&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_indices&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"unquoted"&lt;/span&gt;&lt;span class="x"&gt;],&lt;/span&gt; &lt;span class="n"&gt;coo_str1_raw&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_indices&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"appearing"&lt;/span&gt;&lt;span class="x"&gt;]]&lt;/span&gt;
&lt;span class="mf"&gt;0.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since it seems to be a useful piece of code for navigating through the data, why don’t we wrap it into a function name?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; browsecoompairs&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coo&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;CooMatrix&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;term1&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;term2&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;coo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coom&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;coo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_indices&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;term1&lt;/span&gt;&lt;span class="x"&gt;],&lt;/span&gt; &lt;span class="n"&gt;coo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_indices&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;term2&lt;/span&gt;&lt;span class="x"&gt;]]&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="n"&gt;browsecoom&lt;/span&gt; &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;generic&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; with&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;browsecoompairs&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coo_str1_raw&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"unquoted"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"appearing"&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="mf"&gt;0.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Getting the Word Profiles
&lt;/h3&gt;

&lt;p&gt;Another interesting insight that we can get out of the co-occurrence matrix is the profile of a word. We can look at it as the set of words with which a word did actually co-occur, meaning that with these words it did not have a 0 on the crossing cell of the co-occurrence matrix.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;coo_str1_raw&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coom&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;coo_str1_raw&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_indices&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"String"&lt;/span&gt;&lt;span class="x"&gt;],&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="x"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dims&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt; &lt;span class="kt"&gt;Vector&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;Int64&lt;/span&gt;&lt;span class="x"&gt;}&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
 &lt;span class="mi"&gt;66&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nice! 66 words co-occur more than one time with the word String in a window of 3 words. It is not a surprise that there are so many distinct words that co-occurr with &lt;em&gt;String&lt;/em&gt; in such a short text, though, given that &lt;code&gt;str1&lt;/code&gt; is a text string loaded from the &lt;em&gt;Strings&lt;/em&gt; chapter of the &lt;code&gt;Julia&lt;/code&gt; documentation. Let’s see which these co-occurring words are. The first step is to get the boolean vector that controls which of the words in &lt;code&gt;str1&lt;/code&gt; co-occur with &lt;em&gt;String&lt;/em&gt; and store it in a variable name.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;string_cooc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;coo_str1_raw&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coom&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;coo_str1_raw&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_indices&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"String"&lt;/span&gt;&lt;span class="x"&gt;],&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;.&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="mi"&gt;1451&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt; &lt;span class="n"&gt;SparseArrays&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;SparseVector&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;Bool&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;Int64&lt;/span&gt;&lt;span class="x"&gt;}&lt;/span&gt; &lt;span class="n"&gt;with&lt;/span&gt; &lt;span class="mi"&gt;66&lt;/span&gt; &lt;span class="n"&gt;stored&lt;/span&gt; &lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;   &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;   &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;   &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;   &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;   &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;
          &lt;span class="n"&gt;⋮&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;754&lt;/span&gt; &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;901&lt;/span&gt; &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;902&lt;/span&gt; &lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1012&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1240&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1439&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see above, &lt;code&gt;string_cooc&lt;/code&gt; is a &lt;code&gt;SparseVector&lt;/code&gt;, a special type of vector, full of 1s accompanied by a positional index. If we dig a bit more into the &lt;code&gt;string_cooc&lt;/code&gt; object, we will find out that it has a field called &lt;code&gt;nzind&lt;/code&gt; that returns a vector of these indices.⁴&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;fieldnames&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;typeof&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;string_cooc&lt;/span&gt;&lt;span class="x"&gt;))&lt;/span&gt;
&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;nzind&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;nzval&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;string_cooc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nzind&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;37&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;44&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;65&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;92&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;124&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;132&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;137&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;160&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;172&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;176&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;177&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;178&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;179&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;181&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;188&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;228&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;232&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;252&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;254&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;261&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;262&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;264&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;291&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;351&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;360&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;423&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;424&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;452&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;466&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;467&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;468&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;501&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;502&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;504&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;521&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;532&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;534&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;546&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;547&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;549&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;571&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;572&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;580&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;676&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;751&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;752&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;753&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;754&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;901&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;902&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1012&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1240&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1439&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Recall that these positional indices map to the ones in &lt;code&gt;coo_str1_raw.terms&lt;/code&gt; that contains the actual words. So, things are pretty easy now. Let’s extract the list of the 66 words with simple indexing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coo_str1_raw&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;terms&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;string_cooc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nzind&lt;/span&gt;&lt;span class="x"&gt;])&lt;/span&gt;
&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"#"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"["&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"]"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"("&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"@"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;")"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"are"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;","&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"the"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"a"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"`"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"and"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"0"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"in"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"as"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"is"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Julia"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"In"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"strings"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"When"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;":"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"type"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"for"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"literals"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"String"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"."&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"8"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"32"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"which"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"indices"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"index"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"indexing"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"into"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"encoded"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"julia&amp;gt;"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"necessarily"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"four"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Basics"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"delimited"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"objects"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"given"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"dimension."&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"like"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"access"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"14"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"-codeunit"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"at"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"character."&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"create"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"SubString"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"{"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"}"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"SubStrings"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"support"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Unicode."&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"per"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"UInt"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"UTF"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"16"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"types."&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Additional"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Triple-Quoted"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Literals"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Non-Standard"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"ordinary"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Raw"&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are the 66 words that String co-occurs with within a window of 3 words. Since these could be useful repetitive steps that we would like to avoid following each time, we might as well wrap them into a function name.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; cooccurrences&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coo&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;CooMatrix&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;baseword&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
           &lt;span class="n"&gt;basewordcooc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;coo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coom&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;coo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_indices&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;baseword&lt;/span&gt;&lt;span class="x"&gt;],&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;.&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
           &lt;span class="n"&gt;coo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;terms&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;basewordcooc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nzind&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;
       &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="n"&gt;cooccurrences&lt;/span&gt; &lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;generic&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="nf"&gt; with&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cooccurrences&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coo_str1_raw&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"String"&lt;/span&gt;&lt;span class="x"&gt;))&lt;/span&gt;
&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"#"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"["&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"]"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"("&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"@"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;")"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"are"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;","&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"the"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"a"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"`"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"and"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"0"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"in"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"as"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"is"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Julia"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"In"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"strings"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"When"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;":"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"type"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"for"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"literals"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"String"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"."&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"8"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"32"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"which"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"indices"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"index"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"indexing"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"into"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"encoded"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"julia&amp;gt;"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"necessarily"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"four"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Basics"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"delimited"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"objects"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"given"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"dimension."&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"like"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"access"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"14"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"-codeunit"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"at"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"character."&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"create"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"SubString"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"{"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"}"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"SubStrings"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"support"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Unicode."&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"per"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"UInt"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"UTF"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"16"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"types."&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Additional"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Triple-Quoted"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Literals"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Non-Standard"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"ordinary"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Raw"&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, you can investigate further the co-occurrence values for one or more of these words using the &lt;code&gt;browsecoompairs()&lt;/code&gt; function, explained above. So, we’ve come a long way since we loaded str1 into memory! Before leaving you, for now, I would like to take one more look at &lt;code&gt;coo_str1.coom&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Digging a bit more into the Co-occurrence Matrix
&lt;/h2&gt;

&lt;p&gt;As we saw above, the &lt;code&gt;coo_str1.coom&lt;/code&gt; object is a 1451*1451 matrix; which means that it contains 2105401 cells. For such a small text, it is almost shocking to realize that the word-word co-occurrence matrix is so large.&lt;/p&gt;

&lt;p&gt;Let’s find out how many of the cells have a value larger than 0:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;coo_str1_raw&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coom&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;26728&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means that only 26728 out of the 2105401 word pairs have a co-occurrence frequency of more than 0; or else only ~1,2 % of the matrix has a value other than 0. This means that ~98,7% of the matrix cells are equal to 0. &lt;code&gt;TextAnalysis.jl&lt;/code&gt; includes the &lt;code&gt;SparseArrays&lt;/code&gt; package in its imported packages that handles these sparse matrices very efficiently. In fact, &lt;code&gt;coo_str1.coom&lt;/code&gt; is of type &lt;code&gt;SparseMatrixCSC&lt;/code&gt;. I suggest you go ahead and have a look at &lt;code&gt;SparseArrays.jl&lt;/code&gt; to find out more details on the storage hacks and clever ways of handling sparse matrices such as &lt;code&gt;coo_str1_raw.coom&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;julia&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;typeof&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coo_str1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coom&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;SparseArrays&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;SparseMatrixCSC&lt;/span&gt;&lt;span class="x"&gt;{&lt;/span&gt;&lt;span class="kt"&gt;Float64&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;Int64&lt;/span&gt;&lt;span class="x"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it for now! Although the focus of this post is on NLP, I hope it is relatively easy to draw analogies between words, lexemes and texts with units of analysis in other fields and follow up the ideas of this post.&lt;/p&gt;




&lt;p&gt;[1]: Here is the original paper on Word2Vec "Efficient Estimation of Word Representations in Vector Space": &lt;a href="https://arxiv.org/pdf/1301.3781.pdf"&gt;https://arxiv.org/pdf/1301.3781.pdf&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[2]: Since the complexity of recognizing (or else tokenizing in terms of computational processing) and analyzing features that are beyond the word level is high and is more relevant to theoretical linguists, I will stay on the relatively easily identifiable words that can be thought of as autnomous graphemic units that are separated most of the times by spaces. So, for the non-linguists, words are defined as sets of characters that are separated with spaces within a larger string.&lt;/p&gt;

&lt;p&gt;[3]: Recall that for the word &lt;code&gt;word1&lt;/code&gt; the window of 3 words is defined as follows: &lt;code&gt;pos1 po2 pos3&lt;/code&gt; &lt;strong&gt;word1&lt;/strong&gt; &lt;code&gt;pos4 pos5 pos6&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;[4]: &lt;code&gt;show()&lt;/code&gt; does not give any added value on the operation inside the parentheses. It simply helps all the output values be displayed on the console.&lt;/p&gt;

</description>
      <category>launch</category>
      <category>nlp</category>
      <category>similarity</category>
      <category>cooccurrence</category>
    </item>
    <item>
      <title>Creating a Contingency Table in Julia</title>
      <dc:creator>Alex Tantos</dc:creator>
      <pubDate>Sat, 27 Aug 2022 17:44:00 +0000</pubDate>
      <link>https://forem.julialang.org/atantos/creating-a-contingency-table-41bk</link>
      <guid>https://forem.julialang.org/atantos/creating-a-contingency-table-41bk</guid>
      <description>&lt;h2&gt;
  
  
  Why are contingency tables useful?
&lt;/h2&gt;

&lt;p&gt;More often than not, our data sets include categorical variables that encode qualitative features of our data and take a limited number of possible discrete values. A very common scenario is to check whether there is an association between pairs of such variables. For instance, to check whether the type of car is associated with the number of gears or the number of cylinders in the well-known &lt;code&gt;mtcars&lt;/code&gt; dataset included in the &lt;code&gt;RDatasets&lt;/code&gt; package.&lt;sup id="fnref1"&gt;1&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;There are numerous measures invented especially for measuring whether pairs of categorical variables are associated and used in scientific fields such as linguistics, biology and physics. &lt;/p&gt;

&lt;h2&gt;
  
  
  What is a contingency table?
&lt;/h2&gt;

&lt;p&gt;However, to be able to calculate such association measures between pairs of categorical variables, it is essential to prepare a contingency table.&lt;/p&gt;

&lt;p&gt;A contingency table is a two-dimensional table whereby each of the rows represents one level of the first categorical variable and each of the columns represents one level of the second categorical variable. &lt;/p&gt;

&lt;p&gt;To be able to work efficiently with this type of data, you first need to assign these columns the right type that maps to the statistically-meant data type and retains the information about the levels of the categorical variable and any ordering that they may have. In other words, you need to find a data type that corresponds to factors in &lt;code&gt;R&lt;/code&gt;. A common -but not unique- way to represent categorical data in &lt;code&gt;Julia&lt;/code&gt; is through the &lt;code&gt;CategoricalArray&lt;/code&gt; data type.&lt;sup id="fnref2"&gt;2&lt;/sup&gt; &lt;/p&gt;

&lt;p&gt;Let's create the contingency table of the number of gears and the number of cylinders. Both of these variables are interpreted as &lt;code&gt;Integer&lt;/code&gt;s, but their values are discrete and, clearly, categorical in nature.&lt;/p&gt;

&lt;h2&gt;
  
  
  Loading the mtcars dataset
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;DataFrames&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Chain&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RDatasets&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FreqTables&lt;/span&gt;

&lt;span class="n"&gt;cars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"datasets"&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"mtcars"&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Extracting and converting the two categorical variables to &lt;code&gt;CategoricalArray&lt;/code&gt;s
&lt;/h2&gt;

&lt;p&gt;Although the two variables are read/parsed as &lt;code&gt;Integer&lt;/code&gt;s, they should be first converted to &lt;code&gt;CategoricalValue&lt;/code&gt;s. The following code first converts the &lt;code&gt;Integer&lt;/code&gt; to &lt;code&gt;String&lt;/code&gt; values with the &lt;code&gt;string()&lt;/code&gt; function and then to &lt;code&gt;CategoricalValues&lt;/code&gt; with the &lt;code&gt;CategoricalArray()&lt;/code&gt; constructor.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;cars&lt;/span&gt;&lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="x"&gt;,[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;Gear&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;Cyl&lt;/span&gt;&lt;span class="x"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nd"&gt;@chain&lt;/span&gt; &lt;span class="n"&gt;cars&lt;/span&gt; &lt;span class="k"&gt;begin&lt;/span&gt; 
         &lt;span class="n"&gt;combine&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;Gear&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;Cyl&lt;/span&gt;&lt;span class="x"&gt;]&lt;/span&gt; &lt;span class="o"&gt;.=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;CategoricalArray&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="x"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;renamecols&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Creating the contingency table
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;cyl_gear_freq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;freqtable&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cars&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;Cyl&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;Gear&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Adding the row totals
&lt;/h2&gt;

&lt;p&gt;To be able to add the column with the row totals, you need to apply the relevant transformation that a) takes a tuple of the column names with the &lt;code&gt;AsTable()&lt;/code&gt; function, b) applies the sum function to it and c) names the column &lt;code&gt;Total&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;Notice that &lt;code&gt;cyl_gear_freq&lt;/code&gt; is of type &lt;code&gt;NamedMatrix&lt;/code&gt; and in order to transform it to a &lt;code&gt;DataFrame&lt;/code&gt;, we need to get the array of its values, you would need to use the &lt;code&gt;array&lt;/code&gt; field. Moreover, since its a two-dimensional object, it includes two axes of names and we need the second one so that we can assign names to the new &lt;code&gt;DataFrame&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;cyl_gear&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nd"&gt;@chain&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cyl_gear_freq&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;Symbol&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cyl_gear_freq&lt;/span&gt;&lt;span class="x"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="x"&gt;]))&lt;/span&gt; &lt;span class="k"&gt;begin&lt;/span&gt;
  &lt;span class="n"&gt;transform!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AsTable&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;All&lt;/span&gt;&lt;span class="x"&gt;())&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;Total&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Adding the column totals
&lt;/h2&gt;

&lt;p&gt;Finally, to add the column total, you can use &lt;code&gt;push!()&lt;/code&gt; that adds the resulting array, created by a comprehension with the column totals, to &lt;code&gt;cyl_gear&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight julia"&gt;&lt;code&gt;&lt;span class="n"&gt;push!&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cyl_gear&lt;/span&gt;&lt;span class="x"&gt;,&lt;/span&gt; &lt;span class="x"&gt;[&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="x"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;eachcol&lt;/span&gt;&lt;span class="x"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cyl_gear&lt;/span&gt;&lt;span class="x"&gt;)])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The final result
&lt;/h2&gt;

&lt;p&gt;Here is how the contingency table with its margins looks like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://forem.julialang.org/images/f0aOSuTjjuYPj81r1nLiLIgjz1Ah567YU6E3yORtnIY/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL3J5/dnhxeWEweW5lb21i/N2J4OWZuLnBuZw" class="article-body-image-wrapper"&gt;&lt;img src="https://forem.julialang.org/images/f0aOSuTjjuYPj81r1nLiLIgjz1Ah567YU6E3yORtnIY/w:880/mb:500000/ar:1/aHR0cHM6Ly9mb3Jl/bS5qdWxpYWxhbmcu/b3JnL3JlbW90ZWlt/YWdlcy91cGxvYWRz/L2FydGljbGVzL3J5/dnhxeWEweW5lb21i/N2J4OWZuLnBuZw" alt="Image description" width="734" height="298"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;Recall that the &lt;code&gt;RDatasets&lt;/code&gt; imports the pool of commonly used dataset when &lt;code&gt;R&lt;/code&gt; is loaded. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;This data type is also recommended by the &lt;code&gt;DataFrames.jl&lt;/code&gt; package documentation as well as the recently-published book &lt;a href="https://www.manning.com/books/julia-for-data-analysis"&gt;Julia for Data Analysis&lt;/a&gt;, written by Bogumil Kaminski, for expressing categorical variables in &lt;code&gt;Julia&lt;/code&gt;. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>categorical</category>
      <category>association</category>
      <category>frequency</category>
      <category>string</category>
    </item>
  </channel>
</rss>
