Julia Community 🟣: Alex Tantos

Handling Strings and GadFly-Plotting while Learning about Zipf's Law

Alex Tantos — Thu, 10 Nov 2022 11:21:14 +0000

Last week I had to teach for my Computational Linguistics' class the Zipf's Law using Julia. This post includes the Julia code used for demonstrating the Zipf's Law.

I will not spend time on explaining here why this empirically-motivated law that characterizes natural languages holds. I will confine myself here to saying that some brilliant people before the American linguist George Kingsley Zipf spreads the word about its existence noticed a very interesting pattern:

There are few words in a text that appear most of the times while all the rest appear very few times resulting in a distribution that is reminiscent of the power law distribution.

The goal of this post, however, is to simply demonstrate the validity of the law by using the dataset of the Amazon Musical Instruments Reviews, available in Kaggle.

On the path of achieving this I will be using Julia for handling strings; more specifically, I intend to show how to:

read in text as a String
delete punctuation marks and tokenize a string into word tokens
create a Dictionary with word frequencies
sort the Dictionary based on word frequencies
barplotting the frequency values to see the power law-like distribution

Let's take these steps one-by-one.

Reading in text as a `String`

I will be using the CSV and DataFrames packages to read in the Musical_instruments_reviews.csv file as a DataFrame that contains the Amazon Musical Instruments Reviews.

julia> using DatFrames, CSV

julia> instruments = CSV.read("/Users/atantos/Documents/julia/DataFrames/Instrument_reviews_dataframe/Musical_instruments_reviews.csv", DataFrame);

If you navigate through the instruments DataFrame, you will see that the reviewText column contains the review texts. Each member of the String reviewText column vector is a text. The join() function joins these texts into a single big String that we can further manipulate.

julia> reviewtext = join(instruments.reviewText, " ");

Data Cleaning and Tokenization

The second step is to do some cleaning on the textual data by deleting the punctuation marks and then tokenizing the cleaner output. The following line does the two processing tasks in one step. replace() takes in the reviewed texts and replaces four punctuation marks expressed by the regular expression pattern r"(;|,|\.|!)" with the null string. In other words, it deletes them.
The cleaned text is further tokenized with split using as a splitting criterion the space character " ".

julia> reviewtext_tokens = split(replace(reviewtext,  r"(;|,|\.|!)" => ""), " ")
940593-element Vector{SubString{String}}:
 "Not"
 "much"
 "to"
 "write"
 ⋮
 "recommended"
 "product45/5"
 "stars"

Creating a Word-Frequency Dictionary

The most well-known related counting method in Julia is based on the StatsBase.countmap() function that outputs a Dictionary with words and their frequencies. The first step is to call StasBase's functionality in the current namespace and then we may use its exported function countmap() without needing the package qualification; meaning that we don't need to use the package_name.method() notation, as in StatsBase.countmap(). What countmap() does is that it takes a vector of any type of values and returns a dictionary with keys being the vector elements (the words in our case) and values being the occurrence frequency of these words; namely the elements of the initial vector reviewtext_tokens. Below, word_dict is a dictionary with keys being the vector elements of its argument, reviewtext_tokens and values on the right of the right-pointing arrow are their occurrence frequencies.

julia> using StatsBase

julia> word_dict = countmap(reviewtext_tokens)
Dict{SubString{String}, Int64} with 43812 entries:
  "B0002E2EOE)"           => 1
  "itPS"                  => 1
  "tunerYes"              => 1
  "optionLEVY'S"          => 1
  "whiz"                  => 2
  "simultaneouslyTotally" => 1
  "gathered"              => 2
  ⋮                       => ⋮

Recall that according to Zipf's Law, a few words are very common in a text or corpus of texts and the rest, very rarely, occur. To be able to visualize this asymmetry on the word frequency distribution, we need to first sort the words based on their frequency in decreasing order. Sorting a dictionary based on its values is easy in Julia. The sort() function allows you to use an anonymous function, x->x[2], defined within the named argument by so that you can focus on the values of the Dictionaries' pairs. Notice that for sorting in decreasing order you need to set the rev argument to true.

julia> sorted_word_dict = sort(collect(word_dict), by=x->x[2], rev=true)
43812-element Vector{Pair{SubString{String}, Int64}}:
          "the" => 39206
            "a" => 27175
          "and" => 26223
            "I" => 25333
                ⋮
      "things)" => 1
  "Tone-master" => 1
 "onesMaterial" => 1

Creating a sorted dictionary causes a small complication that one needs to be aware of. Sorted dictionaries have distinct keys and values that are not identical to their unsorted counterparts from which they were constructed. Their keys are the indices of the sorted pair element and their values consist of the key-value pairs of the initial dictionary. This means that in order to access the values of the initial unsorted dictionary that live within the new sorted dictionary, you need to first get the values of the sorted dictionary and ask for the second member of its pairs for retrieving the values with getindex(value, 2). Let's see in practice what I mean by that. Here is the array of keys of the sorted dictionary that contains all the ranking indices .

julia> [key for key in keys(sorted_word_dict)]
43812-element Vector{Int64}:
     1
     2
     3
     4
     ⋮
 43810
 43811
 43812

The values of sorted_word_dict, on the other hand, has the key-value pairs of the initial dictionary word_dict, as you can see:

julia> [value for value in values(sorted_word_dict)]
43812-element Vector{Pair{SubString{String}, Int64}}:
          "the" => 39206
            "a" => 27175
          "and" => 26223
            "I" => 25333
                ⋮
      "things)" => 1
  "Tone-master" => 1
 "onesMaterial" => 1

Keeping the internal structure of sorted_word_dict in mind, below, we are asking for the value of the key-value pairs that live within sorted_word_dict.

julia> freqs=[getindex(value, 2) for value in values(sorted_word_dict)]
43812-element Vector{Int64}:
 39206
 27175
 26223
 25333
     ⋮
     1
     1
     1

To be able to access the keys you need to access the first member of the pairs of the initial unsorted dicitionary that is stored within sorted_word_dict by using getindex(value, 1).

julia> words = [getindex(value, 1) for value in values(sorted_word_dict)]
43812-element Vector{SubString{String}}:
 "the"
 "a"
 "and"
 "I"
 ⋮
 "things)"
 "Tone-master"
 "onesMaterial"

Barplotting with Gadfly

One of the most well-known Julia plotting packages is GadFly. Being a fan of R and its powerful ggplot package, navigating through GadFly's was a breeze.¹ Here is how you could barplot the sorted frequencies of the 40 most frequent words keeping the words as labels on the x-axis and using the dodge position.

julia> using Gadfly

julia> Gadfly.plot(x=words[1:40], y=freqs[1:40], Geom.bar(position=:dodge))

As you can see, the shape of the distribution proves empirically the truth of Zipf's Law.

[1]: Roland Schaetzle wrote an excellent post on TDS that is highly recommended by the Julia community for those who migrate from R to Julia and want to have a similar plotting experience to ggplot.

Working with nested JSON strings/files in Julia

Alex Tantos — Wed, 05 Oct 2022 18:35:54 +0000

Why and when is `JSON` used?

Nowadays, there are tons of softwares in all types of scientific and/or business fields that produce/output data expected to be further analysed/manipulated. Data exchange between different platforms/software/languages is prevalent in a data analyst's daily routine, but it also creates all sorts of issues that can be subsumed under the umbrella of the so-called interoperability problem. This is exactly the problem of finding a common data exchange format between different platforms/software/languages.
Nowadays, JavaScript Object Notation (JSON) is getting more and more popular as the data-exchange format that faces the problem. At least in NLP, a data-intensive field, JSON strings/files are ubiquitous. JSON is a lightweight human-readable text-based serialization format that is easily manipulable, i.e. JSON strings can easily be parsed and generated.

The JSON String and the Goal

The great power of this hierarchical way of representing data is that it allows arbitrarily many layers of nested information. Let's take a real life scenario of extracting specific values out of deeply nested attributes in a JSON string. For a large scale annotation project, our team has been working on JSON strings/files output by the Tagtog platform¹. Here is a short JSON string that I will be using for this post:

jsonstr = """
{
  "annotatable": {
    "parts": [
      "s1v1"
     ]
  },
  "anncomplete": true,
  "sources": [],
  "metas": {},
  "relations": [],
  "entities": [
    {
      "classId": "e_2",
      "part": "s1v1",
      "offsets": [
        {
          "start": **263**,
          "text": **"θελω"**
        }
      ],
      "coordinates": [],
      "confidence": {
        "state": "pre-added",
        "who": [
          "user:alextantos",
        ],
        "prob": 1
      },
      "fields": {
        **"f_26"**: {
          "value": **"desire"**,
          "confidence": {
            "state": "pre-added",
            "who": [
              "user:alextantos"
            ],
            "prob": 1
          }
        }
      },
      "normalizations": {}
    },
    {
      "classId": "e_2",
      "part": "s1v1",
      "offsets": [
        {
          "start": **271**,
          "text": **"σου"**
        }
      ],
      "coordinates": [],
      "confidence": {
        "state": "pre-added",
        "who": [
          "user:alextantos"
        ],
        "prob": 1
      },
      "fields": {
        **"f_30"**: {
          "value": **"second_person_weak"**,
          "confidence": {
            "state": "pre-added",
            "who": [
              "user:alextantos"
            ],
            "prob": 1
          }
        }
      },
      "normalizations": {}
    }
  ]
}
"""

The goal is to extract the asterisk-surrounded information on the code chunk above and end up in having the following tabular data:

Converting the `JSON` String to an All-Inclusive `DataFrame`

Before unwrapping and extracting the relevant information out of the JSON string, let's first convert it to an all-inclusive DataFrame that contains all the layers of information. Aside from DataFrames and Chain, the relevant packages I will be using for JSON string manipulation are JSON3 and JSONTables.

A few words about the relevant packages

JSON3
This package provides two main functions: JSON3.read() and JSON3.write(). With JSON3.read a JSON string is converted into a JSON3.Object or JSON3.Array. The major advantage of having JSON3.Object or JSON3.Array objects is that they both allow for dot or bracket indexing on JSON3 strings. Moreover, they may be further converted to Dict or even Vector objects.

JSONTables
The README.md file of the JSONTables repo says it all. So, this package

provides a JSON integration with the Tables.jl interface, that is, it provides the jsontable function as a way to treat a JSON object of arrays, or a JSON array of objects, as a Tables.jl-compatible source. This allows, among other things, loading JSON "tabular" data into a DataFrame, or a JuliaDB.jl table, or written out directly as a csv file.

JSON string => DataFrame

There are three steps we need to follow so that a JSON string is converted into a DataFrame.
First step: Reading in jsonstr with JSON.read() (recall that jsonstr is created on the first section)

using Chain, DataFrames, JSON3, JSONTables
json3str = JSON3.read(jsonstr, jsonlines=true)

Something I did not mention above is that the JSON strings output by the Tagtog platform adopt the JSON Lines text file format, a very well-known slightly modified popular version of JSON that includes the line separator,'\n'. Notice that the jsonlines argument, above, is set to true exactly for handling the JSON Lines text file format correctly.

Second step: Converting a JSON3 string into a Tables.jl-compatible object.

json3table = jsontable(json3str)

Third step: Converting json3table into a DataFrame object.

json3df = DataFrame(json3table)

Let's have a look at the result:

As expected, the output is a mess. The reason is that the json3df JSON string we started with has been unwrapped on its first level only. As a result, the five columns of json3df map to the outer-most shell of the initial complex JSON string jsonstr. Moreover, the first four of them are not interesting to us, since they are present only for metadata recording-keeping purposes.

Focusing on the Relevant Attribute

Now, if we observe the initial JSON string, to be able to extract the eight pieces of information that we are interested in, we should lase-focus on the :entities attibute column that includes that complicated JSON3.array object. One could potentially do that with json3df[1,:entities], but, as I just mentioned, this returns a JSON3.array object that is not compliant to the Tables.jl interface and, thus, cannot be converted to a DataFrame. But this is an easy step to take; we simply use jsontable() and DataFrame() as done below:

json3dfclear = DataFrame(jsontable(json3df[1,:entities]))

Going even deeper into json3dfclear

The last part of this journey actually leads to the initial goal, namely to extract the eight asterisk-surrounded pieces of information within the jsonstr object of the first section, above, and put them on separate columns of a DataFrame. So, here is the code:

json3dfclear = @chain json3dfclear begin
    select(:fields => ByRow(x -> reduce(vcat, keys(x))) => :field_ids, :fields => ByRow(x -> reduce(vcat,values(x))) => :field_values, :offsets => ByRow(x -> reduce(vcat,values(x))) => :offset_values)
    transform(:field_values => ByRow(x -> reduce(vcat, values(x))) => [:field_name, :rest])
    transform(:offset_values => ByRow(x -> reduce(vcat, values(x))) => [:offset, :text])
    select( [:field_ids, :text, :field_name, :offset])
end

After selecting the :fields column and reshaping the keys and the values of the respective JSON3.array objects that it contains and naming the newly-created columns as :field_ids and :field_values, respectively, so that the third and fourth pieces of information are extracted, i.e. the field id and the field value, the :offsets column is also selected so that the values of the offsets are picked. The last step is to do a series of transformations on the :field_values and :offset_values columns which are also JSON3.array objects and bundle all the relevant information that we want to have. So, again by reshaping the data and using the vcat() function, the values are easily extracted. For an in-depth comprehension of the objects that are created and included on the DataFrame json3dfclear, pay attention at its structure and contents.

Lastly, we select only the relevant columns that result in the following table:

[1]: Tagtog is a great web-based annotation platform with a bunch of nice features for collaborative annotation that I have been using for my corpus linguistics' classes and I strongly recommend you visit their website.

Finding (Semantically) Similar Vectors with Julia is Easy: The First Step

Alex Tantos — Thu, 08 Sep 2022 12:54:23 +0000

Why Should You Care for Semantic Similarity?

Recall last time you read a medium post or an article in a newspaper. Even before you started reading that text you had had specific expectations as to the kind of vocabulary used in it and the type of terminology that you would meet. That set of expectations has been built up by the way you learned to classify the world around you, the relations between people and society, the ideologies that they carry etc. All these expectations are embodied in the actual language of the text and your ability to use them so that you can choose what to read and what to ignore is closely related to tracing semantically similar words, phrases, paragraphs and texts. This valuable human skill of tracing semantically similar things is highly desirable in a number of scientific fields and practical applications. However, nowadays, more often than not, human intuition is not the right tool to approach big datasets or to reveal hidden aspects of even moderate datasets. And this is one of the many occassions in life that maths save us by offering us objectively defined ways to measure semantic similarity before deciding when two pieces of data are semantically similar.

There is a wide range of application areas for semantic similarity, as already mentioned. In NLP as well as in any other fields where string/byte sequencing is central for data analysis and modeling, such as in computational biology where GC content of DNA sequences is recorded and analyzed or in image processing whereby tracing byte sequence patterns is important, one very important first step is to measure semantic similarity between features of the collected/simulated data.

Focusing on NLP, the term feature refers to words, phrases, sentences, paragraphs or even whole text documents. The idea is that if a system is able to compare two words, two phrases, two paragraphs and so on with each other and successfully compute their semantic similarity/dissimilarity, a door of possibilities opens for improving the performance in many NLP tasks such as information retrieval, information extraction, machine translation, text summarization, topic modeling, sentiment analysis, question answering, paraphrasing etc.

Sparse or Dense Vector Representations?

Measuring semantic similarity presupposes that the data are represented suitably. Howerver, many types of data, including textual data, are unstructured and need to first be preprocessed and transformed to a format or representation that can be further exploited for calculating similarities. As in many similar cases, linear algebra is the right tool for us. Translating words, phrases, paragraphs or texts as numeric vectors that represent meaningful textual units brought tremendous changes in NLP at the beginning of 2000's. As a matter of fact, the first tradition of vector space models supported the idea that textual units (i.e. words, phrases, paragraphs and texts) can be meaningfully represented via sparse vectors. The linguistic meaning of these units is condensed or squeezed or embedded in an n-dimensional vector space that we can use to observe and extract meaningful relations among these units.

A new revolution came not long afterwards. The famous -by now- paper on Word2Vec appeared on 2013 and led to a burst of dense vector representations.¹ Alhtough the dense vector tradition outperforms the sparse vector one in almost all tasks, there are still some advantages in using sparse vectors. For once, if the available trained models were not based on the language variety data that you are interested in, then you would probably need to train your own dense vector model; and training a dense vector model, especially a large one, requires a large amount of data rendering it very costly-inefficient both in terms of time, computing resources and even environmental impact (see Hugging Face’s course on Transformers for more details on the environmental impact of training new dense vector representation models).

Summing up, there are two different vector representation traditions related to semantic similarity:

the first tradition is based on sparse vector representations and prevailed at the beginning of 2000’s until around 2013, when
the famous -by now- paper on Word2Vec appeared that introduced dense vector representations and established the more recent tradition on word embeddings.

As just mentioned, training new dense vector representation or using the existing ones may be advantageous in some but not in all cases. Moreover, creating sparse vector representations is a healthy habbit for a) inspecting the frequencies and/or weights of textual units, b) obtaining some first good insights of the writing style and the text genre and c) extracting useful language use patterns.

There are numerous high-quality tutorials, papers and Youtube videos that explain in detail what sparse vector representations are and it is not my intention to replace them. In this post, I will create from scratch a sparse vector representation for the words of a short text passage before I compute the semantic similarity between word pairs in a following post. I will also extract the profile of a word that occurred in the same text passage. To compute semantic similarity based on sparse vector representations, one needs to pay attention to the following three basic steps:

building the word-word co-occurrence matrix
measuring association with context
measuring vector similarity

The Co-occurrence Matrix

Sparse vector representations are based on various types of co-occurrence formats. The most common ones are word-document and word-word vectors.²

The Word-Document Matrix

Each cell in a word-document vector includes the (raw or weighted) frequency of a specific word in a single text of a collection of texts. There are two sparse word vectors on the table below: the first raw of the table represents numerically the word love and the second the word programming. Each cell number is the frequency of the word on the respective text.

Placing each word vector reminds you of something, right? hmm..you guessed it well.. by gathering all unique words of a corpus (i.e., collections of texts) and placing their vectors on top of each other results in a matrix. This exact matrix is also called the word-document (or term-document) matrix. The rows of that matrix correspond to words and the columns to the text documents of the corpus.

Notice that, as expected in real corpora, there are several cells in the above word-document matrix that have a zero value; for example the word love did not occur in the texts text4, text5 and text6. Imagine now a billion word corpus that consists of hundreds of thousands of texts. Counting the occurrences of unique words in the texts inevitably results in a co-occurrence matrix with many 0 values, since (except for the so-called stopwords that carry grammatical meaning and appear in all texts) it is very often the case that a word does not occur in a text of such corpus. That is why the row (word) vectors of these co-occurrence matrices are considered sparce vector representations. They only sparsely have a value other than 0 in their cells.

The Word-Word Co-occurrence Matrix

The only difference between a word-document and word-word matrix is that the columns as well as the rows in the latter are both labeled by words. This means that the (raw or weighted) frequencies recorded in the cells represent the occurrence frequency of a word found in a certain distance of another word. The distance is a parameter, let’s say, that you are expected to have already prespecified.

So, for a distance parameter set to 3, a word-word co-occurrence matrix might look like the table below. A cell of the word-word matrix displays the co-occurrence frequency that the two words labeled in the corresponding row and column of that cell occur within a window of 3 words.

Let’s Get Down to Work with Coding a Co-occurence Matrix

Let’s first load TextAnalysis.jl, the most well-known Julia package for text processing, that will be offering us valuable functions until the end of this post.

using TextAnalysis, Downloads
str1 = read(Downloads.download("https://raw.githubusercontent.com/JuliaLang/julia/master/doc/src/manual/strings.md"), String);

In the above code chunk, the variable name str1 is assigned the string of the raw markdown-ed text of the Strings chapter in the Julia documentaion. Note that the text has been downloaded from Github and read-in as an object of type String in Julia. TextAnalysis.jl does not diretly handle strings of type String and first needs them converted to one of its own data types used for optimizing string processing and manipulation: FileDocument , StringDocument and NGramDocument. The relevant type for our str1 object is StringDocument.

Now, here is how we can create the word-word co-occurence matrix for the words in str1 that can be found in a distance window of 3 words.³

julia> coo_str1 = CooMatrix(StringDocument(str1), window=3);

The CooMatrix type constructor accepts an object of either FileDocument or StringDocument type, while it does not accept objects of type NGramDocument, and returns an object of type CooMatrix. As with any other object in Julia, to inspect the returned object coo_str1, you need to use the fieldnames() on the data type that coo_str1 belongs to.

julia> fieldnames(typeof(coo_str1))
(:coom, :terms, :column_indices)

The coom field stores the actual co-occurence matrix with the normalized frequencies of all words on the Strings chapter of the Julia online documentation. As expected, even for this relatively short text, the co-occurence matrix is pretty large.

julia> size(coo_str1.coom)
(1451, 1451)

Let’s take a sneak peek into the contents of its first two rows:

coo_str1.coom[1:2,:]

Notice that the cell values are not integers, since by default the raw frequencies are normalized by the distance between the word positions of the co-occurred words. Another important thing to keep here is the high number of 0 values in the table that signifies that there are lots of word pairs that do not co-occur in a window of 3 words.

If you would like to extract the non-normalized, raw, co-occurrence frequencies you need to adjust the value of the keyword argument normalize that by default is set to true.

coo_str1_raw = CooMatrix(StringDocument(str1), window=3, normalize=false)

Here are the first two rows of the word-word co-occurence matrix that is based on raw frequency:

coo_str1_raw.coom[1:2,:]

So far, so good. However, I am almost certain that you are probably wondering right now…“how, on earth, could I browse through such a matrix that lacks any row and column labels?” Let’s try to alleviate your concerns and respond to this question in the next section.

Labeling Rows & Columns with Words

The column_indices field of the coo_str1 is an object of OrderedDict type, a type that resembles a hash map data structure, that maps the words to a number. For instance, the word regular maps to the index 1021 on coo_str1.coom.

julia> coo_str1.column_indices
OrderedDict{String, Int64} with 1451 entries:
  "1"                                          => 419
  "regular"                                    => 1021
  "Vector"                                     => 665
  "abracadabra"                                => 976
  "comparisons"                                => 408
  "whose"                                      => 873
  "’"                                          => 1051
  "Many"                                       => 451
  "continuation."                              => 734
  "gives"                                      => 1001
  "to/from"                                    => 195
  "unquoted"                                   => 892
  "plain"                                      => 127
  "https://www.pcre.org/current/doc/html/pcre" => 1065
  "matched"                                    => 1091
  "Any"                                        => 1267
  ⋮                                            => ⋮

Then, the 1021^st row of coo_str1.coom has the 1451 co-occurrence frequencies of regular.

julia> coo_str1_raw.coom[coo_str1_raw.column_indices["regular"],:]
1451-element SparseArrays.SparseVector{Float64, Int64} with 53 stored entries:
  [5   ]  =  4.0
  [10  ]  =  6.0
  [13  ]  =  4.0
  [17  ]  =  2.0
  [18  ]  =  6.0
          ⋮
  [1076]  =  2.0
  [1087]  =  2.0
  [1231]  =  2.0
  [1232]  =  2.0
  [1236]  =  2.0
  [1254]  =  2.0

Since the co-occurrence matrix is symmetric, the columns of coo_str1.coom are identical to its rows, as you can see below.

julia> coo_str1_raw.coom[:,coo_str1_raw.column_indices["regular"]]
1451-element SparseArrays.SparseVector{Float64, Int64} with 53 stored entries:
  [5   ]  =  4.0
  [10  ]  =  6.0
  [13  ]  =  4.0
  [17  ]  =  2.0
  [18  ]  =  6.0
          ⋮
  [1076]  =  2.0
  [1087]  =  2.0
  [1231]  =  2.0
  [1232]  =  2.0
  [1236]  =  2.0
  [1254]  =  2.0

The indices in coo_str1_raw.column_indices, i.e. the values of this OrderedDict, are identical with the position of the words in the coo_str1_raw.terms vector of strings and correspond to the row/column number in coo_str1_raw.coom co-occurrence matrix (recall that coo_str1_raw.coom is smmetric). coo_str1_raw.terms points to the unique terms, i.e. words, of str1. This means that regular is in the 1021st position of the coo_str1_raw.terms vector. Let’s take advantage of this and use it for extracting the words that we want. Then, for getting the co-occurrence frequency of the pair of words unquoted and appearing, we simply use basic indexing. The return value 0.0 tells us that the two words did not co-occurr in a window size of 3 words.

julia> coo_str1_raw.coom[coo_str1_raw.column_indices["unquoted"], coo_str1_raw.column_indices["appearing"]]
0.0

Since it seems to be a useful piece of code for navigating through the data, why don’t we wrap it into a function name?

julia> function browsecoompairs(coo::CooMatrix, term1::String, term2::String)
    coo.coom[coo.column_indices[term1], coo.column_indices[term2]]
end
browsecoom (generic function with 1 method)
julia> browsecoompairs(coo_str1_raw, "unquoted", "appearing")
0.0

Getting the Word Profiles

Another interesting insight that we can get out of the co-occurrence matrix is the profile of a word. We can look at it as the set of words with which a word did actually co-occur, meaning that with these words it did not have a 0 on the crossing cell of the co-occurrence matrix.

julia> sum(x->x>0, coo_str1_raw.coom[coo_str1_raw.column_indices["String"],:], dims=1)
1-element Vector{Int64}:
 66

Nice! 66 words co-occur more than one time with the word String in a window of 3 words. It is not a surprise that there are so many distinct words that co-occurr with String in such a short text, though, given that str1 is a text string loaded from the Strings chapter of the Julia documentation. Let’s see which these co-occurring words are. The first step is to get the boolean vector that controls which of the words in str1 co-occur with String and store it in a variable name.

julia> string_cooc = coo_str1_raw.coom[coo_str1_raw.column_indices["String"],:] .> 0
1451-element SparseArrays.SparseVector{Bool, Int64} with 66 stored entries:
  [1   ]  =  1
  [2   ]  =  1
  [4   ]  =  1
  [5   ]  =  1
  [6   ]  =  1
          ⋮
  [754 ]  =  1
  [901 ]  =  1
  [902 ]  =  1
  [1012]  =  1
  [1240]  =  1
  [1439]  =  1

As you can see above, string_cooc is a SparseVector, a special type of vector, full of 1s accompanied by a positional index. If we dig a bit more into the string_cooc object, we will find out that it has a field called nzind that returns a vector of these indices.⁴

julia> fieldnames(typeof(string_cooc))
(:n, :nzind, :nzval)
julia> show(string_cooc.nzind)
[1, 2, 4, 5, 6, 9, 10, 17, 18, 26, 37, 44, 55, 65, 75, 92, 124, 132, 137, 160, 172, 176, 177, 178, 179, 181, 188, 228, 232, 252, 254, 261, 262, 264, 291, 351, 360, 423, 424, 452, 466, 467, 468, 501, 502, 503, 504, 521, 532, 534, 546, 547, 549, 571, 572, 580, 676, 751, 752, 753, 754, 901, 902, 1012, 1240, 1439]

Recall that these positional indices map to the ones in coo_str1_raw.terms that contains the actual words. So, things are pretty easy now. Let’s extract the list of the 66 words with simple indexing.

julia> show(coo_str1_raw.terms[string_cooc.nzind])
["#", "[", "]", "(", "@", ")", "are", ",", "the", "a", "`", "and", "0", "in", "as", "is", "Julia", "In", "strings", "When", ":", "type", "for", "literals", "String", ".", "8", "32", "which", "indices", "index", "indexing", "into", "encoded", "julia>", "necessarily", "four", "Basics", "delimited", "objects", "given", "dimension.", "like", "access", "14", "-codeunit", "at", "character.", "create", "SubString", "{", "}", "SubStrings", "support", "Unicode.", "per", "UInt", "UTF", "16", "types.", "Additional", "Triple-Quoted", "Literals", "Non-Standard", "ordinary", "Raw"]

These are the 66 words that String co-occurs with within a window of 3 words. Since these could be useful repetitive steps that we would like to avoid following each time, we might as well wrap them into a function name.

julia> function cooccurrences(coo::CooMatrix, baseword::String)
           basewordcooc = coo.coom[coo.column_indices[baseword],:] .> 0
           coo.terms[basewordcooc.nzind]
       end
cooccurrences (generic function with 1 method)
julia> show(cooccurrences(coo_str1_raw, "String"))
["#", "[", "]", "(", "@", ")", "are", ",", "the", "a", "`", "and", "0", "in", "as", "is", "Julia", "In", "strings", "When", ":", "type", "for", "literals", "String", ".", "8", "32", "which", "indices", "index", "indexing", "into", "encoded", "julia>", "necessarily", "four", "Basics", "delimited", "objects", "given", "dimension.", "like", "access", "14", "-codeunit", "at", "character.", "create", "SubString", "{", "}", "SubStrings", "support", "Unicode.", "per", "UInt", "UTF", "16", "types.", "Additional", "Triple-Quoted", "Literals", "Non-Standard", "ordinary", "Raw"]

Now, you can investigate further the co-occurrence values for one or more of these words using the browsecoompairs() function, explained above. So, we’ve come a long way since we loaded str1 into memory! Before leaving you, for now, I would like to take one more look at coo_str1.coom.

Digging a bit more into the Co-occurrence Matrix

As we saw above, the coo_str1.coom object is a 1451*1451 matrix; which means that it contains 2105401 cells. For such a small text, it is almost shocking to realize that the word-word co-occurrence matrix is so large.

Let’s find out how many of the cells have a value larger than 0:

julia> sum(x->x>0, coo_str1_raw.coom)
26728

This means that only 26728 out of the 2105401 word pairs have a co-occurrence frequency of more than 0; or else only ~1,2 % of the matrix has a value other than 0. This means that ~98,7% of the matrix cells are equal to 0. TextAnalysis.jl includes the SparseArrays package in its imported packages that handles these sparse matrices very efficiently. In fact, coo_str1.coom is of type SparseMatrixCSC. I suggest you go ahead and have a look at SparseArrays.jl to find out more details on the storage hacks and clever ways of handling sparse matrices such as coo_str1_raw.coom.

julia> typeof(coo_str1.coom)
SparseArrays.SparseMatrixCSC{Float64, Int64}

That’s it for now! Although the focus of this post is on NLP, I hope it is relatively easy to draw analogies between words, lexemes and texts with units of analysis in other fields and follow up the ideas of this post.

[1]: Here is the original paper on Word2Vec "Efficient Estimation of Word Representations in Vector Space": https://arxiv.org/pdf/1301.3781.pdf

[2]: Since the complexity of recognizing (or else tokenizing in terms of computational processing) and analyzing features that are beyond the word level is high and is more relevant to theoretical linguists, I will stay on the relatively easily identifiable words that can be thought of as autnomous graphemic units that are separated most of the times by spaces. So, for the non-linguists, words are defined as sets of characters that are separated with spaces within a larger string.

[3]: Recall that for the word word1 the window of 3 words is defined as follows: pos1 po2 pos3 word1 pos4 pos5 pos6

[4]: show() does not give any added value on the operation inside the parentheses. It simply helps all the output values be displayed on the console.

Creating a Contingency Table in Julia

Alex Tantos — Sat, 27 Aug 2022 17:44:00 +0000

Why are contingency tables useful?

More often than not, our data sets include categorical variables that encode qualitative features of our data and take a limited number of possible discrete values. A very common scenario is to check whether there is an association between pairs of such variables. For instance, to check whether the type of car is associated with the number of gears or the number of cylinders in the well-known mtcars dataset included in the RDatasets package.¹

There are numerous measures invented especially for measuring whether pairs of categorical variables are associated and used in scientific fields such as linguistics, biology and physics.

What is a contingency table?

However, to be able to calculate such association measures between pairs of categorical variables, it is essential to prepare a contingency table.

A contingency table is a two-dimensional table whereby each of the rows represents one level of the first categorical variable and each of the columns represents one level of the second categorical variable.

To be able to work efficiently with this type of data, you first need to assign these columns the right type that maps to the statistically-meant data type and retains the information about the levels of the categorical variable and any ordering that they may have. In other words, you need to find a data type that corresponds to factors in R. A common -but not unique- way to represent categorical data in Julia is through the CategoricalArray data type.²

Let's create the contingency table of the number of gears and the number of cylinders. Both of these variables are interpreted as Integers, but their values are discrete and, clearly, categorical in nature.

Loading the mtcars dataset

using DataFrames, Chain, RDatasets, FreqTables

cars = dataset("datasets", "mtcars")

Extracting and converting the two categorical variables to `CategoricalArray`s

Although the two variables are read/parsed as Integers, they should be first converted to CategoricalValues. The following code first converts the Integer to String values with the string() function and then to CategoricalValues with the CategoricalArray() constructor.

cars[!,[:Gear, :Cyl]] = @chain cars begin 
         combine(_, [:Gear, :Cyl] .=> x -> CategoricalArray(string.(x)), renamecols=false)
end

Creating the contingency table

cyl_gear_freq = freqtable(cars, :Cyl, :Gear)

Adding the row totals

To be able to add the column with the row totals, you need to apply the relevant transformation that a) takes a tuple of the column names with the AsTable() function, b) applies the sum function to it and c) names the column Total.

Notice that cyl_gear_freq is of type NamedMatrix and in order to transform it to a DataFrame, we need to get the array of its values, you would need to use the array field. Moreover, since its a two-dimensional object, it includes two axes of names and we need the second one so that we can assign names to the new DataFrame.

cyl_gear = @chain DataFrame(cyl_gear_freq.array, Symbol.(names(cyl_gear_freq)[2])) begin
  transform!(AsTable(All()) => sum => :Total)
end

Adding the column totals

Finally, to add the column total, you can use push!() that adds the resulting array, created by a comprehension with the column totals, to cyl_gear.

push!(cyl_gear, [sum(col) for col in eachcol(cyl_gear)])

The final result

Here is how the contingency table with its margins looks like:

Recall that the RDatasets imports the pool of commonly used dataset when R is loaded. ↩
This data type is also recommended by the DataFrames.jl package documentation as well as the recently-published book Julia for Data Analysis, written by Bogumil Kaminski, for expressing categorical variables in Julia. ↩