Alex Tantos

Posted on Sep 8, 2022

Finding (Semantically) Similar Vectors with Julia is Easy: The First Step

#launch #nlp #similarity #cooccurrence

Why Should You Care for Semantic Similarity?

Recall last time you read a medium post or an article in a newspaper. Even before you started reading that text you had had specific expectations as to the kind of vocabulary used in it and the type of terminology that you would meet. That set of expectations has been built up by the way you learned to classify the world around you, the relations between people and society, the ideologies that they carry etc. All these expectations are embodied in the actual language of the text and your ability to use them so that you can choose what to read and what to ignore is closely related to tracing semantically similar words, phrases, paragraphs and texts. This valuable human skill of tracing semantically similar things is highly desirable in a number of scientific fields and practical applications. However, nowadays, more often than not, human intuition is not the right tool to approach big datasets or to reveal hidden aspects of even moderate datasets. And this is one of the many occassions in life that maths save us by offering us objectively defined ways to measure semantic similarity before deciding when two pieces of data are semantically similar.

There is a wide range of application areas for semantic similarity, as already mentioned. In NLP as well as in any other fields where string/byte sequencing is central for data analysis and modeling, such as in computational biology where GC content of DNA sequences is recorded and analyzed or in image processing whereby tracing byte sequence patterns is important, one very important first step is to measure semantic similarity between features of the collected/simulated data.

Focusing on NLP, the term feature refers to words, phrases, sentences, paragraphs or even whole text documents. The idea is that if a system is able to compare two words, two phrases, two paragraphs and so on with each other and successfully compute their semantic similarity/dissimilarity, a door of possibilities opens for improving the performance in many NLP tasks such as information retrieval, information extraction, machine translation, text summarization, topic modeling, sentiment analysis, question answering, paraphrasing etc.

Sparse or Dense Vector Representations?

Measuring semantic similarity presupposes that the data are represented suitably. Howerver, many types of data, including textual data, are unstructured and need to first be preprocessed and transformed to a format or representation that can be further exploited for calculating similarities. As in many similar cases, linear algebra is the right tool for us. Translating words, phrases, paragraphs or texts as numeric vectors that represent meaningful textual units brought tremendous changes in NLP at the beginning of 2000's. As a matter of fact, the first tradition of vector space models supported the idea that textual units (i.e. words, phrases, paragraphs and texts) can be meaningfully represented via sparse vectors. The linguistic meaning of these units is condensed or squeezed or embedded in an n-dimensional vector space that we can use to observe and extract meaningful relations among these units.

A new revolution came not long afterwards. The famous -by now- paper on Word2Vec appeared on 2013 and led to a burst of dense vector representations.¹ Alhtough the dense vector tradition outperforms the sparse vector one in almost all tasks, there are still some advantages in using sparse vectors. For once, if the available trained models were not based on the language variety data that you are interested in, then you would probably need to train your own dense vector model; and training a dense vector model, especially a large one, requires a large amount of data rendering it very costly-inefficient both in terms of time, computing resources and even environmental impact (see Hugging Face’s course on Transformers for more details on the environmental impact of training new dense vector representation models).

Summing up, there are two different vector representation traditions related to semantic similarity:

the first tradition is based on sparse vector representations and prevailed at the beginning of 2000’s until around 2013, when
the famous -by now- paper on Word2Vec appeared that introduced dense vector representations and established the more recent tradition on word embeddings.

As just mentioned, training new dense vector representation or using the existing ones may be advantageous in some but not in all cases. Moreover, creating sparse vector representations is a healthy habbit for a) inspecting the frequencies and/or weights of textual units, b) obtaining some first good insights of the writing style and the text genre and c) extracting useful language use patterns.

There are numerous high-quality tutorials, papers and Youtube videos that explain in detail what sparse vector representations are and it is not my intention to replace them. In this post, I will create from scratch a sparse vector representation for the words of a short text passage before I compute the semantic similarity between word pairs in a following post. I will also extract the profile of a word that occurred in the same text passage. To compute semantic similarity based on sparse vector representations, one needs to pay attention to the following three basic steps:

building the word-word co-occurrence matrix
measuring association with context
measuring vector similarity

The Co-occurrence Matrix

Sparse vector representations are based on various types of co-occurrence formats. The most common ones are word-document and word-word vectors.²

The Word-Document Matrix

Each cell in a word-document vector includes the (raw or weighted) frequency of a specific word in a single text of a collection of texts. There are two sparse word vectors on the table below: the first raw of the table represents numerically the word love and the second the word programming. Each cell number is the frequency of the word on the respective text.

Placing each word vector reminds you of something, right? hmm..you guessed it well.. by gathering all unique words of a corpus (i.e., collections of texts) and placing their vectors on top of each other results in a matrix. This exact matrix is also called the word-document (or term-document) matrix. The rows of that matrix correspond to words and the columns to the text documents of the corpus.

Notice that, as expected in real corpora, there are several cells in the above word-document matrix that have a zero value; for example the word love did not occur in the texts text4, text5 and text6. Imagine now a billion word corpus that consists of hundreds of thousands of texts. Counting the occurrences of unique words in the texts inevitably results in a co-occurrence matrix with many 0 values, since (except for the so-called stopwords that carry grammatical meaning and appear in all texts) it is very often the case that a word does not occur in a text of such corpus. That is why the row (word) vectors of these co-occurrence matrices are considered sparce vector representations. They only sparsely have a value other than 0 in their cells.

The Word-Word Co-occurrence Matrix

The only difference between a word-document and word-word matrix is that the columns as well as the rows in the latter are both labeled by words. This means that the (raw or weighted) frequencies recorded in the cells represent the occurrence frequency of a word found in a certain distance of another word. The distance is a parameter, let’s say, that you are expected to have already prespecified.

So, for a distance parameter set to 3, a word-word co-occurrence matrix might look like the table below. A cell of the word-word matrix displays the co-occurrence frequency that the two words labeled in the corresponding row and column of that cell occur within a window of 3 words.

Let’s Get Down to Work with Coding a Co-occurence Matrix

Let’s first load TextAnalysis.jl, the most well-known Julia package for text processing, that will be offering us valuable functions until the end of this post.

using TextAnalysis, Downloads
str1 = read(Downloads.download("https://raw.githubusercontent.com/JuliaLang/julia/master/doc/src/manual/strings.md"), String);

In the above code chunk, the variable name str1 is assigned the string of the raw markdown-ed text of the Strings chapter in the Julia documentaion. Note that the text has been downloaded from Github and read-in as an object of type String in Julia. TextAnalysis.jl does not diretly handle strings of type String and first needs them converted to one of its own data types used for optimizing string processing and manipulation: FileDocument , StringDocument and NGramDocument. The relevant type for our str1 object is StringDocument.

Now, here is how we can create the word-word co-occurence matrix for the words in str1 that can be found in a distance window of 3 words.³

julia> coo_str1 = CooMatrix(StringDocument(str1), window=3);

The CooMatrix type constructor accepts an object of either FileDocument or StringDocument type, while it does not accept objects of type NGramDocument, and returns an object of type CooMatrix. As with any other object in Julia, to inspect the returned object coo_str1, you need to use the fieldnames() on the data type that coo_str1 belongs to.

julia> fieldnames(typeof(coo_str1))
(:coom, :terms, :column_indices)

The coom field stores the actual co-occurence matrix with the normalized frequencies of all words on the Strings chapter of the Julia online documentation. As expected, even for this relatively short text, the co-occurence matrix is pretty large.

julia> size(coo_str1.coom)
(1451, 1451)

Let’s take a sneak peek into the contents of its first two rows:

coo_str1.coom[1:2,:]

Notice that the cell values are not integers, since by default the raw frequencies are normalized by the distance between the word positions of the co-occurred words. Another important thing to keep here is the high number of 0 values in the table that signifies that there are lots of word pairs that do not co-occur in a window of 3 words.

If you would like to extract the non-normalized, raw, co-occurrence frequencies you need to adjust the value of the keyword argument normalize that by default is set to true.

coo_str1_raw = CooMatrix(StringDocument(str1), window=3, normalize=false)

Here are the first two rows of the word-word co-occurence matrix that is based on raw frequency:

coo_str1_raw.coom[1:2,:]

So far, so good. However, I am almost certain that you are probably wondering right now…“how, on earth, could I browse through such a matrix that lacks any row and column labels?” Let’s try to alleviate your concerns and respond to this question in the next section.

Labeling Rows & Columns with Words

The column_indices field of the coo_str1 is an object of OrderedDict type, a type that resembles a hash map data structure, that maps the words to a number. For instance, the word regular maps to the index 1021 on coo_str1.coom.

julia> coo_str1.column_indices
OrderedDict{String, Int64} with 1451 entries:
  "1"                                          => 419
  "regular"                                    => 1021
  "Vector"                                     => 665
  "abracadabra"                                => 976
  "comparisons"                                => 408
  "whose"                                      => 873
  "’"                                          => 1051
  "Many"                                       => 451
  "continuation."                              => 734
  "gives"                                      => 1001
  "to/from"                                    => 195
  "unquoted"                                   => 892
  "plain"                                      => 127
  "https://www.pcre.org/current/doc/html/pcre" => 1065
  "matched"                                    => 1091
  "Any"                                        => 1267
  ⋮                                            => ⋮

Then, the 1021^st row of coo_str1.coom has the 1451 co-occurrence frequencies of regular.

julia> coo_str1_raw.coom[coo_str1_raw.column_indices["regular"],:]
1451-element SparseArrays.SparseVector{Float64, Int64} with 53 stored entries:
  [5   ]  =  4.0
  [10  ]  =  6.0
  [13  ]  =  4.0
  [17  ]  =  2.0
  [18  ]  =  6.0
          ⋮
  [1076]  =  2.0
  [1087]  =  2.0
  [1231]  =  2.0
  [1232]  =  2.0
  [1236]  =  2.0
  [1254]  =  2.0

Since the co-occurrence matrix is symmetric, the columns of coo_str1.coom are identical to its rows, as you can see below.

julia> coo_str1_raw.coom[:,coo_str1_raw.column_indices["regular"]]
1451-element SparseArrays.SparseVector{Float64, Int64} with 53 stored entries:
  [5   ]  =  4.0
  [10  ]  =  6.0
  [13  ]  =  4.0
  [17  ]  =  2.0
  [18  ]  =  6.0
          ⋮
  [1076]  =  2.0
  [1087]  =  2.0
  [1231]  =  2.0
  [1232]  =  2.0
  [1236]  =  2.0
  [1254]  =  2.0

The indices in coo_str1_raw.column_indices, i.e. the values of this OrderedDict, are identical with the position of the words in the coo_str1_raw.terms vector of strings and correspond to the row/column number in coo_str1_raw.coom co-occurrence matrix (recall that coo_str1_raw.coom is smmetric). coo_str1_raw.terms points to the unique terms, i.e. words, of str1. This means that regular is in the 1021st position of the coo_str1_raw.terms vector. Let’s take advantage of this and use it for extracting the words that we want. Then, for getting the co-occurrence frequency of the pair of words unquoted and appearing, we simply use basic indexing. The return value 0.0 tells us that the two words did not co-occurr in a window size of 3 words.

julia> coo_str1_raw.coom[coo_str1_raw.column_indices["unquoted"], coo_str1_raw.column_indices["appearing"]]
0.0

Since it seems to be a useful piece of code for navigating through the data, why don’t we wrap it into a function name?

julia> function browsecoompairs(coo::CooMatrix, term1::String, term2::String)
    coo.coom[coo.column_indices[term1], coo.column_indices[term2]]
end
browsecoom (generic function with 1 method)
julia> browsecoompairs(coo_str1_raw, "unquoted", "appearing")
0.0

Getting the Word Profiles

Another interesting insight that we can get out of the co-occurrence matrix is the profile of a word. We can look at it as the set of words with which a word did actually co-occur, meaning that with these words it did not have a 0 on the crossing cell of the co-occurrence matrix.

julia> sum(x->x>0, coo_str1_raw.coom[coo_str1_raw.column_indices["String"],:], dims=1)
1-element Vector{Int64}:
 66

Nice! 66 words co-occur more than one time with the word String in a window of 3 words. It is not a surprise that there are so many distinct words that co-occurr with String in such a short text, though, given that str1 is a text string loaded from the Strings chapter of the Julia documentation. Let’s see which these co-occurring words are. The first step is to get the boolean vector that controls which of the words in str1 co-occur with String and store it in a variable name.

julia> string_cooc = coo_str1_raw.coom[coo_str1_raw.column_indices["String"],:] .> 0
1451-element SparseArrays.SparseVector{Bool, Int64} with 66 stored entries:
  [1   ]  =  1
  [2   ]  =  1
  [4   ]  =  1
  [5   ]  =  1
  [6   ]  =  1
          ⋮
  [754 ]  =  1
  [901 ]  =  1
  [902 ]  =  1
  [1012]  =  1
  [1240]  =  1
  [1439]  =  1

As you can see above, string_cooc is a SparseVector, a special type of vector, full of 1s accompanied by a positional index. If we dig a bit more into the string_cooc object, we will find out that it has a field called nzind that returns a vector of these indices.⁴

julia> fieldnames(typeof(string_cooc))
(:n, :nzind, :nzval)
julia> show(string_cooc.nzind)
[1, 2, 4, 5, 6, 9, 10, 17, 18, 26, 37, 44, 55, 65, 75, 92, 124, 132, 137, 160, 172, 176, 177, 178, 179, 181, 188, 228, 232, 252, 254, 261, 262, 264, 291, 351, 360, 423, 424, 452, 466, 467, 468, 501, 502, 503, 504, 521, 532, 534, 546, 547, 549, 571, 572, 580, 676, 751, 752, 753, 754, 901, 902, 1012, 1240, 1439]

Recall that these positional indices map to the ones in coo_str1_raw.terms that contains the actual words. So, things are pretty easy now. Let’s extract the list of the 66 words with simple indexing.

julia> show(coo_str1_raw.terms[string_cooc.nzind])
["#", "[", "]", "(", "@", ")", "are", ",", "the", "a", "`", "and", "0", "in", "as", "is", "Julia", "In", "strings", "When", ":", "type", "for", "literals", "String", ".", "8", "32", "which", "indices", "index", "indexing", "into", "encoded", "julia>", "necessarily", "four", "Basics", "delimited", "objects", "given", "dimension.", "like", "access", "14", "-codeunit", "at", "character.", "create", "SubString", "{", "}", "SubStrings", "support", "Unicode.", "per", "UInt", "UTF", "16", "types.", "Additional", "Triple-Quoted", "Literals", "Non-Standard", "ordinary", "Raw"]

These are the 66 words that String co-occurs with within a window of 3 words. Since these could be useful repetitive steps that we would like to avoid following each time, we might as well wrap them into a function name.

julia> function cooccurrences(coo::CooMatrix, baseword::String)
           basewordcooc = coo.coom[coo.column_indices[baseword],:] .> 0
           coo.terms[basewordcooc.nzind]
       end
cooccurrences (generic function with 1 method)
julia> show(cooccurrences(coo_str1_raw, "String"))
["#", "[", "]", "(", "@", ")", "are", ",", "the", "a", "`", "and", "0", "in", "as", "is", "Julia", "In", "strings", "When", ":", "type", "for", "literals", "String", ".", "8", "32", "which", "indices", "index", "indexing", "into", "encoded", "julia>", "necessarily", "four", "Basics", "delimited", "objects", "given", "dimension.", "like", "access", "14", "-codeunit", "at", "character.", "create", "SubString", "{", "}", "SubStrings", "support", "Unicode.", "per", "UInt", "UTF", "16", "types.", "Additional", "Triple-Quoted", "Literals", "Non-Standard", "ordinary", "Raw"]

Now, you can investigate further the co-occurrence values for one or more of these words using the browsecoompairs() function, explained above. So, we’ve come a long way since we loaded str1 into memory! Before leaving you, for now, I would like to take one more look at coo_str1.coom.

Digging a bit more into the Co-occurrence Matrix

As we saw above, the coo_str1.coom object is a 1451*1451 matrix; which means that it contains 2105401 cells. For such a small text, it is almost shocking to realize that the word-word co-occurrence matrix is so large.

Let’s find out how many of the cells have a value larger than 0:

julia> sum(x->x>0, coo_str1_raw.coom)
26728

This means that only 26728 out of the 2105401 word pairs have a co-occurrence frequency of more than 0; or else only ~1,2 % of the matrix has a value other than 0. This means that ~98,7% of the matrix cells are equal to 0. TextAnalysis.jl includes the SparseArrays package in its imported packages that handles these sparse matrices very efficiently. In fact, coo_str1.coom is of type SparseMatrixCSC. I suggest you go ahead and have a look at SparseArrays.jl to find out more details on the storage hacks and clever ways of handling sparse matrices such as coo_str1_raw.coom.

julia> typeof(coo_str1.coom)
SparseArrays.SparseMatrixCSC{Float64, Int64}

That’s it for now! Although the focus of this post is on NLP, I hope it is relatively easy to draw analogies between words, lexemes and texts with units of analysis in other fields and follow up the ideas of this post.

[1]: Here is the original paper on Word2Vec "Efficient Estimation of Word Representations in Vector Space": https://arxiv.org/pdf/1301.3781.pdf

[2]: Since the complexity of recognizing (or else tokenizing in terms of computational processing) and analyzing features that are beyond the word level is high and is more relevant to theoretical linguists, I will stay on the relatively easily identifiable words that can be thought of as autnomous graphemic units that are separated most of the times by spaces. So, for the non-linguists, words are defined as sets of characters that are separated with spaces within a larger string.

[3]: Recall that for the word word1 the window of 3 words is defined as follows: pos1 po2 pos3 word1 pos4 pos5 pos6

[4]: show() does not give any added value on the operation inside the parentheses. It simply helps all the output values be displayed on the console.

Julia Community 🟣