Julia Community 🟣

Julia-PBN
Julia-PBN

Posted on

Dealing with strings in Julia, patterns and anti-patterns

Dealing with strings in Julia, patterns and anti-patterns.

Table of contents

Introduction and goals

Hello, I'm awnmp, hobbyist programmer, and I like Maths, 17yo as of now, which explains my lack of experience in lots of programming fields (and I've started programming the 8th May 2021, so I'm by no mean an expert, but I pick up concepts easily). English isn't my native language, so I apologize in advance for gramamtical errors and spelling mistakes.

But I started programming (with PERL) for string processing reasons, thus, it's one of the fields in which I specialised with. Feel free to correct me!

After seeing a lot of bad practice in the Julia community regarding Strings, I've decided to create a blog about it.

It's intended to be a big blog, only read the part that concern your use case (the compression part is quite lengthy, but completely irrelevent to most)

What is a string?

A string can be regarded as a list of characters.
To store those characters and the string in a way the computer understand, there's multiple methods.
First of all, storing the string pointer, so a 32 or 64 bits pointer depending on your system.
This pointer will point into a part in memory, you'll need to group some bits, until the end of the string. If it's Unicode or ASCII for UTF-8 :
UTF-8 is a grouping of 8 bits per 8 bits.
UTF-16 is a grouping of 16 bits per 16 bits.
UTF-32, 32 per 32, etc.
Genererally a power of 2, as it's convenient. It's the byte encoding, we'll call each group a codeunit. (it's a bit of an oversimplification, but it works well enough for a mental model)

Now that you've got group of bits (called codeunit, use codeunits to get those), you still have to know what they refer too/what character they represent. It's the character encoding:

  • ASCII, used with UTF-8, each character are 7 bits, and stored as the following table, it's used for string with no "special characters", generally for English.
  • UNICODE, superset of ASCII, so any valid ASCII string is UNICODE compatible. Created to be able to use multiple languages in the same sentence, thus, can store any representable characters. To do so, there are modifier characters, and characters in general have variable length, from 8 bits to 32 bits, it's usable with UTF-8/16/32 (so, 1 to 4 codeunits per characters for UTF-8, 1 to 2 codeunits per characters for UTF-16, and 1 codeunit = 1 character for UTF-32).

There exist other encoding, but they are rare, be aware thought that if your language is mostly using non-latin characters, there may be an optimised encoding for your language.

Most of the time, a string ends with '\0', it's one way to define the end point of it, the other is to store the length with the pointer (that makes a fat pointer).
They are immutable in most language for speed, synchronization, concurrencey, caching reasons. If you have a string in your source code in a AOT-compiled language, it'll generally land in the read-only section of your executable. The table is huge, but some sites exists to find the corresponding characters.

How to compress it?

We just saw that in general, you need 8 to 32 bits per characters to store a string, that's a lot if you want to share files, or lots of books. The above encoding are "general-purpose", thus their inefficiency. Though, if you don't need to save space, don't compress them without reason.

It's an immense field, and most algorithms are quite complicated, so I'l just gloss the general idea without going into details. For anyone interested, Data Compression: The Complete Reference from David Salamon is a great book.

Huffman encoding

The first encoding achieving entropy, aka. being as efficient as possible (memory-wise), with the constraint to have a one-to-one mapping between symbols (or characters) and bits to represent them.
It looks on the frequency of each characters on your text, and build a tree of the characters representation.
Frequent characters get assign shorter representation. You have to store the output and the tree to be able to decode it.

Lempel-Ziv family

They are dictionary based compressing methods, with a sliding-window.
The idea is that you slice the string with a window, with a specified length, and it'll remplace group of characters by something else if it's a duplicate of somewhere else in the string.

LZ77 or LZ1

Is the first LZ compression system, it'll look what was before the sliding-window (current characters you're reading) for replacement.
The remplacement is a (length, distance/offset, c) triple, where length is the length of the replacement, and distance/offset is how many characters to the left should it scroll to seek for the replacement and c is the character after the replacement.
If no matzh is found, a single character, x is thus represented as (0, 0, x).

Example: "Blah blah blah blah."
=> [(0, 0, 'B'), (0, 0, 'l'), (0, 0, 'a'), (0, 0, 'h'), (0, 0, ' '), (0, 0, 'b'), (5, 18, '.')
(the syntax makes it seems less optimize than it is)

LZ78 or LZ2

Instead of having a (length, distance, c) triple, we use a (index, c) pair, in a vector, v, representing the message, where the 0th index is nothing or "". And (index, c) pair is equal to v[index] * c (with Julia meaning of *).

example: "AABABBABBAABA"
=> [(0, 'A'), (1, 'B'), (2, 'B'), (3, 'A'), (2, 'A')]

LZW

Lempel-Ziv-Welch is a LZ78 improvement.
You start with an initial dictionary (which is better represented as a vector of String), with each of your characters in it.

Example:
"ababcbababaaaaaaa" would have the initial "dictionary" ["a", "b", "c"], inital code []

You'll traverse your string, put the index to your code of the longest match from your current character which is in the dictionary, your next character will be the character after your match, and add to your "dictionary" the longest string ending just before your current character that isn't in your dictionary yet, concatenate to your current character.

so, in the "ababcbababaaaaaaa" example:

  • current: a, code = [1], dict = ["a", "b", "c"]
  • current: b, code = [1, 2], dict = ["a", "b", "c", "ab"]
  • current: a, match = "ab", code = [1, 2, 4], dict = ["a", "b", "c", "ab", "ba"]
  • current: c, code = [1, 2, 4, 3], dict = ["a", "b", "c", "ab", "ba", "abc"]
  • current: b, match = "ba", code = [1, 2, 4, 3, 5], dict = ["a", "b", "c", "ab", "ba", "abc", "cb"]
  • current: b, match = "bab" (not here yet, but will be added now), code = [1, 2, 4, 3, 5, 9] dict = ["a", "b", "c", "ab", "ba", "abc", "cb", "bab"]
  • current: a, code = [1, 2, 4, 3, 5, 9, 1], dict = ["a", "b", "c", "ab", "ba", "abc", "cb", "bab", "baba"]
  • current: a, match = "aa" (not here yet, but will be added now), code = [1, 2, 4, 3, 5, 8, 1, 10]
  • current: a, match = "aaa" (not here yet, but will be added now), code = [1, 2, 4, 3, 5, 8 1, 10, 11]
  • current: a, code = [1, 2, 4, 3, 5, 8, 1, 10, 11, 1], end of string

You remove 1 to all of your element of your code, and now you have a vector of integer to which you can store with $\lceil \log_2 \max_{v\in \text{code}} v\rceil$ bits per number.
Some applications will store the LZW code differently than having a fix length for each number, but the compression idea is the same.

others

There are other LZ system, but those are the main one, some other include : Lempel-Ziv-Stac is a combinason of LZ77 and Hufmann (so does deflate, a quite popular algorithm) and Lempel-Ziv-Storer-Szymanski, an improvement on LZ77, notably by using a prefix bit to indicate new characters (so, removing the need to store a whole triple for single character), and chains organisation is changed from a tree to transform the algorithm into $O(\log(n))$ in regard to window size (compared to LZ77 $O(n)$).

PAQ family

It's some extremely complex algorithms, with really good compression rate, but quite slow. They are way too complex for me to explain in simple terms (and I'm not even sure on how it works...), so here how it works.
I was surprised by how little was it known despite being quite interesting, so I decided to put it here too.

What are good-paterns and anti-patterns in Julia for strings?

Now it's finally about what I wanted to talk in the first place, and the most important thing to take from this.

Indexing patterns

Let just give an example on why it's important, "Ξ±Ξ²"[1+1] => error.
And if you think those are some rare characters, "Γ©lite"[2] (elite in French or Spanish) => error.

It's because when you index String in Julia, you're indexing the codeunit number, not the character number (it's because indexing codeunit is $O(1)$ and indexing by characters is $O(n)$, so, for speed reasons)

Saddly, Julia syntax starts to become a bit cubbersome if you write truly generic programs, but it breaks when it isn't, so it's worth it.


Syntax: pattern to avoid => how to do it

why


i+1 => nextind(str, i)

i+n => nextind(str, i, n)

i-1 => prevind(str, i)

i-n => prevind(str, i ,n)

This pattern is to get the next or previous index of a string. As said before, even String isn't a linear indexing container. By using i Β± n, you'll run into error, or even indexing the wrong character without any run-time error !


i = 1
while i <= length(str)
Enter fullscreen mode Exit fullscreen mode

=>

i = firstindex(str)
while i <= lastindex(str)
Enter fullscreen mode Exit fullscreen mode

The indexing may not start with 1, and the length of the string may not be the same as the last indexing (I've run into some annoying and hard to catch bugs due to that)


for i in firstindex(str):lastindex(str) => for i in eachindex(str)

Indexing may not be linear


for i in nextind(str, firstindex(str), 10):prevind(str, lastindex(str), 3) => for i in (Iterators.)filter(i -> nextind(str, firstindex(str), 10) <= i <= prevind(str, lastindex(str), 3), eachindex(str))

This does assume that indexes of the string are increasing (which I have never found any which wasn't), but other alternatives have some allocations:

for i in eachindex(SubString(str, nextind(str, firstindex(str), 10), prevind(str, lastindex(str), 3)) ) .+ nextind(str, firstindex(str), 10)

I did say that it causes code to be messy...

I recommand making a function

indexes(str, from, to) = from == to ? from : [from; indexes(str, nextind(str, from), to)]
Enter fullscreen mode Exit fullscreen mode

(where from and to are valid indexes of the string, otherwise, it won't stop, some better ways to make it exists, like creating an iterator, but that's lengthier)


str[a:b] => SubString(str, a, b)

It's so you don't create new strings uselessly.


str[a:2:b] => str[indexes(str, a, b)[1:2:end]] # indexes is defined above

SubString is only meant to work with an offset, not with separated characters.
And still not linear indexing so a:2:b may return wrong indexing.


String types


function func(s::String) => function func(s::AbstractString) or function func(s)

function func(c::Char) => function func(c::AbstractChar) or function func(c)

There exist more than String and Char ! Why stopping others from using your code?


dict = Dict{Int, Char}()
for (i, c) in pairs(str)
    dict[i] = c
end
Enter fullscreen mode Exit fullscreen mode

=>

dict = Dict{Int, eltype(str)}()
for (i, c) in pairs(str)
    dict[i] = c
end
Enter fullscreen mode Exit fullscreen mode

Some strings may use a different character than those of type Char, eltype is great to get the type of those.


I would highly recommend using Strs.jl to deal with strings in general, and passing your string into str |> Str (the library creator told me it isn't a good practice, as it's type-instable, be aware of that, use str |> UTF32Str if you want it to be code stable (but not necessarely a good representation of it)), it makes indexing linear !!! That means that if you're sure your code only take types from Strs.jl, you can disregard the above tips, which makes coding much easier.

For those who dislike allocation, there's InlinedStrings.jl, I find it less useful, but may be in case you want to store some specific tokens in a vector, and iterate through it often.

Due to the way they're stored, Symbols are faster to compare than strings in general, though, they can't use most string operations, so depending on your use case, it may be better to use them.


string comparison


To be sure a file haven't been changed, and easily checkable by humans, hashing the string is the most common way. It can lead to two different string being be considered the same, but it's really hard to make it on purpose, so it's generally quite safe, use some recent hashing algorithms though.


If you care only at what the string looks like,

a == b, where a and b are Unicode encoded may result false for same-ish values.
Use Unicode.normalize(a) == Unicode.normalize(b)

Example:
"é" == "é" # "\u00e9" == "\u0065\u0301" => false, but with nomralize, it returs true.


If you want more information on how close two string are instead of just equality, use the Levenshtein distance, it basically returns how many single character transformation are needed to go the first string to the second.


If you care about word meaning closeness, you can try word2vec, it transformed English words (it isn't for other languages) into points in a high-dimensional vector, so you can directly check their distance, though, high dimension means that there will be the curse of dimensionality.

I have learned this recently, and never tried it (as I'm not knowledgeable in NLP), so take this as an interesting side-note.


Often, we want to compare infomation in a structured string, as in a .json.
In that case, first, check if there aren't a library for your file format... And if not, then you're in for lexering and parsing your string, good luck.

If it were for another language, I'd suggest using a parsing generator, but I can't find one for Julia... String manipulation doesn't seem to be the main focus in Julia... So you'll have to make it by hand, it's quite hard, it's highly depending on the structure, thus, if you aren't sure if you can implement one, ask for someone else and let it as it is for now.

First, the lexer, a lexer is an algorithm to transform a string into a vector of tokens.

For example, let's say we are parsing a Common Lisp code:

(defun factorial (n)
    (if (<= n 1)
        1
        (* n (factorial (- n 1)))))
Enter fullscreen mode Exit fullscreen mode

You have to know how the syntax of Lisp work, in EBNF, it's:

list -> '(' expression * ')'
expression -> atom | list
atom -> number | name | string | operator
Enter fullscreen mode Exit fullscreen mode

A really simple language syntax-wise. After the lexerizing, you may have [TOK_LP, TOK_defun, :factorial, TOK_LP, :n, TOK_RP, TOK_RP, TOK_IF, TOK_LP, :<=, :n, :1, TOK_RP, :1, TOK_LP, :*, :n, TOK_LP, :factorial, TOK_LP, :-, :n, :1, TOK_RP, TOK_RP, TOK_RP, TOK_RP, TOK_RP]
where TOK_LP = Symbol("("), TOK_RP = Symbol(")"), TOK_IF = :if, TOK_defun = :defun

Basically, that allows to remove the white-space dependance for equality.

If you also want useless parethesis not to infer the equality, then you'll need to create an AST, by parsing your tokens.

You'll get something like Expr(:defun, :factorial, :n, Expr(:if, Expr(:<=, :n, :1), 1, Expr(:*, :n, Expr(:factorial, Expr(:-, :n, :1))))) (note that you can't eval this expression in Julia).
Of course, you'll still have some similar meaning string be considered to be different, you may try doing Ξ±-conversion (preferably in a consistent manner, such as De Bruijn index), but you'll still have argument moving around on commutative functions, distribution not being made, etc. So, you'd probably need some amazing computer algebra system and ATP, but even then, you'll have some edge case, so analyse what you truly need and check accordingly (and bringing ATP to a "check if equal" seems really overkill).


string miscellanous


s =  ""
for i in some_iterable
    s *= f(i)
end
Enter fullscreen mode Exit fullscreen mode

=>

io = IOBuffer()
for i in some_iterable
    write(io, f(i))
end
s = String(take!(io))
Enter fullscreen mode Exit fullscreen mode

When you're concating on a string, you're creating a whole new one. It's quite expansive.


length(str) == 0 => isempty(str)

Checking the length of a String is $O(n)$, as the number of characters isn't determined by the number of codeunit, so it's way slower than using an isempty


I wouldn't recommand collecting your string to make the indexing easier, as it creates an array, thus more GC calls, you can't use most string functions, such as split or regexes functions, and creating the string from your array of characters is a bit expansive, but if you don't care about most string manipulation functions, nor performance, I think it's an acceptable mean for not dealing with weird indexings, at least, better than using the anti-patterns above with normal strings.


miscellanous

About compression, the best (most compressed) way to compress a string is often undecidable, it's linked with the Kolmogorov Complexity.

Reading file as String may be expensive, if you intend to write a really fast parser, convert it from Vector{UInt8}, but be aware of the encoding problems which may appear.

Often it's faster to use REGEXes to find matches in a string with a specific structure. This is on its own a huge concept, it's more about learning to use it more than fixing errors, so it's not included in this post.

Anyway, thanks a lot for reading! I know that I didn't write it well... I started to be in writing debt o-o, and I would probabably procrastinate it for months if I wanted to rewrite it, for some errors which are harmful right now!

If there are some need in the communaty to rewrite some of the compression algorithms discussed above, I can implement them in Julia, except PAQ, that family really is too complicated for me.

If you catch some error, tell me, and I'll try to fix them as fast as possible, and if you want to add some patterns, it's obviously welcome!

Top comments (1)

Collapse
 
logankilpatrick profile image
Logan Kilpatrick

Very thorough explanation! Thanks for this.