Julia Community 🟣

Cover image for The only way you should be splitting a String in Julia - Julia Base.split()
Logan Kilpatrick
Logan Kilpatrick

Posted on • Originally published at logankilpatrick.Medium

The only way you should be splitting a String in Julia - Julia Base.split()

Splitting a string is one of the most common operations you can perform in any programming language. In this article, we will go over the different ways of splitting strings in Julia, including extensive examples of the different approaches.


The basic split syntax 🖖

We will start by declaring a basic string:

julia> my_string = "Hello.World"
"Hello.World"
Enter fullscreen mode Exit fullscreen mode

Next, we will call the split() function and pass in the string as well as the delimiter (what you want the split to occur on).

julia> split_strings = split(my_string, ".")
2-element Vector{SubString{String}}:
"Hello"
"World"
Enter fullscreen mode Exit fullscreen mode

In this case, since the string had only 1 period, we end up with two separate strings. Note that the new strings do not have the delimiter included in them, so the overall length of the combined strings decreased by 1 in this case. Let's look at a more robust example of splitting a string:

julia> my_sentence = "Hello world. This is an extended example with more sentences to show what happens. I am going to add only a few more. Okay, last one. Wait, what is this?"
I just want to highlight what it looks like when there are more items to be split. Let's look at this in action:
julia> split_sentences = split(my_sentence, ".")
5-element Vector{SubString{String}}:
"Hello world"
" This is an extended example with more sentences to show what happens"
" I am going to add only a few more"
" Okay, last one"
" Wait, what is this?"
Enter fullscreen mode Exit fullscreen mode


`

Notice that again, the punctuation is gone but there is white space at the beginning of the sentence. We would have to use a strip function call to remove it.

For now, this is the most basic form of using the split function.

Next, we will explore more advanced use cases!
One more quick thing to keep in mind is that if we don't specify a delimiter, the split function defaults to using a space as the delimiter, if you forget to add one, you will get a wildly different result:

`Julia
julia> split_sentences = split(my_sentence)
30-element Vector{SubString{String}}:
"Hello"
"world."
"This"
"is"

"what"
"is"
"this?"

`

Advanced split examples 🧗

If you want to move beyond the basics of using split and try out some advanced examples, this section is for you. We will start by playing with the optional limit argument which lets us set the max number of items we want to be created.

Julia
julia> split_sentences = split(my_sentence, limit=10)
10-element Vector{SubString{String}}:
"Hello"
"world."
...
"with"
"more"
"sentences to show what happens." ⋯ 40 bytes ⋯ ", last one.

Wait, what is this?" Since we set the limit at 10, we essentially stop splitting on the delimiter after the 10th item. The final item in this vector has multiple sentences in it.

Next, we will look at the keepempty optional argument. It allows us to specify if we want to keep empty items in the resulting vector. It's probably easiest to see this in practice. We will re-define our string to include more text:

Julia
julia> my_sentence = "Hello world. This is an extended example with more sentences to show what happens. I am going to add only a few more. Okay, last one. Wait, what is this? . . . . . . . . ........"
"Hello world. This is an extended example with more sentences to show what happens. I am going to add only a few more. Okay, last one. Wait, what is this? . . . . . . . . ........"

Now, let's see this in action with both options:

Julia
julia> split_sentences = split(my_sentence, ".", keepempty=false)
13-element Vector{SubString{String}}:
"Hello world"
" This is an extended example with more sentences to show what happens"
" I am going to add only a few more"
" Okay, last one"
" Wait, what is this? "
...
" "
" "

Here we can see that since keep empty is false, we have no resulting items where the value is an empty string (""). If we switch the value to true, we get the following:

Julia
julia> split_sentences = split(my_sentence, ".", keepempty=true)
21-element Vector{SubString{String}}:
"Hello world"
" This is an extended example with more sentences to show what happens"
" I am going to add only a few more"
" Okay, last one"
" Wait, what is this? "
" "

""
""

Again, since the original example has lots of periods in a row ( ....... ), the keepempty option set to true gave us a bunch of empty strings.

That's all we can do with the basic split function. Let's explore some of the other ways to handle string splitting in Julia!


Base.rsplit( )- Starting from the end 🧵

Similar to the split function, there is also a rsplit function that does the same thing as split, but it starts from the end (interestingly enough though, the order of the resulting data is not reversed). Let's look at a simple example in practice:

Julia
julia> my_string = "Hello.World.This.Is.A.Test"
"Hello.World.This.Is.A.Test"

Now, let's compare how this is different from just the regular split function:

`Julia
julia> a = split(my_string, ".")
6-element Vector{SubString{String}}:
"Hello"
"World"
"This"
"Is"
"A"
"Test"

julia> b = rsplit(my_string, ".")
6-element Vector{SubString{String}}:
"Hello"
"World"
"This"
"Is"
"A"
"Test"

julia> a == b
true
`

But hold on a second, if rsplit is:

Similar to split, but starting from the end of the string.
why does it not invert the order?

Well, this is a great question and something I asked on Stack Overflow in order to try and get context for this.

The biggest time this will make a difference is if you provide the limit argument. In that case, the results will be different:

`Julia
julia> a = split(my_string, "."; limit=2)
2-element Vector{SubString{String}}:
"Hello"
"World.This.Is.A.Test"

julia> b = rsplit(my_string, "."; limit=2)
2-element Vector{SubString{String}}:
"Hello.World.This.Is.A"
"Test"
julia> a == b
false
`

All of the other argument options hold true for rsplit so there's no need to re-hash those details.


Using eachsplit - Introduced in Julia 1️⃣.8️⃣

In Julia 1.8+, there is a new eachsplit function that allows you to split items just like we did before, but in this case, we return an iterator by default. This can be helpful when you want to work with an iterator instead of just returning a Vector by default. Let's see this in action:

Julia
julia> a = "Ma.rch"
"Ma.rch"
julia> split(a, ".")
2-element Vector{SubString{String}}:
"Ma"
"rch"
julia> eachsplit(a, ".")
Base.SplitIterator{String, String}("Ma.rch", ".", 0, true)

Now, if we want to replicate the behavior of the split function, we need to do the following:

Julia
julia> collect(eachsplit(a, "."))
2-element Vector{SubString}:
"Ma"
"rch"

The collect function allows us to chair together the items through the iterator. Personally, I think the docs could use some improvement here so I have an open issue seeking to address this.

But what even is an iterator to begin with? Well, according to the docs:

Sequential iteration is implemented by the iterate function. Instead of mutating objects as they are iterated over, Julia iterators may keep track of the iteration state externally from the object. The return value from iterate is always either a tuple of a value and a state, or nothing if no elements remain.

If you want to read more about this, check out the full docs section on Iterators.


Wrapping things up 🎁

In this post, we walked through the basics of using the split function along with rsplit and eachsplit. These functions provide the basic groundwork for string splitting in Julia.

There is lots more out there in the String-averse (ha, get it) to explore. I suggest checking out https://github.com/JuliaString/Strs.jl as a starting place for what is possible!

Top comments (1)

Collapse
cjdoris profile image
Christopher Rowley

FYI the formatting of the code blocks is wrong - though the original post on Medium is correct.