Working with nested JSON strings/files in Julia

#json #json3 #jsontables #dataframes

Why and when is `JSON` used?

Nowadays, there are tons of softwares in all types of scientific and/or business fields that produce/output data expected to be further analysed/manipulated. Data exchange between different platforms/software/languages is prevalent in a data analyst's daily routine, but it also creates all sorts of issues that can be subsumed under the umbrella of the so-called interoperability problem. This is exactly the problem of finding a common data exchange format between different platforms/software/languages.
Nowadays, JavaScript Object Notation (JSON) is getting more and more popular as the data-exchange format that faces the problem. At least in NLP, a data-intensive field, JSON strings/files are ubiquitous. JSON is a lightweight human-readable text-based serialization format that is easily manipulable, i.e. JSON strings can easily be parsed and generated.

The JSON String and the Goal

The great power of this hierarchical way of representing data is that it allows arbitrarily many layers of nested information. Let's take a real life scenario of extracting specific values out of deeply nested attributes in a JSON string. For a large scale annotation project, our team has been working on JSON strings/files output by the Tagtog platform¹. Here is a short JSON string that I will be using for this post:

jsonstr = """
{
  "annotatable": {
    "parts": [
      "s1v1"
     ]
  },
  "anncomplete": true,
  "sources": [],
  "metas": {},
  "relations": [],
  "entities": [
    {
      "classId": "e_2",
      "part": "s1v1",
      "offsets": [
        {
          "start": **263**,
          "text": **"θελω"**
        }
      ],
      "coordinates": [],
      "confidence": {
        "state": "pre-added",
        "who": [
          "user:alextantos",
        ],
        "prob": 1
      },
      "fields": {
        **"f_26"**: {
          "value": **"desire"**,
          "confidence": {
            "state": "pre-added",
            "who": [
              "user:alextantos"
            ],
            "prob": 1
          }
        }
      },
      "normalizations": {}
    },
    {
      "classId": "e_2",
      "part": "s1v1",
      "offsets": [
        {
          "start": **271**,
          "text": **"σου"**
        }
      ],
      "coordinates": [],
      "confidence": {
        "state": "pre-added",
        "who": [
          "user:alextantos"
        ],
        "prob": 1
      },
      "fields": {
        **"f_30"**: {
          "value": **"second_person_weak"**,
          "confidence": {
            "state": "pre-added",
            "who": [
              "user:alextantos"
            ],
            "prob": 1
          }
        }
      },
      "normalizations": {}
    }
  ]
}
"""

The goal is to extract the asterisk-surrounded information on the code chunk above and end up in having the following tabular data:

Converting the `JSON` String to an All-Inclusive `DataFrame`

Before unwrapping and extracting the relevant information out of the JSON string, let's first convert it to an all-inclusive DataFrame that contains all the layers of information. Aside from DataFrames and Chain, the relevant packages I will be using for JSON string manipulation are JSON3 and JSONTables.

A few words about the relevant packages

JSON3
This package provides two main functions: JSON3.read() and JSON3.write(). With JSON3.read a JSON string is converted into a JSON3.Object or JSON3.Array. The major advantage of having JSON3.Object or JSON3.Array objects is that they both allow for dot or bracket indexing on JSON3 strings. Moreover, they may be further converted to Dict or even Vector objects.

JSONTables
The README.md file of the JSONTables repo says it all. So, this package

provides a JSON integration with the Tables.jl interface, that is, it provides the jsontable function as a way to treat a JSON object of arrays, or a JSON array of objects, as a Tables.jl-compatible source. This allows, among other things, loading JSON "tabular" data into a DataFrame, or a JuliaDB.jl table, or written out directly as a csv file.

JSON string => DataFrame

There are three steps we need to follow so that a JSON string is converted into a DataFrame.
First step: Reading in jsonstr with JSON.read() (recall that jsonstr is created on the first section)

using Chain, DataFrames, JSON3, JSONTables
json3str = JSON3.read(jsonstr, jsonlines=true)

Something I did not mention above is that the JSON strings output by the Tagtog platform adopt the JSON Lines text file format, a very well-known slightly modified popular version of JSON that includes the line separator,'\n'. Notice that the jsonlines argument, above, is set to true exactly for handling the JSON Lines text file format correctly.

Second step: Converting a JSON3 string into a Tables.jl-compatible object.

json3table = jsontable(json3str)

Third step: Converting json3table into a DataFrame object.

json3df = DataFrame(json3table)

Let's have a look at the result:

As expected, the output is a mess. The reason is that the json3df JSON string we started with has been unwrapped on its first level only. As a result, the five columns of json3df map to the outer-most shell of the initial complex JSON string jsonstr. Moreover, the first four of them are not interesting to us, since they are present only for metadata recording-keeping purposes.

Focusing on the Relevant Attribute

Now, if we observe the initial JSON string, to be able to extract the eight pieces of information that we are interested in, we should lase-focus on the :entities attibute column that includes that complicated JSON3.array object. One could potentially do that with json3df[1,:entities], but, as I just mentioned, this returns a JSON3.array object that is not compliant to the Tables.jl interface and, thus, cannot be converted to a DataFrame. But this is an easy step to take; we simply use jsontable() and DataFrame() as done below:

json3dfclear = DataFrame(jsontable(json3df[1,:entities]))

Going even deeper into json3dfclear

The last part of this journey actually leads to the initial goal, namely to extract the eight asterisk-surrounded pieces of information within the jsonstr object of the first section, above, and put them on separate columns of a DataFrame. So, here is the code:

json3dfclear = @chain json3dfclear begin
    select(:fields => ByRow(x -> reduce(vcat, keys(x))) => :field_ids, :fields => ByRow(x -> reduce(vcat,values(x))) => :field_values, :offsets => ByRow(x -> reduce(vcat,values(x))) => :offset_values)
    transform(:field_values => ByRow(x -> reduce(vcat, values(x))) => [:field_name, :rest])
    transform(:offset_values => ByRow(x -> reduce(vcat, values(x))) => [:offset, :text])
    select( [:field_ids, :text, :field_name, :offset])
end

After selecting the :fields column and reshaping the keys and the values of the respective JSON3.array objects that it contains and naming the newly-created columns as :field_ids and :field_values, respectively, so that the third and fourth pieces of information are extracted, i.e. the field id and the field value, the :offsets column is also selected so that the values of the offsets are picked. The last step is to do a series of transformations on the :field_values and :offset_values columns which are also JSON3.array objects and bundle all the relevant information that we want to have. So, again by reshaping the data and using the vcat() function, the values are easily extracted. For an in-depth comprehension of the objects that are created and included on the DataFrame json3dfclear, pay attention at its structure and contents.

Lastly, we select only the relevant columns that result in the following table:

[1]: Tagtog is a great web-based annotation platform with a bunch of nice features for collaborative annotation that I have been using for my corpus linguistics' classes and I strongly recommend you visit their website.