Why and when is JSON
Ā used?
Nowadays, there are tons of softwares in all types of scientific and/or business fields that produce/output data expected to be further analysed/manipulated. Data exchange between different platforms/software/languages is prevalent in a data analyst's daily routine, but it also creates all sorts of issues that can be subsumed under the umbrella of the so-called interoperability problem. This is exactly the problem of finding a common data exchange format between different platforms/software/languages.
Nowadays, JavaScript Object Notation (JSON
) is getting more and more popular as the data-exchange format that faces the problem. At least in NLP, a data-intensive field, JSON
strings/files are ubiquitous. JSON
is a lightweight human-readable text-based serialization format that is easily manipulable, i.e. JSON
strings can easily be parsed and generated.
The JSON String and theĀ Goal
The great power of this hierarchical way of representing data is that it allows arbitrarily many layers of nested information. Let's take a real life scenario of extracting specific values out of deeply nested attributes in a JSON
string. For a large scale annotation project, our team has been working on JSON
strings/files output by the Tagtog platformĀ¹. Here is a short JSON
string that I will be using for this post:
Ā
jsonstr = """
{
"annotatable": {
"parts": [
"s1v1"
]
},
"anncomplete": true,
"sources": [],
"metas": {},
"relations": [],
"entities": [
{
"classId": "e_2",
"part": "s1v1",
"offsets": [
{
"start": **263**,
"text": **"ĪøĪµĪ»Ļ"**
}
],
"coordinates": [],
"confidence": {
"state": "pre-added",
"who": [
"user:alextantos",
],
"prob": 1
},
"fields": {
**"f_26"**: {
"value": **"desire"**,
"confidence": {
"state": "pre-added",
"who": [
"user:alextantos"
],
"prob": 1
}
}
},
"normalizations": {}
},
{
"classId": "e_2",
"part": "s1v1",
"offsets": [
{
"start": **271**,
"text": **"ĻĪæĻ
"**
}
],
"coordinates": [],
"confidence": {
"state": "pre-added",
"who": [
"user:alextantos"
],
"prob": 1
},
"fields": {
**"f_30"**: {
"value": **"second_person_weak"**,
"confidence": {
"state": "pre-added",
"who": [
"user:alextantos"
],
"prob": 1
}
}
},
"normalizations": {}
}
]
}
"""
The goal is to extract the asterisk-surrounded information on the code chunk above and end up in having the following tabular data:
Converting the JSON
String to an All-Inclusive DataFrame
Before unwrapping and extracting the relevant information out of the JSON
string, let's first convert it to an all-inclusive DataFrame
that contains all the layers of information. Aside from DataFrames
and Chain
, the relevant packages I will be using for JSON
string manipulation are JSON3
and JSONTables
.
A few words about the relevantĀ packages
JSON3
This package provides two main functions: JSON3.read()
and JSON3.write()
. With JSON3.read a JSON string is converted into a JSON3.Object
or JSON3.Array
. The major advantage of having JSON3.Object
or JSON3.Array
objects is that they both allow for dot or bracket indexing on JSON3
strings. Moreover, they may be further converted to Dict
or even Vector
objects.Ā
JSONTables
The README.md
file of the JSONTables
repo says it all. So, this package
provides a JSON integration with the Tables.jl interface, that is, it provides the jsontable function as a way to treat a JSON object of arrays, or a JSON array of objects, as a Tables.jl-compatible source. This allows, among other things, loading JSON "tabular" data into a DataFrame, or a JuliaDB.jl table, or written out directly as a csv file.
JSON string => DataFrameĀ
There are three steps we need to follow so that a JSON string is converted into a DataFrame.
First step: Reading in jsonstr
with JSON.read()
(recall that jsonstr
is created on the first section)
using Chain, DataFrames, JSON3, JSONTables
json3str = JSON3.read(jsonstr, jsonlines=true)
Something I did not mention above is that the JSON
strings output by the Tagtog platform adopt the JSON Lines text file format, a very well-known slightly modified popular version of JSON
that includes the line separator,'\n'
. Notice that the jsonlines
argument, above, is set to true
exactly for handling the JSON Lines text file format correctly.Ā
Second step: Converting a JSON3 string into a Tables.jl-compatible object.
json3table = jsontable(json3str)
Third step: Converting json3table into a DataFrame object.
json3df = DataFrame(json3table)
Let's have a look at the result:
As expected, the output is a mess. The reason is that the json3df
JSON
string we started with has been unwrapped on its first level only. As a result, the five columns of json3df
map to the outer-most shell of the initial complex JSON
string jsonstr. Moreover, the first four of them are not interesting to us, since they are present only for metadata recording-keeping purposes.
Focusing on the Relevant Attribute
Now, if we observe the initial JSON
string, to be able to extract the eight pieces of information that we are interested in, we should lase-focus on theĀ :entities attibute column that includes that complicated JSON3.array object. One could potentially do that with json3df[1,:entities]
, but, as I just mentioned, this returns a JSON3.array
object that is not compliant to the Tables.jl
interface and, thus, cannot be converted to a DataFrame
. But this is an easy step to take; we simply use jsontable()
and DataFrame()
as done below:
json3dfclear = DataFrame(jsontable(json3df[1,:entities]))
Going even deeper into json3dfclearĀ
The last part of this journey actually leads to the initial goal, namely to extract the eight asterisk-surrounded pieces of information within the jsonstr
object of the first section, above, and put them on separate columns of a DataFrame
. So, here is the code:
json3dfclear = @chain json3dfclear begin
select(:fields => ByRow(x -> reduce(vcat, keys(x))) => :field_ids, :fields => ByRow(x -> reduce(vcat,values(x))) => :field_values, :offsets => ByRow(x -> reduce(vcat,values(x))) => :offset_values)
transform(:field_values => ByRow(x -> reduce(vcat, values(x))) => [:field_name, :rest])
transform(:offset_values => ByRow(x -> reduce(vcat, values(x))) => [:offset, :text])
select( [:field_ids, :text, :field_name, :offset])
end
After selecting theĀ :fields
column and reshaping the keys and the values of the respective JSON3.array
objects that it contains and naming the newly-created columns asĀ :field_ids
andĀ :field_values
, respectively, so that the third and fourth pieces of information are extracted, i.e. the field id and the field value, theĀ :offsets
column is also selected so that the values of the offsets are picked. The last step is to do a series of transformations on theĀ :field_values
andĀ :offset_values
columns which are also JSON3.array
objects and bundle all the relevant information that we want to have. So, again by reshaping the data and using the vcat()
function, the values are easily extracted. For an in-depth comprehension of the objects that are created and included on the DataFrame json3dfclear
, pay attention at its structure and contents.
Lastly, we select only the relevant columns that result in the following table:
[1]: Tagtog is a great web-based annotation platform with a bunch of nice features for collaborative annotation that I have been using for my corpus linguistics' classes and I strongly recommend you visit their website.
Top comments (2)
Such a nice topic(JSON). Great post
Add to the discussion