Jorge Cimentada
Most APIs will return data in JavaScript Object Notation (JSON)
Format designed to share data over the internet
JSON is text-based: can be opened with any code editor
It supports the usual numeric/string values + arrays
Important: allows to build complex hierarchical structures
Why is JSON important? Why not use a CSV, for example? or XML? I’ll skip the boring stuff and leave you with three:
JSONs are easy to read directly from the source file. In this session we’ll focus on real-world examples of the problems you’ll face while working with JSONs
{
"president": "Parlosi",
"vicepresident": "Kantos",
"opposition": "Pitaso"
}
Starts with {}
It’s based around key:value
pairs
Each key:value
can contain numbers od strings (for now)
That’s it. That’s a JSON file.
json_str <- '{
"president": "Parlosi",
"vicepresident": "Kantos",
"opposition": "Pitaso"
}'
fromJSON(json_str)
$president
[1] "Parlosi"
$vicepresident
[1] "Kantos"
$opposition
[1] "Pitaso"
The key
must be a string
Parsed as a named list
Think of JSONs as named arrays
{
"president": [
{
"last_name": "Parlosi",
"party": "Free thinkers",
"age": 35
}
],
"vicepresident": [
{
"last_name": "Kantos",
"party": "Free thinkers",
"age": 52
}
],
"opposition": [
{
"last_name": "Pitaso",
"party": "Everyone United",
"age": 45
}
]
}
Let’s break it down:
"president": [
{
"last_name": "Parlosi",
"party": "Free thinkers",
"age": 35
}
]
Same JSON rules
Now we have an array
Structure: key
followed by an array with three key:value
pairs
Think of arrays in JSON as rows in a data frame.
Three keys (names of each slot) where each contains a 1 row data frame inside.
{
"key":[
## First row
{
"col1":1,
"col2":2
},
## Second row
{
"col1":3,
"col2":4
}
]
}
json_str <- '
{
"president": [
{
"last_name": "Parlosi",
"party": "Free thinkers",
"age": 35
}
],
"vicepresident": [
{
"last_name": "Kantos",
"party": "Free thinkers",
"age": 52
}
],
"opposition": [
{
"last_name": "Pitaso",
"party": "Everyone United",
"age": 45
}
]
}
'
fromJSON(json_str, simplifyDataFrame = TRUE)
How is it parsed?
A named list of data frames:
$president
last_name party age
1 Parlosi Free thinkers 35
$vicepresident
last_name party age
1 Kantos Free thinkers 52
$opposition
last_name party age
1 Pitaso Everyone United 45
JSON’s are key:value
pairs and the value it self can be an array with other key/value pairs.
With that explained, how will this be parsed into R?
{
"president": [
{
"last_name": "Parlosi",
"party": "Free thinkers",
"age": 35
},
{
"last_name": "Stevensson",
"party": "Free thinkers"
}
],
"vicepresident": [
null
],
"opposition": {
"last_name": "Pitaso",
"party": "Everyone United",
"age": 45
}
}
Write it down before next slide.
$president
last_name party age
1 Parlosi Free thinkers 35
2 Stevensson Free thinkers NA
$vicepresident
[1] NA
$opposition
$opposition$last_name
[1] "Pitaso"
$opposition$party
[1] "Everyone United"
$opposition$age
[1] 45
A data frame for the first slot. That’s correct because the JSON contained an array with two sets of key:value
pairs. That’s translatable to a data frame with two rows even though one of the two sets did not have a field for age
.
An NA value since null
is the way missing values are represented in JSON but notice that this null
is in a JSON array so this is effectively a data frame with an NA
value.
A named list. That’s right because there is no array structure ([...]
) so key:value
pairs are interpreted as a named list.
Why is this important? You can fix stuff:
$president
last_name party age
1 Parlosi Free thinkers 35
2 Stevensson Free thinkers NA
$vicepresident
[1] NA
$opposition
last_name party age
1 Pitaso Everyone United 45
They most important tool for working with JSONs is subsetting
The dirtiest part of JSON is complex nesting
enframe
+ unnest
strategyMost important part of this section. Taking the previous example, how can we turn it into:
## # A tibble: 4 × 4
## name last_name party age
## <chr> <chr> <chr> <int>
## 1 president Parlosi Free thinkers 35
## 2 president Stevensson Free thinkers NA
## 3 vicepresident <NA> <NA> NA
## 4 opposition Pitaso Everyone United 45
This is the ideal summary of the result: everything in a single data frame with complete and incomplete information.
Each ‘category’ (president
, vicepresident
, opposition
) is a row.
enframe
+ unnest
strategyGeneral strategy of combining two functions: enframe
and unnest
.
enframe
takes a named list and does two things:
Extracts the names of each slot in the list and stores it in a column in a data frame.
Takes everything inside each slot and stores it in a list-column.
enframe
+ unnest
strategyList-column: column of class list that can contain different things (if you remember, all columns in R must be of the same kind, either numeric
, character
or something else but there can’t be two types in the same column).
In our example,
First row of this list-column is a data frame that has two rows
The second row is an empty data frame with an NA
value
The third is now a data frame since we altered the JSON manually (remember?).
enframe
+ unnest
strategyLet’s take it for a spin:
# A tibble: 3 × 2
name value
<chr> <list>
1 president <df [2 × 3]>
2 vicepresident <lgl [1]>
3 opposition <df [1 × 3]>
enframe
+ unnest
strategyHow do we transform it?
unnest
takes list-columns and ‘unpacks’ them into the common class of the list.
If the all objects are of different classes (data.frame
, vectors
, etc..), then unnest
will fail.
If all objects within the list are of the same class, it will combine all of them into a proper column or ‘unpack’ it’s values.
All of this sounds convoluted right? Let’s see some applied examples.
enframe
+ unnest
strategy# A tibble: 4 × 4
name last_name party age
<chr> <chr> <chr> <int>
1 president Parlosi Free thinkers 35
2 president Stevensson Free thinkers NA
3 vicepresident <NA> <NA> NA
4 opposition Pitaso Everyone United 45
All objects of the data frame were compatible so it combines them all.
enframe
+ unnest
strategyA failed example:
Can you tell me why?
enframe
+ unnest
strategyError in `list_unchop()`:
! Can't combine `x[[1]]` <data.frame> and `x[[3]]` <list>.
The main problem you’ll encounter with JSON’s is that you’re trying to parse some JSON that has many nested arrays and some of these arrays are not compatible for unnesting so you’ll have to submerge yourself into these nested arrays and fix whatever data you want to extract.
Suppose that as part of a research project, you’ve recently been granted access to the API of a company. You’re interested in studying the relationship between the geo-location of clients, their shopping patterns and their social class.
{
"client1": [
{
"name": "Kurt Rosenwinkel",
"device": [
{
"type": "iphone",
"location": [
{
"lat": 213.12,
"lon": 43.213
}
]
}
]
}
],
"client2": [
{
"name": "YEITUEI",
"device": [
{
"type": "android",
"location": [
{
"lat": 211.12,
"lon": 53.213
}
]
}
]
}
]
}
Before understand in-depth a JSON, it’s not a terrible idea to try to read it:
json_str <- '
{
"client1": [
{
"name": "Kurt Rosenwinkel",
"device": [
{
"type": "iphone",
"location": [
{
"lat": 213.12,
"lon": 43.213
}
]
}
]
}
],
"client2": [
{
"name": "YEITUEI",
"device": [
{
"type": "android",
"location": [
{
"lat": 211.12,
"lon": 53.213
}
]
}
]
}
]
}
'
res <- fromJSON(json_str)
$client1
name device
1 Kurt Rosenwinkel iphone, 213.12, 43.213
$client2
name device
1 YEITUEI android, 211.12, 53.213
Two data frames parsed, one for each client. Were they parsed correctly?
Say we wanted to append “Device:” to the device of each client:
[1] "Device: list(type = \"iphone\", location = list(list(lat = 213.12, lon = 43.213)))"
Not parsed correctly?
device
is a list with a data frame inside…
location
is also a list. Let’s look at what’s inside:
Whenever fromJSON
encounters an array within an array, it converts each one to a data frame.
The problem is that these are recurrent data frames inside data frames and so on.
It’s difficult to assess where to stop and this is a good example where unnest
is very handy.
Let’s apply this to res
in general:
# A tibble: 2 × 2
client value
<chr> <list>
1 client1 <df [1 × 2]>
2 client2 <df [1 × 2]>
# A tibble: 2 × 4
client name type location
<chr> <chr> <chr> <list>
1 client1 Kurt Rosenwinkel iphone <df [1 × 2]>
2 client2 YEITUEI android <df [1 × 2]>
Arrays are interpreted as ‘rows’ in JSONs
Subsetting is very important for cleaning JSONs
Nested JSONs are your biggest problem
enframe
+ unnest
strategy can help you with nested JSONs
Problems: objects in a list-column
are of different classes
Chapter 14 – read + exercises
Everyone should have a project + a group here.
Work on project should begin now. Submissions can be done from now on. Final deadline for the project is in two weeks.