Jorge Cimentada
Most APIs will return data in JavaScript Object Notation (JSON)
Format designed to share data over the internet
JSON is text-based: can be opened with any code editor
It supports the usual numeric/string values + arrays
Important: allows to build complex hierarchical structures
Why is JSON important? Why not use a CSV, for example? or XML? I’ll skip the boring stuff and leave you with three:
JSONs are easy to read directly from the source file. In this session we’ll focus on real-world examples of the problems you’ll face while working with JSONs
{
"president": "Parlosi",
"vicepresident": "Kantos",
"opposition": "Pitaso"
}
Starts with {}
It’s based around key:value pairs
Each key:value can contain numbers od strings (for now)
That’s it. That’s a JSON file.
json_str <- '{
"president": "Parlosi",
"vicepresident": "Kantos",
"opposition": "Pitaso"
}'
fromJSON(json_str)$president
[1] "Parlosi"
$vicepresident
[1] "Kantos"
$opposition
[1] "Pitaso"
The key must be a string
Parsed as a named list
Think of JSONs as named arrays
{
"president": [
{
"last_name": "Parlosi",
"party": "Free thinkers",
"age": 35
}
],
"vicepresident": [
{
"last_name": "Kantos",
"party": "Free thinkers",
"age": 52
}
],
"opposition": [
{
"last_name": "Pitaso",
"party": "Everyone United",
"age": 45
}
]
}
Let’s break it down:
"president": [
{
"last_name": "Parlosi",
"party": "Free thinkers",
"age": 35
}
]
Same JSON rules
Now we have an array
Structure: key followed by an array with three key:value pairs
Think of arrays in JSON as rows in a data frame.
Three keys (names of each slot) where each contains a 1 row data frame inside.
{
"key":[
## First row
{
"col1":1,
"col2":2
},
## Second row
{
"col1":3,
"col2":4
}
]
}
json_str <- '
{
"president": [
{
"last_name": "Parlosi",
"party": "Free thinkers",
"age": 35
}
],
"vicepresident": [
{
"last_name": "Kantos",
"party": "Free thinkers",
"age": 52
}
],
"opposition": [
{
"last_name": "Pitaso",
"party": "Everyone United",
"age": 45
}
]
}
'
fromJSON(json_str, simplifyDataFrame = TRUE)How is it parsed?
A named list of data frames:
$president
last_name party age
1 Parlosi Free thinkers 35
$vicepresident
last_name party age
1 Kantos Free thinkers 52
$opposition
last_name party age
1 Pitaso Everyone United 45
JSON’s are key:value pairs and the value it self can be an array with other key/value pairs.
With that explained, how will this be parsed into R?
{
"president": [
{
"last_name": "Parlosi",
"party": "Free thinkers",
"age": 35
},
{
"last_name": "Stevensson",
"party": "Free thinkers"
}
],
"vicepresident": [
null
],
"opposition": {
"last_name": "Pitaso",
"party": "Everyone United",
"age": 45
}
}
Write it down before next slide.
$president
last_name party age
1 Parlosi Free thinkers 35
2 Stevensson Free thinkers NA
$vicepresident
[1] NA
$opposition
$opposition$last_name
[1] "Pitaso"
$opposition$party
[1] "Everyone United"
$opposition$age
[1] 45
A data frame for the first slot. That’s correct because the JSON contained an array with two sets of key:value pairs. That’s translatable to a data frame with two rows even though one of the two sets did not have a field for age.
An NA value since null is the way missing values are represented in JSON but notice that this null is in a JSON array so this is effectively a data frame with an NA value.
A named list. That’s right because there is no array structure ([...]) so key:value pairs are interpreted as a named list.
Why is this important? You can fix stuff:
$president
last_name party age
1 Parlosi Free thinkers 35
2 Stevensson Free thinkers NA
$vicepresident
[1] NA
$opposition
last_name party age
1 Pitaso Everyone United 45
They most important tool for working with JSONs is subsetting
The dirtiest part of JSON is complex nesting
enframe + unnest strategyMost important part of this section. Taking the previous example, how can we turn it into:
## # A tibble: 4 × 4
## name last_name party age
## <chr> <chr> <chr> <int>
## 1 president Parlosi Free thinkers 35
## 2 president Stevensson Free thinkers NA
## 3 vicepresident <NA> <NA> NA
## 4 opposition Pitaso Everyone United 45
This is the ideal summary of the result: everything in a single data frame with complete and incomplete information.
Each ‘category’ (president, vicepresident, opposition) is a row.
enframe + unnest strategyGeneral strategy of combining two functions: enframe and unnest.
enframe takes a named list and does two things:
Extracts the names of each slot in the list and stores it in a column in a data frame.
Takes everything inside each slot and stores it in a list-column.
enframe + unnest strategyList-column: column of class list that can contain different things (if you remember, all columns in R must be of the same kind, either numeric, character or something else but there can’t be two types in the same column).
In our example,
First row of this list-column is a data frame that has two rows
The second row is an empty data frame with an NA value
The third is now a data frame since we altered the JSON manually (remember?).
enframe + unnest strategyLet’s take it for a spin:
# A tibble: 3 × 2
name value
<chr> <list>
1 president <df [2 × 3]>
2 vicepresident <lgl [1]>
3 opposition <df [1 × 3]>
enframe + unnest strategyHow do we transform it?
unnest takes list-columns and ‘unpacks’ them into the common class of the list.
If the all objects are of different classes (data.frame, vectors, etc..), then unnest will fail.
If all objects within the list are of the same class, it will combine all of them into a proper column or ‘unpack’ it’s values.
All of this sounds convoluted right? Let’s see some applied examples.
enframe + unnest strategy# A tibble: 4 × 4
name last_name party age
<chr> <chr> <chr> <int>
1 president Parlosi Free thinkers 35
2 president Stevensson Free thinkers NA
3 vicepresident <NA> <NA> NA
4 opposition Pitaso Everyone United 45
All objects of the data frame were compatible so it combines them all.
enframe + unnest strategyA failed example:
Can you tell me why?
enframe + unnest strategyError in `list_unchop()`:
! Can't combine `x[[1]]` <data.frame> and `x[[3]]` <list>.
The main problem you’ll encounter with JSON’s is that you’re trying to parse some JSON that has many nested arrays and some of these arrays are not compatible for unnesting so you’ll have to submerge yourself into these nested arrays and fix whatever data you want to extract.
Suppose that as part of a research project, you’ve recently been granted access to the API of a company. You’re interested in studying the relationship between the geo-location of clients, their shopping patterns and their social class.
{
"client1": [
{
"name": "Kurt Rosenwinkel",
"device": [
{
"type": "iphone",
"location": [
{
"lat": 213.12,
"lon": 43.213
}
]
}
]
}
],
"client2": [
{
"name": "YEITUEI",
"device": [
{
"type": "android",
"location": [
{
"lat": 211.12,
"lon": 53.213
}
]
}
]
}
]
}
Before understand in-depth a JSON, it’s not a terrible idea to try to read it:
json_str <- '
{
"client1": [
{
"name": "Kurt Rosenwinkel",
"device": [
{
"type": "iphone",
"location": [
{
"lat": 213.12,
"lon": 43.213
}
]
}
]
}
],
"client2": [
{
"name": "YEITUEI",
"device": [
{
"type": "android",
"location": [
{
"lat": 211.12,
"lon": 53.213
}
]
}
]
}
]
}
'
res <- fromJSON(json_str)$client1
name device
1 Kurt Rosenwinkel iphone, 213.12, 43.213
$client2
name device
1 YEITUEI android, 211.12, 53.213
Two data frames parsed, one for each client. Were they parsed correctly?
Say we wanted to append “Device:” to the device of each client:
[1] "Device: list(type = \"iphone\", location = list(list(lat = 213.12, lon = 43.213)))"
Not parsed correctly?
device is a list with a data frame inside…
location is also a list. Let’s look at what’s inside:
Whenever fromJSON encounters an array within an array, it converts each one to a data frame.
The problem is that these are recurrent data frames inside data frames and so on.
It’s difficult to assess where to stop and this is a good example where unnest is very handy.
Let’s apply this to res in general:
# A tibble: 2 × 2
client value
<chr> <list>
1 client1 <df [1 × 2]>
2 client2 <df [1 × 2]>
# A tibble: 2 × 4
client name type location
<chr> <chr> <chr> <list>
1 client1 Kurt Rosenwinkel iphone <df [1 × 2]>
2 client2 YEITUEI android <df [1 × 2]>
Arrays are interpreted as ‘rows’ in JSONs
Subsetting is very important for cleaning JSONs
Nested JSONs are your biggest problem
enframe + unnest strategy can help you with nested JSONs
Problems: objects in a list-column are of different classes
Chapter 14 – read + exercises
Everyone should have a project + a group here.
Work on project should begin now. Submissions can be done from now on. Final deadline for the project is in two weeks.