I’ve just started reading the wonderful book Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. I’ve just begun the first chapter and I wanted to document some of the things I found interesting. In this post I’ll walkthrough the example on how to transform a text file with GPS locations into a well formatted rectangular dataset. For a detailed explanation see their book, which I highly recommend buying.
Note: When it makes senses/it’s possible, I always try to find an equivalent tidyverse solution to everything they do in the book.
This is the data.
library(tidyverse)
ex_file <- read_lines("http://rdatasciencecases.org/Data/offline.final.trace.txt")
ex_file[1:4]
## [1] "# timestamp=2006-02-11 08:31:58"
## [2] "# usec=250"
## [3] "# minReadings=110"
## [4] "t=1139643118358;id=00:02:2D:21:0F:33;pos=0.0,0.0,0.0;degree=0.0;00:14:bf:b1:97:8a=-38,2437000000,3;00:14:bf:b1:97:90=-56,2427000000,3;00:0f:a3:39:e1:c0=-53,2462000000,3;00:14:bf:b1:97:8d=-65,2442000000,3;00:14:bf:b1:97:81=-65,2422000000,3;00:14:bf:3b:c7:c6=-66,2432000000,3;00:0f:a3:39:dd:cd=-75,2412000000,3;00:0f:a3:39:e0:4b=-78,2462000000,3;00:0f:a3:39:e2:10=-87,2437000000,3;02:64:fb:68:52:e6=-88,2447000000,1;02:00:42:55:31:00=-84,2457000000,1"
Some lines are comments and the 4th line is the actual data. Basically, everything that is something=
is the name of the column and columns are separated by a ;
. Now, within each column there can also be several values like in the column pos
where numbers are separated by a comma.
First, let’s separate everything now that we know all of the delimiters.
tokens <- str_split(ex_file[4], pattern = "[;=,]")[[1]]
From the documentation we know that the first 4 columns are constant in every line. The remaining columns can vary by each line, which is why they decide to transform the data into stacked/long format. So each unique id
will be repeate the number of times that there’s MAC columns (the columns that vary).
tmp <- matrix(tokens[-(1:10)], ncol = 4, byrow = TRUE)
# We got the MAC in a long format, now we have to get unique id
# of each of the macs (along with time and other vars) to be repeated
# the number of rows that tmp has
# There we go
tmp_two <- matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, byrow = TRUE)
mat <- cbind(tmp_two, tmp)
mat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "1139643118358" "00:02:2D:21:0F:33" "0.0" "0.0" "0.0" "0.0"
## [2,] "1139643118358" "00:02:2D:21:0F:33" "0.0" "0.0" "0.0" "0.0"
## [3,] "1139643118358" "00:02:2D:21:0F:33" "0.0" "0.0" "0.0" "0.0"
## [4,] "1139643118358" "00:02:2D:21:0F:33" "0.0" "0.0" "0.0" "0.0"
## [5,] "1139643118358" "00:02:2D:21:0F:33" "0.0" "0.0" "0.0" "0.0"
## [6,] "1139643118358" "00:02:2D:21:0F:33" "0.0" "0.0" "0.0" "0.0"
## [7,] "1139643118358" "00:02:2D:21:0F:33" "0.0" "0.0" "0.0" "0.0"
## [8,] "1139643118358" "00:02:2D:21:0F:33" "0.0" "0.0" "0.0" "0.0"
## [9,] "1139643118358" "00:02:2D:21:0F:33" "0.0" "0.0" "0.0" "0.0"
## [10,] "1139643118358" "00:02:2D:21:0F:33" "0.0" "0.0" "0.0" "0.0"
## [11,] "1139643118358" "00:02:2D:21:0F:33" "0.0" "0.0" "0.0" "0.0"
## [,7] [,8] [,9] [,10]
## [1,] "00:14:bf:b1:97:8a" "-38" "2437000000" "3"
## [2,] "00:14:bf:b1:97:90" "-56" "2427000000" "3"
## [3,] "00:0f:a3:39:e1:c0" "-53" "2462000000" "3"
## [4,] "00:14:bf:b1:97:8d" "-65" "2442000000" "3"
## [5,] "00:14:bf:b1:97:81" "-65" "2422000000" "3"
## [6,] "00:14:bf:3b:c7:c6" "-66" "2432000000" "3"
## [7,] "00:0f:a3:39:dd:cd" "-75" "2412000000" "3"
## [8,] "00:0f:a3:39:e0:4b" "-78" "2462000000" "3"
## [9,] "00:0f:a3:39:e2:10" "-87" "2437000000" "3"
## [10,] "02:64:fb:68:52:e6" "-88" "2447000000" "1"
## [11,] "02:00:42:55:31:00" "-84" "2457000000" "1"
There we go. We have a stacked matrix with all the variables we need. Let’s wrap the line maker into a function:
processLine <- function(x) {
tokens <- str_split(x, pattern = "[;=,]")[[1]]
# We got the MAC in a long format, now we have to get unique id
# of each of the macs (along with time and other vars) to be repeated
# the number of rows that tmp has
tmp <- matrix(tokens[-(1:10)], ncol = 4, byrow = TRUE)
# There we go
tmp_two <- matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, byrow = TRUE)
mat <- cbind(tmp_two, tmp)
mat
}
Let’s apply it to a few sample rows:
tmp <- map(ex_file[4:20], processLine)
offline <- as.data.frame(do.call("rbind", tmp))
head(offline)
## V1 V2 V3 V4 V5 V6 V7 V8
## 1 1139643118358 00:02:2D:21:0F:33 0.0 0.0 0.0 0.0 00:14:bf:b1:97:8a -38
## 2 1139643118358 00:02:2D:21:0F:33 0.0 0.0 0.0 0.0 00:14:bf:b1:97:90 -56
## 3 1139643118358 00:02:2D:21:0F:33 0.0 0.0 0.0 0.0 00:0f:a3:39:e1:c0 -53
## 4 1139643118358 00:02:2D:21:0F:33 0.0 0.0 0.0 0.0 00:14:bf:b1:97:8d -65
## 5 1139643118358 00:02:2D:21:0F:33 0.0 0.0 0.0 0.0 00:14:bf:b1:97:81 -65
## 6 1139643118358 00:02:2D:21:0F:33 0.0 0.0 0.0 0.0 00:14:bf:3b:c7:c6 -66
## V9 V10
## 1 2437000000 3
## 2 2427000000 3
## 3 2462000000 3
## 4 2442000000 3
## 5 2422000000 3
## 6 2432000000 3
Good! Now we can apply it to all lines, excluding of course the ones which are commented out!
tmp <- map(ex_file[!str_sub(ex_file, 1, 1) == "#"], processLine)
## Warning in matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, :
## data length exceeds size of matrix
## Warning in matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, :
## data length exceeds size of matrix
## Warning in matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, :
## data length exceeds size of matrix
## Warning in matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, :
## data length exceeds size of matrix
## Warning in matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, :
## data length exceeds size of matrix
## Warning in matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, :
## data length exceeds size of matrix
Aha.. so there’s a few warnings? What’s happening? If we ran the previous with options(error, warn = 2)
we would see that it looks like there are some anomalous cases where there’s no MAC information. We either fill out those values with NA’s or we simply exclude them. Because working with the MAC’s is of utmost importance for the analysis, we drop it to save memory. We redefine our function so that if there’s only the 10 starting values it returns a NULL.
processLine <- function(x) {
tokens <- str_split(x, pattern = "[;=,]")[[1]]
# We exclude rows where there's no MAC information
if (length(tokens) == 10) return(NULL)
# We got the MAC in a long format, now we have to get unique id
# of each of the macs (along with time and other vars) to be repeated
# the number of rows that tmp has
tmp <- matrix(tokens[-(1:10)], ncol = 4, byrow = TRUE)
# There we go
tmp_two <- matrix(tokens[c(2, 4, 6:8, 10)], nrow = nrow(tmp), ncol = 6, byrow = TRUE)
mat <- cbind(tmp_two, tmp)
mat
}
And apply it now..
tmp <- map(ex_file[!str_sub(ex_file, 1, 1) == "#"], processLine)
offline <- as_tibble(do.call("rbind", tmp))
Good, let’s set warnings back: options(error = recover, warn = 1)
To finish off let’s set some names.
names(offline) <- c("time", "scanMac", "posX", "posY", "posZ",
"orientation", "mac", "signal", "channel", "type")
offline
## # A tibble: 1,181,628 x 10
## time scanMac posX posY posZ orientation mac signal channel type
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 11396~ 00:02:~ 0.0 0.0 0.0 0.0 00:1~ -38 243700~ 3
## 2 11396~ 00:02:~ 0.0 0.0 0.0 0.0 00:1~ -56 242700~ 3
## 3 11396~ 00:02:~ 0.0 0.0 0.0 0.0 00:0~ -53 246200~ 3
## 4 11396~ 00:02:~ 0.0 0.0 0.0 0.0 00:1~ -65 244200~ 3
## 5 11396~ 00:02:~ 0.0 0.0 0.0 0.0 00:1~ -65 242200~ 3
## 6 11396~ 00:02:~ 0.0 0.0 0.0 0.0 00:1~ -66 243200~ 3
## 7 11396~ 00:02:~ 0.0 0.0 0.0 0.0 00:0~ -75 241200~ 3
## 8 11396~ 00:02:~ 0.0 0.0 0.0 0.0 00:0~ -78 246200~ 3
## 9 11396~ 00:02:~ 0.0 0.0 0.0 0.0 00:0~ -87 243700~ 3
## 10 11396~ 00:02:~ 0.0 0.0 0.0 0.0 02:6~ -88 244700~ 1
## # ... with 1,181,618 more rows
— BONUS —
Just wanted to try to get the data in a wide format where each MAC indicator is a column rather than stacked.
# Define the MAC colums as wide. Because each MAC columns
# has three associated values, I stack them up so there should
# be three rows pero every MAC column
right_col <- tokens[-(1:10)]
right_names <- seq(1, length(right_col), by = 4)
mac_tibble <-
matrix(right_col[-right_names], nrow = 3, ncol = length(right_names),
dimnames = list(NULL, right_col[right_names])) %>%
as_tibble() %>%
add_column(mac_indicators = c("signal", "chanel", "type"),
.before = 1)
# Define the first four columns
left_col <- tokens[1:10]
left_names <- seq(1, length(left_col), by = 2)
left_tibble <-
matrix(left_col[-left_names], nrow = 3, ncol = length(left_names), byrow = TRUE,
dimnames = list(NULL, left_col[left_names])) %>%
as_tibble()
# Bind both dfs
mat <- bind_cols(left_tibble, mac_tibble)
mat
## # A tibble: 3 x 17
## t id pos `0.0` degree mac_indicators `00:14:bf:b1:97:~
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 11396431~ 00:02:2D:~ 0.0 0.0 0.0 signal -38
## 2 11396431~ 00:02:2D:~ 0.0 0.0 0.0 chanel 2437000000
## 3 11396431~ 00:02:2D:~ 0.0 0.0 0.0 type 3
## # ... with 10 more variables: `00:14:bf:b1:97:90` <chr>,
## # `00:0f:a3:39:e1:c0` <chr>, `00:14:bf:b1:97:8d` <chr>,
## # `00:14:bf:b1:97:81` <chr>, `00:14:bf:3b:c7:c6` <chr>,
## # `00:0f:a3:39:dd:cd` <chr>, `00:0f:a3:39:e0:4b` <chr>,
## # `00:0f:a3:39:e2:10` <chr>, `02:64:fb:68:52:e6` <chr>,
## # `02:00:42:55:31:00` <chr>
Let’s wrap it into a function excluding those which dont have MAC values.
processLine <- function(x) {
tokens <- str_split(x, pattern = "[;=,]")[[1]]
if (length(tokens) == 10) return(NULL) # exclude non-MAC lines
right_col <- tokens[-(1:10)]
right_names <- seq(1, length(right_col), by = 4)
mac_tibble <-
matrix(right_col[-right_names], nrow = 3, ncol = length(right_names),
dimnames = list(NULL, right_col[right_names]))
# Define the first four columns
left_col <- tokens[1:10]
left_names <- seq(1, length(left_col), by = 2)
left_tibble <-
matrix(left_col[-left_names], nrow = 3, ncol = length(left_names), byrow = TRUE,
dimnames = list(NULL, left_col[left_names]))
# Bind both dfs
mat <- cbind(left_tibble, mac_tibble)
mat
}
Let’s apply it to each line:
tmp <- map(ex_file[!str_sub(ex_file, 1, 1) == "#"], processLine)
# Interestingly, applying as_tibble instead of as.data.frame is
# very slow. So I opt for data frame and then convert the binded
# df to a tibble
final_data <-
bind_rows(map(tmp, as.data.frame, stringsAsFactors = FALSE)) %>%
as_tibble() %>%
add_column(mac_indicators = rep(c("signal", "chanel", "type"), length(unique(.$t))),
.after = "degree")
final_data
## # A tibble: 438,222 x 40
## t id pos `0.0` degree mac_indicators `00:14:bf:b1:97:~
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 11396431~ 00:02:2D~ 0.0 0.0 0.0 signal -38
## 2 11396431~ 00:02:2D~ 0.0 0.0 0.0 chanel 2437000000
## 3 11396431~ 00:02:2D~ 0.0 0.0 0.0 type 3
## 4 11396431~ 00:02:2D~ 0.0 0.0 0.0 signal -38
## 5 11396431~ 00:02:2D~ 0.0 0.0 0.0 chanel 2437000000
## 6 11396431~ 00:02:2D~ 0.0 0.0 0.0 type 3
## 7 11396431~ 00:02:2D~ 0.0 0.0 0.0 signal -38
## 8 11396431~ 00:02:2D~ 0.0 0.0 0.0 chanel 2437000000
## 9 11396431~ 00:02:2D~ 0.0 0.0 0.0 type 3
## 10 11396431~ 00:02:2D~ 0.0 0.0 0.0 signal -38
## # ... with 438,212 more rows, and 33 more variables:
## # `00:14:bf:b1:97:90` <chr>, `00:0f:a3:39:e1:c0` <chr>,
## # `00:14:bf:b1:97:8d` <chr>, `00:14:bf:b1:97:81` <chr>,
## # `00:14:bf:3b:c7:c6` <chr>, `00:0f:a3:39:dd:cd` <chr>,
## # `00:0f:a3:39:e0:4b` <chr>, `00:0f:a3:39:e2:10` <chr>,
## # `02:64:fb:68:52:e6` <chr>, `02:00:42:55:31:00` <chr>,
## # `00:04:0e:5c:23:fc` <chr>, `00:30:bd:f8:7f:c5` <chr>, `1.0` <chr>,
## # `2.0` <chr>, `3.0` <chr>, `4.0` <chr>, `5.0` <chr>, `6.0` <chr>,
## # `7.0` <chr>, `8.0` <chr>, `9.0` <chr>, `10.0` <chr>, `11.0` <chr>,
## # `12.0` <chr>, `13.0` <chr>, `00:e0:63:82:8b:a9` <chr>,
## # `02:37:fd:3b:54:b5` <chr>, `02:2e:58:22:f1:ac` <chr>,
## # `02:42:1c:4e:b5:c0` <chr>, `02:0a:3d:06:94:88` <chr>,
## # `02:5c:e0:50:49:de` <chr>, `02:4f:99:43:30:cd` <chr>,
## # `02:b7:00:bb:a9:35` <chr>
There we go! It’s a bit refreshing to work on datasets that are not pre-cleaned for you.