Jorge Cimentada and Basilio Moreno
6th of July 2019
Alright, so far we have seen vectors, matrices and data frames.
x <- sample(1:10)
x
[1] 8 3 1 4 6 7 5 9 10 2
We have 10 random numbers.
Their positions are:
1 2 3 4 5 6 7 8 9 10
8 3 1 4 6 7 5 9 10 2
If x
is:
[1] 8 3 1 4 6 7 5 9 10 2
what is the result of:
x[c(1, 3, 8)] #Watch out for square brackets.
x[c(-1, -5)]
x[seq(1, 8, 2)]
x[NA]
x[]
Write it down without running it!
Do these subsetting rules apply the same for all types of vectors?
char <- letters[1:10]
lgl <- c(TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE)
gender <- factor(sample(c("female", "male"), 10, replace = T))
What about these ones?
char[c(1, 1, 1)]
lgl[c(TRUE, 5, 1)]
gender[c(1:3, TRUE)]
Super test:
super_vector <- c(char, gender, lgl)
super_vector[c(1, 11, 27)]
Subsetting rules are the same for all types of vectors.
Exceptions are:
Let's go through each one…
If you remember correctly, matrices are a vector with rows and columns.
x_matrix <- matrix(1:10, 5, 2) # 5 rows and 2 columns
x_matrix
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
Building on the previous examples, what wouldl be the result of this?
x_matrix[c(1, 4, 6)]
To confuse you even more, what do you think would be the result of this?
x_matrix[2:3, ]
A matrix can be thought of as two things:
[1] 1 2 3 4 5 6 7 8 9 10
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
Now that you know.. what are the results of:
x_matrix[1:5, 2]
x_matrix[, 2]
x_matrix[1, 1]
x_matrix[1:10, 2]
x_matrix[, 1:2]
Now, data frame are very similar to matrices.
our_df <- data.frame(letters = letters[1:10], age = sample(25:50, 10),
lgl = sample(c(TRUE, FALSE), 10, replace = T))
our_df
letters age lgl
1 a 25 FALSE
2 b 27 FALSE
3 c 34 FALSE
4 d 39 FALSE
5 e 40 TRUE
6 f 45 FALSE
7 g 35 FALSE
8 h 43 TRUE
9 i 28 FALSE
10 j 48 TRUE
The same way matrices are subsetted!
# First 3 rows for all columns
our_df[1:3, ]
# Only the first and 8th row for first two columns
our_df[c(1, 8), 1:2]
# The 5th column three times for the third column
our_df[c(5, 5, 5), 3]
What? Why is the last one a vector?
So far we saw how to subset the same way we subset matrices.
# We lose the data frame dimensions using this method.
our_df[["age"]]
# We get a data frame with this one.
our_df["age"]
# We don't get a data frame here.
our_df$age
Following the 'list' subsetting rules for data frames:
The result should be:
[1] 34 39 28
Well, now that we're at it… How does it work for lists?
our_list <- list(data = our_df, x_matrix, gnd = gender)
Explanation
ourlist
ourlist[1]
ourlist[[1]]
ourlist[[1]][[1]]
What does this return?
our_df[["our_variable"]]
our_df["our_variable"]
our_df$our_variable
We're subsetting a variable that doesn't exist
What is missing to create this variable?
Three ways of creating a variable:
our_df[["our_variable"]] <- 1:10
our_df["our_variable"] <- 11:20
our_df$our_variable <- seq(1, 20, 2)
There's one other way of doing it… Think hard about []
and the ,
to divide rows and columns
our_df[, "our_variable"] <- "this repeats until end"
Add two variables to the our_df
data frame from any of the options above.
TRUE
for when age is above or equal to 35.our_df$age
and our_df$lgl
.Call them whatever you want.
our_df$lgl_two <- our_df$age >= 35
our_df$add <- our_df$age + our_df$lgl
When whe subset we almost always don't subset like we've been doing.
You have all the tools to achieve this, can you tell me how to do this?
Ok, we only want people with ages below 40 years old.
age < 40
Everything set!
age
is not a variable out there in our environment!our_df$age < 40
[1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
c(2, 4, 7, 8, 10)
comply with the logical statement.our_df[c(2, 4, 7, 8, 10), ]
letters age lgl our_variable lgl_two add
2 b 27 FALSE this repeats until end FALSE 27
4 d 39 FALSE this repeats until end TRUE 39
7 g 35 FALSE this repeats until end TRUE 35
8 h 43 TRUE this repeats until end TRUE 44
10 j 48 TRUE this repeats until end TRUE 49
our_df[our_df$age < 40, ]
letters age lgl our_variable lgl_two add
1 a 25 FALSE this repeats until end FALSE 25
2 b 27 FALSE this repeats until end FALSE 27
3 c 34 FALSE this repeats until end FALSE 34
4 d 39 FALSE this repeats until end TRUE 39
7 g 35 FALSE this repeats until end TRUE 35
9 i 28 FALSE this repeats until end FALSE 28
We can subset pretty much anything with logical vectors.
gender[gender == "female"]
lgl[lgl == TRUE]
Always think about the details!
gender == "female" # is a logical statement
[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE
We could've written:
gender[c(FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE)]
[1] male male male female male female
Levels: female male
But that's too long.
Let's move on to functions.
What are functions?
All at the same time!
For example, take the sd
function (standard deviation).
class(x)
[1] "integer"
class(sd)
[1] "function"
x
[1] 8 3 1 4 6 7 5 9 10 2
sd
function (x, na.rm = FALSE)
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
na.rm = na.rm))
<bytecode: 0x563a24b721c0>
<environment: namespace:stats>
sd(x)
returns the standard deviation of a variable
When you have questions about a function type ?function_name
x <- rnorm(100)
y <- x + rnorm(100, mean = 1, sd = 1)
?rnorm
does.?cor
to calculate the correlation between x and ymethod
argument to be “spearman”cor(x, y, method = "spearman")
[1] 0.7328173