`vignettes/pisa_example.Rmd`

`pisa_example.Rmd`

`perccalc`

is very flexible and can be used for any ordered categorical variable which has a theoretical order, such as education or occupation. In this case study we will calculate the percentile difference in mathematics test scores based on the education of the parents for several countries using the PISA 2006 and PISA 2012 datasets. Let’s load our packages:

`percalc`

automatically loads `pisa_2012`

and `pisa_2006`

which are two datasets with all the information that we need. These two datasets have data for Estonia, Germany and Spain and contain the five test scores in Mathematics and the father’s education in the international ISCED classification. The first thing we have to do is make sure the categories are ordered and calculate the average test score for each student.

```
order_edu <- c("None",
"ISCED 1",
"ISCED 2",
"ISCED 3A, ISCED 4",
"ISCED 3B, C",
"ISCED 5A, 6",
"ISCED 5B")
# Make ordered categories of our categorical variables and calculate avgerage
# math test scores for each year
pisa_2012 <-
pisa_2012 %>%
mutate(father_edu = factor(father_edu, levels = order_edu, ordered = TRUE))
pisa_2006 <-
pisa_2006 %>%
mutate(father_edu = factor(father_edu, levels = order_edu, ordered = TRUE))
# Merge them together
pisa <- rbind(pisa_2006, pisa_2012)
```

Once the categories are ordered, `perc_diff`

can calculate the percentile difference between the 90th and 10th percentile for the complete sample. For example:

```
perc_diff(data_model = pisa,
categorical_var = father_edu,
continuous_var = avg_math,
percentiles = c(90, 10))
#> difference se
#> 45.07312 14.18308
```

This means that the difference in Mathematics test scores between the 90th and 10th percentile of father education is 45 points with a standard error of 14 points. We can extrapolate this example for each country separately using `dplyr::group_by`

. Since the result of `perc_diff`

is a vector, to be able to work seamlessly with `dplyr::group_by`

we will use `perc_diff_df`

, which returns the same results but as a data frame:

With `df_wrap`

we can just pass that to `tidyr::nest`

and `dplyr::mutate`

and calculate it separetely by year and country:

```
cnt_diff <-
pisa %>%
nest(data = c(-year, -CNT)) %>%
mutate(edu_diff = lapply(data, function(x) perc_diff_df(x, father_edu, avg_math, percentiles = c(90, 10)))) %>%
select(-data) %>%
unnest(edu_diff)
cnt_diff
#> # A tibble: 6 x 4
#> year CNT difference se
#> <dbl> <chr> <dbl> <dbl>
#> 1 2006 Estonia 18.3 5.55
#> 2 2006 Germany 61.2 28.8
#> 3 2006 Spain 52.6 15.1
#> 4 2012 Germany 45.0 27.4
#> 5 2012 Spain 40.2 18.2
#> 6 2012 Estonia 17.0 9.88
```

We can see some big differences between, for example, Estonia and Spain. But even more interesting is plotting this:

```
cnt_diff %>%
ggplot(aes(year, difference, group = CNT, color = CNT)) +
geom_point() +
geom_line() +
theme_minimal() +
scale_y_continuous(name = "Achievement gap in Math between the 90th and \n 10th percentile of father's education") +
scale_x_continuous(name = "Year")
```

It looks like Estonia has a much smaller achievement gap relative to Spain and Germany but also note that both Germany and Spain have been decreasing their inequality. We can also try different achievement gaps to to explore the distribution:

```
# Calculate the gap for the 90/50 gap
cnt_half <-
pisa %>%
nest(data = c(-year, -CNT)) %>%
mutate(edu_diff = lapply(data,
function(x) perc_diff_df(x, father_edu, avg_math, percentiles = c(90, 50)))) %>%
select(-data) %>%
unnest(edu_diff)
# Calculate the gap for the 50/10 gap
cnt_bottom <-
pisa %>%
nest(data = c(-year, -CNT)) %>%
mutate(edu_diff = lapply(data,
function(x) perc_diff_df(x, father_edu, avg_math, percentiles = c(50, 10)))) %>%
select(-data) %>%
unnest(edu_diff)
cnt_diff$type <- "90/10"
cnt_half$type <- "90/50"
cnt_bottom$type <- "50/10"
final_cnt <- rbind(cnt_diff, cnt_half, cnt_bottom)
final_cnt$type <- factor(final_cnt$type, levels = c("90/10", "90/50", "50/10"))
final_cnt
#> # A tibble: 18 x 5
#> year CNT difference se type
#> <dbl> <chr> <dbl> <dbl> <fct>
#> 1 2006 Estonia 18.3 5.55 90/10
#> 2 2006 Germany 61.2 28.8 90/10
#> 3 2006 Spain 52.6 15.1 90/10
#> 4 2012 Germany 45.0 27.4 90/10
#> 5 2012 Spain 40.2 18.2 90/10
#> 6 2012 Estonia 17.0 9.88 90/10
#> 7 2006 Estonia -12.8 5.51 90/50
#> 8 2006 Germany 1.98 27.6 90/50
#> 9 2006 Spain 10.2 14.2 90/50
#> 10 2012 Germany -17.5 23.5 90/50
#> 11 2012 Spain -3.38 17.1 90/50
#> 12 2012 Estonia -17.0 9.34 90/50
#> 13 2006 Estonia 31.2 5.55 50/10
#> 14 2006 Germany 59.2 26.6 50/10
#> 15 2006 Spain 42.4 13.9 50/10
#> 16 2012 Germany 62.4 26.1 50/10
#> 17 2012 Spain 43.6 17.1 50/10
#> 18 2012 Estonia 34.0 9.91 50/10
```

Having this dataframe we can visualize all the three gaps in a very intuitive fashion:

```
final_cnt %>%
ggplot(aes(year, difference, group = CNT, color = CNT)) +
geom_point() +
geom_line() +
theme_minimal() +
scale_y_continuous(name = "Achievement gap in Math between the 90th and \n 10th percentile of father's education") +
scale_x_continuous(name = "Year") +
facet_wrap(~ type)
```

It seems that the `90/50`

and `50/10`

differences are not symmetrical.

`percalc`

also has a `perc_dist`

function which calculates the distribution of the percentiles, so we can compare more finegrained percentiles rather than differences:

```
perc_dist(pisa, father_edu, avg_math)
#> # A tibble: 100 x 3
#> percentile estimate std.error
#> <int> <dbl> <dbl>
#> 1 1 0.853 1.64
#> 2 2 1.74 3.20
#> 3 3 2.65 4.69
#> 4 4 3.60 6.10
#> 5 5 4.57 7.45
#> 6 6 5.57 8.72
#> 7 7 6.59 9.93
#> 8 8 7.64 11.1
#> 9 9 8.71 12.2
#> 10 10 9.80 13.2
#> # … with 90 more rows
```

Here we get the complete percentile distribution with the test score in mathematics for each percentile. This can be easily scaled to all country/year combinations with our previous code:

```
cnt_dist <-
pisa %>%
nest(data = c(-year, -CNT)) %>%
mutate(edu_diff = lapply(data, function(x) perc_dist(x, father_edu, avg_math))) %>%
select(-data) %>%
unnest(edu_diff)
cnt_dist
#> # A tibble: 600 x 5
#> year CNT percentile estimate std.error
#> <dbl> <chr> <int> <dbl> <dbl>
#> 1 2006 Estonia 1 0.647 0.388
#> 2 2006 Estonia 2 1.31 0.762
#> 3 2006 Estonia 3 2.00 1.12
#> 4 2006 Estonia 4 2.71 1.47
#> 5 2006 Estonia 5 3.43 1.81
#> 6 2006 Estonia 6 4.17 2.13
#> 7 2006 Estonia 7 4.93 2.44
#> 8 2006 Estonia 8 5.70 2.74
#> 9 2006 Estonia 9 6.48 3.02
#> 10 2006 Estonia 10 7.28 3.30
#> # … with 590 more rows
```

Let’s limit the distribution only to the 10th, 20th, 30th… 100th percentile an compare for country/years:

```
cnt_dist %>%
mutate(year = as.character(year)) %>%
filter(percentile %in% seq(0, 100, by = 10)) %>%
ggplot(aes(percentile, estimate, color = year, group = percentile)) +
geom_point() +
geom_line(color = "black") +
scale_y_continuous(name = "Math test score") +
scale_x_continuous(name = "Percentiles from father's education") +
scale_color_discrete(name = "Year") +
facet_wrap(~ CNT) +
theme_minimal()
```

Here the red dots indicate the year 2006 and the bluish dots year 2012, the black line between them indicates the change over time. Here we can see that although Germany and Spain are decreasing (as we saw in the plot before), the composition of the change is very different: Spain’s decrease is big all around the distribution whereas Germany’s concentrate on the top percentiles.

This type of analysis serves well to disentangle and decompose distributions of ordered categorical. This vignette looked to show the power of this techniqe and how it can be used in practically any setting with an ordered categorical variable and a continuous variable.

If you use this in a publication, remember to cite the software as:

```
#>
#> To cite perccalc in publications use:
#>
#> Cimentada, (2019). perccalc: An R package for estimating percentiles
#> from categorical variables. Journal of Open Source Software, 4(44),
#> 1796, https://doi.org/10.21105/joss.01796
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> doi = {10.21105/joss.01796},
#> url = {https://doi.org/10.21105/joss.01796},
#> year = {2019},
#> publisher = {The Open Journal},
#> volume = {4},
#> number = {44},
#> pages = {1796},
#> author = {Jorge Cimentada},
#> title = {perccalc: An R package for estimating percentiles from categorical variables},
#> journal = {Journal of Open Source Software},
#> }
```