Fitting the wrong model

PUBLISHED ON NOV 10, 2016

These exercises are from the book Data Analysis Using Regression and Multilevel/Hierarchical Models. I’ve really gotten into completing these exercises and I guess that by posting them I’ve found an excuse to keep doing it. This time I went back to chapter 8 which deals with simulations. I picked the first exercise, page 165 exercise 8.6.1, which says:

Fitting the wrong model: suppose you have 100 data points that arose from the following model: y = 3 + 0.1×1 + 0.5×2 + error, with errors having a t distribution with mean 0, scale 5, and 4 degrees of freedom.We shall explore the implications of fitting a standard linear regression to these data.

The (a) section of the exercises says as follows:

  1. Simulate data from this model. For simplicity, suppose the values of x1 are simply the integers from 1 to 100, and that the values of x2 are random and equally likely to be 0 or 1. Fit a linear regression (with normal errors) to these data and see if the 68% confidence intervals for the regression coefficients (for each, the estimates ±1 standard error) cover the true values.

This is simple enough. Simulate some linear model but change the error term to be t distributed with a set of characteristics. Here’s the code:

suppressWarnings(suppressMessages({
  library(arm)
  library(broom)
  library(hett)
  }))

set.seed(2131)
x1 <- 1:100
x2 <- rbinom(100, 1, 0.5)
error1 <- rt(100, df=4)*sqrt(5 * (4-2)/4) + 0 # t distributed errors
                                              # with df 4, mean 0 and var 5

y = 3 + 0.1*x1 + 0.5*x2 + error1

display(lm(y ~ x1 + x2))
## lm(formula = y ~ x1 + x2)
##             coef.est coef.se
## (Intercept) 3.30     0.43   
## x1          0.10     0.01   
## x2          0.33     0.40   
## ---
## n = 100, k = 3
## residual sd = 1.96, R-Squared = 0.67

It looks like the true slope of x1 is contained in the 68% CI’s.

c(upper = 0.10 + (1 * 0.01), lower = 0.10 + (-1 * 0.01))
## upper lower 
##  0.11  0.09

For x2 it’s contained but the uncertainty is too high making the CI’s too wide.

c(upper = 0.33 + (1 * 0.40), lower = 0.33 + (-1 * 0.40))
## upper lower 
##  0.73 -0.07

  1. Put the above step in a loop and repeat 1000 times. Calculate the confidence coverage for the 68% intervals for each of the three coefficients in the model.
coefs <- matrix(NA, nrow = 3, ncol = 1000)
se <- matrix(NA, nrow = 3, ncol = 1000)

# Naturally, these estimates will be different for anyone who runs this code
# even if specifying set seed because the loop will loop new numbers each time.

for (i in 1:ncol(coefs)) {
  x1 <- 1:100
  x2 <- rbinom(100, 1, 0.5)
  error1 <- rt(100, df=4)*sqrt(5 * (4-2)/4) + 0 # t distributed errors
                                                # with df 4 and mean 0
  y = 3 + 0.1*x1 + 0.5*x2 + error1
  
  mod1 <- summary(lm(y ~ x1 + x2))
  coefs[1,i] <- tidy(mod1)[1,2, drop = TRUE]
  coefs[2,i] <- tidy(mod1)[2,2, drop = TRUE]
  coefs[3,i] <- tidy(mod1)[3,2, drop = TRUE]
  
  se[1,i] <- tidy(mod1)[1,3, drop = TRUE]
  se[2,i] <- tidy(mod1)[2,3, drop = TRUE]
  se[3,i] <- tidy(mod1)[3,3, drop = TRUE]
}

repl_coef <- rowMeans(coefs)
repl_se <- rowMeans(se)

cbind(repl_coef + (-1 * repl_se), repl_coef + (1 * repl_se))
##            [,1]      [,2]
## [1,] 2.48215326 3.4808804
## [2,] 0.09255549 0.1079026
## [3,] 0.05919238 0.9495994

Going back to previous block of code which contains the true parameters, the 68% interval for the intercept does contain 3, the 68% interval for x1 does contain 0.10 and both CI’s are quite precise. Finally, the confidence interval for x2 does contain 0.5 but the uncertainty is huge. What does this mean? The estimation of the slope for x2 does contain the true parameter but given that our error is too big and the normal distribution of lm does not account for that, it presents much more uncertainty in the estimation of the slope. If we ran a t distributed lm then it will certainly be more precise.

The last section of the exercise asks you to do exactly that. Repeat the previous loop but instead of using lm, use tlm from the hett package which accounts for a t-distributed error term. Compare the CI’s and coefficients. Let’s do it:

  1. Repeat this simulation, but instead fit the model using t errors (see Exercise 6.6). The only change here is defining error1 as a t distribution instead of normally distributed
coefs <- matrix(NA, nrow = 3, ncol = 1000)
se <- matrix(NA, nrow = 3, ncol = 1000)

for (i in 1:ncol(coefs)) {
  x1 <- 1:100
  x2 <- rbinom(100, 1, 0.5)
  error1 <- rt(100, df=4)*sqrt(5 * (4-2)/4) + 0 # t distributed errors
  y = 3 + 0.1*x1 + 0.5*x2 + error1
  
  mod1 <- summary(tlm(y ~ x1 + x2))
  coefs[1,i] <- mod1$loc.summary$coefficients[1,1]
  coefs[2,i] <- mod1$loc.summary$coefficients[2,1]
  coefs[3,i] <- mod1$loc.summary$coefficients[3,1]
  
  se[1,i] <- mod1$loc.summary$coefficients[1,2]
  se[2,i] <- mod1$loc.summary$coefficients[2,2]
  se[3,i] <- mod1$loc.summary$coefficients[3,2]
}

repl_coef <- rowMeans(coefs)
repl_se <- rowMeans(se)

cbind(repl_coef + (-1 * repl_se), repl_coef + (1 * repl_se))
##            [,1]      [,2]
## [1,] 2.61212284 3.4265738
## [2,] 0.09335461 0.1058854
## [3,] 0.14971691 0.8769205

Accounting for the t-distributed error (so the tails are much wider), the intervals for the intercept and x1 are quite similar (but narrower) and for x2 they’re certainly much more narrow. Note that the CI is still pretty big, reflecting the variance in the error term. But whenever this variance exceeds what a normal distribution can capture, we should account for it: it might help to reduce the uncertainty in the estimation. Note that, if you reran both simulations and compare the coefficients in repl_coef, they’re practically the same. So the different estimations don’t affect the parameters, but rather the uncertainty with which we trust them.

TAGS: GELMAN, R
comments powered by Disqus