We all hear about Maximum Likelihood Estimation (MLE) and we often see hints of it in our model output. As usual, doing things manually can give a better grasp on how to better understand how our models work. Here’s a very short example implementing MLE based on the explanation from Gelman and Hill (2007), page 404-405.
The likelihood is literally how much our outcome variable Y is compatible with our predictor X. We compute this measure of compatibility with the probability density function for the normal distribution. In R, dnorm
returns this likelihood. The plot on this website gives a very clear intuition on what dnorm
returns: it is literally the height of the distribution, or in other words, the likelihood. We of course, want the highest likelihood, as it indicates greater compatibility.
For example, assuming parameters
is a vector with the intercept a
, the coefficient b
and an error term sigma
, we can compute the likelihood for any random value of these coefficients:
loglikelihood <- function(parameters, predictor, outcome) {
# intercept
a <- parameters[1]
# beta coef
b <- parameters[2]
# error term
sigma <- parameters[3]
# Calculate the likelihood of `y` given `a + b * x`
ll.vec <- dnorm(outcome, a + b * predictor, sigma, log = TRUE)
# sum that likelihood over all the values in the data
sum(ll.vec)
}
# Generate three random values for intercept, beta and error term
inits <- runif(3)
# Calculate the likelihood given these three values
loglikelihood(
inits,
predictor = mtcars$disp,
outcome = mtcars$mpg
)
## [1] -11687.41
That’s the likelihood given the random values for the intercept, the coefficient and sigma. How does a typical linear model estimate the maximum of these likelihoods? It performs an optimization search trying out a sliding set of values for these unknowns and searches for the combination that returns the maximum:
mle <-
optim(
inits, # The three random values for intercept, beta and sigma
loglikelihood, # The loglik function
lower = c(-Inf, -Inf, 1.e-5), # The lower bound for the three values (all can be negative except sigma, which is 1.e-5)
method = "L-BFGS-B",
control = list(fnscale = -1), # This signals to search for the maximum rather than the minimum
predictor = mtcars$disp,
outcome = mtcars$mpg
)
mle$par[1:2]
## [1] 29.59985346 -0.04121511
Let’s compare that to the result of lm
:
coef(lm(mpg ~ disp, data = mtcars))
## (Intercept) disp
## 29.59985476 -0.04121512
In layman terms, MLE really just checks how compatible a given data point is with the outcome with the respect to a coefficient. It repeats that step many times until it finds the combination of coefficients that maximizes the outcome.