Chapter 7 Linear Mixed Models

Example 7.1 (Dependent Samples on the Mean) Consider inference on a population’s mean. Supposdly, more observations imply more infotmation on the mean. This, however, is not the case if samples are completely dependant. More observations do not add any new information. From this example one may think that dependence is a bad thing. This is a false intuitiont: negative correlations imply oscilations about the mean, so they are actually more informative on the mean than independent observations.
Example 7.2 (Repeated Measures) Consider a prospective study, i.e., data that originates from selecting a set of subjects and making measurements on them over time. Also assume that some subjects received some treatment, and other did not. When we want to infer on the population from which these subjects have been sampled, we need to recall that some series of observations came from the same subject. If we were to ignore the subject of origin, and treat each observation as an independent sample point, we will think we have more information in our data than we actually do. For a rough intuition, think of a case where observatiosn within subject are perfectly dependent.

The sources of variability, i.e. noise, are known in the statistical literature as “random effects”. Specifying these sources determines the correlation structure in our measurements. In the simplest linear models of Chapter 5, we thought of the variability as a measurement error, independent of anything else. This, however, is rarely the case when time or space are involved.

The variability in our data is rarely the object of interest. It is merely the source of uncertainty in our measurements. The effects we want to infer on are assumingly non-random, thus known as “fixed-effects”. A model which has several sources of variability, i.e. random-effects, and several deterministic effects to study, i.e. fixed-effects, is known as a “mixed effects” model. Here are some examples of such models.

Example 7.3 (Fixed and Random Machine Effect) Consider the problem of testing for a change in the distribution of diamteters of manufactured bottle caps. We want to study the (fixed) effect of time: before versus after. Bottle caps are produced by several machines. Clearly there is variablity in the diameters within-machine and between-machines. Given many measurements on many bottle caps from many machines, we could standardize measurements by removing each machine’s average. This implies the within-machine variability is the only source of variability we care about, because the substration of the machine effect, removed information on the between-machine variability.
Alternatively, we could treat the between-machine variability as another source of noise/uncertainty when inferring on the temporal fixed effect.
Example 7.4 (Fixed and Random Subject Effect) Consider an experimenal design where each subject is given 2 types of diets, and his health condition is recorded. We could standardize over subjects by removing the subject-wise average, before comparing diets. This is what a paired t-test does. This also implies the within-subject variability is the only source of variability we care about. Alternatively, for inference on the population of “all subjects” we need to adress the between-subject variability, and not only the within-subject variability.

The unifying theme of the above examples, is that the variability in our data has several sources. Which are the sources of variability that need to concern us? This is a delicate matter which depends on your goals. As a rule of thumb, we will suggest the following view: If information of an effect will be available at the time of prediction, treat it as a fixed effect. If it is not, treat it as a random-effect.

Mixed models are so fundamental, that they have earned many names:

  • Mixed Effects: Because we may have both fixed effects we want to estimate and remove, and random effects which contribute to the variability to infer against.

  • Variance Components: Because as the examples show, variance has more than a single source (like in the Linear Models of Chapter 5).

  • Hirarchial Models: Because as Example 7.4 demonstrates, we can think of the sampling as hierarchical– first sample a subject, and then sample its response.

  • Multilevel Analysis: For the same reasons it is also known as Hierarchical Models.

  • Repeated Measures: Because we make several measurements from each unit, like in Example 7.4.

  • Longitudinal Data: Because we follow units over time, like in Example 7.4.

  • Panel Data: Is the term typically used in econometric for such longitudinal data.

  • MANOVA: Many of the problems that may be solved with a multivariate analysis of variance (MANOVA), may be solved with a mixed model for reasons we detail in 8.

  • Structured Prediction: In the machine learning literature, predicting outcomes with structure, such as correlated vectors, is known as Structured Learning. Because mixed-models merely specify correlations, using a mixed-model for making predictions may be thought of as an instance of structured prediction.

Whether we are aiming to infer on a generative model’s parameters, or to make predictions, there is no “right” nor “wrong” approach. Instead, there is always some implied measure of error, and an algorithm may be good, or bad, with respect to this measure (think of false and true positives, for instance). This is why we care about dependencies in the data: ignoring the dependence structure will probably yield inefficient algorithms. Put differently, if we ignore the statistical dependence in the data we will probably me making more errors than possible/optimal.

We now emphasize:

  1. Like in previous chapters, by “model” we refer to the assumed generative distribution, i.e., the sampling distribution.

  2. Mixed effect models are a way to infer against the right level of variability. Using a naive linear model (which assumes a single source of variability) instead of a mixed effects model, probably means your inference is overly anti-conservative. Put differently, the uncertainty in your estimates is higher than the linear model from Chapter 5 may suggest.

  3. In a mixed effect model we will specify the dependence structure via the hierarchy in the sampling scheme (e.g. caps within machine, students within class, etc.). Not all dependency models can be specified in this way. Dependency structures that are not hierarchical include temporal dependencies (AR, ARIMA, ARCH and GARCH), spatial, Markov Chains, and more. To specify dependency structures that are no hierarchical, see Chapter 8 in (the excellent) Weiss (2005).

  4. If you are using the model merely for predictions, and not for inference on the fixed effects or variance components, then stating the generative distribution may be be useful, but not necessarily. See the Supervised Learning Chapter 9 for more on prediction problems. Also recall that machine learning from non-independent observations (such as mixed models) is a delicate matter that is rarely treated in the literature.

7.1 Problem Setup

\[\begin{align} y|x,u = x'\beta + z'u + \varepsilon \tag{7.1} \end{align}\]

where \(x\) are the factors with fixed effects, \(\beta\), which we may want to study. The factors \(z\), with effects \(u\), are the random effects which contribute to variability. In our repeated measures example (7.2) the treatment is a fixed effect, and the subject is a random effect. In our bottle-caps example (7.3) the time (before vs. after) is a fixed effect, and the machines may be either a fixed or a random effect (depending on the purpose of inference). In our diet example (7.4) the diet is the fixed effect and the family is a random effect.

Notice that we state \(y|x,z\) merely as a convenient way to do inference on \(y|x\), instead of directly specifying \(Var[y|x]\). This is exactly the power of mixed-effect models: we specify the covariance not via the matrix \(Var[y,z]\), but rather via the sampling hierarchy.

Given a sample of \(n\) observations \((y_i,x_i,z_i)\) from model (7.1), we will want to estimate \((\beta,u)\). Under some assumption on the distribution of \(\varepsilon\) and \(z\), we can use maximum likelihood (ML). In the context of mixed-models, however, ML is typically replaced with restricted maximum likelihood (ReML), because it returns unbiased estimates of \(Var[y|x]\) and ML does not.

7.1.1 Non-Linear Mixed Models

The idea of random-effects can also be implemented for non-linear mean models. Formally, this means that \(y|x,z=f(x,z,\varepsilon)\) for some non-linear \(f\). This is known as non-linead-mixed-models, which will not be discussed in this text.

7.1.2 Generalized Linear Mixed Models (GLMM)

You can marry the ideas of random effects, with non-linear link functions, and non-Gaussian distribution of the response. These are known as Generalized Linear Mixed Models. Wikidot has a nice comparison of several software suits for GLMMs.

7.2 Mixed Models with R

We will fit mixed models with the lmer function from the lme4 package, written by the mixed-models Guru Douglas Bates. We start with a small simulation demonstrating the importance of acknowledging your sources of variability. Our demonstration consists of fitting a linear model that assumes independence, when data is clearly dependent.

# Simulation parameters
n.groups <- 4 # number of groups
n.repeats <- 2 # sample per group
groups <- rep(1:n.groups, each=n.repeats) %>% as.factor
n <- length(groups)
z0 <- rnorm(n.groups,0,10) # generate group effects
(z <- z0[as.numeric(groups)]) # generate and inspect random group effects
## [1]  6.960992  6.960992  2.380404  2.380404 -8.095683 -8.095683 -3.569163
## [8] -3.569163
epsilon <- rnorm(n,0,1) # generate measurement error

# Generate data
beta0 <- 2 # set global mean
y <- beta0 + z + epsilon # generate synthetic sample

We can now fit the linear and mixed models.

lm.5 <- lm(y~z)  # fit a linear model assuming independence
lme.5 <- lmer(y~1|groups) # fit a mixed-model that deals with the group dependence

The summary of the linear model

summary.lm.5 <- summary(lm.5)
## Call:
## lm(formula = y ~ z)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2717 -0.4342  0.2110  0.4527  1.1885 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.24983    0.32663   6.888 0.000462 ***
## z            0.89810    0.05677  15.819 4.05e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.9191 on 6 degrees of freedom
## Multiple R-squared:  0.9766, Adjusted R-squared:  0.9727 
## F-statistic: 250.2 on 1 and 6 DF,  p-value: 4.048e-06

The summary of the mixed-model

summary.lme.5 <- summary(lme.5)
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ 1 | groups
## REML criterion at convergence: 34.3
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.04069 -0.57522 -0.03426  0.66816  0.91364 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  groups   (Intercept) 35.0431  5.9197  
##  Residual              0.8879  0.9423  
## Number of obs: 8, groups:  groups, 4
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)    1.728      2.979    0.58

Look at the standard error of the global mean, i.e., the intercept: for lm it is 0.326633, and for lme it is 2.9785508. Why this difference? Because lm treats the group effect15 as a fixed while the mixed model treats the group effect as a source of noise/uncertainty. Clearly, inference using lm underestimates our uncertainty in the estimated population mean (\(\beta_0\)).

Now let’s adopt the paired t-test view, which removes the group mean, so that it implicitly ignores the between-group variability. Which is the model compatible with this view?

diffs <- tapply(y, groups, diff) 
diffs # Q:what is this estimating? A: epsilon+epsilon.
##          1          2          3          4 
## -1.5646333  1.0593082  1.7856037  0.5873133
sd(diffs) # 
## [1] 1.441244

So we see that a paired t-test infers only against the within-group variability. Q:Is this a good think? A: depends…

7.2.1 A Single Random Effect

We will use the Dyestuff data from the lme4 package, which encodes the yield, in grams, of a coloring solution (dyestuff), produced in 6 batches using 5 different preparations.

data(Dyestuff, package='lme4')
##   Batch Yield
## 1     A  1545
## 2     A  1440
## 3     A  1440
## 4     A  1520
## 5     A  1580
## 6     B  1540

And visually


If we want to do inference on the (global) mean yield, we need to account for the two sources of variability: the within-batch variability, and the between-batch variability We thus fit a mixed model, with an intercept and random batch effect.

lme.1<- lmer( Yield ~ 1  | Batch  , Dyestuff )
## Linear mixed model fit by REML ['lmerMod']
## Formula: Yield ~ 1 | Batch
##    Data: Dyestuff
## REML criterion at convergence: 319.7
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.4117 -0.7634  0.1418  0.7792  1.8296 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Batch    (Intercept) 1764     42.00   
##  Residual             2451     49.51   
## Number of obs: 30, groups:  Batch, 6
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  1527.50      19.38    78.8

Things to note:

  • The syntax Yield ~ 1 | Batch tells R to fit a model with a global intercept (1) and a random Batch effect (|Batch). More on that later.
  • As usual, summary is content aware and has a different behavior for lme class objects.
  • The output distinguishes between random effects (\(u\)), a source of variability, and fixed effect (\(\beta\)), which we want to study. The mean of the random effect is not reported because it is unassumingly 0.
  • Were we not interested in the variance components, and only in the coefficients or predictions, an (almost) equivalent lm formulation is lm(Yield ~ Batch).

Some utility functions let us query the lme object. The function coef will work, but will return a cumbersome output. Better use fixef to extract the fixed effects, and ranef to extract the random effects. The model matrix (of the fixed effects alone), can be extracted with model.matrix, and predictions made with predict. Note, however, that predictions with mixed-effect models are better treated as prediction problems as in the Supervised Learning Chapter 9, but are a very delicate matter.


7.2.2 Multiple Random Effects

Let’s make things more interesting by allowing more than one random effect. One-way ANOVA can be thought of as the fixed-effects counterpart of the single random effect.

In the Penicillin data, we measured the diameter of spread of an organism, along the plate used (a to x), and penicillin type (A to F). We will now try to infer on the diameter of typical organism, and compute its variability over plates and Penicillin types.

##   diameter plate sample
## 1       27     a      A
## 2       23     a      B
## 3       26     a      C
## 4       23     a      D
## 5       23     a      E
## 6       21     a      F

One sample per combination:

table(sample, plate) # how many observations per plate & type?
##       plate
## sample a b c d e f g h i j k l m n o p q r s t u v w x
##      A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##      B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##      C 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##      D 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##      E 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##      F 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

And visually:

Let’s fit a mixed-effects model with a random plate effect, and a random sample effect:

lme.2 <- lmer ( diameter ~  1  + (1|plate )+(1|sample) , Penicillin )
fixef(lme.2) # Fixed effects
## (Intercept) 
##    22.97222
ranef(lme.2) # Random effects
## $plate
##   (Intercept)
## a  0.80454704
## b  0.80454704
## c  0.18167191
## d  0.33739069
## e  0.02595313
## f -0.44120322
## g -1.37551591
## h  0.80454704
## i -0.75264078
## j -0.75264078
## k  0.96026582
## l  0.49310948
## m  1.42742217
## n  0.49310948
## o  0.96026582
## p  0.02595313
## q -0.28548443
## r -0.28548443
## s -1.37551591
## t  0.96026582
## u -0.90835956
## v -0.28548443
## w -0.59692200
## x -1.21979713
## $sample
##   (Intercept)
## A  2.18705797
## B -1.01047615
## C  1.93789946
## D -0.09689497
## E -0.01384214
## F -3.00374417

Things to note:

  • The syntax 1+ (1| plate ) + (1| sample ) fits a global intercept (mean), a random plate effect, and a random sample effect.
  • Were we not interested in the variance components, an (almost) equivalent lm formulation is lm(diameter ~ plate + sample).
  • The output of ranef is somewhat controversial. Think about it: Why would we want to plot the estimates of a random variable?

Since we have two random effects, we may compute the variability of the global mean (the only fixed effect) as we did before. Perhaps more interestingly, we can compute the variability in the response, for a particular plate or sample type.

random.effect.lme2 <- ranef(lme.2, condVar = TRUE) 
qrr2 <- lattice::dotplot(random.effect.lme2, strip = FALSE)

Variability in response for each plate, over various sample types:


Variability in response for each sample type, over the various plates:


Things to note:

  • The condVar argument of the ranef function tells R to compute the variability in response conditional on each random effect at a time.
  • The dotplot function, from the lattice package, is only there for the fancy plotting.

We used the penicillin exampe to demonstrate the incoporation of two random-effects. We could have, however, compared between penicillin types. For this matter, penicillin types are fixed effects to infer on, and not part of the uncertainty in the mean diameter. The appropriate model is the following:

lme.2.2 <- lmer( diameter ~  1  + sample + (1|plate) , Penicillin )

I may now ask myself: does the sample, i.e. penicillin, have any effect? This is what the ANOVA table typically gives us. The next table can be thought of as a “repeated measures ANOVA”:

## Analysis of Variance Table
##        Df Sum Sq Mean Sq F value
## sample  5 449.22  89.844  297.09

Uhh! No p-values. Why is this? Because Doug Bates, the author of lme4 makes a strong argument against current methods of computing p-values in mixed models. If you insist on an p-value, you may recur to other packages that provide that, at your own caution:

## Analysis of Deviance Table (Type II Wald chisquare tests)
## Response: diameter
##         Chisq Df Pr(>Chisq)    
## sample 1485.4  5  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

… and yes; the penicillin type has a significant effect on the diamter.

7.2.3 A Full Mixed-Model

In the sleepstudy data, we recorded the reaction times to a series of tests (Reaction), after various subject (Subject) underwent various amounts of sleep deprivation (Day).

We now want to estimate the (fixed) effect of the days of sleep deprivation on response time, while allowing each subject to have his/hers own effect. Put differently, we want to estimate a random slope for the effect of day. The fixed Days effect can be thought of as the average slope over subjects.

lme.3 <- lmer ( Reaction ~ Days + ( Days | Subject ) , data= sleepstudy )

Things to note:

  • ~Days specifies the fixed effect.
  • We used the Days|Subect syntax to tell R we want to fit the model ~Days within each subject.
  • Were we fitting the model for purposes of prediction only, an (almost) equivalent lm formulation is lm(Reaction~Days*Subject).

The fixed day effect is:

## (Intercept)        Days 
##   251.40510    10.46729

The variability in the average response (intercept) and day effect is

## $Subject
##     (Intercept)        Days
## 308   2.2585654   9.1989719
## 309 -40.3985770  -8.6197032
## 310 -38.9602459  -5.4488799
## 330  23.6904985  -4.8143313
## 331  22.2602027  -3.0698946
## 332   9.0395259  -0.2721707
## 333  16.8404312  -0.2236244
## 334  -7.2325792   1.0745761
## 335  -0.3336959 -10.7521591
## 337  34.8903509   8.6282839
## 349 -25.2101104   1.1734143
## 350 -13.0699567   6.6142050
## 351   4.5778352  -3.0152572
## 352  20.8635925   3.5360133
## 369   3.2754530   0.8722166
## 370 -25.6128694   4.8224646
## 371   0.8070397  -0.9881551
## 372  12.3145394   1.2840297

Did we really need the whole lme machinery to fit a within-subject linear regression and then average over subjects? The answer is yes. The assumptions on the distribution of random effect, namely, that they are normally distributed, allows us to pool information from one subject to another. In the words of John Tukey: “we borrow strength over subjects”. Is this a good thing? If the normality assumption is true, it certainly is. If, on the other hand, you have a lot of samples per subject, and you don’t need to “borrow strength” from one subject to another, you can simply fit within-subject linear models without the mixed-models machinery.

To demonstrate the “strength borrowing”, here is a comparison of the lme, versus the effects of fitting a linear model to each subject separately.

Here is a comparison of the random-day effect from lme versus a subject-wise linear model. They are not the same.


7.3 Serial Correlations

TODO: nlme package.

7.4 Relation to MANOVA

Multivariate analysis of variance (MANOVA) deals with the estimation of effect on vector valued outcomes. Put differently: in ANOVA the response, \(y\), is univariate. In MANOVA, the outcome is multivariate. MANOVA is useful when there are correlations among the entries of \(y\). Otherwise- one may simply solve many ANOVA problems, instead of a single MANOVA.

Now assume that the outcome of a MANOVA is measurements of an individual at several time periods. The measurements are clearly correlated, so that MANOVA may be useful. But one may also treat the subject as a random effect, with a univariate response. We thus see that this seemingly MANOVA problem can be solved with the mixed models framework.

What MANOVA problems cannot be solved with mixed models? There may be cases where the covariance of the multivariate outcome, \(y\), is very complicated. If the covariance in \(y\) may not be stated using a combination of random and fixed effects, then the covariance has to be stated explicitly in the MANOVA framework. It is also possible to consider mixed-models with multivariate outcomes, i.e., a mixed MANOVA, or hirarchial MANOVA. The R functions we present herein permit this. MANOVA in R

[TODO: nlme:::lme()]

7.5 The Variance-Components View


7.6 Bibliographic Notes

Most of the examples in this chapter are from the documentation of the lme4 package (Bates et al. 2015). For a general and very applied treatment, see Pinero and Bates (2000). As usual, a hands on view can be found in Venables and Ripley (2013), and also in an excellent blog post by Kristoffer Magnusson For a more theoretical view see Weiss (2005) or Searle, Casella, and McCulloch (2009). Sometimes it is unclear if an effect is random or fixed; on the difference between the two types of inference see Rosset and Tibshirani (2018) and references therein. For more on predictions in linear mixed models see Robinson (1991), Rabinowicz and Rosset (2018), and references therein.

7.7 Practice Yourself

  1. Computing the variance of the sample mean given dependent correlations. How does it depend on the covariance between observations? When is the sample most informative on the population mean?

  2. Return to the Penicillin data set. Instead of fitting an LME model, fit an LM model with lm. I.e., treat all random effects as fixed.
    1. Compare the effect estimates.
    2. Compare the standard errors.
    3. Compare the predictions of the two models.
  3. [Very Advanced!] Return to the Penicillin data and use the gls function to fit a generalized linear model, equivalent to the LME model in our text.
  4. Read about the “oats” dataset using ? MASS::oats.Inspect the dependency of the yield (Y) in the Varieties (V) and the Nitrogen treatment (N).
    1. Fit a linear model, does the effect of the treatment significant? The interaction between the Varieties and Nitrogen is significant?
    2. An expert told you that could be a variance between the different blocks (B) which can bias the analysis. fit a LMM for the data.
    3. Do you think the blocks should be taken into account as “random effect” or “fixed effect”?


Weiss, Robert E. 2005. Modeling Longitudinal Data. Springer Science & Business Media.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. doi:10.18637/jss.v067.i01.

Pinero, Jose, and Douglas Bates. 2000. “Mixed-Effects Models in S and S-Plus (Statistics and Computing).” Springer, New York.

Venables, William N, and Brian D Ripley. 2013. Modern Applied Statistics with S-Plus. Springer Science & Business Media.

Searle, Shayle R, George Casella, and Charles E McCulloch. 2009. Variance Components. Vol. 391. John Wiley & Sons.

Rosset, Saharon, and Ryan J Tibshirani. 2018. “From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation.” Journal of the American Statistical Association, no. just-accepted. Taylor & Francis.

Robinson, George K. 1991. “That Blup Is a Good Thing: The Estimation of Random Effects.” Statistical Science. JSTOR, 15–32.

Rabinowicz, Assaf, and Saharon Rosset. 2018. “Assessing Prediction Error at Interpolation and Extrapolation Points.” arXiv Preprint arXiv:1802.00996.

  1. A.k.a. the cluster effect.