4 Heteroskedasticity and Grouped Data (Random Effects)
In the last chapter, we have discussed how to set up a basic lm, including the selection of predictors according to the purpose of the modelling (prediction, causal analysis). Remember in particular that for a causal analysis, which I consider the standard case in the scineces, first, think about the problem and your question an decide on a base structure. Ideally, you do this by:
- Writing down your scientific questions (e.g. Ozone ~ Wind)
- Then add confounders / mediators if needed.
- Remember to make a difference between variables controlled for confounding, and other confounders (which are typically not controlled for confounding). We may have to use some model selection, but in fact with a good analysis plan this is rarely necessary for a causal analysis.
After having arrived at such a base structure, we will have to check if the model is appropriate for the analysis. Yesterday, we already discussed about residual checks and we discussed that the 4 standard residual plots check for 4 different problems.
- Residuals vs Fitted = Functional relationship.
- Normal Q-Q = Normality of residuals.
- Scale - Location = Variance homogeneity.
- Residuals vs Leverage = Should we worry about certain outliers?
Here an example for a linear regression of Ozone against Wind:
= lm(Ozone ~ Temp , data = airquality)
fit plot(Ozone ~ Temp, data = airquality)
abline(fit)
par(mfrow = c(2, 2))
plot(fit)
The usual strategy now is to
- First get the functional relationship right, so that the model correctly describe the mean
- Then adjust the model assumptions regardigng distribution, variance and outliers.
We will go through these steps now, and on the way also learn how to deal with heteroskedasticity, outliers, weird distributions and grouped data (random or mixed models).
4.1 Adjusting the Functional Form
In the residual ~ fitted plot above, we can clearly see a pattern, which means that our model has a systematic misfit. Note that in a multiple regression, you should also check res ~ predictor for all predictors, because patterns of misfit often show up more clearly when plotted against the single predictors.
What should we do if we see a pattern? Here a few strategies that you might want to consider:
4.1.1 Changing the regression formular
The easiest strategy is to add complexity to the polynomial, e.g. quadratic terms, interactions etc.
library(effects)
= lm(Ozone ~ Wind * Temp + I(Wind^2) + I(Temp^2), data = airquality)
fit
plot(allEffects(fit, partial.residuals = T), selection = 1)
plot(allEffects(fit, partial.residuals = T), selection = 2)
plot(allEffects(fit, partial.residuals = T), selection = 3)
and see if the residuals are getting better. To avoid doing this totally randomly, it may be useful to plot residuals against individual predictors by hand!
4.1.2 Generalized additive models (GAMs)
Another options are GAMs = Generalized Additive Models. The idea is to fit a smooth function to data, to automatically find the “right” functional form. The smoothness of the function is automatically optimized.
library(mgcv)
= gam(Ozone ~ s(Wind) + s(Temp) + s(Solar.R) , data = airquality)
fit summary(fit)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Ozone ~ s(Wind) + s(Temp) + s(Solar.R)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.099 1.663 25.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Wind) 2.910 3.657 13.695 < 2e-16 ***
## s(Temp) 3.833 4.753 11.613 < 2e-16 ***
## s(Solar.R) 2.760 3.447 3.967 0.00858 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.723 Deviance explained = 74.7%
## GCV = 338.9 Scale est. = 306.83 n = 111
# allEffects doesn't work here.
plot(fit, pages = 0, residuals = T, pch = 20, lwd = 1.8, cex = 0.7,
col = c("black", rep("red", length(fit$residuals))))
AIC(fit)
## [1] 962.596
Comparison to normal lm()
:
= lm(Ozone ~ Wind + Temp + Solar.R , data = airquality)
fit AIC(fit)
## [1] 998.7171
Spline interaction is called a tensor spline:
= gam(Ozone ~ te(Wind, Temp) + s(Solar.R) , data = airquality)
fit summary(fit)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Ozone ~ te(Wind, Temp) + s(Solar.R)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.099 1.403 30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## te(Wind,Temp) 13.176 14.87 22.151 <2e-16 ***
## s(Solar.R) 2.822 3.52 3.177 0.0237 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.803 Deviance explained = 83.1%
## GCV = 258.06 Scale est. = 218.54 n = 111
plot(fit, pages = 0, residuals = T, pch = 20, lwd = 1.9, cex = 0.4)
AIC(fit)
## [1] 930.5047
GAMs are particularly useful for confounders. If you have confounders, you usually don’t care that the fitted relationship is a bit hard to interpret, you just want the confounder effect to be removed. So, if you want to fit the causal relationship between Ozone ~ Wind, account for the other variables, a good strategy might be:
= gam(Ozone ~ Wind + s(Temp) + s(Solar.R) , data = airquality)
fit summary(fit)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Ozone ~ Wind + s(Temp) + s(Solar.R)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.2181 6.3166 11.433 < 2e-16 ***
## Wind -3.0302 0.6082 -4.982 2.55e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Temp) 3.358 4.184 14.972 <2e-16 ***
## s(Solar.R) 2.843 3.551 3.721 0.0115 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.664 Deviance explained = 68.6%
## GCV = 402.11 Scale est. = 372.4 n = 111
In this way, you still get a nicely interpretable linear effect for Wind, but you don’t have to worry about the functional form of the other predictors.
4.2 Modelling Variance Terms
After we have fixed the functional form, we want to look at the distribution of the residuals. We said yesterday that you can try to get them more normal by applying an appropriate transformation, e.g. the logarithm or square root. Without transformation, we often find that data shows heteroskedasticity, i.e. the residual variance changes with some predictor or the mean estimate (see also Scale - Location plot). Maybe your experimental data looks like this:
set.seed(125)
= data.frame(treatment = factor(rep(c("A", "B", "C"), each = 15)))
data $observation = c(7, 2 ,4)[as.numeric(data$treatment)] +
datarnorm( length(data$treatment), sd = as.numeric(data$treatment)^2 )
boxplot(observation ~ treatment, data = data)
Especially p-values and confidence intervals of lm()
and ANOVA can react quite strongly to such differences in residual variation. So, running a standard lm()
/ ANOVA on this data is not a good idea - in this case, we see that all regression effects are not significant, as is the ANOVA, suggesting that there is no difference between groups.
= lm(observation ~ treatment, data = data)
fit summary(fit)
##
## Call:
## lm(formula = observation ~ treatment, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.2897 -1.0514 0.3531 2.4465 19.8602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.043 1.731 4.069 0.000204 ***
## treatmentB -3.925 2.448 -1.603 0.116338
## treatmentC -1.302 2.448 -0.532 0.597601
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.704 on 42 degrees of freedom
## Multiple R-squared: 0.05973, Adjusted R-squared: 0.01495
## F-statistic: 1.334 on 2 and 42 DF, p-value: 0.2744
summary(aov(fit))
## Df Sum Sq Mean Sq F value Pr(>F)
## treatment 2 119.9 59.95 1.334 0.274
## Residuals 42 1887.6 44.94
So, what can we do?
4.2.1 Transformation
One option is to search for a transformation of the response that improves the problem - If heteroskedasticity correlates with the mean value, one can typically decrease it by some sqrt or log transformation, but often difficult, because this may also conflict with keeping the distribution normal.
4.2.2 Model the variance
The second, more general option, is to model the variance - Modelling the variance to fit a model where the variance is not fixed. The basic option in R is nlme::gls
. GLS = Generalized Least Squares. In this function, you can specify a dependency of the residual variance on a predictor or the response. See options via ?varFunc
. In our case, we will use the varIdent option, which allows to specify a different variance per treatment.
library(nlme)
= gls(observation ~ treatment, data = data, weights = varIdent(form = ~ 1 | treatment))
fit summary(fit)
## Generalized least squares fit by REML
## Model: observation ~ treatment
## Data: data
## AIC BIC logLik
## 243.9258 254.3519 -115.9629
##
## Variance function:
## Structure: Different standard deviations per stratum
## Formula: ~1 | treatment
## Parameter estimates:
## A B C
## 1.000000 4.714512 11.821868
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 7.042667 0.2348387 29.989388 0.0000
## treatmentB -3.925011 1.1317816 -3.467994 0.0012
## treatmentC -1.302030 2.7861462 -0.467323 0.6427
##
## Correlation:
## (Intr) trtmnB
## treatmentB -0.207
## treatmentC -0.084 0.017
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.4587934 -0.6241702 0.1687727 0.6524558 1.9480169
##
## Residual standard error: 0.9095262
## Degrees of freedom: 45 total; 42 residual
If you check the ANOVA, also the ANOVA is significant!
anova(fit)
## Denom. DF: 42
## numDF F-value p-value
## (Intercept) 1 899.3761 <.0001
## treatment 2 6.0962 0.0047
The second option for modeling variances is to use the glmmTMB
.{R} package, which we will use quite frequently this week. Here, you can specify an extra regression formula for the dispersion (= residual variance). If we fit this:
library(glmmTMB)
= glmmTMB(observation ~ treatment, data = data, dispformula = ~ treatment) fit
We get 2 regression tables as outputs - one for the effects, and one for the dispersion (= residual variance). We see, as expected, that the dispersion is higher in groups B and C compared to A. An advantage over gls is that we get confidence intervals and p-values for these differences on top!
summary(fit)
## Family: gaussian ( identity )
## Formula: observation ~ treatment
## Dispersion: ~treatment
## Data: data
##
## AIC BIC logLik deviance df.resid
## 248.7 259.5 -118.3 236.7 39
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.0427 0.2269 31.042 < 2e-16 ***
## treatmentB -3.9250 1.0934 -3.590 0.000331 ***
## treatmentC -1.3020 2.6917 -0.484 0.628582
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Dispersion model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2587 0.3651 -0.708 0.479
## treatmentB 3.1013 0.5164 6.006 1.91e-09 ***
## treatmentC 4.9399 0.5164 9.566 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
4.2.3 Exercise variance modelling
Take this plot of Ozone ~ Solar.R using the airquality data. Clearly there is heteroskedasticity in the relationship:
plot(Ozone ~ Solar.R, data = airquality)
We can also see this when we fit the regression model:
= lm(Ozone ~ Solar.R, data = airquality)
m1 par(mfrow = c(2, 2))
plot(m1)
Task
We could of course consider other predictors, but let’s say we want to fit this model specifically
- Try to get the variance stable with a transformation.
- Use the
gls
function (packagenlme
) with the untransformed response to make the variance dependent on Solar.R. Hint: Read invarClasses
.{R} and decide how to model this. - Use
glmmTMB
.{R} to model heteroskedasticity.
Solution
4.3 Non-normality and Outliers
What can we do if, after accounting for the functional relationship, response transformation and variance modelling, residual diagnostic 2 shows non-normality, in particular strong outliers? Here simulated example data with strong outliers / deviations from normality:
set.seed(123)
= 100
n = runif(n, -1, 1)
concentration = 2 * concentration + rnorm(n, sd = 0.5) +
growth rbinom(n, 1, 0.05) * rnorm(n, mean = 6*concentration, sd = 6)
plot(growth ~ concentration)
Fitting the model, we see that the distribution is to wide:
= lm(growth ~ concentration)
fit par(mfrow = c(2, 2))
plot(fit)
What can we do to deal with such distributional problems and outliers?
- Removing - Bad option, hard to defend, reviewers don’t like this - if at all, better show robustness with and without outlier, but result is sometimes not robust.
- Change the distribution - Fit a model with a different distribution, i.e. GLM or other. -> We will do this on Wednesday.
- Robust regressions.
- Quantile regression - A special type of regression that does not assume a particular residual distribution.
Change distribution
If we want to change the distribution, we have to go to a GLM, see Wednesday.
Robust regression
Robust methods generally refer to methods that are robust to violation of assumptions, e.g. outliers. More specifically, standard robust regressions typically downweight datap oints that have a too high influence on the fit. See https://cran.r-project.org/web/views/Robust.html for a list of robust packages in R.
# This is the classic method.
library(MASS)
= rlm(growth ~ concentration)
fit summary(fit)
##
## Call: rlm(formula = growth ~ concentration)
## Residuals:
## Min 1Q Median 3Q Max
## -7.1986 -0.3724 0.0377 0.3391 7.0902
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -0.0978 0.0594 -1.6453
## concentration 2.0724 0.1048 19.7721
##
## Residual standard error: 0.534 on 98 degrees of freedom
# No p-values and not sure if we can trust the confidence intervals.
# Would need to boostrap by hand!
# This is another option that gives us p-values directly.
library(robustbase)
## Warning: package 'robustbase' was built under R version 4.1.1
##
## Attaching package: 'robustbase'
## The following object is masked from 'package:survival':
##
## heart
= lmrob(growth ~ concentration)
fit summary(fit)
##
## Call:
## lmrob(formula = growth ~ concentration)
## \--> method = "MM"
## Residuals:
## Min 1Q Median 3Q Max
## -7.2877 -0.4311 -0.0654 0.2788 7.0384
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04448 0.05160 -0.862 0.391
## concentration 2.00588 0.08731 22.974 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Robust residual standard error: 0.5549
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8415
## Convergence in 7 IRWLS iterations
##
## Robustness weights:
## 9 observations c(27,40,47,52,56,76,80,91,100)
## are outliers with |weight| = 0 ( < 0.001);
## 5 weights are ~= 1. The remaining 86 ones are summarized as
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6673 0.9015 0.9703 0.9318 0.9914 0.9989
## Algorithmic parameters:
## tuning.chi bb tuning.psi refine.tol
## 1.548e+00 5.000e-01 4.685e+00 1.000e-07
## rel.tol scale.tol solve.tol eps.outlier
## 1.000e-07 1.000e-10 1.000e-07 1.000e-03
## eps.x warn.limit.reject warn.limit.meanrw
## 1.819e-12 5.000e-01 5.000e-01
## nResample max.it best.r.s k.fast.s k.max
## 500 50 2 1 200
## maxit.scale trace.lev mts compute.rd fast.s.large.n
## 200 0 1000 0 2000
## psi subsampling cov
## "bisquare" "nonsingular" ".vcov.avar1"
## compute.outlier.stats
## "SM"
## seed : int(0)
Quantile regression
Quantile regressions don’t fit a line with an error spreading around it, but try to fit a quantile (e.g. the 0.5 quantile, the median) regardless of the distribution. Thus, they work even if the usual assumptions don’t hold.
library(qgam)
= data.frame(growth = growth, concentration = concentration)
dat
= qgam(growth ~ concentration, data = dat, qu = 0.5) fit
## Estimating learning rate. Each dot corresponds to a loss evaluation.
## qu = 0.5................done
summary(fit)
##
## Family: elf
## Link function: identity
##
## Formula:
## growth ~ concentration
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.08167 0.05823 -1.403 0.161
## concentration 2.04781 0.09500 21.556 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 0.427 Deviance explained = 48.8%
## -REML = 157.82 Scale est. = 1 n = 100
Summary
Actions on real outliers:
- Robust regression.
- Remove
Actions on different distributions:
- Transform.
- Change distribution or quantile regression.
4.4 Random and Mixed Effects - Motivation
Random effects are a very common addition to regression models that can be used for any type of grouping (categorical) variable. Lets look at the Month in airquality:
$fMonth = as.factor(airquality$Month) airquality
Let’s say further that we are only interested in calculating the mean of Ozone:
= lm(Ozone ~ 1, data = airquality) fit
Problem: If we fit residuals, we see that they are correlated in Month, so we somehow have to account for Month:
plot(residuals(fit) ~ airquality$fMonth[as.numeric(row.names(model.frame(fit)))])
A fixed effect model for fMonth would be
= lm(Ozone ~ fMonth, data = airquality)
fit summary(fit)
##
## Call:
## lm(formula = Ozone ~ fMonth, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.115 -16.823 -7.282 13.125 108.038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.615 5.759 4.101 7.87e-05 ***
## fMonth6 5.829 11.356 0.513 0.609
## fMonth7 35.500 8.144 4.359 2.93e-05 ***
## fMonth8 36.346 8.144 4.463 1.95e-05 ***
## fMonth9 7.833 7.931 0.988 0.325
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.36 on 111 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.2352, Adjusted R-squared: 0.2077
## F-statistic: 8.536 on 4 and 111 DF, p-value: 4.827e-06
However, using a fixed effect costs a lot of degrees of freedom, and maybe we are not really interested in Month, we just want to correct the correlation in the residuals.
Solution: Mixed / random effect models. In a mixed model, we assume (differently to a fixed effect model) that the effect of Month is coming from a normal distribution. In a way, you could say that there are two types of errors:
The random effect, which is a normal “error” per group (in this case Month).
And the residual error, which comes on top of the random effect.
Because of this hierarchical structure, these models are also called “multi-level models” or “hierarchical models”. Nomenclature:
- No random effect = Fixed effect model.
- Only random effects + intercept = Random effect model.
- Random effects + fixed effects = Mixed model.
Because grouping naturally occurs in any type of experimental data (batches, blocks, etc.), mixed effect models are the de-facto default for most experimental data! Mind, that grouping even occurs for example, when 2 different persons gather information.
4.4.1 Fitting Random Effects Models
To speak about random effects, we will use an example data set containing exam scores of 4,059 students from 65 schools in Inner London. This data set is located in the R package mlmRev
.{R}.
- Response: “normexam” (Normalized exam score).
- Predictor 1: “standLRT” (Standardised LR test score; Reading test taken when they were 11 years old).
- Predictor 2: “sex” of the student (F / M).
If we analyze this with a simple lm, we get the following response:
library(mlmRev)
library(effects)
= lm(normexam ~ standLRT + sex , data = Exam)
mod0 plot(allEffects(mod0))
Random intercept model
A random intercept model assumes that each school gets their own intercept. It’s pretty much identical to the fixed effect model lm(normexam ~ standLRT + sex + school)
, except that instead of giving each school a separate independent intercept, we assume that the school effects come from a common normal distribution.
= lmer(normexam ~ standLRT + sex + (1 | school), data = Exam)
mod1 summary(mod1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: normexam ~ standLRT + sex + (1 | school)
## Data: Exam
##
## REML criterion at convergence: 9346.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.7120 -0.6314 0.0166 0.6855 3.2735
##
## Random effects:
## Groups Name Variance Std.Dev.
## school (Intercept) 0.08986 0.2998
## Residual 0.56252 0.7500
## Number of obs: 4059, groups: school, 65
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.07639 0.04202 1.818
## standLRT 0.55947 0.01245 44.930
## sexM -0.17136 0.03279 -5.226
##
## Correlation of Fixed Effects:
## (Intr) stnLRT
## standLRT -0.013
## sexM -0.337 0.061
If we look at the outputs, we see that the effects of school are not explicitly mentioned, i.e. we fit an average over the schools. This is also because we treat the random effects as error rather than as estimates.
However, the school mean effects are estimated, and we can make them visible, e.g. via:
with(Exam, {
= ranef(mod1)$school[,1]
randcoef = fixef(mod1)
fixedcoef plot(standLRT, normexam)
for(i in 1:65){
abline(a = fixedcoef[1] + randcoef[i], b = fixedcoef[2], col = i)
} })
Random slope model
A random slope model assumes that each school also gets their own slope for a given parameter (per default we will always estimate slope and intercept, but you could overwrite this, not recommended!). Let’s do this for standLRT (you could of course do both as well).
= lmer(normexam ~ standLRT + sex + (standLRT | school), data = Exam)
mod2 summary(mod2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: normexam ~ standLRT + sex + (standLRT | school)
## Data: Exam
##
## REML criterion at convergence: 9303.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.8339 -0.6373 0.0245 0.6819 3.4500
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## school (Intercept) 0.08795 0.2966
## standLRT 0.01514 0.1230 0.53
## Residual 0.55019 0.7417
## Number of obs: 4059, groups: school, 65
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.06389 0.04167 1.533
## standLRT 0.55275 0.02016 27.420
## sexM -0.17576 0.03228 -5.445
##
## Correlation of Fixed Effects:
## (Intr) stnLRT
## standLRT 0.356
## sexM -0.334 0.036
Fitting a random slope on standLRT is pretty much identical to fit the fixed effect model
lm(normexam ~ standLRT*school + sex)
, except that school is a random effect, and therefore parameter estimates for the interaction sex:school are not independent. The results is similar to the random intercept model, except that we have an additional variance term.
Here a visualization of the results
with(Exam, {
= ranef(mod2)$school[,1]
randcoefI = ranef(mod2)$school[,2]
randcoefS = fixef(mod2)
fixedcoef plot(standLRT, normexam)
for(i in 1:65){
abline(a = fixedcoef[1] + randcoefI[i] , b = fixedcoef[2] + randcoefS[i], col = i)
} })
Syntax cheat sheet:
- Random intercept: `(1 | group).
- ONLY random slope for a given fixed effect:
(0 + fixedEffect | group)
. - Random slope + intercept + correlation (default):
(fixedEffect | group)
. - Random slope + intercept without correlation:
(fixedEffect || group)
, identical to
(1 | group) + (0 + fixedEffect | group)
. - Nested random effects:
(1 | group / subgroup)
. If groups are labeled A, B, C, … and subgroups 1, 2, 3, …, this will create labels A1, A2 ,B1, B2, so that you effectively group in subgroups. Useful for the many experimental people that do not label subgroups uniquely, but otherwise no statistical difference to a normal random effect. - Crossed random effects: You can add random effects independently, as in
(1 | group1) + (1 | group2)
.
4.4.2 Task: a mixed model for plantHeight
Task
Take our old plantHeight dataset that we worked with already a lot. Consider that the relationship height ~ temp may be different for each family. Some families may have larger plants in general (random intercept), but it may also be that the temperature relationship changes per family. Thus, you could include family as random intercept + slope in this relationship. Specify the model and run it!
Solution
<- lm(loght ~ temp, data = plantHeight)
oldModel summary(oldModel)
# Stragegy: if you have a grouping factor, add a random intercept
# check additionally for random slope
<- lmer(loght ~ temp + (1|Family), data = plantHeight)
randomInterceptModel summary(randomInterceptModel)
# if I count sd = 1 paramter more
# if I count the REs, 69 parameters more
# solution: depends on the sd estimate -> wide sd, loose nearly 69df, narrow sd = loose nothing except the sd estimate -> flexibility of an random effect model is adoptiv (not fixed)
<- lm(loght ~ temp + Family, data = plantHeight)
fixedInterceptModel summary(fixedInterceptModel)
# loose 69 degrees of freedom -> this model has 69 parameters more to fit than the model loght ~ temp, loose a lot of power
<- lmer(loght ~ temp + (temp | Family), data = plantHeight)
randomSlopeModel summary(randomSlopeModel)
# scaling helps the optimizer
$sTemp = scale(plantHeight$temp)
plantHeight
<- lmer(loght ~ sTemp + (sTemp | Family), data = plantHeight)
randomSlopeModel summary(randomSlopeModel)
# fixed effect model has no significance any more
<- lm(loght ~ temp * Family, data = plantHeight)
fixedSlopeInterceptModel summary(fixedSlopeInterceptModel)
summary(aov(fixedSlopeInterceptModel))
4.4.3 Problems With Mixed Models
Specifying mixed models is quite simple, however, there is a large list of (partly quite complicate) issues in their practical application. Here, a (incomplete) list:
Interpretation of random effects
What do the random effects mean? The classical interpretation of the rando intercept is that they are a group-structured error, i.e. they are like residuals, but for an entire group.
However, this view breaks down a bit for a random slope model. Another way to view this is that the RE terms absorb variation in parameters. This view works for random intercept and slope models. So, our main model fits the grand mean, and the RE models variation in the parameters, which is assumed to be normally distributed.
A third view of RE models is that they are regularized fixed effect models. What does that mean? Note again that, conceptual, every random effect model corresponds to a fixed effect model. For a random intercept model, this would be
lmer(y ~ x + (1|group)) => lm(y ~ x + group)
and for the random slope, it would be
lmer(y ~ x + (x|group)) => lm(y ~ x * group)
So, you could alternatively always fit the fixed effect model. What’s the difference? First of all, for the fixed effect models, you would get many more estimates and p-values in the standard summary functions. But that’s not really a big difference. More importantly, however, estimates in the random effect model will be closer together, and if you have groups with very little data, they will still be fittable. So, in a random effect model, parameter estimates for groups with few data points are informed by the estimates of the other groups, because they are connected via the normal distribution.
This is today often used by people that want to fit effects across groups with varying availability of data. For example, if we have data for common and rare species, and we are interested in the density dependence of the species, we could fit
~ density + (density | species) mortality
In such a model, we have the mean density effect across all species, and rare species with few data will be constrained by this effect, while common species with a lot of data can overrule the normal distribution imposed by the random slope and get their own estimate. In this picture, the random effect imposes an adaptive shrinkage, similar to a LASSO or ridge shrinkage estimator, with the shrinkage strength controlled by the standard deviation of the random effect.
Degrees of freedom for a random effect
The second problem is: How many parameters does a random effect model have? To know how many parameters the model hss is crucial for calculating p-values, AIC and all that. We can estimate roughly how many parameters we should have by looking at the fixed effect version of the models:
= lm(normexam ~ standLRT + sex , data = Exam)
mod1 $rank # 3 parameters. mod1
## [1] 3
= lmer(normexam ~ standLRT + sex + (1 | school), data = Exam)
mod2 # No idea how many parameters.
= lm(normexam ~ standLRT + sex + school, data = Exam)
mod3 $rank # 67 parameters. mod3
## [1] 67
What we can say is that the mixed model is more complicated than mod1, but less than mod2 (as it has the additional constraint), so the complexity must be somewhere in-between. But now much?
In fact, the complexity is controlled by the estimated variance of the random effect. For a high variance, the model is nearly as complex as mod3, for a low variance, it is only as complex as mod1. Because of these issues, lmer
by default does not return p-values. However, you can calculate p-values based on approximate degrees of freedom via the lmerTest
package, which also corrects ANOVA for random effects, but not AIC.
library(lmerTest)
= lmer(normexam ~ standLRT + sex + (1 | school), data = Exam, REML = F)
m2 summary(m2)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
## method [lmerModLmerTest]
## Formula: normexam ~ standLRT + sex + (1 | school)
## Data: Exam
##
## AIC BIC logLik deviance df.resid
## 9340.0 9371.6 -4665.0 9330.0 4054
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.7117 -0.6326 0.0168 0.6865 3.2744
##
## Random effects:
## Groups Name Variance Std.Dev.
## school (Intercept) 0.08807 0.2968
## Residual 0.56226 0.7498
## Number of obs: 4059, groups: school, 65
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.07646 0.04168 76.66670 1.834 0.0705 .
## standLRT 0.55954 0.01245 4052.83930 44.950 < 2e-16 ***
## sexM -0.17138 0.03276 3032.37966 -5.231 1.8e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) stnLRT
## standLRT -0.013
## sexM -0.339 0.061
Predictions
When we predict, we can predict conditional on the fitted REs, or we can use the grand mean. In some packages, the predict function can be adjusted. In lme4, this is via the option re.form, which is available in the predict and simulate function.
Model selection with mixed models
For model selection, the degrees of freedom problem means that normal model selection techniques such as AIC or naive LRTs don’t work on the random effect structure, because they don’t count the correct degrees of freedom.
However, assuming that by changing the fixed effect structure, the flexibility of the REs doesn’t change a lot (you would see this by looking at the RE sd), we can use standard model selection on the fixed effect structure. All I have said about model selection on standard models applies also here: good for predictions, rarely a good idea if your goal is causal inference.
Regarding the random effect structure - my personal recommendation for most cases is the following:
- add random intercept on all abvious grouping variables
- check residuals per group (e.g. with the plot function below), add random slope if needed
<- lmer(y ~ x + (1|group))
m1
plot(m1,
resid(., scaled=TRUE) ~ fitted(.) | group,
abline = 0)
If you absolutely want to do model selection on the RE structure
- lmerTest::ranova performs an ANOVA with estimated df, adding entire RE groups
- if you want to do details model selections on the RE structure, you should implement a simulated LRT based on a parametric bootstrap. See day 5, on the parametric bootstrap.
Variance partitioning / ANOVA
Also variance partitioning in mixed models is a bit tricky, as (see type I/II/III ANOVA discussion) fixed and random components of the model are in some way “correlated”. Moreover, a key question is (see also interpreatio above): Do you want to count the random effect variance as “explained”, or “residual”. The most common approach is the hierarchical partitioning proposed by by Nakagawa & Schielzeth 2013, Nakagawa et al. (2017), which is implemented in the MuMIn
.{R} package. With this, we can run
library(MuMIn)
r.squaredGLMM(m2)
## R2m R2c
## [1,] 0.3303846 0.4210712
Interpretation
- R2m: Marginal \({R}^{2}\) value associated with fixed effects.
- R2c: Conditional \({R}^{2}\) value associated with fixed effects plus the random effects.
4.5 Case studies
4.5.1 Case Study 1: College Student Performance Over Time
Background and data structure
The GPA (college grade point average) data is a longitudinal data set (also named panel data, German: “Längsschnittstudie”. A study repeated at several different moments in time, compared to a cross-sectional study (German: “Querschnittstudie”) which has several participants at one time). In this data set, 200 college students and their GPA have been followed 6 consecutive semesters. Look at the GPA data set, which can be found in the EcoData
package:
library(EcoData)
str(gpa)
In this data set, there are GPA measures on 6 consecutive occasions, with a job status variable (how many hours worked) for the same 6 occasions. There are two student-level explanatory variables: The sex (1 = male, 2 = female) and the high school gpa. There is also a dichotomous student-level outcome variable, which indicates whether a student has been admitted to the university of their choice. Since not every student applies to a university, this variable has many missing values. Each student and each year of observation have an id.
Task
Analyze if GPA improves over time (occasion)! Here a few hints to look at:
- Consider which fixed effect structure you want to fit. For example, you might be interested if males and femals differ in their temporal trend
- Student is the grouping variable -> RE. Which RE structure do you want to fit? A residual plot may help
- For your benefit, have a look at the difference in the regression table (confidence intervals, coefficients and p-values) of mixed and corresponding fixed effects model. You can also look at the estimates of the mixed effects model (hint:
?ranef
). - After having specified the mixed model, have a look at residuals. You can model dispersion problems in mixed models with glmmTMB, same syntax for REs as lme4
Solution
library(lme4)
library(glmmTMB)
library(EcoData)
# initial model with a random intercept and fixed effect structure based on
# causal assumptions
$sOccasion = scale(gpa$occasion)
gpa$nJob = as.numeric(gpa$job)
gpa
<- lmer(gpa ~ sOccasion*sex + nJob + (1|student), data = gpa)
fit summary(fit)
# plot seems to show a lot of differences still, so add random slope
plot(fit,
resid(., scaled=TRUE) ~ fitted(.) | student,
abline = 0)
# slope + intercept model
<- lmer(gpa ~ sOccasion*sex + nJob + (sOccasion|student), data = gpa)
fit
# checking residuals - looks like heteroskedasticity
plot(fit)
<- lmer(gpa ~ sOccasion*sex + nJob + (sOccasion|student), data = gpa)
fit
# I'm using here glmmTMB, alternatively could add weights nlme::lme, which also allows specificying mixed models with all variance modelling options that we discussed for gls, but random effect specification is different than here
<- glmmTMB(gpa ~ sOccasion*sex + nJob + (sOccasion|student), data = gpa, dispformula = ~ sOccasion)
fit summary(fit)
# unfortunately, the dispersion in this model cannot be reliably checked, because the functions for this are not (yet) implemented in glmmTMB
plot(residuals(fit, type = "pearson") ~ predict(fit)) # not implemented
library(DHARMa)
simulateResiduals(fit, plot = T)
# still, the variable dispersion model is highly supported by the data and clearly preferable over a fixed dispersion model
4.5.2 Case Study 2 - Honeybee Data
Task
We use a dataset on bee colonies infected by the American Foulbrood (AFB) disease.
library(EcoData)
str(bees)
Perform the data analysis, according to the hypothesis discussed in the course.
Solution
# adding BeesN as a possible confounder
library(lme4)
<- lmer(log(Spobee + 1) ~ Infection + BeesN + (1|Hive), data = bees)
fit summary(fit)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log(Spobee + 1) ~ Infection + BeesN + (1 | Hive)
## Data: bees
##
## REML criterion at convergence: 260.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.27939 -0.45413 0.09209 0.52159 1.64023
##
## Random effects:
## Groups Name Variance Std.Dev.
## Hive (Intercept) 4.8463 2.2014
## Residual 0.6033 0.7767
## Number of obs: 72, groups: Hive, 24
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 5.583e+00 1.907e+00 2.100e+01 2.927 0.00805 **
## Infection 2.663e+00 5.528e-01 2.100e+01 4.818 9.22e-05 ***
## BeesN -2.068e-05 2.558e-05 2.100e+01 -0.808 0.42804
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Infctn
## Infection -0.476
## BeesN -0.966 0.398
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling
# residual plot shows that hives are either infected or not, thus
# doesn't make sense to add a random slope
plot(fit,
resid(., scaled=TRUE) ~ fitted(.) | Hive,
abline = 0)