set.seed(125)
= data.frame(treatment = factor(rep(c("A", "B", "C"), each = 15)))
data $observation = c(7, 2 ,4)[as.numeric(data$treatment)] +
datarnorm( length(data$treatment), sd = as.numeric(data$treatment)^2 )
boxplot(observation ~ treatment, data = data)
10 Dispersion models
Modelling dispersion means that we describe how the expected variance changes as a function of the mean or predictor variables. The residual problem that we address by that is know as heteroskedasticity.
10.1 Heteroskedasticity in the linear model
The easiest way to understand heteroskedasticity is in the linear model. Let’s look at an extreme example that we create by simulating some data
Especially p-values and confidence intervals of lm()
and ANOVA can react quite strongly to such differences in residual variation. So, running a standard lm()
/ ANOVA on this data is not a good idea - in this case, we see that all regression effects are not significant, as is the ANOVA, suggesting that there is no difference between groups.
= lm(observation ~ treatment, data = data)
fit summary(fit)
Call:
lm(formula = observation ~ treatment, data = data)
Residuals:
Min 1Q Median 3Q Max
-17.2897 -1.0514 0.3531 2.4465 19.8602
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.043 1.731 4.069 0.000204 ***
treatmentB -3.925 2.448 -1.603 0.116338
treatmentC -1.302 2.448 -0.532 0.597601
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.704 on 42 degrees of freedom
Multiple R-squared: 0.05973, Adjusted R-squared: 0.01495
F-statistic: 1.334 on 2 and 42 DF, p-value: 0.2744
summary(aov(fit))
Df Sum Sq Mean Sq F value Pr(>F)
treatment 2 119.9 59.95 1.334 0.274
Residuals 42 1887.6 44.94
A subtlety about heteroskedasticity in the normal model is that it maily affects p-values for the variable that we are
# error sd is variable, but uncorrelated with predictor
<- function(effect = 0, n = 1000){
getP = rgamma(n, shape = 2)
sd #hist(sd)
= runif(n)
x = effect * x + rnorm(n, sd = sd)
y <- lm(y ~ x)
fit <- summary(fit)
table return(table$coefficients[2,4])
}
= replicate(1000, getP())
out hist(out, breaks = 20)
# error sd is variable and correlated with predictor (identical in this case)
<- function(effect = 0, n = 1000){
getP2 = rgamma(n, shape = 2)
sd = effect * sd + rnorm(n, sd = sd)
y <- lm(y ~ sd)
fit <- summary(fit)
table return(table$coefficients[2,4])
}
= replicate(1000, getP2())
out2 hist(out2, breaks = 20)
So, what can we do?
10.1.1 Transformation
One option is to search for a transformation of the response that improves the problem - If heteroskedasticity correlates with the mean value, one can typically decrease it by some sqrt or log transformation, but often difficult, because this may also conflict with keeping the distribution normal.
10.1.2 Modelling the variance / dispersion
The second, more general option, is to model the variance - Modelling the variance to fit a model where the variance is not fixed. We will discuss two packages in R that allow to model the dispersion
The first (traditional) option is to use nlme::gls
. GLS = Generalized Least Squares. In the gls
function, you can specify a dependency of the residual variance on a predictor or the response via the weight argument. There are different types of dependencies that you can specify, see ?varFunc
. In our case, we will use the varIdent
function, which allows to specify a different variance per treatment.
library(nlme)
= gls(observation ~ treatment, data = data,
fit weights = varIdent(form = ~ 1 | treatment))
summary(fit)
Generalized least squares fit by REML
Model: observation ~ treatment
Data: data
AIC BIC logLik
243.9258 254.3519 -115.9629
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | treatment
Parameter estimates:
A B C
1.000000 4.714512 11.821868
Coefficients:
Value Std.Error t-value p-value
(Intercept) 7.042667 0.2348387 29.989388 0.0000
treatmentB -3.925011 1.1317816 -3.467994 0.0012
treatmentC -1.302030 2.7861462 -0.467323 0.6427
Correlation:
(Intr) trtmnB
treatmentB -0.207
treatmentC -0.084 0.017
Standardized residuals:
Min Q1 Med Q3 Max
-2.4587934 -0.6241702 0.1687727 0.6524558 1.9480170
Residual standard error: 0.9095262
Degrees of freedom: 45 total; 42 residual
If we check the ANOVA, we see that, unlike before, we get a significant effect of the treatment
anova(fit)
Denom. DF: 42
numDF F-value p-value
(Intercept) 1 899.3761 <.0001
treatment 2 6.0962 0.0047
The second option for modeling variances is to use the glmmTMB
package. Here, you can specify an extra regression formula for the dispersion (= residual variance). If we fit this:
library(glmmTMB)
= glmmTMB(observation ~ treatment, data = data,
fit dispformula = ~ treatment)
summary(fit)
Family: gaussian ( identity )
Formula: observation ~ treatment
Dispersion: ~treatment
Data: data
AIC BIC logLik deviance df.resid
248.7 259.5 -118.3 236.7 39
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.0427 0.2269 31.042 < 2e-16 ***
treatmentB -3.9250 1.0934 -3.590 0.000331 ***
treatmentC -1.3020 2.6917 -0.484 0.628581
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Dispersion model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.1293 0.1826 -0.708 0.479
treatmentB 1.5506 0.2582 6.006 1.91e-09 ***
treatmentC 2.4700 0.2582 9.566 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We get 2 regression tables as outputs - one for the effects, and one for the dispersion (= residual variance). We see, as expected, that the dispersion is higher in groups B and C compared to A. An advantage over gls is that we get confidence intervals and p-values for these differences on top!
10.2 Heteroskedasticity in GLMMs
GLM(M)s can be heteroskedastic as well, i.e. dispersion depends on some predictors. In glmmTMB
, you can make the dispersion of the negative Binomial dependent on a formula via the dispformula
argument, in the same way as in nlme
for the linear model.
Variance problems would show up when plotting residuals against predicted and predictors. On the previous page, we saw some variance problems in the Salamander model. We could add a variable dispersion model via
= glmmTMB(count ~ spp + mined + (1|site), family = nbinom1,
m3 dispformula = ~ spp + mined , data = Salamanders)
summary(m3)
Family: nbinom1 ( log )
Formula: count ~ spp + mined + (1 | site)
Dispersion: ~spp + mined
Data: Salamanders
AIC BIC logLik deviance df.resid
1654.4 1730.3 -810.2 1620.4 627
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
site (Intercept) 0.2283 0.4778
Number of obs: 644, groups: site, 23
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5288 0.2799 -5.462 4.70e-08 ***
sppPR -1.3304 0.3480 -3.822 0.000132 ***
sppDM 0.2695 0.2004 1.345 0.178561
sppEC-A -0.7525 0.2772 -2.714 0.006641 **
sppEC-L 0.6228 0.2109 2.952 0.003155 **
sppDES-L 0.7113 0.1976 3.600 0.000318 ***
sppDF 0.1470 0.2171 0.677 0.498259
minedno 2.1348 0.2825 7.557 4.14e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Dispersion model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2834 0.6414 -0.442 0.6586
sppPR 0.3160 0.7501 0.421 0.6735
sppDM 0.1979 0.5712 0.346 0.7289
sppEC-A 0.3592 0.6477 0.554 0.5792
sppEC-L 1.0830 0.5215 2.077 0.0378 *
sppDES-L 0.7951 0.5370 1.481 0.1387
sppDF 0.3769 0.6109 0.617 0.5373
minedno 0.5583 0.4187 1.334 0.1823
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(DHARMa)
= simulateResiduals(m3, plot = T) res
par(mfrow = c(1, 2))
plotResiduals(res, Salamanders$spp)
plotResiduals(res, Salamanders$mined)
10.3 Excursion: outliers, robust and quantile regression
What can we do if, after accounting for the functional relationship, response transformation and variance modelling, residual diagnostic 2 shows non-normality, in particular strong outliers? Here simulated example data with strong outliers / deviations from normality:
set.seed(123)
= 100
n = runif(n, -1, 1)
concentration = 2 * concentration + rnorm(n, sd = 0.5) +
growth rbinom(n, 1, 0.05) * rnorm(n, mean = 6*concentration, sd = 6)
plot(growth ~ concentration)
Fitting the model, we see that the distribution is to wide:
= lm(growth ~ concentration)
fit par(mfrow = c(2, 2))
plot(fit)
What can we do to deal with such distributional problems and outliers?
- Removing - Bad option, hard to defend, reviewers don’t like this - if at all, better show robustness with and without outlier, but result is sometimes not robust.
- Change the distribution - Fit a model with a different distribution, i.e. GLM or other.
- Robust regressions.
- Quantile regression - A special type of regression that does not assume a particular residual distribution.
10.3.1 Robust regression
Robust methods generally refer to methods that are robust to violation of assumptions, e.g. outliers. More specifically, standard robust regressions typically downweight datap oints that have a too high influence on the fit. See https://cran.r-project.org/web/views/Robust.html for a list of robust packages in R.
# This is the classic method.
library(MASS)
= rlm(growth ~ concentration)
fit summary(fit)
Call: rlm(formula = growth ~ concentration)
Residuals:
Min 1Q Median 3Q Max
-7.1986 -0.3724 0.0377 0.3391 7.0902
Coefficients:
Value Std. Error t value
(Intercept) -0.0978 0.0594 -1.6453
concentration 2.0724 0.1048 19.7721
Residual standard error: 0.534 on 98 degrees of freedom
# No p-values and not sure if we can trust the confidence intervals.
# Would need to boostrap by hand!
# This is another option that gives us p-values directly.
library(robustbase)
= lmrob(growth ~ concentration)
fit summary(fit)
Call:
lmrob(formula = growth ~ concentration)
\--> method = "MM"
Residuals:
Min 1Q Median 3Q Max
-7.2877 -0.4311 -0.0654 0.2788 7.0384
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.04448 0.05160 -0.862 0.391
concentration 2.00588 0.08731 22.974 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Robust residual standard error: 0.5549
Multiple R-squared: 0.8431, Adjusted R-squared: 0.8415
Convergence in 7 IRWLS iterations
Robustness weights:
9 observations c(27,40,47,52,56,76,80,91,100)
are outliers with |weight| = 0 ( < 0.001);
5 weights are ~= 1. The remaining 86 ones are summarized as
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.6673 0.9015 0.9703 0.9318 0.9914 0.9989
Algorithmic parameters:
tuning.chi bb tuning.psi refine.tol
1.548e+00 5.000e-01 4.685e+00 1.000e-07
rel.tol scale.tol solve.tol eps.outlier
1.000e-07 1.000e-10 1.000e-07 1.000e-03
eps.x warn.limit.reject warn.limit.meanrw
1.819e-12 5.000e-01 5.000e-01
nResample max.it best.r.s k.fast.s k.max
500 50 2 1 200
maxit.scale trace.lev mts compute.rd fast.s.large.n
200 0 1000 0 2000
psi subsampling cov
"bisquare" "nonsingular" ".vcov.avar1"
compute.outlier.stats
"SM"
seed : int(0)
10.3.2 Quantile regression
Quantile regressions don’t fit a line with an error spreading around it, but try to fit a quantile (e.g. the 0.5 quantile, the median) regardless of the distribution. Thus, they work even if the usual assumptions don’t hold.
library(qgam)
= data.frame(growth = growth, concentration = concentration)
dat
= qgam(growth ~ concentration, data = dat, qu = 0.5) fit
Estimating learning rate. Each dot corresponds to a loss evaluation.
qu = 0.5................done
summary(fit)
Family: elf
Link function: identity
Formula:
growth ~ concentration
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.08167 0.05823 -1.403 0.161
concentration 2.04781 0.09500 21.556 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.427 Deviance explained = 48.8%
-REML = 157.82 Scale est. = 1 n = 100