9  Dispersion models

Modelling dispersion means that we describe how the expected variance changes as a function of the mean or predictor variables. The residual problem that we address by that is know as heteroskedasticity.

9.1 Heteroskedasticity in the linear model

The easiest way to understand heteroskedasticity is in the linear model. Let’s look at an extreme example that we create by simulating some data

set.seed(125)

data = data.frame(treatment = factor(rep(c("A", "B", "C"), each = 15)))
data$observation = c(7, 2 ,4)[as.numeric(data$treatment)] +
  rnorm( length(data$treatment), sd = as.numeric(data$treatment)^2 )
boxplot(observation ~ treatment, data = data)

Especially p-values and confidence intervals of lm() and ANOVA can react quite strongly to such differences in residual variation. So, running a standard lm() / ANOVA on this data is not a good idea - in this case, we see that all regression effects are not significant, as is the ANOVA, suggesting that there is no difference between groups.

fit = lm(observation ~ treatment, data = data)
summary(fit)

Call:
lm(formula = observation ~ treatment, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.2897  -1.0514   0.3531   2.4465  19.8602 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    7.043      1.731   4.069 0.000204 ***
treatmentB    -3.925      2.448  -1.603 0.116338    
treatmentC    -1.302      2.448  -0.532 0.597601    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.704 on 42 degrees of freedom
Multiple R-squared:  0.05973,   Adjusted R-squared:  0.01495 
F-statistic: 1.334 on 2 and 42 DF,  p-value: 0.2744
summary(aov(fit))
            Df Sum Sq Mean Sq F value Pr(>F)
treatment    2  119.9   59.95   1.334  0.274
Residuals   42 1887.6   44.94               
Important

A subtlety about heteroskedasticity in the normal model is that it maily affects p-values for the variable that we are


# error sd is variable, but uncorrelated with predictor

getP <- function(effect = 0, n = 1000){
  sd = rgamma(n, shape = 2)
  #hist(sd)
  x = runif(n)
  y = effect * x + rnorm(n, sd = sd)  
  fit <- lm(y ~ x)
  table <- summary(fit)
  return(table$coefficients[2,4])
}

out = replicate(1000, getP())
hist(out, breaks = 20)


# error sd is variable and correlated with predictor (identical in this case)

getP2 <- function(effect = 0, n = 1000){
  sd = rgamma(n, shape = 2)
  y = effect * sd + rnorm(n, sd = sd)  
  fit <- lm(y ~ sd)
  table <- summary(fit)
  return(table$coefficients[2,4])
}

out2 = replicate(1000, getP2())
hist(out2, breaks = 20)

So, what can we do?

9.1.1 Transformation

One option is to search for a transformation of the response that improves the problem - If heteroskedasticity correlates with the mean value, one can typically decrease it by some sqrt or log transformation, but often difficult, because this may also conflict with keeping the distribution normal.

9.1.2 Modelling the variance / dispersion

The second, more general option, is to model the variance - Modelling the variance to fit a model where the variance is not fixed. We will discuss two packages in R that allow to model the dispersion

The first (traditional) option is to use nlme::gls. GLS = Generalized Least Squares. In the gls function, you can specify a dependency of the residual variance on a predictor or the response via the weight argument. There are different types of dependencies that you can specify, see ?varFunc. In our case, we will use the varIdent function, which allows to specify a different variance per treatment.

library(nlme)

fit = gls(observation ~ treatment, data = data, 
          weights = varIdent(form = ~ 1 | treatment))
summary(fit)
Generalized least squares fit by REML
  Model: observation ~ treatment 
  Data: data 
       AIC      BIC    logLik
  243.9258 254.3519 -115.9629

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | treatment 
 Parameter estimates:
        A         B         C 
 1.000000  4.714512 11.821868 

Coefficients:
                Value Std.Error   t-value p-value
(Intercept)  7.042667 0.2348387 29.989388  0.0000
treatmentB  -3.925011 1.1317816 -3.467994  0.0012
treatmentC  -1.302030 2.7861462 -0.467323  0.6427

 Correlation: 
           (Intr) trtmnB
treatmentB -0.207       
treatmentC -0.084  0.017

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-2.4587934 -0.6241702  0.1687727  0.6524558  1.9480170 

Residual standard error: 0.9095262 
Degrees of freedom: 45 total; 42 residual

If we check the ANOVA, we see that, unlike before, we get a significant effect of the treatment

anova(fit)
Denom. DF: 42 
            numDF  F-value p-value
(Intercept)     1 899.3761  <.0001
treatment       2   6.0962  0.0047

The second option for modeling variances is to use the glmmTMB package. Here, you can specify an extra regression formula for the dispersion (= residual variance). If we fit this:

library(glmmTMB)

fit = glmmTMB(observation ~ treatment, data = data, 
              dispformula = ~ treatment)
summary(fit)
 Family: gaussian  ( identity )
Formula:          observation ~ treatment
Dispersion:                   ~treatment
Data: data

     AIC      BIC   logLik deviance df.resid 
   248.7    259.5   -118.3    236.7       39 


Conditional model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   7.0427     0.2269  31.042  < 2e-16 ***
treatmentB   -3.9250     1.0934  -3.590 0.000331 ***
treatmentC   -1.3020     2.6917  -0.484 0.628582    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Dispersion model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.2587     0.3651  -0.708    0.479    
treatmentB    3.1013     0.5164   6.006 1.91e-09 ***
treatmentC    4.9399     0.5164   9.566  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We get 2 regression tables as outputs - one for the effects, and one for the dispersion (= residual variance). We see, as expected, that the dispersion is higher in groups B and C compared to A. An advantage over gls is that we get confidence intervals and p-values for these differences on top!

Exercise variance modelling

Take this plot of Ozone ~ Solar.R using the airquality data. Clearly there is heteroskedasticity in the relationship:

plot(Ozone ~ Solar.R, data = airquality)

We can also see this when we fit the regression model:

m1 = lm(Ozone ~ Solar.R, data = airquality)
par(mfrow = c(2, 2))
plot(m1)

We could of course consider other predictors, but let’s say we want to fit this model specifically

  1. Try to get the variance stable with a transformation.
  2. Use the gls function (package nlme) with the untransformed response to make the variance dependent on Solar.R. Hint: Read in varClasses and decide how to model this.
  3. Use glmmTMB to model heteroskedasticity.
  1. Transformation
m3 = lm(sqrt(Ozone) ~ Solar.R, data = airquality)
  1. GLS
m3 = nlme::gls(Ozone ~ Solar.R, weights = varPower(form = ~Solar.R), data = airquality[complete.cases(airquality),])
  1. glmmTMB
m4 = glmmTMB(Ozone ~ Solar.R, dispformula = ~Solar.R, data = airquality)

9.2 Heteroskedasticity in GLMMs

GLM(M)s can be heteroskedastic as well, i.e. dispersion depends on some predictors. In glmmTMB, you can make the dispersion of the negative Binomial dependent on a formula via the dispformula argument, in the same way as in nlme for the linear model.

Variance problems would show up when plotting residuals against predicted and predictors. On the previous page, we saw some variance problems in the Salamander model. We could add a variable dispersion model via

m3 = glmmTMB(count ~ spp + mined + (1|site), family = nbinom1,
             dispformula = ~ spp + mined ,  data = Salamanders)
summary(m3)
 Family: nbinom1  ( log )
Formula:          count ~ spp + mined + (1 | site)
Dispersion:             ~spp + mined
Data: Salamanders

     AIC      BIC   logLik deviance df.resid 
  1654.4   1730.3   -810.2   1620.4      627 

Random effects:

Conditional model:
 Groups Name        Variance Std.Dev.
 site   (Intercept) 0.2283   0.4778  
Number of obs: 644, groups:  site, 23

Conditional model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.5288     0.2799  -5.462 4.70e-08 ***
sppPR        -1.3304     0.3480  -3.822 0.000132 ***
sppDM         0.2695     0.2004   1.345 0.178561    
sppEC-A      -0.7525     0.2772  -2.714 0.006641 ** 
sppEC-L       0.6228     0.2109   2.952 0.003155 ** 
sppDES-L      0.7113     0.1976   3.600 0.000318 ***
sppDF         0.1470     0.2171   0.677 0.498259    
minedno       2.1348     0.2825   7.557 4.14e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Dispersion model:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.2834     0.6414  -0.442   0.6586  
sppPR         0.3160     0.7501   0.421   0.6735  
sppDM         0.1979     0.5712   0.346   0.7289  
sppEC-A       0.3592     0.6477   0.554   0.5792  
sppEC-L       1.0830     0.5215   2.077   0.0378 *
sppDES-L      0.7951     0.5370   1.481   0.1387  
sppDF         0.3769     0.6109   0.617   0.5373  
minedno       0.5583     0.4187   1.334   0.1823  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(DHARMa)
res = simulateResiduals(m3, plot = T)

par(mfrow = c(1, 2))
plotResiduals(res, Salamanders$spp)
plotResiduals(res, Salamanders$mined)

9.3 Excursion: outliers, robust and quantile regression

What can we do if, after accounting for the functional relationship, response transformation and variance modelling, residual diagnostic 2 shows non-normality, in particular strong outliers? Here simulated example data with strong outliers / deviations from normality:

set.seed(123)

n = 100
concentration = runif(n, -1, 1)
growth = 2 * concentration + rnorm(n, sd = 0.5) +
  rbinom(n, 1, 0.05) * rnorm(n, mean = 6*concentration, sd = 6)
plot(growth ~ concentration)

Fitting the model, we see that the distribution is to wide:

fit = lm(growth ~ concentration)
par(mfrow = c(2, 2))
plot(fit)

What can we do to deal with such distributional problems and outliers?

  • Removing - Bad option, hard to defend, reviewers don’t like this - if at all, better show robustness with and without outlier, but result is sometimes not robust.
  • Change the distribution - Fit a model with a different distribution, i.e. GLM or other.
  • Robust regressions.
  • Quantile regression - A special type of regression that does not assume a particular residual distribution.

9.3.1 Robust regression

Robust methods generally refer to methods that are robust to violation of assumptions, e.g. outliers. More specifically, standard robust regressions typically downweight datap oints that have a too high influence on the fit. See https://cran.r-project.org/web/views/Robust.html for a list of robust packages in R.

# This is the classic method.
library(MASS)

fit = rlm(growth ~ concentration) 
summary(fit)

Call: rlm(formula = growth ~ concentration)
Residuals:
    Min      1Q  Median      3Q     Max 
-7.1986 -0.3724  0.0377  0.3391  7.0902 

Coefficients:
              Value   Std. Error t value
(Intercept)   -0.0978  0.0594    -1.6453
concentration  2.0724  0.1048    19.7721

Residual standard error: 0.534 on 98 degrees of freedom
# No p-values and not sure if we can trust the confidence intervals.
# Would need to boostrap by hand!

# This is another option that gives us p-values directly.
library(robustbase)

fit = lmrob(growth ~ concentration) 
summary(fit)

Call:
lmrob(formula = growth ~ concentration)
 \--> method = "MM"
Residuals:
    Min      1Q  Median      3Q     Max 
-7.2877 -0.4311 -0.0654  0.2788  7.0384 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -0.04448    0.05160  -0.862    0.391    
concentration  2.00588    0.08731  22.974   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Robust residual standard error: 0.5549 
Multiple R-squared:  0.8431,    Adjusted R-squared:  0.8415 
Convergence in 7 IRWLS iterations

Robustness weights: 
 9 observations c(27,40,47,52,56,76,80,91,100)
     are outliers with |weight| = 0 ( < 0.001); 
 5 weights are ~= 1. The remaining 86 ones are summarized as
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.6673  0.9015  0.9703  0.9318  0.9914  0.9989 
Algorithmic parameters: 
       tuning.chi                bb        tuning.psi        refine.tol 
        1.548e+00         5.000e-01         4.685e+00         1.000e-07 
          rel.tol         scale.tol         solve.tol       eps.outlier 
        1.000e-07         1.000e-10         1.000e-07         1.000e-03 
            eps.x warn.limit.reject warn.limit.meanrw 
        1.819e-12         5.000e-01         5.000e-01 
     nResample         max.it       best.r.s       k.fast.s          k.max 
           500             50              2              1            200 
   maxit.scale      trace.lev            mts     compute.rd fast.s.large.n 
           200              0           1000              0           2000 
                  psi           subsampling                   cov 
           "bisquare"         "nonsingular"         ".vcov.avar1" 
compute.outlier.stats 
                 "SM" 
seed : int(0) 

9.3.2 Quantile regression

Quantile regressions don’t fit a line with an error spreading around it, but try to fit a quantile (e.g. the 0.5 quantile, the median) regardless of the distribution. Thus, they work even if the usual assumptions don’t hold.

library(qgam)

dat = data.frame(growth = growth, concentration = concentration)

fit = qgam(growth ~ concentration, data = dat, qu = 0.5) 
Estimating learning rate. Each dot corresponds to a loss evaluation. 
qu = 0.5................done 
summary(fit)

Family: elf 
Link function: identity 

Formula:
growth ~ concentration

Parametric coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.08167    0.05823  -1.403    0.161    
concentration  2.04781    0.09500  21.556   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


R-sq.(adj) =  0.427   Deviance explained = 48.8%
-REML = 157.82  Scale est. = 1         n = 100