3  Bias-variance trade-off

3.1 Understanding the bias-variance trade-off

Which of the following statements about the bias-variance trade-off is correct? (see figure above)

3.2 Optimizing the bias-variance trade-off

Optimizing the bias-variance trade-off means adjusting the complexity of the model which can be achieved by:

  • Feature selection (more features increases the flexibility of the model)

  • Regularization

3.2.1 Feature selection

Adding features increases the flexibility of the model and the goodness of fit:

library(mlbench)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
data(BostonHousing)
data = BostonHousing

summary(lm(medv~rm, data = data))

Call:
lm(formula = medv ~ rm, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.346  -2.547   0.090   2.986  39.433 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -34.671      2.650  -13.08   <2e-16 ***
rm             9.102      0.419   21.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.616 on 504 degrees of freedom
Multiple R-squared:  0.4835,    Adjusted R-squared:  0.4825 
F-statistic: 471.8 on 1 and 504 DF,  p-value: < 2.2e-16
summary(lm(medv~rm+dis, data = data))$r.squared
[1] 0.4955246
summary(lm(medv~., data = data))$r.squared
[1] 0.7406427
# Main effects + all potential interactions:
summary(lm(medv~.^2, data = data))$r.squared
[1] 0.9211876

The model with all features and their potential interactions has the highest \(R^2\), but it also has the highest uncertainty because there are on average only 5 observations for each parameter (92 parameters and 506 observations). So how do we decide which level of complexity is appropriate for our task? For the data we use to train the model, \(R^2\) will always get better with higher model complexity, so it is a poor decision criterion. We will show this in the Section 3.2.4 section. In short, the idea is that we need to split the data so that we have an evaluation (test) dataset that wasn’t used to train the model, which we can then use in turn to see if our model generalizes well to new data.

3.2.2 Regularization

Regularization means adding information or structure to a system in order to solve an ill-posed optimization problem or to prevent overfitting. There are many ways of regularizing a machine learning model. The most important distinction is between shrinkage estimators and estimators based on model averaging.

Shrikage estimators are based on the idea of adding a penalty to the loss function that penalizes deviations of the model parameters from a particular value (typically 0). In this way, estimates are “shrunk” to the specified default value. In practice, the most important penalties are the least absolute shrinkage and selection operator; also Lasso or LASSO, where the penalty is proportional to the sum of absolute deviations (\(L1\) penalty), and the Tikhonov regularization aka Ridge regression, where the penalty is proportional to the sum of squared distances from the reference (\(L2\) penalty). Thus, the loss function that we optimize is given by

\[ loss = fit - \lambda \cdot d \]

where fit refers to the standard loss function, \(\lambda\) is the strength of the regularization, and \(d\) is the chosen metric, e.g. \(L1\) or\(L2\):

\[ loss_{L1} = fit - \lambda \cdot \Vert weights \Vert_1 \]

\[ loss_{L2} = fit - \lambda \cdot \Vert weights \Vert_2 \]

\(\lambda\) and possibly d are typically optimized under cross-validation. \(L1\) and \(L2\) can be also combined what is then called elastic net (see Zou and Hastie (2005)).

Model averaging refers to an entire set of techniques, including boosting, bagging and other averaging techniques. The general principle is that predictions are made by combining (= averaging) several models. This is based on on the insight that it is often more efficient having many simpler models and average them, than one “super model”. The reasons are complicated, and explained in more detail in Dormann et al. (2018).

A particular important application of averaging is boosting, where the idea is that many weak learners are combined to a model average, resulting in a strong learner. Another related method is bootstrap aggregating, also called bagging. Idea here is to boostrap (use random sampling with replacement ) the data, and average the bootstrapped predictions.

To see how these techniques work in practice, let’s first focus on LASSO and Ridge regularization for weights in neural networks. We can imagine that the LASSO and Ridge act similar to a rubber band on the weights that pulls them to zero if the data does not strongly push them away from zero. This leads to important weights, which are supported by the data, being estimated as different from zero, whereas unimportant model structures are reduced (shrunken) to zero.

LASSO \(\left(penalty \propto \sum_{}^{} \mathrm{abs}(weights) \right)\) and Ridge \(\left(penalty \propto \sum_{}^{} weights^{2} \right)\) have slightly different properties. They are best understood if we express those as the effective prior preference they create on the parameters:

As you can see, the LASSO creates a very strong preference towards exactly zero, but falls off less strongly towards the tails. This means that parameters tend to be estimated either to exactly zero, or, if not, they are more free than the Ridge. For this reason, LASSO is often more interpreted as a model selection method.

The Ridge, on the other hand, has a certain area around zero where it is relatively indifferent about deviations from zero, thus rarely leading to exactly zero values. However, it will create a stronger shrinkage for values that deviate significantly from zero.

3.2.2.1 Ridge - Example

We can use the glmnet package for Ridge, LASSO, and elastic-net regressions.

We want to predict the house prices of Boston (see help of the dataset):

library(mlbench)
library(dplyr)
library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-7
data(BostonHousing)
data = BostonHousing
Y = data$medv
X = data %>% select(-medv, -chas) %>% scale()

hist(cor(X))

m1 = glmnet(y = Y, x = X, alpha = 0)

The glmnet function automatically tries different values for lambda:

cbind(coef(m1, s = 0.001), coef(m1, s = 100.5))
13 x 2 sparse Matrix of class "dgCMatrix"
                     s1          s1
(Intercept) 22.53280632 22.53280632
crim        -0.79174957 -0.21113427
zn           0.76313031  0.18846808
indus       -0.17037817 -0.25120998
nox         -1.32794787 -0.21314250
rm           2.85780876  0.46463202
age         -0.05389395 -0.18279762
dis         -2.38716188  0.07906631
rad          1.42772476 -0.17967948
tax         -1.09026758 -0.24233282
ptratio     -1.93105019 -0.31587466
b            0.86718037  0.18764060
lstat       -3.43236617 -0.46055837

3.2.2.2 LASSO - Example

By changing \(alpha\) to 1.0 we use a LASSO instead of a Ridge regression:

m2 = glmnet(y = Y, x = X, alpha = 1.0)
cbind(coef(m2, s = 0.001), coef(m2, s = 0.5))
13 x 2 sparse Matrix of class "dgCMatrix"
                     s1           s1
(Intercept) 22.53280632 22.532806324
crim        -0.95543108 -0.135047323
zn           1.06718108  .          
indus        0.21519500  .          
nox         -1.95945910 -0.000537715
rm           2.71666891  2.998520195
age          0.05184895  .          
dis         -3.10566908 -0.244045205
rad          2.73963771  .          
tax         -2.20279273  .          
ptratio     -2.13052857 -1.644234575
b            0.88420283  0.561686909
lstat       -3.80177809 -3.682148016

3.2.2.3 Elastic-net - Example

By setting \(alpha\) to a value between 0 and 1.0, we use a combination of LASSO and Rdige:

m3 = glmnet(y = Y, x = X, alpha = 0.5)
cbind(coef(m3, s = 0.001), coef(m3, s = 0.5))
13 x 2 sparse Matrix of class "dgCMatrix"
                     s1         s1
(Intercept) 22.53280632 22.5328063
crim        -0.95716118 -0.3488473
zn           1.06836343  0.1995842
indus        0.21825187  .        
nox         -1.96211736 -0.7613698
rm           2.71859592  3.0137090
age          0.05299551  .        
dis         -3.10330132 -1.3011740
rad          2.73321635  .        
tax         -2.19638611  .        
ptratio     -2.13041090 -1.8051547
b            0.88458269  0.6897165
lstat       -3.79836182 -3.6136853

3.2.3 Hyperparameters

Generally, parameters such as \(\lambda\) and \(\alpha\) that, for example, control the complexity or other parameters that control their learning or the optimization are called hyperparameters. Comming back to our glmnet example:

We can plot the effect of \(\lambda\) on the effect estimates:

plot(m1)

So which lambda should we choose now? If we calculate the model fit for different lambdas (e.g. using the RMSE):

lambdas = seq(0.001, 1.5, length.out = 100)
RMSEs = 
  sapply(lambdas, function(l) {
    prediction = predict(m1, newx = X, s = l)
    RMSE = Metrics::rmse(Y, prediction)
    return(RMSE)
    })
plot(lambdas, RMSEs)

We see that the lowest lambda achieved the highest RMSE - which is not surprising because the unconstrained model, the most complex model, has the highest fit, so no bias but probably high variance (with respect to the bias-variance tradeoff).

3.2.3.1 Split data into training and testing

We want a model that generalizes well to new data, which we need to “simulate” here by splitting of a holdout before the training and using the holdout then for testing our model:

set.seed(1)
library(mlbench)
library(dplyr)
data(BostonHousing)
data = BostonHousing
Y = data$medv
X = data %>% select(-medv, -chas) %>% scale()

# Split data
indices = sample.int(nrow(X), 0.2*nrow(X))
train_X = X[indices,]
test_X = X[-indices,]
train_Y = Y[indices]
test_Y = Y[-indices]

# Train model on train data
m1 = glmnet(y = train_Y, x = train_X, alpha = 0.5)

# Test model on test data
pred = predict(m1, newx = test_X, s = 0.01)

# Calculate performance on test data
Metrics::rmse(test_Y, pred)
[1] 5.063774

Let’s do it again for different values of lambdas:

lambdas = seq(0.0000001, 0.5, length.out = 100)
RMSEs = 
  sapply(lambdas, function(l) {
    prediction = predict(m1, newx = test_X, s = l)
    return(Metrics::rmse(test_Y, prediction))
    })
plot(lambdas, RMSEs, xlab = "Lambda", ylab = "RMSE", type = "l", las = 2)
abline(v = lambdas[which.min(RMSEs)], col = "red", lwd = 1.5)

Hyperparameter tuning describes the process of finding the optimal set of hyperparameters for a certain task. They are usually data specific, so they have to tuned for each dataset.

If we do only one split it could happen that we only find a set of hyperparameters that are best suited for this specific split and thus we usally do several splits so that each observation is once an observation in the test dataset, cross-validation

3.2.4 Cross-validation

The cv.glmnet function does per default a 5xCV (so 5 splits) and in each split different values for \(\lambda\) are tested

m1 = glmnet::cv.glmnet(x = X, y = Y, alpha = 0.5, nfolds = 5)
m1

Call:  glmnet::cv.glmnet(x = X, y = Y, nfolds = 5, alpha = 0.5) 

Measure: Mean-Squared Error 

    Lambda Index Measure    SE Nonzero
min 0.0105    78   23.80 3.247      12
1se 0.6905    33   26.88 4.014       8
plot(m1)

m1$lambda.min
[1] 0.01049538

So low values of \(\lambda\) seem to achieve the lowest error, thus the higehst predictive performance.

This is called hyperparameter tuning.

Dormann, Carsten F, Justin M Calabrese, Gurutzeta Guillera-Arroita, Eleni Matechou, Volker Bahn, Kamil Bartoń, Colin M Beale, et al. 2018. “Model Averaging in Ecology: A Review of Bayesian, Information-Theoretic, and Tactical Approaches for Predictive Inference.” Ecological Monographs 88 (4): 485–504.
Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2): 301–20.