7  Artificial Neural Networks

Artificial neural networks are biologically inspired, the idea is that inputs are processed by weights, the neurons, the signals then accumulate at hidden nodes (axioms), and only if the sum of activations of several neurons exceed a certain threshold, the signal will be passed on.

library(cito)

cito allows us to fit fully-connected neural networks within one line of code. When we come to other tasks such as image recognition we have to use frameworks with higher flexibility such as keras or torch.

Neural networks are harder to optimize (hey are optimized via backpropagation and gradient descent) and a few hyperparameters that control the optimization should be familiar:

Hyperparameter Meaning Range
learning rate the step size of the parameter updating in the iterative optimization routine, if too high, the optimizer will step over good local optima, if too small, the optimizer will be stuck in a bad local optima [0.00001, 0.5]
batch size NNs are optimized via stochastic gradient descent, i.e. only a batch of the data is used to update the parameters at a time

Depends on the data:

10-250

epoch the data is fed into the optimization in batches, once the entire data set has been used in the optimization, the epoch is complete (so e.g. n = 100, batch size = 20, it takes 5 steps to complete an epoch) 100+ (use early stopping)

Example:

data = airquality[complete.cases(airquality),]
data = scale(data)

model = dnn(Ozone~., 
            hidden = c(10L, 10L), # Architecture, number of hidden layers and nodes in each layer
            activation = c("selu", "selu"), # activation functions for the specific hidden layer
            loss = "mse", lr = 0.01, data = data, epochs = 150L, verbose = FALSE)
plot(model)

summary(model)
Deep Neural Network Model summary
Model generated on basis of: 
Feature Importance:
  variable importance
1  Solar.R   1.203977
2     Wind   1.989396
3     Temp   3.109649
4    Month   1.130403
5      Day   1.085794

The architecture of the NN can be specified by the hidden argument, it is a vector where the length corresponds to the number of hidden layers and value of entry to the number of hidden neurons in each layer (and the same applies for the activation argument that specifies the activation functions in the hidden layers). It is hard to make recommendations about the architecture, a kind of general rule is that the width of the hidden layers is more important than the depth of the NN.

The loss function has to be adjusted to the response type:

Loss Type Example
mse (mean squared error) Regression Numeric values
mae (mean absolute error) Regression Numeric values, often used for skewed data
softmax Classification, multi-label Species
cross-entropy Classification, binary or multi-class Survived/non-survived, Multi-species/communities
binomial Classification, binary or multi-class Binomial likelihood
poisson Regression Count data
Importance of the learning rate

cito visualizes the training (see graphic). The reason for this is that the training can easily fail if the learning rate (lr) is poorly chosen. If the lr is too high, the optimizer “jumps” over good local optima, while it gets stuck in local optima if the lr is too small:

model = dnn(Ozone~., 
            hidden = c(10L, 10L), 
            activation = c("selu", "selu"), 
            loss = "mse", lr = 0.4, data = data, epochs = 150L, verbose = FALSE)

If too high, the training will either directly fail (because the loss jumps to infinity) or the loss will be very wiggly and doesn’t decrease over the number of epochs.

model = dnn(Ozone~., 
            hidden = c(10L, 10L), 
            activation = c("selu", "selu"), 
            loss = "mse", lr = 0.0001, data = data, epochs = 150L, verbose = FALSE)

If too low, the loss will be very wiggly but doesn’t decrease.

Learning rate scheduler

Adjusting / reducing the learning rate during training is a common approach in neural networks. The idea is to start with a larger learning rate and then steadily decrease it during training (either systematically or based on specific properties):

model = dnn(Ozone~., 
            hidden = c(10L, 10L), 
            activation = c("selu", "selu"), 
            loss = "mse", 
            lr = 0.1,
            lr_scheduler = config_lr_scheduler("step", step_size = 30, gamma = 0.1),
            # reduce learning all 30 epochs (new lr = 0.1* old lr)
            data = data, epochs = 150L, verbose = FALSE)

7.1 Regularization

We can use \(\lambda\) and \(\alpha\) to set L1 and L2 regularization on the weights in our NN:

model = dnn(Ozone~., 
            hidden = c(10L, 10L), 
            activation = c("selu", "selu"), 
            loss = "mse", 
            lr = 0.05,
            lambda = 0.1,
            alpha = 0.5,
            lr_scheduler = config_lr_scheduler("step", step_size = 30, gamma = 0.1),
            # reduce learning all 30 epochs (new lr = 0.1* old lr)
            data = data, epochs = 150L, verbose = FALSE)

summary(model)
Deep Neural Network Model summary
Model generated on basis of: 
Feature Importance:
  variable importance
1  Solar.R   1.170810
2     Wind   1.854874
3     Temp   2.703980
4    Month   1.001610
5      Day   1.014565

Be careful that you don’t accidentally set all weights to 0 because of a too high regularization. We check the weights of the first layer:

fields::image.plot(coef(model)[[1]][[1]]) # weights of the first layer

7.2 Exercise

Question: Hyperparameter tuning - Titanic dataset

Tune architecture

  • Play around with the architecture and try to improve the AUC on the submission server

Bonus:

  • Tune the architecture! (depth and width of the NN via the hidden argument)
library(EcoData)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(missRanger)
data(titanic_ml)
data = titanic_ml
data = 
  data %>% select(survived, sex, age, fare, pclass)
data[,-1] = missRanger(data[,-1], verbose = 0)

data_sub =
  data %>%
    mutate(age = scales::rescale(age, c(0, 1)),
           fare = scales::rescale(fare, c(0, 1))) %>%
    mutate(sex = as.integer(sex) - 1L,
           pclass = as.integer(pclass - 1L))
data_new = data_sub[is.na(data_sub$survived),] # for which we want to make predictions at the end
data_obs = data_sub[!is.na(data_sub$survived),] # data with known response


model = dnn(survived~., 
          hidden = c(10L, 10L), # change
          activation = c("selu", "selu"), # change
          loss = "binomial", 
          lr = 0.05, #change
          validation = 0.2,
          lambda = 0.001, # change
          alpha = 0.1, # change
          lr_scheduler = config_lr_scheduler("reduce_on_plateau", patience = 10, factor = 0.9),
          data = data_obs, epochs = 40L, verbose = TRUE, plot= TRUE)
Loss at epoch 1: training: 0.726, validation: 0.659, lr: 0.05000

Loss at epoch 2: training: 0.666, validation: 0.660, lr: 0.05000
Loss at epoch 3: training: 0.655, validation: 0.722, lr: 0.05000
Loss at epoch 4: training: 0.634, validation: 0.635, lr: 0.05000
Loss at epoch 5: training: 0.621, validation: 0.657, lr: 0.05000
Loss at epoch 6: training: 0.608, validation: 0.574, lr: 0.05000
Loss at epoch 7: training: 0.598, validation: 0.609, lr: 0.05000
Loss at epoch 8: training: 0.575, validation: 0.639, lr: 0.05000
Loss at epoch 9: training: 0.570, validation: 0.534, lr: 0.05000
Loss at epoch 10: training: 0.551, validation: 0.526, lr: 0.05000
Loss at epoch 11: training: 0.547, validation: 0.503, lr: 0.05000
Loss at epoch 12: training: 0.537, validation: 0.487, lr: 0.05000
Loss at epoch 13: training: 0.534, validation: 0.510, lr: 0.05000
Loss at epoch 14: training: 0.529, validation: 0.539, lr: 0.05000
Loss at epoch 15: training: 0.524, validation: 0.483, lr: 0.05000
Loss at epoch 16: training: 0.521, validation: 0.528, lr: 0.05000
Loss at epoch 17: training: 0.534, validation: 0.490, lr: 0.05000
Loss at epoch 18: training: 0.519, validation: 0.520, lr: 0.05000
Loss at epoch 19: training: 0.526, validation: 0.616, lr: 0.05000
Loss at epoch 20: training: 0.529, validation: 0.492, lr: 0.05000
Loss at epoch 21: training: 0.530, validation: 0.513, lr: 0.05000
Loss at epoch 22: training: 0.523, validation: 0.485, lr: 0.05000
Loss at epoch 23: training: 0.523, validation: 0.481, lr: 0.05000
Loss at epoch 24: training: 0.520, validation: 0.490, lr: 0.05000
Loss at epoch 25: training: 0.526, validation: 0.462, lr: 0.05000
Loss at epoch 26: training: 0.516, validation: 0.571, lr: 0.05000
Loss at epoch 27: training: 0.533, validation: 0.621, lr: 0.05000
Loss at epoch 28: training: 0.515, validation: 0.523, lr: 0.05000
Loss at epoch 29: training: 0.521, validation: 0.665, lr: 0.05000
Loss at epoch 30: training: 0.517, validation: 0.531, lr: 0.05000
Loss at epoch 31: training: 0.510, validation: 0.629, lr: 0.05000
Loss at epoch 32: training: 0.515, validation: 0.570, lr: 0.05000
Loss at epoch 33: training: 0.514, validation: 0.492, lr: 0.05000
Loss at epoch 34: training: 0.511, validation: 0.584, lr: 0.05000
Loss at epoch 35: training: 0.518, validation: 0.618, lr: 0.05000
Loss at epoch 36: training: 0.519, validation: 0.569, lr: 0.04500
Loss at epoch 37: training: 0.508, validation: 0.489, lr: 0.04500
Loss at epoch 38: training: 0.512, validation: 0.576, lr: 0.04500
Loss at epoch 39: training: 0.512, validation: 0.482, lr: 0.04500
Loss at epoch 40: training: 0.511, validation: 0.626, lr: 0.04500
# Predictions:

predictions = predict(model, newdata = data_new)

write.csv(data.frame(y = predictions[,1]), file = "Max_titanic_ensemble.csv")
Question: Hyperparameter tuning - Plant-pollinator dataset

see Section A.2 for more information about the dataset.

Prepare the data:

library(EcoData)
library(missRanger)
library(dplyr)
data(plantPollinator_df)
plant_poll = plantPollinator_df

plant_poll_imputed = plant_poll %>% select(diameter,
                                           corolla,
                                           tongue,
                                           body,
                                           interaction,
                                           colour, 
                                           nectar,
                                           feeding,
                                           season)
# Remove response variable interaction
plant_poll_imputed = missRanger::missRanger(data = plant_poll_imputed %>%
                                              select(-interaction), verbose = 0)

# scale numeric variables
plant_poll_imputed[,sapply(plant_poll_imputed, is.numeric)] = scale(plant_poll_imputed[,sapply(plant_poll_imputed, is.numeric)])

# Add response back to the dataset after the imputatiob
plant_poll_imputed$interaction = plant_poll$interaction
plant_poll_imputed$colour = as.factor(plant_poll_imputed$colour)
plant_poll_imputed$nectar = as.factor(plant_poll_imputed$nectar)
plant_poll_imputed$feeding = as.factor(plant_poll_imputed$feeding)
plant_poll_imputed$season = as.factor(plant_poll_imputed$season)


data_new = plant_poll_imputed[is.na(plant_poll_imputed$interaction), ] # for which we want to make predictions at the end
data_obs = plant_poll_imputed[!is.na(plant_poll_imputed$interaction), ]# data with known response
dim(data_obs)
[1] 14690     9

The dataset is large! More than 10,000 observations. For now, let’s switch to a simple holdout strategy for validating our model (e.g. use 80% of the data to train the model and 20% of the data to validate your model.

Moreover:

table(data_obs$interaction)

    0     1 
14095   595 

The data is strongly imbalanced, i.e. many 0s but only a few 1. There are different strategies how to deal with that, for example oversampling the 1s or undersampling the 0s.

Undersampling the 0s:

data_obs = data_obs[c(sample(which(data_obs$interaction == 0), 2000), which(data_obs$interaction == 1)),]
table(data_obs$interaction)

   0    1 
2000  595 
data_obs$interaction = as.integer(data_obs$interaction)

Minimal example:

library(cito)
set.seed(42)
tuning_steps = 2
hyper_lambda = runif(tuning_steps,0.0001, 0.02)
hyper_alpha = runif(tuning_steps,0, 1.0)
hyper_hidden = sample.int(10, tuning_steps)
hyper_nodes = sample(seq(5, 100), size = tuning_steps)

outer_split = sample(nrow(data_obs), 0.2*nrow(data_obs))

results = data.frame(
  set = 1,
  lambda = rep(NA, 1),
  alpha = rep(NA, 1),
  hidden = rep(NA, 1),
  nodes = rep(NA, 1),
  AUC = rep(NA, 1)
)

train_outer = data_obs[-outer_split, ]
test_outer = data_obs[outer_split, ]

tuning_results = 
    sapply(1:length(hyper_lambda), function(k) {
      model = dnn(interaction~., 
          hidden = rep(hyper_nodes[k], hyper_hidden[k]), 
          activation = rep("selu", hyper_hidden[k]), 
          loss = "binomial", 
          lr = 0.05,
          lambda = hyper_lambda[k],
          alpha = hyper_alpha[k],
          batchsize = 100L, # increasing the batch size will reduce the runtime
          lr_scheduler = config_lr_scheduler("reduce_on_plateau", patience = 10, factor = 0.9),
          data = train_outer, epochs = 50L, verbose = FALSE, plot= FALSE)
      return(Metrics::auc(test_outer$interaction, predict(model, test_outer )[,1]))
    })
results[1, 1] = 1
results[1, 2] =  hyper_lambda[which.max(tuning_results)]
results[1, 3] =  hyper_alpha[which.max(tuning_results)]
results[1, 4] =  hyper_hidden[which.max(tuning_results)]
results[1, 5] =  hyper_nodes[which.max(tuning_results)]  
results[1, 6] = max(tuning_results)

print(results)
  set     lambda     alpha hidden nodes AUC
1   1 0.01830464 0.2861395     10    22 0.5

Make predictions:

k = 1
model = dnn(interaction~., 
    hidden = rep(results$nodes[k], results$hidden[k]), 
    activation = rep("selu", hyper_hidden[k]), 
    loss = "binomial", 
    lr = 0.05,
    lambda = results$lambda[k],
    alpha = results$alpha[k],
    batchsize = 100L, # increasing the batch size will reduce the runtime
    lr_scheduler = config_lr_scheduler("reduce_on_plateau", patience = 10, factor = 0.9),
    data = train_outer, epochs = 50L, verbose = FALSE, plot= FALSE)

predictions = predict(model, newdata = data_new)[,1]

write.csv(data.frame(y = predictions), file = "Max_plant_poll_ensemble.csv")