library(cito)
7 Artificial Neural Networks
Artificial neural networks are biologically inspired, the idea is that inputs are processed by weights, the neurons, the signals then accumulate at hidden nodes (axioms), and only if the sum of activations of several neurons exceed a certain threshold, the signal will be passed on.
cito allows us to fit fully-connected neural networks within one line of code. When we come to other tasks such as image recognition we have to use frameworks with higher flexibility such as keras or torch.
Neural networks are harder to optimize (hey are optimized via backpropagation and gradient descent) and a few hyperparameters that control the optimization should be familiar:
Hyperparameter | Meaning | Range |
---|---|---|
learning rate | the step size of the parameter updating in the iterative optimization routine, if too high, the optimizer will step over good local optima, if too small, the optimizer will be stuck in a bad local optima | [0.00001, 0.5] |
batch size | NNs are optimized via stochastic gradient descent, i.e. only a batch of the data is used to update the parameters at a time | Depends on the data: 10-250 |
epoch | the data is fed into the optimization in batches, once the entire data set has been used in the optimization, the epoch is complete (so e.g. n = 100, batch size = 20, it takes 5 steps to complete an epoch) | 100+ (use early stopping) |
Example:
= airquality[complete.cases(airquality),]
data = scale(data)
data
= dnn(Ozone~.,
model hidden = c(10L, 10L), # Architecture, number of hidden layers and nodes in each layer
activation = c("selu", "selu"), # activation functions for the specific hidden layer
loss = "mse", lr = 0.01, data = data, epochs = 150L, verbose = FALSE)
plot(model)
summary(model)
Deep Neural Network Model summary
Model generated on basis of:
Feature Importance:
variable importance
1 Solar.R 1.203977
2 Wind 1.989396
3 Temp 3.109649
4 Month 1.130403
5 Day 1.085794
The architecture of the NN can be specified by the hidden
argument, it is a vector where the length corresponds to the number of hidden layers and value of entry to the number of hidden neurons in each layer (and the same applies for the activation
argument that specifies the activation functions in the hidden layers). It is hard to make recommendations about the architecture, a kind of general rule is that the width of the hidden layers is more important than the depth of the NN.
The loss function has to be adjusted to the response type:
Loss | Type | Example |
---|---|---|
mse (mean squared error) | Regression | Numeric values |
mae (mean absolute error) | Regression | Numeric values, often used for skewed data |
softmax | Classification, multi-label | Species |
cross-entropy | Classification, binary or multi-class | Survived/non-survived, Multi-species/communities |
binomial | Classification, binary or multi-class | Binomial likelihood |
poisson | Regression | Count data |
cito visualizes the training (see graphic). The reason for this is that the training can easily fail if the learning rate (lr) is poorly chosen. If the lr is too high, the optimizer “jumps” over good local optima, while it gets stuck in local optima if the lr is too small:
= dnn(Ozone~.,
model hidden = c(10L, 10L),
activation = c("selu", "selu"),
loss = "mse", lr = 0.4, data = data, epochs = 150L, verbose = FALSE)
If too high, the training will either directly fail (because the loss jumps to infinity) or the loss will be very wiggly and doesn’t decrease over the number of epochs.
= dnn(Ozone~.,
model hidden = c(10L, 10L),
activation = c("selu", "selu"),
loss = "mse", lr = 0.0001, data = data, epochs = 150L, verbose = FALSE)
If too low, the loss will be very wiggly but doesn’t decrease.
Adjusting / reducing the learning rate during training is a common approach in neural networks. The idea is to start with a larger learning rate and then steadily decrease it during training (either systematically or based on specific properties):
= dnn(Ozone~.,
model hidden = c(10L, 10L),
activation = c("selu", "selu"),
loss = "mse",
lr = 0.1,
lr_scheduler = config_lr_scheduler("step", step_size = 30, gamma = 0.1),
# reduce learning all 30 epochs (new lr = 0.1* old lr)
data = data, epochs = 150L, verbose = FALSE)
7.1 Regularization
We can use \(\lambda\) and \(\alpha\) to set L1 and L2 regularization on the weights in our NN:
= dnn(Ozone~.,
model hidden = c(10L, 10L),
activation = c("selu", "selu"),
loss = "mse",
lr = 0.05,
lambda = 0.1,
alpha = 0.5,
lr_scheduler = config_lr_scheduler("step", step_size = 30, gamma = 0.1),
# reduce learning all 30 epochs (new lr = 0.1* old lr)
data = data, epochs = 150L, verbose = FALSE)
summary(model)
Deep Neural Network Model summary
Model generated on basis of:
Feature Importance:
variable importance
1 Solar.R 1.170810
2 Wind 1.854874
3 Temp 2.703980
4 Month 1.001610
5 Day 1.014565
Be careful that you don’t accidentally set all weights to 0 because of a too high regularization. We check the weights of the first layer:
::image.plot(coef(model)[[1]][[1]]) # weights of the first layer fields