library(cito)
7 Artificial Neural Networks
Artificial neural networks are biologically inspired, the idea is that inputs are processed by weights, the neurons, the signals then accumulate at hidden nodes (axioms), and only if the sum of activations of several neurons exceed a certain threshold, the signal will be passed on.
7.1 Loss
Tasks such as regression and classification are fundamentally different; the former has continuous responses, while the latter has a discrete response. In ML algorithms, these different tasks can be represented by different loss functions. Moreover, the tasks can differ even within regression or classification (e.g., in classification, we have binary classification (0 or 1) or multi-class classification (0, 1, or 2)). As a result, especially in DL, we have different specialized loss functions available for specific response types. The table below shows a list of supported loss functions in cito:
Loss | Type | Example |
---|---|---|
mse (mean squared error) | Regression | Numeric values |
mae (mean absolute error) | Regression | Numeric values, often used for skewed data |
softmax | Classification, multi-label | Species |
cross-entropy | Classification, binary or multi-class | Survived/non-survived, Multi-species/communities |
binomial | Classification, binary or multi-class | Binomial likelihood |
poisson | Regression | Count data |
In the iris data, we model Species
which has 3 response levels, so this is was what we call multilabel and it requires a softmax link and a cross-entropy loss function, in cito we specify that by using the softmax
loss:
library(cito)
<- dnn(Species~., data = datasets::iris, loss = "softmax", verbose = FALSE) model
head(predict(model, type = "response"))
setosa versicolor virginica
[1,] 0.9956944 0.004305612 5.349870e-11
[2,] 0.9916688 0.008331116 2.680886e-10
[3,] 0.9944936 0.005506448 1.437312e-10
[4,] 0.9900427 0.009957277 6.517569e-10
[5,] 0.9960542 0.003945861 5.230870e-11
[6,] 0.9940950 0.005904986 1.127888e-10
7.2 Validation
A holdout, or validation data, is important for detecting (and preventing) overfitting. In cito, we can directly tell the dnn
function to automatically use a random subset of the data as validation data, which is validated after each epoch (each iteration of the optimization), allowing us to monitor the training:
= airquality[complete.cases(airquality),] # DNN cannot handle NAs!
data = scale(data)
data
= dnn(Ozone~.,
model validation = 0.2,
loss = "mse",data = data, verbose = FALSE)
The validation argument ranges from 0 and 1 is the percent of the data that should be used for validation
7.2.1 Baseline loss
Since training DNNs can be quite challenging, we provide in cito a baseline loss that is computed from an intercept-only model (e.g., just the mean of the response). And the absolute minimum performance our DNN should achieve is to outperform the baseline model!
7.3 Trainings parameter
In DL, the optimization (the training of the DNN) is challenging as we have to optimize up to millions of parameters (which are not really identifiable, it is accepted that the optimization does not find a global minimum but just a good local minimum). We have a few important hyperparameters that affect only the optimization:
Hyperparameter | Meaning | Range |
---|---|---|
learning rate | the step size of the parameter updating in the iterative optimization routine, if too high, the optimizer will step over good local optima, if too small, the optimizer will be stuck in a bad local optima | [0.00001, 0.5] |
batch size | NNs are optimized via stochastic gradient descent, i.e. only a batch of the data is used to update the parameters at a time | Depends on the data: 10-250 |
epoch | the data is fed into the optimization in batches, once the entire data set has been used in the optimization, the epoch is complete (so e.g. n = 100, batch size = 20, it takes 5 steps to complete an epoch) | 100+ (use early stopping) |
7.3.1 Learning rate
cito visualizes the training (see graphic). The reason for this is that the training can easily fail if the learning rate (lr) is poorly chosen. If the lr is too high, the optimizer “jumps” over good local optima, while it gets stuck in local optima if the lr is too small:
= dnn(Ozone~.,
model hidden = c(10L, 10L),
activation = c("selu", "selu"),
loss = "mse", lr = 0.4, data = data, epochs = 150L, verbose = FALSE)
If too high, the training will either directly fail (because the loss jumps to infinity) or the loss will be very wiggly and doesn’t decrease over the number of epochs.
= dnn(Ozone~.,
model hidden = c(10L, 10L),
activation = c("selu", "selu"),
loss = "mse", lr = 0.0001, data = data, epochs = 150L, verbose = FALSE)
If too low, the loss will be very wiggly but doesn’t decrease.
Adjusting / reducing the learning rate during training is a common approach in neural networks. The idea is to start with a larger learning rate and then steadily decrease it during training (either systematically or based on specific properties):
= dnn(Ozone~.,
model hidden = c(10L, 10L),
activation = c("selu", "selu"),
loss = "mse",
lr = 0.1,
lr_scheduler = config_lr_scheduler("step", step_size = 30, gamma = 0.1),
# reduce learning all 30 epochs (new lr = 0.1* old lr)
data = data, epochs = 150L, verbose = FALSE)
7.4 Architecture
The architecture of the NN can be specified by the hidden
argument, it is a vector where the length corresponds to the number of hidden layers and value of entry to the number of hidden neurons in each layer (and the same applies for the activation
argument that specifies the activation functions in the hidden layers). It is hard to make recommendations about the architecture, a kind of general rule is that the width of the hidden layers is more important than the depth of the NN.
Example:
= airquality[complete.cases(airquality),] # DNN cannot handle NAs!
data = scale(data)
data
= dnn(Ozone~.,
model hidden = c(10L, 10L), # Architecture, number of hidden layers and nodes in each layer
activation = c("selu", "selu"), # activation functions for the specific hidden layer
loss = "mse", lr = 0.01, data = data, epochs = 150L, verbose = FALSE)
plot(model)
summary(model)
Summary of Deep Neural Network Model
Feature Importance:
variable importance_1
1 Solar.R 1.395183
2 Wind 2.399698
3 Temp 3.587700
4 Month 1.073583
5 Day 1.124274
Average Conditional Effects:
Response_1
Solar.R 0.17920597
Wind -0.33835856
Temp 0.49624949
Month -0.05472599
Day 0.07795687
Standard Deviation of Conditional Effects:
Response_1
Solar.R 0.10734868
Wind 0.25713754
Temp 0.29069498
Month 0.06664078
Day 0.08592126
7.5 Regularization
We can use \(\lambda\) and \(\alpha\) to set L1 and L2 regularization on the weights in our NN:
= dnn(Ozone~.,
model hidden = c(10L, 10L),
activation = c("selu", "selu"),
loss = "mse",
lr = 0.01,
lambda = 0.01, # regularization strength
alpha = 0.5,
lr_scheduler = config_lr_scheduler("step", step_size = 30, gamma = 0.1),
# reduce learning all 30 epochs (new lr = 0.1* old lr)
data = data, epochs = 150L, verbose = FALSE)
summary(model)
Summary of Deep Neural Network Model
Feature Importance:
variable importance_1
1 Solar.R 1.185922
2 Wind 1.946549
3 Temp 2.801965
4 Month 1.129728
5 Day 1.066972
Average Conditional Effects:
Response_1
Solar.R 0.17293910
Wind -0.35702771
Temp 0.50822308
Month -0.12243401
Day 0.09675976
Standard Deviation of Conditional Effects:
Response_1
Solar.R 0.04150010
Wind 0.08836144
Temp 0.11741846
Month 0.05096023
Day 0.03556525
Be careful that you don’t accidentally set all weights to 0 because of a too high regularization. We check the weights of the first layer:
::image.plot(coef(model)[[1]][[1]]) # weights of the first layer fields
7.6 Hyperparameter tuning
cito has a feature to automatically tune hyperparameters under Cross Validation!
- if you pass the function
tune(...)
to a hyperparameter, this hyperparameter will be automatically tuned - in the
tuning = config_tuning(...)
argument, you can specify the cross-validation strategy and the number of hyperparameters that shoudl be tested - after the tuning, cito will fit automatically a model with the best hyperparameters on the full data and will return this model
Minimal example with the iris dataset:
= iris
df 1:4] = scale(df[,1:4])
df[,
= dnn(Species~.,
model_tuned loss = "softmax",
data = iris,
lambda = tune(lower = 0.0, upper = 0.2), # you can pass the "tune" function to a hyerparameter
tuning = config_tuning(CV = 3, steps = 20L),
verbose = FALSE
)
Starting hyperparameter tuning...
Fitting final model...
# tuning results
$tuning model_tuned
# A tibble: 20 × 5
steps test train models lambda
<int> <dbl> <dbl> <lgl> <dbl>
1 1 75.2 0 NA 0.0869
2 2 Inf 0 NA 0.136
3 3 Inf 0 NA 0.132
4 4 93.5 0 NA 0.126
5 5 77.9 0 NA 0.0765
6 6 52.3 0 NA 0.0418
7 7 32.7 0 NA 0.0239
8 8 Inf 0 NA 0.148
9 9 73.1 0 NA 0.0703
10 10 53.8 0 NA 0.0414
11 11 Inf 0 NA 0.130
12 12 27.9 0 NA 0.0196
13 13 90.6 0 NA 0.131
14 14 Inf 0 NA 0.179
15 15 25.5 0 NA 0.0154
16 16 Inf 0 NA 0.170
17 17 Inf 0 NA 0.140
18 18 76.3 0 NA 0.0749
19 19 Inf 0 NA 0.163
20 20 Inf 0 NA 0.187
# model_tuned is now already the best model!
7.7 Exercise
see Section A.3 for more information about the dataset.
Prepare the data:
library(EcoData)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
data(plantPollinator_df)
= plantPollinator_df
plant_poll summary(plant_poll)
crop insect type
Vaccinium_corymbosum: 256 Andrena_wilkella : 80 Length:20480
Brassica_napus : 256 Andrena_barbilabris: 80 Class :character
Carum_carvi : 256 Andrena_cineraria : 80 Mode :character
Coriandrum_sativum : 256 Andrena_flavipes : 80
Daucus_carota : 256 Andrena_gravida : 80
Malus_domestica : 256 Andrena_haemorrhoa : 80
(Other) :18944 (Other) :20000
season diameter corolla colour
Length:20480 Min. : 2.00 Length:20480 Length:20480
Class :character 1st Qu.: 5.00 Class :character Class :character
Mode :character Median : 19.00 Mode :character Mode :character
Mean : 27.03
3rd Qu.: 25.00
Max. :150.00
NA's :9472
nectar b.system s.pollination inflorescence
Length:20480 Length:20480 Length:20480 Length:20480
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
composite guild tongue body
Length:20480 Length:20480 Min. : 2.000 Min. : 2.00
Class :character Class :character 1st Qu.: 4.800 1st Qu.: 8.00
Mode :character Mode :character Median : 6.600 Median :10.50
Mean : 8.104 Mean :10.66
3rd Qu.:10.500 3rd Qu.:13.00
Max. :26.400 Max. :25.00
NA's :17040 NA's :6160
sociality feeding interaction
Length:20480 Length:20480 0 :14095
Class :character Class :character 1 : 595
Mode :character Mode :character NA's: 5790
# scale numeric features
sapply(plant_poll, is.numeric)] = scale(plant_poll[, sapply(plant_poll, is.numeric)])
plant_poll[,
# remove NAs
= plant_poll[complete.cases(plant_poll),] # remove NAs
df
# remove factors with only one level
= df %>% select(-crop, -insect, -season, -colour, -guild, -feeding, -composite)
data_obs
# change response to integer (because cito wants integer 0/1 for binomial data)
$interaction = as.integer(data_obs$interaction) - 1
data_obs
# prepare the test data
= plant_poll[is.na(plantPollinator_df$interaction), ]
newdata = missRanger::missRanger(data = newdata[,-ncol(newdata)], verbose = 0) # fill NAs
newdata_imputed $interaction = NA newdata_imputed
Minimal example in cito:
library(cito)
set.seed(42)
= dnn(interaction~.,
model hidden = c(50, 50),
activation = "selu",
loss = "binomial",
lr = tune(values = seq(0.0001, 0.03, length.out = 10)),
batchsize = 100L, # increasing the batch size will reduce the runtime
data = data_obs,
epochs = 200L,
burnin = 200L,
tuning = config_tuning(CV = 3, steps = 10))
Starting hyperparameter tuning...
Fitting final model...
print(model$tuning)
# A tibble: 10 × 5
steps test train models lr
<int> <dbl> <dbl> <lgl> <dbl>
1 1 307. 0 NA 0.0267
2 2 304. 0 NA 0.03
3 3 316. 0 NA 0.0101
4 4 Inf 0 NA 0.00342
5 5 306. 0 NA 0.0234
6 6 315. 0 NA 0.0134
7 7 314. 0 NA 0.0134
8 8 307. 0 NA 0.0200
9 9 304. 0 NA 0.03
10 10 Inf 0 NA 0.00342
# make final predictions
= predict(model, newdata_imputed, type = "response")[,1]
predictions
# prepare submissions
write.csv(data.frame(y = predictions), file = "my_submission.csv")
Your Tasks:
- Use cito to tune learning parameters and the regularization
- Submit your predictions to http://rhsbio7.uni-regensburg.de:8500/
Minimal example:
library(cito)
set.seed(42)
= dnn(interaction~.,
model hidden = c(50, 50),
activation = "selu",
loss = "binomial",
lr = tune(values = seq(0.0001, 0.03, length.out = 10)),
lambda = tune(values = seq(0.0001, 0.1, length.out = 10)),
alpha = tune(),
batchsize = 100L, # increasing the batch size will reduce the runtime
data = data_obs,
epochs = 100L,
tuning = config_tuning(CV = 3, steps = 15))
Starting hyperparameter tuning...
Fitting final model...
print(model$tuning)
# A tibble: 15 × 7
steps test train models lambda alpha lr
<int> <dbl> <dbl> <lgl> <dbl> <dbl> <dbl>
1 1 Inf 0 NA 0.0889 0.197 0.00674
2 2 Inf 0 NA 0.1 0.501 0.0134
3 3 Inf 0 NA 0.0334 0.825 0.0200
4 4 Inf 0 NA 0.0112 0.0921 0.0134
5 5 Inf 0 NA 0.0778 0.470 0.0234
6 6 Inf 0 NA 0.0445 0.881 0.0134
7 7 Inf 0 NA 0.0445 0.436 0.00674
8 8 Inf 0 NA 0.0667 0.277 0.0200
9 9 Inf 0 NA 0.1 0.0571 0.0001
10 10 Inf 0 NA 0.0112 0.885 0.0234
11 11 Inf 0 NA 0.0112 0.148 0.0234
12 12 Inf 0 NA 0.0667 0.317 0.0167
13 13 Inf 0 NA 0.0001 0.681 0.00342
14 14 Inf 0 NA 0.1 0.283 0.0001
15 15 Inf 0 NA 0.0556 0.425 0.0101
Make predictions:
= predict(model, newdata_imputed, type = "response")[,1]
predictions
write.csv(data.frame(y = predictions), file = "Max_plant_.csv")