We can use TensorFlow directly from R (see Chapter 8 for an introduction to TensorFlow), and we could use this knowledge to implement a neural network in TensorFlow directly in R. However, this can be quite cumbersome. For simple problems, it is usually faster to use a higher-level API that helps us implement the machine learning models in TensorFlow. The most common of these is Keras.
Keras is a powerful framework for building and training neural networks with just a few lines of code. As of the end of 2018, Keras and TensorFlow are fully interoperable, allowing us to take advantage of the best of both.
The goal of this lesson is to familiarize you with Keras. If you have TensorFlow installed, you can find Keras inside TensorFlow: tf.keras. However, the RStudio team has built an R package on top of tf.keras that is more convenient to use. To load the Keras package, type
library(keras)# or library(torch)
9.1 Example workflow in Keras / Torch
We build a small classifier to predict the three species of the iris data set. Load the necessary packages and data sets:
library(keras)library(tensorflow)library(torch)set_random_seed(321L, disable_gpu =FALSE) # Already sets R's random seed.data(iris)head(iris)
For neural networks, it is beneficial to scale the predictors (scaling = centering and standardization, see ?scale). We also split our data into predictors (X) and response (Y = the three species).
X =scale(iris[,1:4])Y = iris[,5]
Additionally, Keras/TensorFlow cannot handle factors and we have to create contrasts (one-hot encoding). To do so, we have to specify the number of categories. This can be tricky for a beginner, because in other programming languages like Python and C++, arrays start at zero. Thus, when we would specify 3 as number of classes for our three species, we would have the classes 0,1,2,3. Keep this in mind.
Y =to_categorical(as.integer(Y) - 1L, 3)head(Y) # 3 columns, one for each level of the response.
A sequential Keras model is a higher order type of model within Keras and consists of one input and one output model.
2. Add hidden layers to the model (we will learn more about hidden layers during the next days).
When specifying the hidden layers, we also have to specify the shape and a so called activation function. You can think of the activation function as decision for what is forwarded to the next neuron (but we will learn more about it later). If you want to know this topic in even more depth, consider watching the videos presented in section @ref(basicMath).
The shape of the input is the number of predictors (here 4) and the shape of the output is the number of classes (here 3).
The Torch syntax is very similar, we will give a list of layers to the “nn_sequential” function. Here, we have to specify the softmax activation function as an extra layer:
softmax scales a potential multidimensional vector to the interval \((0, 1]\) for each component. The sum of all components equals 1. This might be very useful for example for handling probabilities. Ensure ther the labels start at 0! Otherwise the softmax function does not work well!
3. Compile the model with a loss function (here: cross entropy) and an optimizer (here: Adamax).
We will learn about other options later, so for now, do not worry about the “learning_rate” (“lr” in Torch or earlier in TensorFlow) argument, cross entropy or the optimizer.
library(tensorflow)library(keras)set_random_seed(321L, disable_gpu =FALSE) # Already sets R's random seed.model_history = model %>%fit(x = X, y =apply(Y, 2, as.integer), epochs = 30L,batch_size = 20L, shuffle =TRUE)
In Torch, we jump directly to the training loop, however, here we have to write our own training loop:
Get a batch of data.
Predict on batch.
Ccalculate loss between predictions and true labels.
Backpropagate error.
Update weights.
Go to step 1 and repeat.
library(torch)torch_manual_seed(321L)set.seed(123)# Calculate number of training steps.epochs =30batch_size =20steps =round(nrow(X)/batch_size * epochs)X_torch =torch_tensor(X)Y_torch =torch_tensor(apply(Y, 1, which.max)) # Set model into training status.model_torch$train()log_losses =NULL# Training loop.for(i in1:steps){# Get batch. indices =sample.int(nrow(X), batch_size)# Reset backpropagation. optimizer_torch$zero_grad()# Predict and calculate loss. pred =model_torch(X_torch[indices, ]) loss =nnf_cross_entropy(pred, Y_torch[indices])# Backpropagation and weight update. loss$backward() optimizer_torch$step() log_losses[i] =as.numeric(loss)}
Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
NA's :37 NA's :7
Month Day
Min. :5.000 Min. : 1.0
1st Qu.:6.000 1st Qu.: 8.0
Median :7.000 Median :16.0
Mean :6.993 Mean :15.8
3rd Qu.:8.000 3rd Qu.:23.0
Max. :9.000 Max. :31.0
There are NAs in the data, which we have to remove because Keras cannot handle NAs. If you don’t know how to remove NAs from a data.frame, use Google (e.g. with the query: “remove-rows-with-all-or-some-nas-missing-values-in-data-frame”).
data = data[complete.cases(data),] # Remove NAs.summary(data)
Ozone Solar.R Wind Temp
Min. : 1.0 Min. : 7.0 Min. : 2.30 Min. :57.00
1st Qu.: 18.0 1st Qu.:113.5 1st Qu.: 7.40 1st Qu.:71.00
Median : 31.0 Median :207.0 Median : 9.70 Median :79.00
Mean : 42.1 Mean :184.8 Mean : 9.94 Mean :77.79
3rd Qu.: 62.0 3rd Qu.:255.5 3rd Qu.:11.50 3rd Qu.:84.50
Max. :168.0 Max. :334.0 Max. :20.70 Max. :97.00
Month Day
Min. :5.000 Min. : 1.00
1st Qu.:6.000 1st Qu.: 9.00
Median :7.000 Median :16.00
Mean :7.216 Mean :15.95
3rd Qu.:9.000 3rd Qu.:22.50
Max. :9.000 Max. :31.00
Split the data in features (\(\boldsymbol{X}\)) and response (\(\boldsymbol{y}\), Ozone) and scale the \(\boldsymbol{X}\) matrix.
model %>%compile(loss = loss_mean_squared_error, optimizer_adamax(learning_rate =0.05))
What is the “mean_squared_error” loss?
Fit model:
model_history = model %>%fit(x = x, y =as.numeric(y), epochs = 100L,batch_size = 20L, shuffle =TRUE)
Plot training history.
plot(model_history)
Create predictions.
pred_keras =predict(model, x)
Compare your Keras model with a linear model:
fit =lm(Ozone ~ ., data = data)pred_lm =predict(fit, data)rmse_lm =mean(sqrt((y - pred_lm)^2))rmse_keras =mean(sqrt((y - pred_keras)^2))print(rmse_lm)
[1] 14.78897
print(rmse_keras)
[1] 9.067499
Before we start, load and prepare the data set:
library(torch)data = airqualitysummary(data)
Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
NA's :37 NA's :7
Month Day
Min. :5.000 Min. : 1.0
1st Qu.:6.000 1st Qu.: 8.0
Median :7.000 Median :16.0
Mean :6.993 Mean :15.8
3rd Qu.:8.000 3rd Qu.:23.0
Max. :9.000 Max. :31.0
plot(data)
There are NAs in the data, which we have to remove because Keras cannot handle NAs. If you don’t know how to remove NAs from a data.frame, use Google (e.g. with the query: “remove-rows-with-all-or-some-nas-missing-values-in-data-frame”).
data = data[complete.cases(data),] # Remove NAs.summary(data)
Ozone Solar.R Wind Temp
Min. : 1.0 Min. : 7.0 Min. : 2.30 Min. :57.00
1st Qu.: 18.0 1st Qu.:113.5 1st Qu.: 7.40 1st Qu.:71.00
Median : 31.0 Median :207.0 Median : 9.70 Median :79.00
Mean : 42.1 Mean :184.8 Mean : 9.94 Mean :77.79
3rd Qu.: 62.0 3rd Qu.:255.5 3rd Qu.:11.50 3rd Qu.:84.50
Max. :168.0 Max. :334.0 Max. :20.70 Max. :97.00
Month Day
Min. :5.000 Min. : 1.00
1st Qu.:6.000 1st Qu.: 9.00
Median :7.000 Median :16.00
Mean :7.216 Mean :15.95
3rd Qu.:9.000 3rd Qu.:22.50
Max. :9.000 Max. :31.00
Split the data in features (\(\boldsymbol{X}\)) and response (\(\boldsymbol{y}\), Ozone) and scale the \(\boldsymbol{X}\) matrix.
x =scale(data[,2:6])y = data[,1]
Pass a list of layer objects to a sequential network class of torch (input and output layer are already specified, you have to add hidden layers between them):
We have to pass the network’s parameters to the optimizer (how is this different to keras?)
optimizer_torch =optim_adam(params = model_torch$parameters, lr =0.05)
Fit model
In torch we write the trainings loop on our own. Complete the trainings loop:
# Calculate number of training steps.epochs = ...batch_size =32steps = ...X_torch =torch_tensor(x)Y_torch =torch_tensor(y, ...) # Set model into training status.model_torch$train()log_losses =NULL# Training loop.for(i in1:steps){# Get batch indices. indices =sample.int(nrow(x), batch_size) X_batch = ... Y_batch = ...# Reset backpropagation. optimizer_torch$zero_grad()# Predict and calculate loss. pred =model_torch(X_batch) loss = ...# Backpropagation and weight update. loss$backward() optimizer_torch$step() log_losses[i] =as.numeric(loss)}
# Calculate number of training steps.epochs =100batch_size =32steps =round(nrow(x)/batch_size*epochs)X_torch =torch_tensor(x)Y_torch =torch_tensor(y, dtype =torch_float32())$view(list(-1, 1)) # Set model into training status.model_torch$train()log_losses =NULL# Training loop.for(i in1:steps){# Get batch indices. indices =sample.int(nrow(x), batch_size) X_batch = X_torch[indices,] Y_batch = Y_torch[indices,]# Reset backpropagation. optimizer_torch$zero_grad()# Predict and calculate loss. pred =model_torch(X_batch) loss =nnf_mse_loss(pred, Y_batch)# Backpropagation and weight update. loss$backward() optimizer_torch$step() log_losses[i] =as.numeric(loss)}
Tips:
Number of training $ steps = Number of rows / batchsize * Epochs $
Search torch::nnf_… for the correct loss function (mse…)
Make sure that X_torch and Y_torch have the same data type! (you can set the dtype via torch_tensor(…, dtype = …)) _ Check the dimension of Y_torch, we need a matrix!
Plot training history.
plot(y = log_losses, x =1:steps, xlab ="Epoch", ylab ="MSE")
Create predictions.
pred_torch =model_torch(X_torch)pred_torch =as.numeric(pred_torch) # cast torch to R object
Compare your Torch model with a linear model:
fit =lm(Ozone ~ ., data = data)pred_lm =predict(fit, data)rmse_lm =mean(sqrt((y - pred_lm)^2))rmse_torch =mean(sqrt((y - pred_torch)^2))print(rmse_lm)
[1] 14.78897
print(rmse_torch)
[1] 6.872959
Task: Titanic dataset
Build a Keras DNN for the titanic dataset
Bonus Task: More details on the inner working of Keras
The next task differs for Torch and Keras users. Keras users will learn more about the inner working of training while Torch users will learn how to simplify and generalize the training loop.
steps =floor(nrow(x)/32) * epochs # We need nrow(x)/32 steps for each epoch.for(i in1:steps){# Get data. batch =get_batch()# Transform it into tensors. bX = tf$constant(batch$bX) bY = tf$constant(matrix(batch$bY, ncol = 1L))# Automatic differentiation:# Record computations with respect to our model variables.with(tf$GradientTape() %as% tape, { pred =model(bX) # We record the operation for our model weights. loss = tf$reduce_mean(tf$keras$losses$mean_squared_error(bY, pred)) } )# Calculate the gradients for our model$weights at the loss / backpropagation. gradients = tape$gradient(loss, model$weights) # Update our model weights with the learning rate specified above. optimizer$apply_gradients(purrr::transpose(list(gradients, model$weights))) if(! i%%30){cat("Loss: ", loss$numpy(), "\n") # Print loss every 30 steps (not epochs!). }}
Keras and Torch use dataloaders to generate the data batches. Dataloaders are objects that return batches of data infinetly. Keras create the dataloader object automatically in the fit function, in Torch we have to write them ourselves:
Define a dataset object. This object informs the dataloader function about the inputs, outputs, length (nrow), and how to sample from it.
Create an instance of the dataset object by calling it and passing the actual data to it
Pass the initiated dataset to the dataloader function
Our dataloader is again an object which has to be initiated. The initiated object returns a list of two elements, batch x and batch y. The initated object stops returning batches when the dataset was completly transversed (no worries, we don’t have to all of this ourselves).
Loss at epoch 10: 15.999315
Loss at epoch 20: 8.324385
Loss at epoch 30: 5.632413
Loss at epoch 40: 4.256267
Loss at epoch 50: 3.425689
Remarks:
Mind the different input and output layer numbers.
The loss function increases randomly, because different subsets of the data were drawn. This is a downside of stochastic gradient descent.
A positive thing about stochastic gradient descent is, that local valleys or hills may be left and global ones can be found instead.
9.3 Underlying mathematical concepts - optional
If are not yet familiar with the underlying concepts of neural networks and want to know more about that, it is suggested to read / view the following videos / sites. Consider the Links and videos with descriptions in parentheses as optional bonus.
This might be useful to understand the further concepts in more depth.
Short explanation of entropy, cross entropy and Kullback–Leibler divergence
Deep Learning (chapter 1)
How neural networks learn - Deep Learning (chapter 2)
Backpropagation - Deep Learning (chapter 3)
Another video about backpropagation (extends the previous one) - Deep Learning (chapter 4)
9.3.1 Caveats of neural network optimization
Depending on activation functions, it might occur that the network won’t get updated, even with high learning rates (called vanishing gradient, especially for “sigmoid” functions). Furthermore, updates might overshoot (called exploding gradients) or activation functions will result in many zeros (especially for “relu”, dying relu).
In general, the first layers of a network tend to learn (much) more slowly than subsequent ones.