[Appendix B — Introduction to TensorFlow and Keras]{#sec-tensorflowintro .quarto-section-identifier}

Appendix B — Introduction to TensorFlow and Keras

B.1 Introduction to TensorFlow

One of the most commonly used frameworks for machine learning is TensorFlow. TensorFlow is an open source linear algebra library with focus on neural networks, published by Google in 2015. TensorFlow supports several interesting features, in particular automatic differentiation, several gradient optimizers and CPU and GPU parallelization.

These advantages are nicely explained in the following video:

To sum up the most important points of the video:

TensorFlow is a math library which is highly optimized for neural networks.
If a GPU is available, computations can be easily run on the GPU but even on a CPU TensorFlow is still very fast.
The “backend” (i.e. all the functions and all computations) are written in C++ and CUDA (CUDA is a programming language for NVIDIA GPUs).
The interface (the part of TensorFlow we use) is written in Python and is also available in R, which means, we can write the code in R/Python but it will be executed by the (compiled) C++ backend.

All operations in TensorFlow are written in C++ and are highly optimized. But don’t worry, we don’t have to use C++ to use TensorFlow because there are several bindings for other languages. TensorFlow officially supports a Python API, but meanwhile there are several community carried APIs for other languages:

R
Go
Rust
Swift
JavaScript

In this course we will use TensorFlow with the https://tensorflow.rstudio.com/ binding, that was developed and published 2017 by the RStudio team. First, they developed an R package (reticulate) for calling Python in R. Actually, we are using the Python TensorFlow module in R (more about this later).

TensorFlow offers different levels of API. We could implement a neural network completely by ourselves or we could use Keras which is provided as a submodule by TensorFlow. Keras is a powerful module for building and training neural networks. It allows us building and training neural networks in a few lines of codes. Since the end of 2018, Keras and TensorFlow are completly interoperable, allowing us to utilize the best of both. In this course, we will show how we can use Keras for neural networks but also how we can use the TensorFlow’s automatic differenation for using complex objective functions.

Useful links:

TensorFlow documentation (This is for the Python API, but just replace the “.” with “$”.)
Rstudio TensorFlow website

B.1.1 Data Containers

TensorFlow has two data containers (structures):

constant (tf$constant): Creates a constant (immutable) value in the computation graph.
variable (tf$Variable): Creates a mutable value in the computation graph (used as parameter/weight in models).

To get started with TensorFlow, we have to load the library and check if the installation worked.

library(tensorflow)
library(keras3)

# Don't worry about weird messages. TensorFlow supports additional optimizations.
exists("tf")

[1] TRUE

immutable = tf$constant(5.0)
mutable = tf$Variable(5.0)

Don’t worry about weird messages (they will only appear once at the start of the session).

B.1.2 Basic Operations

We now can define the variables and do some math with them:

a = tf$constant(5)
b = tf$constant(10)
print(a)

tf.Tensor(5.0, shape=(), dtype=float32)

print(b)

tf.Tensor(10.0, shape=(), dtype=float32)

c = tf$add(a, b)
print(c)

tf.Tensor(15.0, shape=(), dtype=float32)

tf$print(c) # Prints to stderr. For stdout, use k_print_tensor(..., message).

Normal R methods such as print() are provided by the R package “tensorflow”.

The TensorFlow library (created by the RStudio team) built R methods for all common operations:

`+.tensorflow.tensor` = function(a, b){ return(tf$add(a,b)) }
# Mind the backticks.
(a+b)

tf.Tensor(15.0, shape=(), dtype=float32)

Their operators also automatically transform R numbers into constant tensors when attempting to add a tensor to an R number:

d = c + 5  # 5 is automatically converted to a tensor.
print(d)

tf.Tensor(20.0, shape=(), dtype=float32)

TensorFlow containers are objects, what means that they are not just simple variables of type numeric (class(5)), but they instead have so called methods. Methods are changing the state of a class (which for most of our purposes here is the values of the object). For instance, there is a method to transform the tensor object back to an R object:

class(d)

[1] "tensorflow.tensor"                               
[2] "tensorflow.python.framework.ops.EagerTensor"     
[3] "tensorflow.python.framework.ops._EagerTensorBase"
[4] "tensorflow.python.framework.tensor.Tensor"       
[5] "tensorflow.python.types.internal.NativeObject"   
[6] "tensorflow.python.types.core.Symbol"             
[7] "tensorflow.python.types.core.Value"              
[8] "tensorflow.python.types.core.Tensor"             
[9] "python.builtin.object"

class(d$numpy())

[1] "numeric"

class(as.matrix(d))

[1] "matrix" "array"

B.1.3 Data Types

R uses dynamic typing, what means you can assign a number, character, function or whatever to a variable and the the type is automatically inferred. In other languages you have to state the type explicitly, e.g. in C:

int a = 5;
float a = 5.0;
char a = "a";

While TensorFlow tries to infer the type dynamically, you must often state it explicitly. Common important types:

float32 (floating point number with 32 bits, “single precision”)
float64 (floating point number with 64 bits, “double precision”)
int8 (integer with 8 bits)

The reason why TensorFlow is so explicit about types is that many GPUs (e.g. the NVIDIA GeForces) can handle only up to 32 bit numbers! (you do not need high precision in graphical modeling)

But let us see in practice what we have to do with these types and how to specifcy them:

r_matrix = matrix(runif(10*10), 10, 10)
m = tf$constant(r_matrix, dtype = "float32") 
b = tf$constant(2.0, dtype = "float64")
c = m / b # Doesn't work! We try to divide float32/float64.

So what went wrong here? We tried to divide a float32 by a float64 number, but we can only divide numbers of the same type!

r_matrix = matrix(runif(10*10), 10, 10)
m = tf$constant(r_matrix, dtype = "float64")
b = tf$constant(2.0, dtype = "float64")
c = m / b # Now it works.

We can also specify the type of the object by providing an object e.g. tf$float64.

r_matrix = matrix(runif(10*10), 10, 10)
m = tf$constant(r_matrix, dtype = tf$float64)

In TensorFlow, arguments often require exact/explicit data types: TensorFlow often expects integers as arguments. In R however an integer is normally saved as float. Thus, we have to use an “L” after an integer to tell the R interpreter that it should be treated as an integer:

is.integer(5)
is.integer(5L)
matrix(t(r_matrix), 5, 20, byrow = TRUE)
tf$reshape(r_matrix, shape = c(5, 20))$numpy()
tf$reshape(r_matrix, shape = c(5L, 20L))$numpy()

Skipping the “L” is one of the most common errors when using R-TensorFlow!

B.1.4 Exercises

Question: TensorFlow Operations

To run TensorFlow from R, note that you can access the different mathematical operations in TensorFlow via tf$…, e.g. there is a tf$math$… for all common math operations or the tf$linalg$… for different linear algebra operations. Tip: type tf$ and then hit the tab key to list all available options (sometimes you have to do this directly in the console).

An example: How to get the maximum value of a vector?

library(tensorflow)
library(keras3)

x = 100:1
y = as.double(100:1)

max(x)  # R solution. Integer!

[1] 100

tf$math$reduce_max(x) # TensorFlow solution. Integer!

tf.Tensor(100, shape=(), dtype=int32)

max(y)  # Float!

[1] 100

tf$math$reduce_max(y) # Float!

tf.Tensor(100.0, shape=(), dtype=float32)

Rewrite the following expressions (a to g) in TensorFlow:

x = 100:1
y = as.double(100:1)

# a)
min(x)

[1] 1

# b)
mean(x)

[1] 50.5

# c) Tip: Use Google!
which.max(x)

[1] 1

# d) 
which.min(x)

[1] 100

# e) Tip: Use Google! 
order(x)

  [1] 100  99  98  97  96  95  94  93  92  91  90  89  88  87  86  85  84  83
 [19]  82  81  80  79  78  77  76  75  74  73  72  71  70  69  68  67  66  65
 [37]  64  63  62  61  60  59  58  57  56  55  54  53  52  51  50  49  48  47
 [55]  46  45  44  43  42  41  40  39  38  37  36  35  34  33  32  31  30  29
 [73]  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11
 [91]  10   9   8   7   6   5   4   3   2   1

# f) Tip: See tf$reshape.
m = matrix(y, 10, 10) # Mind: We use y here! (Float)
m_2 = abs(m %*% t(m))  # m %*% m is the normal matrix multiplication.
m_2_log = log(m_2)
print(m_2_log)

          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
 [1,] 10.55841 10.54402 10.52943 10.51461 10.49957 10.48431 10.46880 10.45305
 [2,] 10.54402 10.52969 10.51515 10.50040 10.48542 10.47022 10.45478 10.43910
 [3,] 10.52943 10.51515 10.50067 10.48598 10.47107 10.45593 10.44057 10.42496
 [4,] 10.51461 10.50040 10.48598 10.47135 10.45651 10.44144 10.42614 10.41061
 [5,] 10.49957 10.48542 10.47107 10.45651 10.44173 10.42674 10.41151 10.39605
 [6,] 10.48431 10.47022 10.45593 10.44144 10.42674 10.41181 10.39666 10.38127
 [7,] 10.46880 10.45478 10.44057 10.42614 10.41151 10.39666 10.38158 10.36628
 [8,] 10.45305 10.43910 10.42496 10.41061 10.39605 10.38127 10.36628 10.35105
 [9,] 10.43705 10.42317 10.40910 10.39482 10.38034 10.36565 10.35073 10.33559
[10,] 10.42079 10.40699 10.39299 10.37879 10.36439 10.34977 10.33495 10.31989
          [,9]    [,10]
 [1,] 10.43705 10.42079
 [2,] 10.42317 10.40699
 [3,] 10.40910 10.39299
 [4,] 10.39482 10.37879
 [5,] 10.38034 10.36439
 [6,] 10.36565 10.34977
 [7,] 10.35073 10.33495
 [8,] 10.33559 10.31989
 [9,] 10.32022 10.30461
[10,] 10.30461 10.28909

# g) Custom mean function i.e. rewrite the function using TensorFlow. 
mean_R = function(y){
  result = sum(y) / length(y)
  return(result)
}

mean_R(y) == mean(y)    # Test for equality.

[1] TRUE

library(tensorflow)
library(keras)

Registered S3 methods overwritten by 'keras':
  method                               from  
  as.data.frame.keras_training_history keras3
  plot.keras_training_history          keras3
  print.keras_training_history         keras3
  r_to_py.R6ClassGenerator             keras3


Attaching package: 'keras'

The following objects are masked from 'package:keras3':

    %<-active%, %py_class%, activation_elu, activation_exponential,
    activation_gelu, activation_hard_sigmoid, activation_linear,
    activation_relu, activation_selu, activation_sigmoid,
    activation_softmax, activation_softplus, activation_softsign,
    activation_tanh, adapt, application_densenet121,
    application_densenet169, application_densenet201,
    application_efficientnet_b0, application_efficientnet_b1,
    application_efficientnet_b2, application_efficientnet_b3,
    application_efficientnet_b4, application_efficientnet_b5,
    application_efficientnet_b6, application_efficientnet_b7,
    application_inception_resnet_v2, application_inception_v3,
    application_mobilenet, application_mobilenet_v2,
    application_mobilenet_v3_large, application_mobilenet_v3_small,
    application_nasnetlarge, application_nasnetmobile,
    application_resnet101, application_resnet101_v2,
    application_resnet152, application_resnet152_v2,
    application_resnet50, application_resnet50_v2, application_vgg16,
    application_vgg19, application_xception, bidirectional,
    callback_backup_and_restore, callback_csv_logger,
    callback_early_stopping, callback_lambda,
    callback_learning_rate_scheduler, callback_model_checkpoint,
    callback_reduce_lr_on_plateau, callback_remote_monitor,
    callback_tensorboard, clone_model, constraint_maxnorm,
    constraint_minmaxnorm, constraint_nonneg, constraint_unitnorm,
    count_params, custom_metric, dataset_boston_housing,
    dataset_cifar10, dataset_cifar100, dataset_fashion_mnist,
    dataset_imdb, dataset_imdb_word_index, dataset_mnist,
    dataset_reuters, dataset_reuters_word_index, freeze_weights,
    from_config, get_config, get_file, get_layer, get_vocabulary,
    get_weights, image_array_save, image_dataset_from_directory,
    image_load, image_to_array, imagenet_decode_predictions,
    imagenet_preprocess_input, initializer_constant,
    initializer_glorot_normal, initializer_glorot_uniform,
    initializer_he_normal, initializer_he_uniform,
    initializer_identity, initializer_lecun_normal,
    initializer_lecun_uniform, initializer_ones,
    initializer_orthogonal, initializer_random_normal,
    initializer_random_uniform, initializer_truncated_normal,
    initializer_variance_scaling, initializer_zeros, install_keras,
    keras, keras_model, keras_model_sequential, Layer,
    layer_activation, layer_activation_elu,
    layer_activation_leaky_relu, layer_activation_parametric_relu,
    layer_activation_relu, layer_activation_softmax,
    layer_activity_regularization, layer_add, layer_additive_attention,
    layer_alpha_dropout, layer_attention, layer_average,
    layer_average_pooling_1d, layer_average_pooling_2d,
    layer_average_pooling_3d, layer_batch_normalization,
    layer_category_encoding, layer_center_crop, layer_concatenate,
    layer_conv_1d, layer_conv_1d_transpose, layer_conv_2d,
    layer_conv_2d_transpose, layer_conv_3d, layer_conv_3d_transpose,
    layer_conv_lstm_1d, layer_conv_lstm_2d, layer_conv_lstm_3d,
    layer_cropping_1d, layer_cropping_2d, layer_cropping_3d,
    layer_dense, layer_depthwise_conv_1d, layer_depthwise_conv_2d,
    layer_discretization, layer_dot, layer_dropout, layer_embedding,
    layer_flatten, layer_gaussian_dropout, layer_gaussian_noise,
    layer_global_average_pooling_1d, layer_global_average_pooling_2d,
    layer_global_average_pooling_3d, layer_global_max_pooling_1d,
    layer_global_max_pooling_2d, layer_global_max_pooling_3d,
    layer_gru, layer_hashing, layer_input, layer_integer_lookup,
    layer_lambda, layer_layer_normalization, layer_lstm, layer_masking,
    layer_max_pooling_1d, layer_max_pooling_2d, layer_max_pooling_3d,
    layer_maximum, layer_minimum, layer_multi_head_attention,
    layer_multiply, layer_normalization, layer_permute,
    layer_random_brightness, layer_random_contrast, layer_random_crop,
    layer_random_flip, layer_random_rotation, layer_random_translation,
    layer_random_zoom, layer_repeat_vector, layer_rescaling,
    layer_reshape, layer_resizing, layer_rnn, layer_separable_conv_1d,
    layer_separable_conv_2d, layer_simple_rnn,
    layer_spatial_dropout_1d, layer_spatial_dropout_2d,
    layer_spatial_dropout_3d, layer_string_lookup, layer_subtract,
    layer_text_vectorization, layer_unit_normalization,
    layer_upsampling_1d, layer_upsampling_2d, layer_upsampling_3d,
    layer_zero_padding_1d, layer_zero_padding_2d,
    layer_zero_padding_3d, learning_rate_schedule_cosine_decay,
    learning_rate_schedule_cosine_decay_restarts,
    learning_rate_schedule_exponential_decay,
    learning_rate_schedule_inverse_time_decay,
    learning_rate_schedule_piecewise_constant_decay,
    learning_rate_schedule_polynomial_decay, loss_binary_crossentropy,
    loss_categorical_crossentropy, loss_categorical_hinge,
    loss_cosine_similarity, loss_hinge, loss_huber, loss_kl_divergence,
    loss_mean_absolute_error, loss_mean_absolute_percentage_error,
    loss_mean_squared_error, loss_mean_squared_logarithmic_error,
    loss_poisson, loss_sparse_categorical_crossentropy,
    loss_squared_hinge, mark_active, metric_auc,
    metric_binary_accuracy, metric_binary_crossentropy,
    metric_categorical_accuracy, metric_categorical_crossentropy,
    metric_categorical_hinge, metric_cosine_similarity,
    metric_false_negatives, metric_false_positives, metric_hinge,
    metric_mean, metric_mean_absolute_error,
    metric_mean_absolute_percentage_error, metric_mean_iou,
    metric_mean_squared_error, metric_mean_squared_logarithmic_error,
    metric_mean_wrapper, metric_poisson, metric_precision,
    metric_precision_at_recall, metric_recall,
    metric_recall_at_precision, metric_root_mean_squared_error,
    metric_sensitivity_at_specificity,
    metric_sparse_categorical_accuracy,
    metric_sparse_categorical_crossentropy,
    metric_sparse_top_k_categorical_accuracy,
    metric_specificity_at_sensitivity, metric_squared_hinge,
    metric_sum, metric_top_k_categorical_accuracy,
    metric_true_negatives, metric_true_positives, new_callback_class,
    new_layer_class, new_learning_rate_schedule_class, new_loss_class,
    new_metric_class, new_model_class, normalize, optimizer_adadelta,
    optimizer_adagrad, optimizer_adam, optimizer_adamax,
    optimizer_ftrl, optimizer_nadam, optimizer_rmsprop, optimizer_sgd,
    pad_sequences, pop_layer, predict_on_batch, regularizer_l1,
    regularizer_l1_l2, regularizer_l2, regularizer_orthogonal,
    set_vocabulary, set_weights, shape, test_on_batch,
    text_dataset_from_directory, time_distributed,
    timeseries_dataset_from_array, to_categorical, train_on_batch,
    unfreeze_weights, use_backend, with_custom_object_scope, zip_lists

x = 100:1
y = as.double(100:1)

# a)    min(x)
tf$math$reduce_min(x) # Integer!

tf.Tensor(1, shape=(), dtype=int32)

tf$math$reduce_min(y) # Float!

tf.Tensor(1.0, shape=(), dtype=float32)

# b)    mean(x)
# Check out the difference here:
mean(x)

[1] 50.5

mean(y)

[1] 50.5

tf$math$reduce_mean(x)  # Integer!

tf.Tensor(50, shape=(), dtype=int32)

tf$math$reduce_mean(y)  # Float!

tf.Tensor(50.5, shape=(), dtype=float32)

# c)    which.max(x)
tf$argmax(x)

tf.Tensor(0, shape=(), dtype=int64)

tf$argmax(y)

tf.Tensor(0, shape=(), dtype=int64)

# d)    which.min(x)
tf$argmin(x)

tf.Tensor(99, shape=(), dtype=int64)

# e)    order(x)
tf$argsort(x)

tf.Tensor(
[99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76
 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52
 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28
 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4
  3  2  1  0], shape=(100), dtype=int32)

# f)
# m = matrix(y, 10, 10)
# m_2 = abs(m %*% m)
# m_2_log = log(m_2)

# Mind: We use y here! TensorFlow just accepts floats in the following lines!
mTF = tf$reshape(y, list(10L, 10L))
m_2TF = tf$math$abs( tf$matmul(mTF, tf$transpose(mTF)) )
m_2_logTF = tf$math$log(m_2TF)
print(m_2_logTF)

tf.Tensor(
[[11.4217415 11.311237  11.186988  11.045079  10.87965   10.68132
  10.433675  10.103772   9.608109   8.582045 ]
 [11.311237  11.200746  11.076511  10.934624  10.769221  10.570932
  10.323349   9.993557   9.498147   8.473241 ]
 [11.186988  11.076511  10.952296  10.810434  10.645068  10.446829
  10.199324   9.869672   9.374583   8.351139 ]
 [11.045079  10.934624  10.810434  10.668607  10.503285  10.305112
  10.05771    9.728241   9.233568   8.212026 ]
 [10.87965   10.769221  10.645068  10.503285  10.338026  10.139942
   9.892679   9.563459   9.069353   8.0503845]
 [10.68132   10.570932  10.446829  10.305112  10.139942   9.941987
   9.694924   9.366061   8.872767   7.857481 ]
 [10.433675  10.323349  10.199324  10.05771    9.892679   9.694924
   9.448175   9.119868   8.62784    7.6182513]
 [10.103772   9.993557   9.869672   9.728241   9.563459   9.366061
   9.119868   8.79255    8.302762   7.30317  ]
 [ 9.608109   9.498147   9.374583   9.233568   9.069353   8.872767
   8.62784    8.302762   7.818028   6.8405466]
 [ 8.582045   8.473241   8.351139   8.212026   8.0503845  7.857481
   7.6182513  7.30317    6.8405466  5.9532433]], shape=(10, 10), dtype=float32)

# g)    # Custom mean function
mean_TF = function(y){
  result = tf$math$reduce_sum(y)
  return( result / length(y) )  # If y is an R object.
}
mean_TF(y) == mean(y)

tf.Tensor(True, shape=(), dtype=bool)

Question: Runtime

This exercise compares the speed of R to TensorFlow. The first exercise is to rewrite the following function in TensorFlow:

do_something_R = function(x = matrix(0.0, 10L, 10L)){
  mean_per_row = apply(x, 1, mean)
  result = x - mean_per_row
  return(result)
}

Here, we provide a skeleton for a TensorFlow function:

do_something_TF = function(x = matrix(0.0, 10L, 10L)){
   ...
}

We can compare the speed using the Microbenchmark package:

test = matrix(0.0, 100L, 100L)
microbenchmark::microbenchmark(do_something_R(test), do_something_TF(test))

Try different matrix sizes for the test matrix and compare the speed.

Tip: Have a look at the the tf.reduce_mean documentation and the “axis” argument.

Compare the following with different matrix sizes:

test = matrix(0.0, 1000L, 500L)
testTF = tf$constant(test)

Also try the following:

microbenchmark::microbenchmark(
   tf$matmul(testTF, tf$transpose(testTF)), # TensorFlow style.
   test %*% t(test)  # R style.
)

do_something_TF = function(x = matrix(0.0, 10L, 10L)){
  x = tf$constant(x)  # Remember, this is a local copy!
  mean_per_row = tf$reduce_mean(x, axis = 0L)
  result = x - mean_per_row
  return(result)
}

test = matrix(0.0, 100L, 100L)
microbenchmark::microbenchmark(do_something_R(test), do_something_TF(test))

test = matrix(0.0, 1000L, 500L)
microbenchmark::microbenchmark(do_something_R(test), do_something_TF(test))

Why is R faster (the first time)?

1. The R functions we used (apply, mean, “-”) are also implemented in C.
1. The problem is not large enough and TensorFlow has an overhead.

test = matrix(0.0, 1000L, 500L)
testTF = tf$constant(test)

microbenchmark::microbenchmark(
  tf$matmul(testTF, tf$transpose(testTF)),  # TensorFlow style.
  test %*% t(test) # R style.
)

Question: Linear Algebra

Google to find out how to write the following expressions in TensorFlow:

A = matrix(c(1, 2, 0, 0, 2, 0, 2, 5, 3), 3, 3)

# i)
solve(A)  # Solve equation AX = B. If just A  is given, invert it.

# j)
diag(A) # Diagonal of A, if no matrix is given, construct diagonal matrix.

# k)
diag(diag(A)) # Diagonal matrix with entries diag(A).

# l)
eigen(A)

# m)
det(A)

library(tensorflow)
library(keras3)

A = matrix(c(1., 2., 0., 0., 2., 0., 2., 5., 3.), 3, 3)
# Do not use the "L" form here!

# i)    solve(A)
tf$linalg$inv(A)

# j)    diag(A)
tf$linalg$diag_part(A)

# k)    diag(diag(A))
tf$linalg$diag(tf$linalg$diag_part(A))

# l)    eigen(A)
tf$linalg$eigh(A)

# m)    det(A)
tf$linalg$det(A)

Question: Automatic differentation

TensorFlow supports automatic differentiation (analytical and not numerical!). Let’s have a look at the function $f(x) = 5 x^2 + 3$ with derivative $f'(x) = 10x$. So for $f'(5)$ we will get $10$.

Let’s do this in TensorFlow. Define the function:

f = function(x){ return(5.0 * tf$square(x) + 3.0) }

We want to calculate the derivative for $x = 2.0$:

x = tf$constant(2.0)

To do automatic differentiation, we have to forward $x$ through the function within the tf$GradientTape() environment. We have also have to tell TensorFlow which value to “watch”:

with(tf$GradientTape() %as% tape,
  {
    tape$watch(x)
    y = f(x)
  }
)

To print the gradient:

(tape$gradient(y, x))

tf.Tensor(20.0, shape=(), dtype=float32)

We can also calculate the second order derivative $f''(x) = 10$:

with(tf$GradientTape() %as% first,
  {
    first$watch(x)
    with(tf$GradientTape() %as% second,
      {
        second$watch(x)
        y = f(x)
        g = first$gradient(y, x)
      }
    )
  }
)

(second$gradient(g, x))

tf.Tensor(10.0, shape=(), dtype=float32)

What is happening here? Think about and discuss it.

A more advanced example: Linear regression

In this case we first simulate data following $\boldsymbol{y} = \boldsymbol{X} \boldsymbol{w} + \boldsymbol{\epsilon}$ ($\boldsymbol{\epsilon}$ follows a normal distribution == error).

x = matrix(round(runif(500, -2, 2), 3), 100, 5)
w = round(rnorm(5, 2, 1), 3)
y = x %*% w + round(rnorm(100, 0, 0.25), 4)

In R we would do the following to fit a linear regression model:

summary(lm(y~x))


Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.60967 -0.14994 -0.01641  0.15907  0.72013 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.01616    0.02638  -0.613    0.542    
x1           1.95722    0.02190  89.368   <2e-16 ***
x2           3.50553    0.02409 145.530   <2e-16 ***
x3           2.16999    0.02254  96.269   <2e-16 ***
x4           1.92183    0.02312  83.122   <2e-16 ***
x5           2.09638    0.02273  92.246   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2585 on 94 degrees of freedom
Multiple R-squared:  0.9983,    Adjusted R-squared:  0.9982 
F-statistic: 1.117e+04 on 5 and 94 DF,  p-value: < 2.2e-16

Let’s build our own model in TensorFlow. Here, we use now the variable data container type (remember they are mutable and we need this type for the weights ($\boldsymbol{w}$) of the regression model). We want our model to learn these weights.

The input (predictors, independent variables or features, $\boldsymbol{X}$) and the observed (response, $\boldsymbol{y}$) are constant and will not be learned/optimized.

library(tensorflow)
library(keras3)

x = matrix(round(runif(500, -2, 2), 3), 100, 5)
w = round(rnorm(5, 2, 1), 3)
y = x %*% w + round(rnorm(100, 0, 0.25), 4)

# Weights we want to learn.
# We know the real weights but in reality we wouldn't know them.
# So use guessed ones.
wTF = tf$Variable(matrix(rnorm(5, 0, 0.01), 5, 1))

xTF = tf$constant(x)
yTF = tf$constant(y)

# We need an optimizer which updates the weights (wTF).
optimizer = tf$keras$optimizers$Adamax(learning_rate = 0.1)

for(i in 1:100){
  with(tf$GradientTape() %as% tape,
    {
      pred = tf$matmul(xTF, wTF)
      loss = tf$sqrt(tf$reduce_mean(tf$square(yTF - pred)))
    }
  )

  if(!i%%10){ print(as.numeric(loss), message = "Loss: ") }  # Every 10 times.
  grads = tape$gradient(loss, wTF)
  optimizer$apply_gradients(purrr::transpose(list(list(grads), list(wTF))))
}

[1] 2.48493
[1] 1.093104
[1] 0.44713
[1] 0.4174882
[1] 0.3037115
[1] 0.3142751
[1] 0.3034402
[1] 0.3028021
[1] 0.3022693
[1] 0.3016723

print(as.matrix(wTF), message = "Resulting weights:\n")

          [,1]
[1,] 0.9044624
[2,] 2.7220652
[3,] 1.6548279
[4,] 1.3135430
[5,] 0.1999556

cat("Original weights: ", w, "\n")

Original weights:  0.882 2.718 1.652 1.321 0.189

Discuss the code, go through the code line by line and try to understand it.

Additional exercise:

Play around with the simulation, increase/decrease the number of weights, add an intercept (you also need an additional variable in model).

library(tensorflow)
library(keras3)

numberOfWeights = 3
numberOfFeatures = 10000

x = matrix(round(runif(numberOfFeatures * numberOfWeights, -2, 2), 3),
           numberOfFeatures, numberOfWeights)
w = round(rnorm(numberOfWeights, 2, 1), 3)
intercept = round(rnorm(1, 3, .5), 3)
y = intercept + x %*% w + round(rnorm(numberOfFeatures, 0, 0.25), 4)

# Guessed weights and intercept.
wTF = tf$Variable(matrix(rnorm(numberOfWeights, 0, 0.01), numberOfWeights, 1))
interceptTF = tf$Variable(matrix(rnorm(1, 0, .5)), 1, 1) # Double, not float32.

xTF = tf$constant(x)
yTF = tf$constant(y)

optimizer = tf$keras$optimizers$Adamax(learning_rate = 0.05)

for(i in 1:100){
  with(tf$GradientTape(persistent = TRUE) %as% tape,
    {
      pred = tf$add(interceptTF, tf$matmul(xTF, wTF))
      loss = tf$sqrt(tf$reduce_mean(tf$square(yTF - pred)))
    }
  )

  if(!i%%10){ print(as.numeric(loss), message = "Loss: ") }  # Every 10 times.
  grads = tape$gradient(loss, list(wTF, interceptTF))
  optimizer$apply_gradients(purrr::transpose(list(grads, list(wTF, interceptTF))))
}

[1] 5.442895
[1] 4.558433
[1] 3.709029
[1] 2.902339
[1] 2.142875
[1] 1.43612
[1] 0.7944588
[1] 0.3106539
[1] 0.3024492
[1] 0.2686584

print(as.matrix(wTF), message = "Resulting weights:\n")

          [,1]
[1,] 2.0252784
[2,] 4.0646193
[3,] 0.3679218

cat("Original weights: ", w, "\n")

Original weights:  2.014 4.003 0.368

print(as.numeric(interceptTF), message = "Resulting intercept:\n")

[1] 3.072313

cat("Original intercept: ", intercept, "\n")

Original intercept:  3.112

B.2 Introduction to PyTorch

PyTorch is another famous library for deep learning. Like TensorFlow, Torch itself is written in C++ with an API for Python. In 2020, the RStudio team released R-Torch, and while R-TensorFlow calls the Python API in the background, the R-Torch API is built directly on the C++ Torch library!

Useful links:

PyTorch documentation (This is for the Python API, bust just replace the “.” with “$”.)
R-Torch website

To get started with Torch, we have to load the library and check if the installation worked.

library(torch)


Attaching package: 'torch'

The following object is masked from 'package:keras3':

    as_iterator

B.2.1 Data Containers

Unlike TensorFlow, Torch doesn’t have two data containers for mutable and immutable variables. All variables are initialized via the torch_tensor function:

a = torch_tensor(1.)

To mark variables as mutable (and to track their operations for automatic differentiation) we have to set the argument ‘requires_grad’ to true in the torch_tensor function:

mutable = torch_tensor(5, requires_grad = TRUE) # tf$Variable(...)
immutable = torch_tensor(5, requires_grad = FALSE) # tf$constant(...)

B.2.2 Basic Operations

We now can define the variables and do some math with them:

a = torch_tensor(5.)
b = torch_tensor(10.)
print(a)

torch_tensor
 5
[ CPUFloatType{1} ]

print(b)

torch_tensor
 10
[ CPUFloatType{1} ]

c = a$add(b)
print(c)

torch_tensor
 15
[ CPUFloatType{1} ]

The R-Torch package provides all common R methods (an advantage over TensorFlow).

a = torch_tensor(5.)
b = torch_tensor(10.)
print(a+b)

torch_tensor
 15
[ CPUFloatType{1} ]

print(a/b)

torch_tensor
 0.5000
[ CPUFloatType{1} ]

print(a*b)

torch_tensor
 50
[ CPUFloatType{1} ]

Their operators also automatically transform R numbers into tensors when attempting to add a tensor to a R number:

d = a + 5  # 5 is automatically converted to a tensor.
print(d)

torch_tensor
 10
[ CPUFloatType{1} ]

As for TensorFlow, we have to explicitly transform the tensors back to R:

class(d)

[1] "torch_tensor" "R7"

class(as.numeric(d))

[1] "numeric"

B.2.3 Data Types

Similar to TensorFlow:

r_matrix = matrix(runif(10*10), 10, 10)
m = torch_tensor(r_matrix, dtype = torch_float32()) 
b = torch_tensor(2.0, dtype = torch_float64())
c = m / b

But here’s a difference! With TensorFlow we would get an error, but with R-Torch, m is automatically casted to a double (float64). However, this is still bad practice!

During the course we will try to provide the corresponding PyTorch code snippets for all Keras/TensorFlow examples.

B.2.4 Exercises

Question: Torch Operations

Rewrite the following expressions (a to g) in torch:

x = 100:1
y = as.double(100:1)

# a)
min(x)

[1] 1

# b)
mean(x)

[1] 50.5

# c) Tip: Use Google!
which.max(x)

[1] 1

# d) 
which.min(x)

[1] 100

# e) Tip: Use Google! 
order(x)

  [1] 100  99  98  97  96  95  94  93  92  91  90  89  88  87  86  85  84  83
 [19]  82  81  80  79  78  77  76  75  74  73  72  71  70  69  68  67  66  65
 [37]  64  63  62  61  60  59  58  57  56  55  54  53  52  51  50  49  48  47
 [55]  46  45  44  43  42  41  40  39  38  37  36  35  34  33  32  31  30  29
 [73]  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11
 [91]  10   9   8   7   6   5   4   3   2   1

# f) Tip: See tf$reshape.
m = matrix(y, 10, 10) # Mind: We use y here! (Float)
m_2 = abs(m %*% t(m))  # m %*% m is the normal matrix multiplication.
m_2_log = log(m_2)
print(m_2_log)

          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
 [1,] 10.55841 10.54402 10.52943 10.51461 10.49957 10.48431 10.46880 10.45305
 [2,] 10.54402 10.52969 10.51515 10.50040 10.48542 10.47022 10.45478 10.43910
 [3,] 10.52943 10.51515 10.50067 10.48598 10.47107 10.45593 10.44057 10.42496
 [4,] 10.51461 10.50040 10.48598 10.47135 10.45651 10.44144 10.42614 10.41061
 [5,] 10.49957 10.48542 10.47107 10.45651 10.44173 10.42674 10.41151 10.39605
 [6,] 10.48431 10.47022 10.45593 10.44144 10.42674 10.41181 10.39666 10.38127
 [7,] 10.46880 10.45478 10.44057 10.42614 10.41151 10.39666 10.38158 10.36628
 [8,] 10.45305 10.43910 10.42496 10.41061 10.39605 10.38127 10.36628 10.35105
 [9,] 10.43705 10.42317 10.40910 10.39482 10.38034 10.36565 10.35073 10.33559
[10,] 10.42079 10.40699 10.39299 10.37879 10.36439 10.34977 10.33495 10.31989
          [,9]    [,10]
 [1,] 10.43705 10.42079
 [2,] 10.42317 10.40699
 [3,] 10.40910 10.39299
 [4,] 10.39482 10.37879
 [5,] 10.38034 10.36439
 [6,] 10.36565 10.34977
 [7,] 10.35073 10.33495
 [8,] 10.33559 10.31989
 [9,] 10.32022 10.30461
[10,] 10.30461 10.28909

# g) Custom mean function i.e. rewrite the function using TensorFlow. 
mean_R = function(y){
  result = sum(y) / length(y)
  return(result)
}

mean_R(y) == mean(y)    # Test for equality.

[1] TRUE

library(torch)


x = 100:1
y = as.double(100:1)

# a)    min(x)
torch_min(x) # Integer!

torch_tensor
1
[ CPULongType{} ]

torch_min(y) # Float!

torch_tensor
1
[ CPUFloatType{} ]

# b)    mean(x)
# Check out the difference here:
mean(x)

[1] 50.5

mean(y)

[1] 50.5

torch_mean(torch_tensor(x, dtype = torch_float32()))  # Integer! Why?

torch_tensor
50.5
[ CPUFloatType{} ]

torch_mean(y)  # Float!

torch_tensor
50.5
[ CPUFloatType{} ]

# c)    which.max(x)
torch_argmax(x)

torch_tensor
1
[ CPULongType{} ]

torch_argmax(y)

torch_tensor
1
[ CPULongType{} ]

# d)    which.min(x)
torch_argmin(x)

torch_tensor
100
[ CPULongType{} ]

# e)    order(x)
torch_argsort(x)

torch_tensor
 100
  99
  98
  97
  96
  95
  94
  93
  92
  91
  90
  89
  88
  87
  86
  85
  84
  83
  82
  81
  80
  79
  78
  77
  76
  75
  74
  73
  72
  71
... [the output was truncated (use n=-1 to disable)]
[ CPULongType{100} ]

# f)
# m = matrix(y, 10, 10)
# m_2 = abs(m %*% m)
# m_2_log = log(m_2)

# Mind: We use y here! 
mTorch = torch_reshape(y, c(10, 10))
mTorch2 = torch_abs(torch_matmul(mTorch, torch_t(mTorch))) # hard to read!

# Better:
mTorch2 = mTorch$matmul( mTorch$t() )$abs()
mTorch2_log = mTorch$log()

print(mTorch2_log)

torch_tensor
 4.6052  4.5951  4.5850  4.5747  4.5643  4.5539  4.5433  4.5326  4.5218  4.5109
 4.4998  4.4886  4.4773  4.4659  4.4543  4.4427  4.4308  4.4188  4.4067  4.3944
 4.3820  4.3694  4.3567  4.3438  4.3307  4.3175  4.3041  4.2905  4.2767  4.2627
 4.2485  4.2341  4.2195  4.2047  4.1897  4.1744  4.1589  4.1431  4.1271  4.1109
 4.0943  4.0775  4.0604  4.0431  4.0254  4.0073  3.9890  3.9703  3.9512  3.9318
 3.9120  3.8918  3.8712  3.8501  3.8286  3.8067  3.7842  3.7612  3.7377  3.7136
 3.6889  3.6636  3.6376  3.6109  3.5835  3.5553  3.5264  3.4965  3.4657  3.4340
 3.4012  3.3673  3.3322  3.2958  3.2581  3.2189  3.1781  3.1355  3.0910  3.0445
 2.9957  2.9444  2.8904  2.8332  2.7726  2.7081  2.6391  2.5649  2.4849  2.3979
 2.3026  2.1972  2.0794  1.9459  1.7918  1.6094  1.3863  1.0986  0.6931  0.0000
[ CPUFloatType{10,10} ]

# g)    # Custom mean function
mean_Torch = function(y){
  result = torch_sum(y)
  return( result / length(y) )  # If y is an R object.
}
mean_Torch(y) == mean(y)

torch_tensor
 1
[ CPUBoolType{1} ]

Question: Runtime

What is the meaning of “An effect is not significant”?
Is an effect with three *** more significant / certain than an effect with one *?

This exercise compares the speed of R to torch The first exercise is to rewrite the following function in torch:

do_something_R = function(x = matrix(0.0, 10L, 10L)){
  mean_per_row = apply(x, 1, mean)
  result = x - mean_per_row
  return(result)
}

Here, we provide a skeleton for a TensorFlow function:

do_something_torch= function(x = matrix(0.0, 10L, 10L)){
   ...
}

We can compare the speed using the Microbenchmark package:

test = matrix(0.0, 100L, 100L)
microbenchmark::microbenchmark(do_something_R(test), do_something_torch(test))

Try different matrix sizes for the test matrix and compare the speed.

Tip: Have a look at the the torch_mean documentation and the “dim” argument.

Compare the following with different matrix sizes:

test = matrix(0.0, 1000L, 500L)
testTorch = torch_tensor(test)

Also try the following:

microbenchmark::microbenchmark(
   torch_matmul(testTorch, testTorch$t()), # Torch style.
   test %*% t(test)  # R style.
)

do_something_torch = function(x = matrix(0.0, 10L, 10L)){
  x = torch_tensor(x)  # Remember, this is a local copy!
  mean_per_row = torch_mean(x, dim = 1)
  result = x - mean_per_row
  return(result)
}

test = matrix(0.0, 100L, 100L)
microbenchmark::microbenchmark(do_something_R(test), do_something_torch(test))

test = matrix(0.0, 1000L, 500L)
microbenchmark::microbenchmark(do_something_R(test), do_something_torch(test))

Why is R faster (the first time)?

1. The R functions we used (apply, mean, “-”) are also implemented in C.
1. The problem is not large enough and torch has an overhead.

test = matrix(0.0, 1000L, 500L)
testTorch = torch_tensor(test)

microbenchmark::microbenchmark(
   torch_matmul(testTorch, testTorch$t()), # Torch style.
   test %*% t(test)  # R style.
)

:::

Question: Linear Algebra

Google to find out how to write the following tasks in torch:

A = matrix(c(1, 2, 0, 0, 2, 0, 2, 5, 3), 3, 3)

# i)
solve(A)  # Solve equation AX = B. If just A  is given, invert it.

# j)
diag(A) # Diagonal of A, if no matrix is given, construct diagonal matrix.

# k)
diag(diag(A)) # Diagonal matrix with entries diag(A).

# l)
eigen(A)

# m)
det(A)

library(torch)

A = matrix(c(1., 2., 0., 0., 2., 0., 2., 5., 3.), 3, 3)
# Do not use the "L" form here!

# i)    solve(A)
linalg_inv(A)

torch_tensor
 1.0000 -0.0000 -0.6667
-1.0000  0.5000 -0.1667
 0.0000  0.0000  0.3333
[ CPUFloatType{3,3} ]

# j)    diag(A)
torch_diag(A)

torch_tensor
 1
 2
 3
[ CPUFloatType{3} ]

# k)    diag(diag(A))
torch_diag(A)$diag()

torch_tensor
 1  0  0
 0  2  0
 0  0  3
[ CPUFloatType{3,3} ]

# l)    eigen(A)
linalg_eigh(A)

[[1]]
torch_tensor
-0.5616
 3.0000
 3.5616
[ CPUFloatType{3} ]

[[2]]
torch_tensor
-0.7882  0.0000  0.6154
 0.6154  0.0000  0.7882
 0.0000  1.0000  0.0000
[ CPUFloatType{3,3} ]

# m)    det(A)
linalg_det(A)

torch_tensor
6
[ CPUFloatType{} ]

Question: Automatic differentation

Torch supports automatic differentiation (analytical and not numerical!). Let’s have a look at the function $f(x) = 5 x^2 + 3$ with derivative $f'(x) = 10x$. So for $f'(5)$ we will get $10$.

Let’s do this in torch Define the function:

f = function(x){ return(5.0 * torch_pow(x, 2.) + 3.0) }

We want to calculate the derivative for $x = 2.0$:

x = torch_tensor(2.0, requires_grad = TRUE)

To do automatic differentiation, we have to forward $x$ through the function and call the $backward() method of the result:

y = f(x)
y$backward(retain_graph=TRUE )

To print the gradient:

x$grad

torch_tensor
 20
[ CPUFloatType{1} ]

We can also calculate the second order derivative $f''(x) = 10$:

x = torch_tensor(2.0, requires_grad = TRUE)
y = f(x)
grad = torch::autograd_grad(y, x, retain_graph = TRUE, create_graph = TRUE)[[1]] # first
(torch::autograd_grad(grad, x, retain_graph = TRUE, create_graph = TRUE)[[1]]) # second

torch_tensor
 10
[ CPUFloatType{1} ][ grad_fn = <MulBackward0> ]

What is happening here? Think about and discuss it.

A more advanced example: Linear regression

In this case we first simulate data following $\boldsymbol{y} = \boldsymbol{X} \boldsymbol{w} + \boldsymbol{\epsilon}$ ($\boldsymbol{\epsilon}$ follows a normal distribution == error).

set_random_seed(321L, disable_gpu = FALSE)  # Already sets R's random seed.

x = matrix(round(runif(500, -2, 2), 3), 100, 5)
w = round(rnorm(5, 2, 1), 3)
y = x %*% w + round(rnorm(100, 0, 0.25), 4)

In R we would do the following to fit a linear regression model:

summary(lm(y~x))


Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.67893 -0.16399  0.00968  0.15058  0.51099 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.004865   0.027447   0.177     0.86    
x1          2.191511   0.023243  94.287   <2e-16 ***
x2          2.741690   0.025328 108.249   <2e-16 ***
x3          1.179181   0.023644  49.872   <2e-16 ***
x4          0.591873   0.025154  23.530   <2e-16 ***
x5          2.302417   0.022575 101.991   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2645 on 94 degrees of freedom
Multiple R-squared:  0.9974,    Adjusted R-squared:  0.9972 
F-statistic:  7171 on 5 and 94 DF,  p-value: < 2.2e-16

Let’s build our own model in TensorFlow. Here, we use now the variable data container type (remember they are mutable and we need this type for the weights ($\boldsymbol{w}$) of the regression model). We want our model to learn these weights.

The input (predictors, independent variables or features, $\boldsymbol{X}$) and the observed (response, $\boldsymbol{y}$) are constant and will not be learned/optimized.

library(torch)
torch::torch_manual_seed(42L)

x = matrix(round(runif(500, -2, 2), 3), 100, 5)
w = round(rnorm(5, 2, 1), 3)
y = x %*% w + round(rnorm(100, 0, 0.25), 4)

# Weights we want to learn.
# We know the real weights but in reality we wouldn't know them.
# So use guessed ones.
wTorch = torch_tensor(matrix(rnorm(5, 0, 0.01), 5, 1), requires_grad = TRUE)

xTorch = torch_tensor(x)
yTorch = torch_tensor(y)

# We need an optimizer which updates the weights (wTF).
optimizer = optim_adam(params = list(wTorch), lr = 0.1)

for(i in 1:100){
  pred = xTorch$matmul(wTorch)
  loss = (yTorch - pred)$pow(2.0)$mean()$sqrt()

  if(!i%%10){ print(paste0("Loss: ", as.numeric(loss)))}  # Every 10 times.
  loss$backward()
  optimizer$step() # do optimization step
  optimizer$zero_grad() # reset gradients
}

[1] "Loss: 4.4065318107605"
[1] "Loss: 2.37926030158997"
[1] "Loss: 0.901207447052002"
[1] "Loss: 0.403193771839142"
[1] "Loss: 0.296265512704849"
[1] "Loss: 0.268377929925919"
[1] "Loss: 0.232994765043259"
[1] "Loss: 0.219554618000984"
[1] "Loss: 0.215328559279442"
[1] "Loss: 0.213282018899918"

cat("Inferred weights: ", round(as.numeric(wTorch), 3), "\n")

Inferred weights:  0.701 3.089 1.801 1.123 3.452

cat("Original weights: ", w, "\n")

Original weights:  0.67 3.085 1.787 1.121 3.455

Discuss the code, go through the code line by line and try to understand it.

Additional exercise:

Play around with the simulation, increase/decrease the number of weights, add an intercept (you also need an additional variable in model).

library(torch)
torch::torch_manual_seed(42L)

numberOfWeights = 3
numberOfFeatures = 10000

x = matrix(round(runif(numberOfFeatures * numberOfWeights, -2, 2), 3),
           numberOfFeatures, numberOfWeights)
w = round(rnorm(numberOfWeights, 2, 1), 3)
intercept = round(rnorm(1, 3, .5), 3)
y = intercept + x %*% w + round(rnorm(numberOfFeatures, 0, 0.25), 4)

# Guessed weights and intercept.
wTorch = torch_tensor(matrix(rnorm(numberOfWeights, 0, 0.01), numberOfWeights, 1), requires_grad = TRUE)
interceptTorch = torch_tensor(matrix(rnorm(1, 0, .5), 1, 1), requires_grad = TRUE) # Double, not float32.

xTorch = torch_tensor(x)
yTorch = torch_tensor(y)

# We need an optimizer which updates the weights (wTF).
optimizer = optim_adam(params = list(wTorch, interceptTorch), lr = 0.1)

for(i in 1:100){
  pred = xTorch$matmul(wTorch)$add(interceptTorch)
  loss = (yTorch - pred)$pow(2.0)$mean()$sqrt()

  if(!i%%10){ print(paste0("Loss: ", as.numeric(loss)))}  # Every 10 times.
  loss$backward()
  optimizer$step() # do optimization step
  optimizer$zero_grad() # reset gradients
}

[1] "Loss: 3.51533484458923"
[1] "Loss: 1.74870145320892"
[1] "Loss: 0.41416934132576"
[1] "Loss: 0.518697261810303"
[1] "Loss: 0.293963432312012"
[1] "Loss: 0.263338804244995"
[1] "Loss: 0.258341491222382"
[1] "Loss: 0.254723280668259"
[1] "Loss: 0.252453774213791"
[1] "Loss: 0.25116890668869"

cat("Inferred weights: ", round(as.numeric(wTorch), 3), "\n")

Inferred weights:  3.118 -0.349 2.107

cat("Original weights: ", w, "\n")

Original weights:  3.131 -0.353 2.11

cat("Inferred intercept: ", round(as.numeric(interceptTorch), 3), "\n")

Inferred intercept:  2.836

cat("Original intercept: ", intercept, "\n")

Original intercept:  2.832