[Appendix C — Machine learning framework - mlr3]{#workflow-mlr3 .quarto-section-identifier}

Appendix C — Machine learning framework - mlr3

As we have seen today, many of the machine learning algorithms are distributed over several packages but the general machine learning pipeline is very similar for all models: feature engineering, feature selection, hyperparameter tuning and cross-validation.

Machine learning frameworks such as mlr3 or tidymodels provide a general interface for the ML pipeline, in particular the training and the hyperparameter tuning with nested CV. They support most ML packages/algorithms.

C.1 mlr3

The key features of mlr3 are:

All common machine learning packages are integrated into mlr3, you can easily switch between different machine learning algorithms.
A common ‘language’/workflow to specify machine learning pipelines.
Support for different cross-validation strategies.
Hyperparameter tuning for all supported machine learning algorithms.
Ensemble models.

Useful links:

C.1.1 mlr3 - The Basic Workflow

The mlr3 package actually consists of several packages for different tasks (e.g. mlr3tuning for hyperparameter tuning, mlr3pipelines for data preparation pipes). But let’s start with the basic workflow:

library(EcoData)
library(cito)
library(tidyverse)
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(mlr3tuning)
library(mlr3measures)
data(nasa)
str(nasa)

'data.frame':   4687 obs. of  40 variables:
 $ Neo.Reference.ID            : int  3449084 3702322 3406893 NA 2363305 3017307 2438430 3653917 3519490 2066391 ...
 $ Name                        : int  NA 3702322 3406893 3082923 2363305 3017307 2438430 3653917 3519490 NA ...
 $ Absolute.Magnitude          : num  18.7 22.1 24.8 21.6 21.4 18.2 20 21 20.9 16.5 ...
 $ Est.Dia.in.KM.min.          : num  0.4837 0.1011 0.0291 0.1272 0.1395 ...
 $ Est.Dia.in.KM.max.          : num  1.0815 0.226 0.0652 0.2845 0.3119 ...
 $ Est.Dia.in.M.min.           : num  483.7 NA 29.1 127.2 139.5 ...
 $ Est.Dia.in.M.max.           : num  1081.5 226 65.2 284.5 311.9 ...
 $ Est.Dia.in.Miles.min.       : num  0.3005 0.0628 NA 0.0791 0.0867 ...
 $ Est.Dia.in.Miles.max.       : num  0.672 0.1404 0.0405 0.1768 0.1938 ...
 $ Est.Dia.in.Feet.min.        : num  1586.9 331.5 95.6 417.4 457.7 ...
 $ Est.Dia.in.Feet.max.        : num  3548 741 214 933 1023 ...
 $ Close.Approach.Date         : Factor w/ 777 levels "1995-01-01","1995-01-08",..: 511 712 472 239 273 145 428 694 87 732 ...
 $ Epoch.Date.Close.Approach   : num  NA 1.42e+12 1.21e+12 1.00e+12 1.03e+12 ...
 $ Relative.Velocity.km.per.sec: num  11.22 13.57 5.75 13.84 4.61 ...
 $ Relative.Velocity.km.per.hr : num  40404 48867 20718 49821 16583 ...
 $ Miles.per.hour              : num  25105 30364 12873 30957 10304 ...
 $ Miss.Dist..Astronomical.    : num  NA 0.0671 0.013 0.0583 0.0381 ...
 $ Miss.Dist..lunar.           : num  112.7 26.1 NA 22.7 14.8 ...
 $ Miss.Dist..kilometers.      : num  43348668 10030753 1949933 NA 5694558 ...
 $ Miss.Dist..miles.           : num  26935614 6232821 1211632 5418692 3538434 ...
 $ Orbiting.Body               : Factor w/ 1 level "Earth": 1 1 1 1 1 1 1 1 1 1 ...
 $ Orbit.ID                    : int  NA 8 12 12 91 NA 24 NA NA 212 ...
 $ Orbit.Determination.Date    : Factor w/ 2680 levels "2014-06-13 15:20:44",..: 69 NA 1377 1774 2275 2554 1919 731 1178 2520 ...
 $ Orbit.Uncertainity          : int  0 8 6 0 0 0 1 1 1 0 ...
 $ Minimum.Orbit.Intersection  : num  NA 0.05594 0.00553 NA 0.0281 ...
 $ Jupiter.Tisserand.Invariant : num  5.58 3.61 4.44 5.5 NA ...
 $ Epoch.Osculation            : num  2457800 2457010 NA 2458000 2458000 ...
 $ Eccentricity                : num  0.276 0.57 0.344 0.255 0.22 ...
 $ Semi.Major.Axis             : num  1.1 NA 1.52 1.11 1.24 ...
 $ Inclination                 : num  20.06 4.39 5.44 23.9 3.5 ...
 $ Asc.Node.Longitude          : num  29.85 1.42 170.68 356.18 183.34 ...
 $ Orbital.Period              : num  419 1040 682 427 503 ...
 $ Perihelion.Distance         : num  0.794 0.864 0.994 0.828 0.965 ...
 $ Perihelion.Arg              : num  41.8 359.3 350 268.2 179.2 ...
 $ Aphelion.Dist               : num  1.4 3.15 2.04 1.39 1.51 ...
 $ Perihelion.Time             : num  2457736 2456941 2457937 NA 2458070 ...
 $ Mean.Anomaly                : num  55.1 NA NA 297.4 310.5 ...
 $ Mean.Motion                 : num  0.859 0.346 0.528 0.843 0.716 ...
 $ Equinox                     : Factor w/ 1 level "J2000": 1 1 NA 1 1 1 1 1 1 1 ...
 $ Hazardous                   : int  0 0 0 1 1 0 0 0 1 1 ...

Let’s drop time, name and ID variable and create a classification task:

data = nasa %>% select(-Orbit.Determination.Date,
                       -Close.Approach.Date, -Name, -Neo.Reference.ID)
data$Hazardous = as.factor(data$Hazardous)

# Create a classification task.
task = TaskClassif$new(id = "nasa", backend = data,
                       target = "Hazardous", positive = "1")

Create a generic pipeline of data transformation (imputation $\to$ scaling $\to$ encoding of categorical variables):

set.seed(123)

# Let's create the preprocessing graph.
preprocessing = po("imputeoor") %>>% po("scale") %>>% po("encode") 

# Run the task.
transformed_task = preprocessing$train(task)[[1]]

transformed_task$missings()

                   Hazardous           Absolute.Magnitude 
                        4187                            0 
               Aphelion.Dist           Asc.Node.Longitude 
                           0                            0 
                Eccentricity    Epoch.Date.Close.Approach 
                           0                            0 
            Epoch.Osculation         Est.Dia.in.Feet.max. 
                           0                            0 
        Est.Dia.in.Feet.min.           Est.Dia.in.KM.max. 
                           0                            0 
          Est.Dia.in.KM.min.            Est.Dia.in.M.max. 
                           0                            0 
           Est.Dia.in.M.min.        Est.Dia.in.Miles.max. 
                           0                            0 
       Est.Dia.in.Miles.min.                  Inclination 
                           0                            0 
 Jupiter.Tisserand.Invariant                 Mean.Anomaly 
                           0                            0 
                 Mean.Motion               Miles.per.hour 
                           0                            0 
  Minimum.Orbit.Intersection     Miss.Dist..Astronomical. 
                           0                            0 
      Miss.Dist..kilometers.            Miss.Dist..lunar. 
                           0                            0 
           Miss.Dist..miles.                     Orbit.ID 
                           0                            0 
          Orbit.Uncertainity               Orbital.Period 
                           0                            0 
              Perihelion.Arg          Perihelion.Distance 
                           0                            0 
             Perihelion.Time  Relative.Velocity.km.per.hr 
                           0                            0 
Relative.Velocity.km.per.sec              Semi.Major.Axis 
                           0                            0 
               Equinox.J2000             Equinox..MISSING 
                           0                            0 
         Orbiting.Body.Earth       Orbiting.Body..MISSING 
                           0                            0

We can even visualize the preprocessing graph:

preprocessing$plot()

To test our model (glmnet) with 10-fold cross-validated, we will do:

Specify the missing target rows as validation so that they will be ignored.
Specify the cross-validation, the learner (the machine learning model we want to use), and the measurement (AUC).
Run (benchmark) our model.

set.seed(123)

transformed_task$data()[1,]

   Hazardous Absolute.Magnitude Aphelion.Dist Asc.Node.Longitude Eccentricity
      <fctr>              <num>         <num>              <num>        <num>
1:         0         -0.8132265    -0.3804201          -1.140837    -0.315606
   Epoch.Date.Close.Approach Epoch.Osculation Est.Dia.in.Feet.max.
                       <num>            <num>                <num>
1:                 -4.792988        0.1402677            0.2714179
   Est.Dia.in.Feet.min. Est.Dia.in.KM.max. Est.Dia.in.KM.min. Est.Dia.in.M.max.
                  <num>              <num>              <num>             <num>
1:            0.3134076          0.3007134          0.2565687         0.2710953
   Est.Dia.in.M.min. Est.Dia.in.Miles.max. Est.Dia.in.Miles.min. Inclination
               <num>                 <num>                 <num>       <num>
1:         0.2916245             0.2620443              0.258651   0.5442288
   Jupiter.Tisserand.Invariant Mean.Anomaly Mean.Motion Miles.per.hour
                         <num>        <num>       <num>          <num>
1:                   0.3840868    -1.028761   0.3193953     -0.2541306
   Minimum.Orbit.Intersection Miss.Dist..Astronomical. Miss.Dist..kilometers.
                        <num>                    <num>                  <num>
1:                  -5.459119                -7.076926              0.2512296
   Miss.Dist..lunar. Miss.Dist..miles.  Orbit.ID Orbit.Uncertainity
               <num>             <num>     <num>              <num>
1:         0.2398625         0.2381077 -9.651472          -1.007087
   Orbital.Period Perihelion.Arg Perihelion.Distance Perihelion.Time
            <num>          <num>               <num>           <num>
1:     -0.3013135      -1.170536         -0.01831583       0.1052611
   Relative.Velocity.km.per.hr Relative.Velocity.km.per.sec Semi.Major.Axis
                         <num>                        <num>           <num>
1:                  -0.2816782                   -0.2841407      -0.2791037
   Equinox.J2000 Equinox..MISSING Orbiting.Body.Earth Orbiting.Body..MISSING
           <num>            <num>               <num>                  <num>
1:             1                0                   1                      0

transformed_task$set_row_roles((1:nrow(data))[is.na(data$Hazardous)],
                               "holdout")

cv10 = mlr3::rsmp("cv", folds = 10L)
EN = lrn("classif.glmnet", predict_type = "prob")
measurement =  msr("classif.auc")

result = mlr3::resample(transformed_task,
                        EN, resampling = cv10, store_models = TRUE)

# Calculate the average AUC of the holdouts.
result$aggregate(measurement)

Very cool! Preprocessing + 10-fold cross-validation model evaluation in a few lines of code!

Let’s create the final predictions:

pred = sapply(1:10, function(i) result$learners[[i]]$predict(transformed_task,
row_ids = (1:nrow(data))[is.na(data$Hazardous)])$data$prob[, "1", drop = FALSE])
dim(pred)
predictions = apply(pred, 1, mean)

You could now submit the predictions here.

But we are still not happy with the results, let’s do some hyperparameter tuning!

C.1.2 mlr3 - Hyperparameter Tuning

With mlr3, we can easily extend the above example to do hyperparameter tuning within nested cross-validation (the tuning has its own inner cross-validation).

Print the hyperparameter space of our glmnet learner:

EN$param_set

<ParamSet>

Warning: Unknown argument 'on' has been passed.

Key: <id>
                      id    class lower upper nlevels
                  <char>   <char> <num> <num>   <num>
 1:                alpha ParamDbl     0     1     Inf
 2:                  big ParamDbl  -Inf   Inf     Inf
 3:               devmax ParamDbl     0     1     Inf
 4:                dfmax ParamInt     0   Inf     Inf
 5:                  eps ParamDbl     0     1     Inf
 6:                epsnr ParamDbl     0     1     Inf
 7:                exact ParamLgl    NA    NA       2
 8:              exclude ParamInt     1   Inf     Inf
 9:                 exmx ParamDbl  -Inf   Inf     Inf
10:                 fdev ParamDbl     0     1     Inf
11:                gamma ParamDbl  -Inf   Inf     Inf
12:            intercept ParamLgl    NA    NA       2
13:               lambda ParamUty    NA    NA     Inf
14:     lambda.min.ratio ParamDbl     0     1     Inf
15:         lower.limits ParamUty    NA    NA     Inf
16:                maxit ParamInt     1   Inf     Inf
17:                mnlam ParamInt     1   Inf     Inf
18:                 mxit ParamInt     1   Inf     Inf
19:               mxitnr ParamInt     1   Inf     Inf
20:            newoffset ParamUty    NA    NA     Inf
21:              nlambda ParamInt     1   Inf     Inf
22:               offset ParamUty    NA    NA     Inf
23:       penalty.factor ParamUty    NA    NA     Inf
24:                 pmax ParamInt     0   Inf     Inf
25:                 pmin ParamDbl     0     1     Inf
26:                 prec ParamDbl  -Inf   Inf     Inf
27:                relax ParamLgl    NA    NA       2
28:                    s ParamDbl     0   Inf     Inf
29:          standardize ParamLgl    NA    NA       2
30: standardize.response ParamLgl    NA    NA       2
31:               thresh ParamDbl     0   Inf     Inf
32:             trace.it ParamInt     0     1       2
33:        type.gaussian ParamFct    NA    NA       2
34:        type.logistic ParamFct    NA    NA       2
35:     type.multinomial ParamFct    NA    NA       2
36:         upper.limits ParamUty    NA    NA     Inf
                      id    class lower upper nlevels
                                                                                      default
                                                                                       <list>
 1:                                                                                         1
 2:                                                                                   9.9e+35
 3:                                                                                     0.999
 4: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
 5:                                                                                     1e-06
 6:                                                                                     1e-08
 7:                                                                                     FALSE
 8: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
 9:                                                                                       250
10:                                                                                     1e-05
11:                                                                                         1
12:                                                                                      TRUE
13: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
14: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
15: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
16:                                                                                    100000
17:                                                                                         5
18:                                                                                       100
19:                                                                                        25
20: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
21:                                                                                       100
22:                                                                                          
23: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
24: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
25:                                                                                     1e-09
26:                                                                                     1e-10
27:                                                                                     FALSE
28:                                                                                      0.01
29:                                                                                      TRUE
30:                                                                                     FALSE
31:                                                                                     1e-07
32:                                                                                         0
33: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
34: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
35: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
36: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
                                                                                      default
    parents  value
     <list> <list>
 1:               
 2:               
 3:               
 4:               
 5:               
 6:               
 7:               
 8:               
 9:               
10:               
11:   relax       
12:               
13:               
14:               
15:               
16:               
17:               
18:               
19:               
20:               
21:               
22:               
23:               
24:               
25:               
26:               
27:               
28:               
29:               
30:               
31:               
32:               
33:               
34:               
35:               
36:               
    parents  value

Define the hyperparameter space of the random forest:

library(paradox)

EN_pars = 
    paradox::ParamSet$new(
      list(paradox::ParamDbl$new("alpha", lower = 0, upper = 1L),
           paradox::ParamDbl$new("lambda", lower = 0, upper = 0.5 )) )
print(EN_pars)

<ParamSet>
       id    class lower upper nlevels
   <char>   <char> <num> <num>   <num>
1:  alpha ParamDbl     0   1.0     Inf
2: lambda ParamDbl     0   0.5     Inf
                                                                                     default
                                                                                      <list>
1: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
2: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
    value
   <list>
1:       
2:

To set up the tuning pipeline we need:

Inner cross-validation resampling object.
Tuning criterion (e.g. AUC).
Tuning method (e.g. random or block search).
Tuning terminator (When should we stop tuning? E.g. after $n$ iterations).

set.seed(123)

inner3 = mlr3::rsmp("cv", folds = 3L)
measurement =  msr("classif.auc")
tuner =  mlr3tuning::tnr("random_search") 
terminator = mlr3tuning::trm("evals", n_evals = 5L)
EN = lrn("classif.glmnet", predict_type = "prob")

learner_tuner = AutoTuner$new(learner = EN, 
                              measure = measurement, 
                              tuner = tuner, 
                              terminator = terminator,
                              search_space = EN_pars,
                              resampling = inner3)
print(learner_tuner)

<AutoTuner:classif.glmnet.tuned>
* Model: list
* Search Space:
<ParamSet>
       id    class lower upper nlevels
   <char>   <char> <num> <num>   <num>
1:  alpha ParamDbl     0   1.0     Inf
2: lambda ParamDbl     0   0.5     Inf
                                                                                     default
                                                                                      <list>
1: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
2: <NoDefault>\n  Public:\n    clone: function (deep = FALSE) \n    initialize: function () 
    value
   <list>
1:       
2:       
* Packages: mlr3, mlr3tuning, mlr3learners, glmnet
* Predict Type: prob
* Feature Types: logical, integer, numeric
* Properties: multiclass, twoclass, weights

Now we can wrap it normally into the 10-fold cross-validated setup as done previously:

# Calculate the average AUC of the holdouts.
result$aggregate(measurement)

classif.auc 
  0.6767554

Let’s create the final predictions:

pred = sapply(1:3, function(i) result$learners[[i]]$predict(transformed_task,
row_ids = (1:nrow(data))[is.na(data$Hazardous)])$data$prob[, "1", drop = FALSE])
dim(pred)
predictions = apply(pred, 1, mean)

C.2 Exercises

C.2.1 Tuning Regularization

Question: Hyperparameter tuning - Titanic dataset

Tune architecture

Tune training parameters (learning rate, batch size) and regularization

Hints

cito has a feature to automatically tune hyperparameters under Cross Validation!

passing tune(...) to a hyperparameter will tell cito to tune this specific hyperparameter
the tuning = config_tuning(...) let you specify the cross-validation strategy and the number of hyperparameters that should be tested (steps = number of hyperparameter combinations that should be tried)
after tuning, cito will fit automatically a model with the best hyperparameters on the full data and will return this model

Minimal example with the iris dataset:

library(cito)
df = iris
df[,1:4] = scale(df[,1:4])

model_tuned = dnn(Species~., 
                  loss = "softmax",
                  data = iris,
                  lambda = tune(lower = 0.0, upper = 0.2), # you can pass the "tune" function to a hyerparameter
                  tuning = config_tuning(CV = 3, steps = 20L)
                  )

# tuning results
model_tuned$tuning


# model_tuned is now already the best model!

library(EcoData)
library(dplyr)
library(missRanger)
data(titanic_ml)
data = titanic_ml
data = 
  data %>% select(survived, sex, age, fare, pclass)
data[,-1] = missRanger(data[,-1], verbose = 0)

data_sub =
  data %>%
    mutate(age = scales::rescale(age, c(0, 1)),
           fare = scales::rescale(fare, c(0, 1))) %>%
    mutate(sex = as.integer(sex) - 1L,
           pclass = as.integer(pclass - 1L))
data_new = data_sub[is.na(data_sub$survived),] # for which we want to make predictions at the end
data_obs = data_sub[!is.na(data_sub$survived),] # data with known response


model = dnn(survived~., 
          hidden = c(10L, 10L), # change
          activation = c("selu", "selu"), # change
          loss = "binomial", 
          lr = 0.05, #change
          validation = 0.2,
          lambda = 0.001, # change
          alpha = 0.1, # change
          lr_scheduler = config_lr_scheduler("reduce_on_plateau", patience = 10, factor = 0.9),
          data = data_obs, epochs = 40L, verbose = TRUE, plot= TRUE)

Loss at epoch 1: training: 0.720, validation: 0.673, lr: 0.05000

Loss at epoch 2: training: 0.658, validation: 0.811, lr: 0.05000
Loss at epoch 3: training: 0.654, validation: 0.600, lr: 0.05000
Loss at epoch 4: training: 0.647, validation: 0.824, lr: 0.05000
Loss at epoch 5: training: 0.641, validation: 0.586, lr: 0.05000
Loss at epoch 6: training: 0.611, validation: 0.563, lr: 0.05000
Loss at epoch 7: training: 0.639, validation: 0.729, lr: 0.05000
Loss at epoch 8: training: 0.618, validation: 0.565, lr: 0.05000
Loss at epoch 9: training: 0.604, validation: 0.754, lr: 0.05000
Loss at epoch 10: training: 0.602, validation: 0.566, lr: 0.05000
Loss at epoch 11: training: 0.571, validation: 0.510, lr: 0.05000
Loss at epoch 12: training: 0.593, validation: 0.529, lr: 0.05000
Loss at epoch 13: training: 0.594, validation: 0.499, lr: 0.05000
Loss at epoch 14: training: 0.564, validation: 0.560, lr: 0.05000
Loss at epoch 15: training: 0.575, validation: 0.484, lr: 0.05000
Loss at epoch 16: training: 0.559, validation: 0.503, lr: 0.05000
Loss at epoch 17: training: 0.539, validation: 0.457, lr: 0.05000
Loss at epoch 18: training: 0.537, validation: 0.553, lr: 0.05000
Loss at epoch 19: training: 0.590, validation: 0.565, lr: 0.05000
Loss at epoch 20: training: 0.592, validation: 0.549, lr: 0.05000
Loss at epoch 21: training: 0.535, validation: 0.962, lr: 0.05000
Loss at epoch 22: training: 0.515, validation: 0.461, lr: 0.05000
Loss at epoch 23: training: 0.564, validation: 0.453, lr: 0.05000
Loss at epoch 24: training: 0.514, validation: 0.416, lr: 0.05000
Loss at epoch 25: training: 0.517, validation: 0.648, lr: 0.05000
Loss at epoch 26: training: 0.483, validation: 0.573, lr: 0.05000
Loss at epoch 27: training: 0.492, validation: 0.406, lr: 0.05000
Loss at epoch 28: training: 0.501, validation: 0.458, lr: 0.05000
Loss at epoch 29: training: 0.466, validation: 0.491, lr: 0.05000
Loss at epoch 30: training: 0.594, validation: 0.423, lr: 0.05000
Loss at epoch 31: training: 0.525, validation: 0.726, lr: 0.05000
Loss at epoch 32: training: 0.482, validation: 0.397, lr: 0.05000
Loss at epoch 33: training: 0.524, validation: 0.965, lr: 0.05000
Loss at epoch 34: training: 0.525, validation: 0.885, lr: 0.05000
Loss at epoch 35: training: 0.520, validation: 0.430, lr: 0.05000
Loss at epoch 36: training: 0.546, validation: 0.577, lr: 0.05000
Loss at epoch 37: training: 0.493, validation: 0.570, lr: 0.05000
Loss at epoch 38: training: 0.465, validation: 1.023, lr: 0.05000
Loss at epoch 39: training: 0.470, validation: 0.493, lr: 0.05000
Loss at epoch 40: training: 0.482, validation: 0.392, lr: 0.05000

# Predictions:

predictions = predict(model, newdata = data_new, type = "response") # change prediction type to response so that cito predicts probabilities

write.csv(data.frame(y = predictions[,1]), file = "Max_titanic_dnn.csv")

C.2.2 Bonus: mlr3

Task: Use mlr3 for the titanic dataset

Use mlr3 to tune glmnet for the titanic dataset using nested CV
Submit single predictions and multiple predictions

If you need help, take a look at the solution, go through it line by line and try to understand it.

Prepare data

data = titanic_ml %>% select(-name, -ticket, -name, -body)
data$pclass = as.factor(data$pclass)
data$sex = as.factor(data$sex)
data$survived = as.factor(data$survived)

# Change easy things manually:
data$embarked[data$embarked == ""] = "S"  # Fill in "empty" values.
data$embarked = droplevels(as.factor(data$embarked)) # Remove unused levels ("").
data$cabin = (data$cabin != "") * 1 # Dummy code the availability of a cabin.
data$fare[is.na(data$fare)] = mean(data$fare, na.rm = TRUE)
levels(data$home.dest)[levels(data$home.dest) == ""] = "unknown"
levels(data$boat)[levels(data$boat) == ""] = "none"

# Create a classification task.
task = TaskClassif$new(id = "titanic", backend = data,
                       target = "survived", positive = "1")
task$missings()

 survived       age      boat     cabin  embarked      fare home.dest     parch 
      655       263         0         0         0         0         0         0 
   pclass       sex     sibsp 
        0         0         0

# Let's create the preprocessing graph.
preprocessing = po("imputeoor") %>>% po("scale") %>>% po("encode") 

# Run the task.
transformed_task = preprocessing$train(task)[[1]]

transformed_task$set_row_roles((1:nrow(data))[is.na(data$survived)], "holdout")

Hyperparameter tuning:

cv10 = mlr3::rsmp("cv", folds = 10L)

inner3 = mlr3::rsmp("cv", folds = 3L)
measurement =  msr("classif.auc")
tuner =  mlr3tuning::tnr("random_search") 
terminator = mlr3tuning::trm("evals", n_evals = 5L)
EN = lrn("classif.glmnet", predict_type = "prob")
EN_pars = 
    paradox::ParamSet$new(
      list(paradox::ParamDbl$new("alpha", lower = 0, upper = 1L),
           paradox::ParamDbl$new("lambda", lower = 0, upper = 0.5 )) )

learner_tuner = AutoTuner$new(learner = EN, 
                              measure = measurement, 
                              tuner = tuner, 
                              terminator = terminator,
                              search_space = EN_pars,
                              resampling = inner3)


result = mlr3::resample(transformed_task, learner_tuner,
                        resampling = cv10, store_models = TRUE)

Evaluation:

measurement =  msr("classif.auc")
result$aggregate(measurement)

classif.auc 
  0.9939211

Predictions:

We can extract a learner with optimized hyperparameters:

model = result$learners[[1]]$learner$clone()
model$param_set$values

$alpha
[1] 0.1832108

$lambda
[1] 0.1246408

And we can fit it then on the full data set:

model$train(transformed_task)
predictions = model$predict(transformed_task, row_ids = transformed_task$row_roles$holdout)
predictions = predictions$prob[,1]
head(predictions)

[1] 0.8555042 0.1554276 0.3219091 0.7343347 0.8628420 0.8735773

And submit to http://rhsbio7.uni-regensburg.de:8500

write.csv(data.frame(y = predictions), file = "glmnet.csv")