4  Machine learning pipeline

The Standard Machine Learning Pipeline at the example of the Titanic Data set

Before we specialize on any tuning, it is important to understand that machine learning always consists of a pipeline of actions.

The typical machine learning workflow consist of:

Here is an (optional) video that explains the entire pipeline from a slightly different perspective:

In the following example, we use tidyverse, a collection of R packages for data science / data manipulation mainly developed by Hadley Wickham. A video that explains the basics can be found here :

Another good reference is “R for data science” by Hadley Wickham: .

For this lecture you need the Titanic data set provided by us. You can find it in GRIPS (datasets.RData in the data set and submission section) or at http://rhsbio6.uni-regensburg.de:8500.

We have split the data set already into training and test/prediction data sets (the test/prediction split has one column less than the train split, as the result is not known a priori).

4.1 Data preparation

Load necessary libraries:

library(tidyverse)

Load data set:

library(EcoData)
data(titanic_ml)
data = titanic_ml

Standard summaries:

str(data)
'data.frame':   1309 obs. of  14 variables:
 $ pclass   : int  2 1 3 3 3 3 3 1 3 1 ...
 $ survived : int  1 1 0 0 0 0 0 1 0 1 ...
 $ name     : chr  "Sinkkonen, Miss. Anna" "Woolner, Mr. Hugh" "Sage, Mr. Douglas Bullen" "Palsson, Master. Paul Folke" ...
 $ sex      : Factor w/ 2 levels "female","male": 1 2 2 2 2 2 2 1 1 1 ...
 $ age      : num  30 NA NA 6 30.5 38.5 20 53 NA 42 ...
 $ sibsp    : int  0 0 8 3 0 0 0 0 0 0 ...
 $ parch    : int  0 0 2 1 0 0 0 0 0 0 ...
 $ ticket   : Factor w/ 929 levels "110152","110413",..: 221 123 779 542 589 873 472 823 588 834 ...
 $ fare     : num  13 35.5 69.55 21.07 8.05 ...
 $ cabin    : Factor w/ 187 levels "","A10","A11",..: 1 94 1 1 1 1 1 1 1 1 ...
 $ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 2 4 2 ...
 $ boat     : Factor w/ 28 levels "","1","10","11",..: 3 28 1 1 1 1 1 19 1 15 ...
 $ body     : int  NA NA NA NA 50 32 NA NA NA NA ...
 $ home.dest: Factor w/ 370 levels "","?Havana, Cuba",..: 121 213 1 1 1 1 322 350 1 1 ...
summary(data)
     pclass         survived          name               sex     
 Min.   :1.000   Min.   :0.0000   Length:1309        female:466  
 1st Qu.:2.000   1st Qu.:0.0000   Class :character   male  :843  
 Median :3.000   Median :0.0000   Mode  :character               
 Mean   :2.295   Mean   :0.3853                                  
 3rd Qu.:3.000   3rd Qu.:1.0000                                  
 Max.   :3.000   Max.   :1.0000                                  
                 NA's   :655                                     
      age              sibsp            parch            ticket    
 Min.   : 0.1667   Min.   :0.0000   Min.   :0.000   CA. 2343:  11  
 1st Qu.:21.0000   1st Qu.:0.0000   1st Qu.:0.000   1601    :   8  
 Median :28.0000   Median :0.0000   Median :0.000   CA 2144 :   8  
 Mean   :29.8811   Mean   :0.4989   Mean   :0.385   3101295 :   7  
 3rd Qu.:39.0000   3rd Qu.:1.0000   3rd Qu.:0.000   347077  :   7  
 Max.   :80.0000   Max.   :8.0000   Max.   :9.000   347082  :   7  
 NA's   :263                                        (Other) :1261  
      fare                     cabin      embarked      boat    
 Min.   :  0.000                  :1014    :  2           :823  
 1st Qu.:  7.896   C23 C25 C27    :   6   C:270    13     : 39  
 Median : 14.454   B57 B59 B63 B66:   5   Q:123    C      : 38  
 Mean   : 33.295   G6             :   5   S:914    15     : 37  
 3rd Qu.: 31.275   B96 B98        :   4            14     : 33  
 Max.   :512.329   C22 C26        :   4            4      : 31  
 NA's   :1         (Other)        : 271            (Other):308  
      body                      home.dest  
 Min.   :  1.0                       :564  
 1st Qu.: 72.0   New York, NY        : 64  
 Median :155.0   London              : 14  
 Mean   :160.8   Montreal, PQ        : 10  
 3rd Qu.:256.0   Cornwall / Akron, OH:  9  
 Max.   :328.0   Paris, France       :  9  
 NA's   :1188    (Other)             :639  

The name variable consists of 1309 unique factors (there are 1309 observations…) and could be now transformed. If you are interested in how to do that, take a look at the following box.

length(unique(data$name))
[1] 1307

However, there is a title in each name. Let’s extract the titles:

  1. We will extract all names and split each name after each comma “,”.
  2. We will split the second split of the name after a point “.” and extract the titles.
first_split = sapply(data$name,
                     function(x) stringr::str_split(x, pattern = ",")[[1]][2])
titles = sapply(first_split,
                function(x) strsplit(x, ".",fixed = TRUE)[[1]][1])

We get 18 unique titles:

table(titles)
titles
         Capt           Col           Don          Dona            Dr 
            1             4             1             1             8 
     Jonkheer          Lady         Major        Master          Miss 
            1             1             2            61           260 
         Mlle           Mme            Mr           Mrs            Ms 
            2             1           757           197             2 
          Rev           Sir  the Countess 
            8             1             1 

A few titles have a very low occurrence rate:

titles = stringr::str_trim((titles))
titles %>%
 fct_count()
# A tibble: 18 × 2
   f                n
   <fct>        <int>
 1 Capt             1
 2 Col              4
 3 Don              1
 4 Dona             1
 5 Dr               8
 6 Jonkheer         1
 7 Lady             1
 8 Major            2
 9 Master          61
10 Miss           260
11 Mlle             2
12 Mme              1
13 Mr             757
14 Mrs            197
15 Ms               2
16 Rev              8
17 Sir              1
18 the Countess     1

We will combine titles with low occurrences into one title, which we can easily do with the forcats package.

titles2 =
  forcats::fct_collapse(titles,
                        officer = c("Capt", "Col", "Major", "Dr", "Rev"),
                        royal = c("Jonkheer", "Don", "Sir",
                                  "the Countess", "Dona", "Lady"),
                        miss = c("Miss", "Mlle"),
                        mrs = c("Mrs", "Mme", "Ms")
                        )

We can count titles again to see the new number of titles:

titles2 %>%  
   fct_count()
# A tibble: 6 × 2
  f           n
  <fct>   <int>
1 officer    23
2 royal       6
3 Master     61
4 miss      262
5 mrs       200
6 Mr        757

Add new title variable to data set:

data =
  data %>%
    mutate(title = titles2)

4.1.1 Imputation

NAs are a common problem in ML. For example, the age variable has 20% NAs:

summary(data$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.1667 21.0000 28.0000 29.8811 39.0000 80.0000     263 
sum(is.na(data$age)) / nrow(data)
[1] 0.2009167

Either we remove all observations with NAs, or we impute (fill) the missing values, e.g. with the median age. However, age itself might depend on other variables such as sex, class and title. We want to fill the NAs with the median age of these groups. In tidyverse we can easily “group” the data, i.e. we will nest the observations (here: group_by after sex, pclass and title). After grouping, all operations (such as our median(age….)) will be done within the specified groups.

data =
  data %>%
    select(survived, sex, age, fare, pclass) %>% 
    group_by(sex, pclass) %>%
    mutate(age2 = ifelse(is.na(age), median(age, na.rm = TRUE), age)) %>%
    mutate(fare2 = ifelse(is.na(fare), median(fare, na.rm = TRUE), fare)) %>%
    ungroup()

4.1.2 Preprocessing and Feature Selection

Later (tomorrow), we want to use Keras in our example, but it cannot handle factors and requires the data to be scaled.

Normally, one would do this for all predictors, but as we only show the pipeline here, we have sub-selected a bunch of predictors and do this only for them. We first scale the numeric predictors and change the factors with only two groups/levels into integers (this can be handled by Keras).

data_sub =
  data %>%
    select(survived, sex, age2, fare2, pclass) %>%
    mutate(age2 = scales::rescale(age2, c(0, 1)),
           fare2 = scales::rescale(fare2, c(0, 1))) %>%
    mutate(sex = as.integer(sex) - 1L,
           pclass = as.integer(pclass - 1L))

Factors with more than two levels should be one hot encoded (Make columns for every different factor level and write 1 in the respective column for every taken feature value and 0 else. For example: \(\{red, green, green, blue, red\} \rightarrow \{(0,0,1), (0,1,0), (0,1,0), (1,0,0), (0,0,1)\}\)):

one_title = model.matrix(~0+as.factor(title), data = data)
colnames(one_title) = levels(data$title)

one_sex = model.matrix(~0+as.factor(sex), data = data)
colnames(one_sex) = levels(data$sex)

one_pclass = model.matrix(~0+as.factor(pclass), data = data)
colnames(one_pclass) = paste0("pclass", 1:length(unique(data$pclass)))

And we have to add the dummy encoded variables to the data set:

data = cbind(data.frame(survived= data$survived),
                 one_title, one_sex, age = data$age2,
                 fare = data$fare2, one_pclass)
head(data)

4.2 Modelling

4.2.1 Split data for final predictions

To tune our hyperparameters and evaluate our models, we need to split the data as we learned in the CV section. Before doing so, however, we must split off the new observations in the data set :

summary(data_sub$survived)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.3853  1.0000  1.0000     655 

655 observations have NAs in our response variable, these are the observations for which we want to make predictions at the end of our pipeline (we have no information about their actual values!).

data_new = data_sub[is.na(data_sub$survived),]
data_obs = data_sub[!is.na(data_sub$survived),]

4.2.2 Training and evaluation

Now, we can do a simple 10xCV with the observed_data:

library(glmnet)
library(glmnetUtils)
data_obs = data_sub[!is.na(data_sub$survived),] 
cv = 10

outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv))

results = data.frame(
  set = 1:cv,
  AUC = rep(NA, cv)
)

for(i in 1:cv) {
  train_outer = data_obs[outer_split != i, ]
  test_outer = data_obs[outer_split == i, ]
  model = glmnet(survived~.,data = train_outer, family = "binomial",alpha = 0.2)
  results[i, 2] = Metrics::auc(test_outer$survived, predict(model, test_outer, 
                                                            alpha = 0.2,
                                                            s = 0.01,
                                                            type = "response"))

}
print(results)
   set       AUC
1    1 0.8650794
2    2 0.8723197
3    3 0.7632114
4    4 0.7541463
5    5 0.8984221
6    6 0.7200000
7    7 0.7829268
8    8 0.8701737
9    9 0.8787879
10  10 0.7867647
print(mean(results$AUC))
[1] 0.8191832

4.2.3 Hyperparameter optimization

We did a simple 10xCV to evaluate our model but we didn’t tune our hyperparameters (\(\lambda\) and \(\alpha\)). If we want to tune them, we need do another CV within each split of the model evaulation CV, which is called nested CV.

We used only one split (the split for the submission server doesn’t count) to evaluate the performance of the model before we made the final predictions. If we test many different hyperparameter combinations, how do we ensure that a certain hyperparameter is not only good for our training dataset but also good for the new data (our outer split on the submission server)? You may have guessed it already, we need to do another CV within the previous CV to check whether a certain hyperparameter solution generalizes to the whole data. To tune \(\lambda\), we would need to split the data another time (called nested CV).

Why is it important to tune hyperparameters? Hyperparameters (configuration parameters of our ML algorithms that (mostly) control their complexity) are usually tuned (optimized) in an automatic / systematic way. A common procedure, called random search, is to sample random configuration combinations from the set of hyperparameters and test for each combination the prediction error.

Let’s implement manually a nested CV to tune the \(\alpha\). Let’s start with a 5CVx5CV and 20x different alpha values:

library(glmnet)
library(glmnetUtils)
set.seed(42)
data_obs = data_sub[!is.na(data_sub$survived),] 
cv = 5
cv_inner = 5
hyper_alpha = runif(20,0, 1)

outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv))

results = data.frame(
  set = rep(NA, cv),
  alpha = rep(NA, cv),
  AUC = rep(NA, cv)
)

for(i in 1:cv) {
  train_outer = data_obs[outer_split != i, ]
  test_outer = data_obs[outer_split == i, ]
  
  best_alpha = NULL
  best_auc = NULL
  
  # inner split
  for(j in 1:cv_inner) {
    inner_split = as.integer(cut(1:nrow(train_outer), breaks = cv_inner))
    train_inner = train_outer[inner_split != j, ]
    test_inner = train_outer[inner_split == j, ]
    
    tuning_results_inner = 
      sapply(1:length(hyper_alpha), function(k) {
        model = glmnet(survived~.,data = train_inner, family = "binomial",alpha = hyper_alpha[k])
        return(Metrics::auc(test_inner$survived, predict(model, test_inner, 
                                                         alpha = hyper_alpha[k],
                                                         s = 0.01,
                                                         type = "response")))
      })
    best_alpha[j] = hyper_alpha[which.max(tuning_results_inner)]
    best_auc[j] = max(tuning_results_inner)
  }
  best_alpha = best_alpha[which.max(best_auc)]
  model = glmnet(survived~., data = train_outer, alpha = best_alpha, family = "binomial")
  results[i, 1] = i
  results[i, 2] = best_alpha
  results[i, 3] = Metrics::auc(test_outer$survived, predict(model, test_outer, s = 0.01, alpha = best_alpha, type = "response"))
}

print(results)
  set     alpha       AUC
1   1 0.1346666 0.8720588
2   2 0.4577418 0.7554754
3   3 0.9148060 0.8178054
4   4 0.2861395 0.8238994
5   5 0.9148060 0.8156749

4.3 Predictions and Submission

When we are satisfied with the performance of our model, we will create predictions for the new observations on the submission server. But first we will train now our model on the full observed dataset:

\(\alpha = 0.915\) has the highest AUC, let’s use it to train the model on the full dataset:

model = glmnet(survived~.,data = data_obs, family = "binomial",alpha = 0.915)

We cannot assess the performance for the new observations because the true survival ratio is unknown, however, we can now submit our predictions to the submission server at http://rhsbio7.uni-regensburg.de:8500.

For the submission it is critical to change the predictions into a data.frame, select the second column (the probability to survive), and save it with the write.csv function:

data_new = data_sub[is.na(data_sub$survived),]
write.csv(data.frame(y = predict(model, data_new, alpha = 0.915, s = 0.01, type = "response")[,1] ), file = "Max_1.csv")

We have now used the \(\alpha\) value with the highest AUC here, but our tuning has shown that the best value of \(\alpha\) depends on the partitioning, so it would probably be better to build ten models and combine their predictions (e.g., by averaging the predictions):

prediction_ensemble = 
  sapply(results$alpha, function(alpha) {
    model = glmnet(survived~.,data = data_obs, family = "binomial",alpha = alpha)
    return(predict(model, data_new, alpha = alpha, s = 0.01, type = "response")[,1])
  })
write.csv(data.frame(y = apply(prediction_ensemble, 1, mean)), file = "Max_1.csv")

4.4 Exercises

Task: Tuning \(\alpha\) and \(\lambda\)
  1. Extend the code from above and tune \(\alpha\) and \(\lambda\) (Nested-CV or via a simple CV)

  2. Train the model with best set of hyperparameters and submit your predictions

  3. Compare the predictive performance from the single best model with the ensemble model

Submit both predictions (http://rhsbio7.uni-regensburg.de:8500/), which model has a higher AUC?

library(EcoData)
library(dplyr)
library(missRanger)
library(glmnet)
library(glmnetUtils)
data(titanic_ml)
data = titanic_ml
data = 
  data %>% select(survived, sex, age, fare, pclass)

# missRanger uses a random forest to impute NAs (RF is trained on the data to predict values for the NAs)
data[,-1] = missRanger(data[,-1], verbose = 0)

data_sub =
  data %>%
    mutate(age = scales::rescale(age, c(0, 1)),
           fare = scales::rescale(fare, c(0, 1))) %>%
    mutate(sex = as.integer(sex) - 1L,
           pclass = as.integer(pclass - 1L))
data_new = data_sub[is.na(data_sub$survived),] # for which we want to make predictions at the end
data_obs = data_sub[!is.na(data_sub$survived),] # data with known response

Bonus:

  • Try different features
  • Try cito
  • Try different datasets (see Appendix A)

Code template for a simple CV (only \(\alpha\) is tuned, add the tuning for \(\lambda\):

library(glmnet)
library(glmnetUtils)
set.seed(42)
data_obs = data_sub[!is.na(data_sub$survived),] 
cv = 5
hyper_alpha = runif(20,0, 1)

outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv))

results = data.frame(
  set = rep(NA, cv),
  alpha = rep(NA, cv),
  AUC = rep(NA, cv)
)

for(i in 1:cv) {
  train_outer = data_obs[outer_split != i, ]
  test_outer = data_obs[outer_split == i, ]
  
  tuning_results = 
      sapply(1:length(hyper_alpha), function(k) {
        model = glmnet(survived~.,data = train_outer, family = "binomial",alpha = hyper_alpha[k])
        return(Metrics::auc(test_outer$survived, predict(model, test_outer, 
                                                         alpha = hyper_alpha[k],
                                                         s = 0.01,
                                                         type = "response")))
      })
  best_alpha = hyper_alpha[which.max(tuning_results)]
  results[i, 1] = i
  results[i, 2] = best_alpha
  results[i, 3] = max(tuning_results)
}

print(results)
  set     alpha       AUC
1   1 0.2861395 0.8745098
2   2 0.8304476 0.7569686
3   3 0.8304476 0.8206522
4   4 0.1346666 0.8270440
5   5 0.6569923 0.8144654

Nested CV:

set.seed(42)
data_obs = data_sub[!is.na(data_sub$survived),] 
cv = 5
cv_inner = 5
hyper_alpha = runif(30,0, 1)
hyper_lambda = runif(30,0, 1)

outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv))

results = data.frame(
  set = rep(NA, cv),
  alpha = rep(NA, cv),
  lambda = rep(NA, cv),
  AUC = rep(NA, cv)
)

for(i in 1:cv) {
  train_outer = data_obs[outer_split != i, ]
  test_outer = data_obs[outer_split == i, ]
  
  best_alpha = NULL
  best_lambda = NULL
  best_auc = NULL
  # inner split
  for(j in 1:cv_inner) {
    inner_split = as.integer(cut(1:nrow(train_outer), breaks = cv_inner))
    train_inner = train_outer[inner_split != j, ]
    test_inner = train_outer[inner_split == j, ]
    
    tuning_results_inner = 
      sapply(1:length(hyper_alpha), function(k) {
        model = glmnet(survived~.,data = train_inner, family = "binomial",alpha = hyper_alpha[k])
        return(Metrics::auc(test_inner$survived, predict(model, test_inner, 
                                                         alpha = hyper_alpha[k],
                                                         s = hyper_lambda[k],
                                                         type = "response")))
      })
    best_alpha[j] = hyper_alpha[which.max(tuning_results_inner)]
    best_lambda[j] = hyper_lambda[which.max(tuning_results_inner)]
    best_auc[j] = max(tuning_results_inner)
  }
  best_alpha = best_alpha[which.max(best_auc)]
  best_lambda = best_lambda[which.max(best_auc)]
  
  model = glmnet(survived~., data = train_outer, alpha = best_alpha, family = "binomial")
  results[i, 1] = i
  results[i, 2] = best_alpha
  results[i, 3] = best_lambda
  results[i, 4] = Metrics::auc(test_outer$survived, predict(model, test_outer, s = best_lambda, alpha = best_alpha, type = "response"))
}

print(results)
  set     alpha      lambda       AUC
1   1 0.6417455 0.003948339 0.8745098
2   2 0.7365883 0.007334147 0.7557242
3   3 0.7365883 0.007334147 0.8193582
4   4 0.6417455 0.003948339 0.8253507
5   5 0.6417455 0.003948339 0.8132559

Simple CV:

set.seed(42)
data_obs = data_sub[!is.na(data_sub$survived),] 
cv = 5
hyper_alpha = runif(20,0, 1)
hyper_lambda = runif(20, 0, 1)

outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv))

results = data.frame(
  set = rep(NA, cv),
  alpha = rep(NA, cv),
  lambda = rep(NA, cv),
  AUC = rep(NA, cv)
)

for(i in 1:cv) {
  train_outer = data_obs[outer_split != i, ]
  test_outer = data_obs[outer_split == i, ]
  
  tuning_results = 
      sapply(1:length(hyper_alpha), function(k) {
        model = glmnet(survived~.,data = train_outer, family = "binomial",alpha = hyper_alpha[k])
        return(Metrics::auc(test_outer$survived, predict(model, test_outer, 
                                                         alpha = hyper_alpha[k],
                                                         s = hyper_lambda[k],
                                                         type = "response")))
      })
  best_alpha = hyper_alpha[which.max(tuning_results)]
  best_lambda = hyper_lambda[which.max(tuning_results)]
  results[i, 1] = i
  results[i, 2] = best_alpha
  results[i, 3] = best_lambda
  results[i, 4] = max(tuning_results)
}

print(results)
  set     alpha      lambda       AUC
1   1 0.4622928 0.003948339 0.8740196
2   2 0.4622928 0.003948339 0.7567198
3   3 0.1174874 0.207658973 0.8237578
4   4 0.4622928 0.003948339 0.8255926
5   5 0.9782264 0.007334147 0.8144654

Predictions:

prediction_ensemble = 
  sapply(1:nrow(results), function(i) {
    model = glmnet(survived~.,data = data_obs, family = "binomial",alpha = results$alpha[i])
    return(predict(model, data_new, alpha = results$alpha[i], s = results$lambda[i], type = "response")[,1])
  })

# Single predictions from the model with the highest AUC:
write.csv(data.frame(y = prediction_ensemble[,which.max(results$AUC)]), file = "Max_titanic_best_model.csv")

# Single predictions from the ensemble model:
write.csv(data.frame(y = apply(prediction_ensemble, 1, mean)), file = "Max_titanic_ensemble.csv")

4.5 Machine learning frameworks

As we have seen today, many of the machine learning algorithms are distributed over several packages but the general machine learning pipeline is very similar for all models: feature engineering, feature selection, hyperparameter tuning and cross-validation.

Machine learning frameworks such as mlr3 or tidymodels provide a general interface for the ML pipeline, in particular the training and the hyperparameter tuning with nested CV. They support most ML packages/algorithms.

4.5.1 mlr3

The key features of mlr3 are:

  • All common machine learning packages are integrated into mlr3, you can easily switch between different machine learning algorithms.
  • A common ‘language’/workflow to specify machine learning pipelines.
  • Support for different cross-validation strategies.
  • Hyperparameter tuning for all supported machine learning algorithms.
  • Ensemble models.

Useful links:

4.5.1.1 mlr3 - The Basic Workflow

The mlr3 package actually consists of several packages for different tasks (e.g. mlr3tuning for hyperparameter tuning, mlr3pipelines for data preparation pipes). But let’s start with the basic workflow:

library(EcoData)
library(tidyverse)
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(mlr3tuning)
library(mlr3measures)
data(nasa)
str(nasa)
'data.frame':   4687 obs. of  40 variables:
 $ Neo.Reference.ID            : int  3449084 3702322 3406893 NA 2363305 3017307 2438430 3653917 3519490 2066391 ...
 $ Name                        : int  NA 3702322 3406893 3082923 2363305 3017307 2438430 3653917 3519490 NA ...
 $ Absolute.Magnitude          : num  18.7 22.1 24.8 21.6 21.4 18.2 20 21 20.9 16.5 ...
 $ Est.Dia.in.KM.min.          : num  0.4837 0.1011 0.0291 0.1272 0.1395 ...
 $ Est.Dia.in.KM.max.          : num  1.0815 0.226 0.0652 0.2845 0.3119 ...
 $ Est.Dia.in.M.min.           : num  483.7 NA 29.1 127.2 139.5 ...
 $ Est.Dia.in.M.max.           : num  1081.5 226 65.2 284.5 311.9 ...
 $ Est.Dia.in.Miles.min.       : num  0.3005 0.0628 NA 0.0791 0.0867 ...
 $ Est.Dia.in.Miles.max.       : num  0.672 0.1404 0.0405 0.1768 0.1938 ...
 $ Est.Dia.in.Feet.min.        : num  1586.9 331.5 95.6 417.4 457.7 ...
 $ Est.Dia.in.Feet.max.        : num  3548 741 214 933 1023 ...
 $ Close.Approach.Date         : Factor w/ 777 levels "1995-01-01","1995-01-08",..: 511 712 472 239 273 145 428 694 87 732 ...
 $ Epoch.Date.Close.Approach   : num  NA 1.42e+12 1.21e+12 1.00e+12 1.03e+12 ...
 $ Relative.Velocity.km.per.sec: num  11.22 13.57 5.75 13.84 4.61 ...
 $ Relative.Velocity.km.per.hr : num  40404 48867 20718 49821 16583 ...
 $ Miles.per.hour              : num  25105 30364 12873 30957 10304 ...
 $ Miss.Dist..Astronomical.    : num  NA 0.0671 0.013 0.0583 0.0381 ...
 $ Miss.Dist..lunar.           : num  112.7 26.1 NA 22.7 14.8 ...
 $ Miss.Dist..kilometers.      : num  43348668 10030753 1949933 NA 5694558 ...
 $ Miss.Dist..miles.           : num  26935614 6232821 1211632 5418692 3538434 ...
 $ Orbiting.Body               : Factor w/ 1 level "Earth": 1 1 1 1 1 1 1 1 1 1 ...
 $ Orbit.ID                    : int  NA 8 12 12 91 NA 24 NA NA 212 ...
 $ Orbit.Determination.Date    : Factor w/ 2680 levels "2014-06-13 15:20:44",..: 69 NA 1377 1774 2275 2554 1919 731 1178 2520 ...
 $ Orbit.Uncertainity          : int  0 8 6 0 0 0 1 1 1 0 ...
 $ Minimum.Orbit.Intersection  : num  NA 0.05594 0.00553 NA 0.0281 ...
 $ Jupiter.Tisserand.Invariant : num  5.58 3.61 4.44 5.5 NA ...
 $ Epoch.Osculation            : num  2457800 2457010 NA 2458000 2458000 ...
 $ Eccentricity                : num  0.276 0.57 0.344 0.255 0.22 ...
 $ Semi.Major.Axis             : num  1.1 NA 1.52 1.11 1.24 ...
 $ Inclination                 : num  20.06 4.39 5.44 23.9 3.5 ...
 $ Asc.Node.Longitude          : num  29.85 1.42 170.68 356.18 183.34 ...
 $ Orbital.Period              : num  419 1040 682 427 503 ...
 $ Perihelion.Distance         : num  0.794 0.864 0.994 0.828 0.965 ...
 $ Perihelion.Arg              : num  41.8 359.3 350 268.2 179.2 ...
 $ Aphelion.Dist               : num  1.4 3.15 2.04 1.39 1.51 ...
 $ Perihelion.Time             : num  2457736 2456941 2457937 NA 2458070 ...
 $ Mean.Anomaly                : num  55.1 NA NA 297.4 310.5 ...
 $ Mean.Motion                 : num  0.859 0.346 0.528 0.843 0.716 ...
 $ Equinox                     : Factor w/ 1 level "J2000": 1 1 NA 1 1 1 1 1 1 1 ...
 $ Hazardous                   : int  0 0 0 1 1 0 0 0 1 1 ...

Let’s drop time, name and ID variable and create a classification task:

data = nasa %>% select(-Orbit.Determination.Date,
                       -Close.Approach.Date, -Name, -Neo.Reference.ID)
data$Hazardous = as.factor(data$Hazardous)

# Create a classification task.
task = TaskClassif$new(id = "nasa", backend = data,
                       target = "Hazardous", positive = "1")

Create a generic pipeline of data transformation (imputation \(\rightarrow\) scaling \(\rightarrow\) encoding of categorical variables):

set.seed(123)

# Let's create the preprocessing graph.
preprocessing = po("imputeoor") %>>% po("scale") %>>% po("encode") 

# Run the task.
transformed_task = preprocessing$train(task)[[1]]

transformed_task$missings()
                   Hazardous           Absolute.Magnitude 
                        4187                            0 
               Aphelion.Dist           Asc.Node.Longitude 
                           0                            0 
                Eccentricity    Epoch.Date.Close.Approach 
                           0                            0 
            Epoch.Osculation         Est.Dia.in.Feet.max. 
                           0                            0 
        Est.Dia.in.Feet.min.           Est.Dia.in.KM.max. 
                           0                            0 
          Est.Dia.in.KM.min.            Est.Dia.in.M.max. 
                           0                            0 
           Est.Dia.in.M.min.        Est.Dia.in.Miles.max. 
                           0                            0 
       Est.Dia.in.Miles.min.                  Inclination 
                           0                            0 
 Jupiter.Tisserand.Invariant                 Mean.Anomaly 
                           0                            0 
                 Mean.Motion               Miles.per.hour 
                           0                            0 
  Minimum.Orbit.Intersection     Miss.Dist..Astronomical. 
                           0                            0 
      Miss.Dist..kilometers.            Miss.Dist..lunar. 
                           0                            0 
           Miss.Dist..miles.                     Orbit.ID 
                           0                            0 
          Orbit.Uncertainity               Orbital.Period 
                           0                            0 
              Perihelion.Arg          Perihelion.Distance 
                           0                            0 
             Perihelion.Time  Relative.Velocity.km.per.hr 
                           0                            0 
Relative.Velocity.km.per.sec              Semi.Major.Axis 
                           0                            0 
               Equinox.J2000             Equinox..MISSING 
                           0                            0 
         Orbiting.Body.Earth       Orbiting.Body..MISSING 
                           0                            0 

We can even visualize the preprocessing graph:

preprocessing$plot()

To test our model (glmnet) with 10-fold cross-validated, we will do:

  • Specify the missing target rows as validation so that they will be ignored.
  • Specify the cross-validation, the learner (the machine learning model we want to use), and the measurement (AUC).
  • Run (benchmark) our model.
set.seed(123)

transformed_task$data()[1,]
   Hazardous Absolute.Magnitude Aphelion.Dist Asc.Node.Longitude Eccentricity
1:         0         -0.8132265    -0.3804201          -1.140837    -0.315606
   Epoch.Date.Close.Approach Epoch.Osculation Est.Dia.in.Feet.max.
1:                 -4.792988        0.1402677            0.2714179
   Est.Dia.in.Feet.min. Est.Dia.in.KM.max. Est.Dia.in.KM.min. Est.Dia.in.M.max.
1:            0.3134076          0.3007134          0.2565687         0.2710953
   Est.Dia.in.M.min. Est.Dia.in.Miles.max. Est.Dia.in.Miles.min. Inclination
1:         0.2916245             0.2620443              0.258651   0.5442288
   Jupiter.Tisserand.Invariant Mean.Anomaly Mean.Motion Miles.per.hour
1:                   0.3840868    -1.028761   0.3193953     -0.2541306
   Minimum.Orbit.Intersection Miss.Dist..Astronomical. Miss.Dist..kilometers.
1:                  -5.459119                -7.076926              0.2512296
   Miss.Dist..lunar. Miss.Dist..miles.  Orbit.ID Orbit.Uncertainity
1:         0.2398625         0.2381077 -9.651472          -1.007087
   Orbital.Period Perihelion.Arg Perihelion.Distance Perihelion.Time
1:     -0.3013135      -1.170536         -0.01831583       0.1052611
   Relative.Velocity.km.per.hr Relative.Velocity.km.per.sec Semi.Major.Axis
1:                  -0.2816782                   -0.2841407      -0.2791037
   Equinox.J2000 Equinox..MISSING Orbiting.Body.Earth Orbiting.Body..MISSING
1:             1                0                   1                      0
transformed_task$set_row_roles((1:nrow(data))[is.na(data$Hazardous)],
                               "holdout")

cv10 = mlr3::rsmp("cv", folds = 10L)
EN = lrn("classif.glmnet", predict_type = "prob")
measurement =  msr("classif.auc")
result = mlr3::resample(transformed_task,
                        EN, resampling = cv10, store_models = TRUE)

# Calculate the average AUC of the holdouts.
result$aggregate(measurement)

Very cool! Preprocessing + 10-fold cross-validation model evaluation in a few lines of code!

Let’s create the final predictions:

pred = sapply(1:10, function(i) result$learners[[i]]$predict(transformed_task,
row_ids = (1:nrow(data))[is.na(data$Hazardous)])$data$prob[, "1", drop = FALSE])
dim(pred)
predictions = apply(pred, 1, mean)

You could now submit the predictions here.

But we are still not happy with the results, let’s do some hyperparameter tuning!

4.5.1.2 mlr3 - Hyperparameter Tuning

With mlr3, we can easily extend the above example to do hyperparameter tuning within nested cross-validation (the tuning has its own inner cross-validation).

Print the hyperparameter space of our glmnet learner:

EN$param_set
<ParamSet>
                      id    class lower upper nlevels        default parents
 1:                alpha ParamDbl     0     1     Inf              1        
 2:                  big ParamDbl  -Inf   Inf     Inf        9.9e+35        
 3:               devmax ParamDbl     0     1     Inf          0.999        
 4:                dfmax ParamInt     0   Inf     Inf <NoDefault[3]>        
 5:                  eps ParamDbl     0     1     Inf          1e-06        
 6:                epsnr ParamDbl     0     1     Inf          1e-08        
 7:                exact ParamLgl    NA    NA       2          FALSE        
 8:              exclude ParamInt     1   Inf     Inf <NoDefault[3]>        
 9:                 exmx ParamDbl  -Inf   Inf     Inf            250        
10:                 fdev ParamDbl     0     1     Inf          1e-05        
11:                gamma ParamDbl  -Inf   Inf     Inf              1   relax
12:            intercept ParamLgl    NA    NA       2           TRUE        
13:               lambda ParamUty    NA    NA     Inf <NoDefault[3]>        
14:     lambda.min.ratio ParamDbl     0     1     Inf <NoDefault[3]>        
15:         lower.limits ParamUty    NA    NA     Inf <NoDefault[3]>        
16:                maxit ParamInt     1   Inf     Inf         100000        
17:                mnlam ParamInt     1   Inf     Inf              5        
18:                 mxit ParamInt     1   Inf     Inf            100        
19:               mxitnr ParamInt     1   Inf     Inf             25        
20:            newoffset ParamUty    NA    NA     Inf <NoDefault[3]>        
21:              nlambda ParamInt     1   Inf     Inf            100        
22:               offset ParamUty    NA    NA     Inf                       
23:       penalty.factor ParamUty    NA    NA     Inf <NoDefault[3]>        
24:                 pmax ParamInt     0   Inf     Inf <NoDefault[3]>        
25:                 pmin ParamDbl     0     1     Inf          1e-09        
26:                 prec ParamDbl  -Inf   Inf     Inf          1e-10        
27:                relax ParamLgl    NA    NA       2          FALSE        
28:                    s ParamDbl     0   Inf     Inf           0.01        
29:          standardize ParamLgl    NA    NA       2           TRUE        
30: standardize.response ParamLgl    NA    NA       2          FALSE        
31:               thresh ParamDbl     0   Inf     Inf          1e-07        
32:             trace.it ParamInt     0     1       2              0        
33:        type.gaussian ParamFct    NA    NA       2 <NoDefault[3]>        
34:        type.logistic ParamFct    NA    NA       2 <NoDefault[3]>        
35:     type.multinomial ParamFct    NA    NA       2 <NoDefault[3]>        
36:         upper.limits ParamUty    NA    NA     Inf <NoDefault[3]>        
                      id    class lower upper nlevels        default parents
    value
 1:      
 2:      
 3:      
 4:      
 5:      
 6:      
 7:      
 8:      
 9:      
10:      
11:      
12:      
13:      
14:      
15:      
16:      
17:      
18:      
19:      
20:      
21:      
22:      
23:      
24:      
25:      
26:      
27:      
28:      
29:      
30:      
31:      
32:      
33:      
34:      
35:      
36:      
    value

Define the hyperparameter space of the random forest:

library(paradox)

EN_pars = 
    paradox::ParamSet$new(
      list(paradox::ParamDbl$new("alpha", lower = 0, upper = 1L),
           paradox::ParamDbl$new("lambda", lower = 0, upper = 0.5 )) )
print(EN_pars)
<ParamSet>
       id    class lower upper nlevels        default value
1:  alpha ParamDbl     0   1.0     Inf <NoDefault[3]>      
2: lambda ParamDbl     0   0.5     Inf <NoDefault[3]>      

To set up the tuning pipeline we need:

  • Inner cross-validation resampling object.
  • Tuning criterion (e.g. AUC).
  • Tuning method (e.g. random or block search).
  • Tuning terminator (When should we stop tuning? E.g. after \(n\) iterations).
set.seed(123)

inner3 = mlr3::rsmp("cv", folds = 3L)
measurement =  msr("classif.auc")
tuner =  mlr3tuning::tnr("random_search") 
terminator = mlr3tuning::trm("evals", n_evals = 5L)
EN = lrn("classif.glmnet", predict_type = "prob")

learner_tuner = AutoTuner$new(learner = EN, 
                              measure = measurement, 
                              tuner = tuner, 
                              terminator = terminator,
                              search_space = EN_pars,
                              resampling = inner3)
print(learner_tuner)
<AutoTuner:classif.glmnet.tuned>
* Model: list
* Search Space:
<ParamSet>
       id    class lower upper nlevels        default value
1:  alpha ParamDbl     0   1.0     Inf <NoDefault[3]>      
2: lambda ParamDbl     0   0.5     Inf <NoDefault[3]>      
* Packages: mlr3, mlr3tuning, mlr3learners, glmnet
* Predict Type: prob
* Feature Types: logical, integer, numeric
* Properties: multiclass, twoclass, weights

Now we can wrap it normally into the 10-fold cross-validated setup as done previously:

# Calculate the average AUC of the holdouts.
result$aggregate(measurement)
classif.auc 
  0.6767554 

Let’s create the final predictions:

pred = sapply(1:3, function(i) result$learners[[i]]$predict(transformed_task,
row_ids = (1:nrow(data))[is.na(data$Hazardous)])$data$prob[, "1", drop = FALSE])
dim(pred)
predictions = apply(pred, 1, mean)

4.6 Exercises

Question: Use mlr3 for the titanic dataset
  1. Use mlr3 to tune glmnet for the titanic dataset using nested CV
  2. Submit single predictions and multiple predictions

If you need help, take a look at the solution, go through it line by line and try to understand it.

Prepare data

data = titanic_ml %>% select(-name, -ticket, -name, -body)
data$pclass = as.factor(data$pclass)
data$sex = as.factor(data$sex)
data$survived = as.factor(data$survived)

# Change easy things manually:
data$embarked[data$embarked == ""] = "S"  # Fill in "empty" values.
data$embarked = droplevels(as.factor(data$embarked)) # Remove unused levels ("").
data$cabin = (data$cabin != "") * 1 # Dummy code the availability of a cabin.
data$fare[is.na(data$fare)] = mean(data$fare, na.rm = TRUE)
levels(data$home.dest)[levels(data$home.dest) == ""] = "unknown"
levels(data$boat)[levels(data$boat) == ""] = "none"

# Create a classification task.
task = TaskClassif$new(id = "titanic", backend = data,
                       target = "survived", positive = "1")
task$missings()
 survived       age      boat     cabin  embarked      fare home.dest     parch 
      655       263         0         0         0         0         0         0 
   pclass       sex     sibsp 
        0         0         0 
# Let's create the preprocessing graph.
preprocessing = po("imputeoor") %>>% po("scale") %>>% po("encode") 

# Run the task.
transformed_task = preprocessing$train(task)[[1]]

transformed_task$set_row_roles((1:nrow(data))[is.na(data$survived)], "holdout")

Hyperparameter tuning:

cv10 = mlr3::rsmp("cv", folds = 10L)

inner3 = mlr3::rsmp("cv", folds = 3L)
measurement =  msr("classif.auc")
tuner =  mlr3tuning::tnr("random_search") 
terminator = mlr3tuning::trm("evals", n_evals = 5L)
EN = lrn("classif.glmnet", predict_type = "prob")
EN_pars = 
    paradox::ParamSet$new(
      list(paradox::ParamDbl$new("alpha", lower = 0, upper = 1L),
           paradox::ParamDbl$new("lambda", lower = 0, upper = 0.5 )) )

learner_tuner = AutoTuner$new(learner = EN, 
                              measure = measurement, 
                              tuner = tuner, 
                              terminator = terminator,
                              search_space = EN_pars,
                              resampling = inner3)


result = mlr3::resample(transformed_task, learner_tuner,
                        resampling = cv10, store_models = TRUE)

Evaluation:

measurement =  msr("classif.auc")
result$aggregate(measurement)
classif.auc 
  0.9924459 

Predictions:

We can extract a learner with optimized hyperparameters:

model = result$learners[[1]]$learner$clone()
model$param_set$values
$alpha
[1] 0.2353052

$lambda
[1] 0.03832512

And we can fit it then on the full data set:

model$train(transformed_task)
predictions = model$predict(transformed_task, row_ids = transformed_task$row_roles$holdout)
predictions = predictions$prob[,1]
head(predictions)
[1] 0.94428932 0.07481128 0.21730284 0.86504280 0.94653467 0.95414457

And submit to http://rhsbio7.uni-regensburg.de:8500

write.csv(data.frame(y = predictions), file = "glmnet.csv")