4  Machine learning pipeline

The Standard Machine Learning Pipeline using the Titanic Data set

Before we specialize on any tuning, it is important to understand that machine learning always consists of a pipeline of actions.

The typical machine learning workflow consist of:

Machine Learning pipeline

In the following example, we use tidyverse, a collection of R packages for data science / data manipulation mainly developed by Hadley Wickham.

The dplyr package is part of a framework called tidyverse. Unique features of the tidyverse are the pipe %>% operator and tibble objects.

  • The %>% operator:

    Applying several functions in sequence on an object often results in uncountable/confusing number of round brackets:

    library(tidyverse)
    ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
    ✔ dplyr     1.1.4     ✔ readr     2.1.5
    ✔ forcats   1.0.0     ✔ stringr   1.5.1
    ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
    ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
    ✔ purrr     1.0.4     
    ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ✖ dplyr::filter() masks stats::filter()
    ✖ dplyr::lag()    masks stats::lag()
    ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
    max(mean(range(c(5, 3, 2, 1))))
    [1] 3

    The pipe operator simplifies that by saying “apply the next function on the result of the current function”:

    c(5, 3, 2, 1) %>% range %>% mean %>% max
    [1] 3

    Which is easier to write, read, and to understand!

  • tibble objects are just an extension of data.frames. In the course we will use mostly data.frames, so it is better to transform the tibbles back to data.frames:

    air_grouped = airquality %>% group_by(Month)
    
    class(air_grouped)
    [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
    air_grouped = as.data.frame(air_grouped)
    class(air_grouped)
    [1] "data.frame"

Another good reference is “R for data science” by Hadley Wickham: .

For this lecture you need the Titanic data set provided by us (via the EcoData package).

You can find it in GRIPS (datasets.RData in the data set and submission section) or at http://rhsbio7.uni-regensburg.de:8500 (VPN for University of Regensburg is required!).

Motivation - We want a model to predict the survival probability of new passengers.

We have split the data set into training and an outer test/prediction data sets (the test/prediction split has one column less than the train split, as the response for the test/outer split is unknown).

The goal is to build a predictive model that can accurately predict the chances of survival for Titanic passengers!

The dataset:

library(tidyverse)
library(EcoData)
data(titanic_ml)
data = titanic_ml

The response variable:

unique(data$survived)
[1]  1  0 NA

0 = passenger died

1 = passenger survived

NA = we don’t have information about the passenger, at the end, we will make predictions for these passengers!

Important: Preprocessing of the data must be done for the training and testing data together!!

4.1 Data preparation

Load necessary libraries:

library(tidyverse)

Load data set:

library(EcoData)
data(titanic_ml)
data = titanic_ml

Standard summaries:

str(data)
'data.frame':   1309 obs. of  13 variables:
 $ pclass   : int  2 1 3 3 3 3 3 1 3 1 ...
 $ survived : int  1 1 0 0 0 0 0 1 0 1 ...
 $ name     : chr  "Sinkkonen, Miss. Anna" "Woolner, Mr. Hugh" "Sage, Mr. Douglas Bullen" "Palsson, Master. Paul Folke" ...
 $ sex      : Factor w/ 2 levels "female","male": 1 2 2 2 2 2 2 1 1 1 ...
 $ age      : num  30 NA NA 6 30.5 38.5 20 53 NA 42 ...
 $ sibsp    : int  0 0 8 3 0 0 0 0 0 0 ...
 $ parch    : int  0 0 2 1 0 0 0 0 0 0 ...
 $ ticket   : Factor w/ 929 levels "110152","110413",..: 221 123 779 542 589 873 472 823 588 834 ...
 $ fare     : num  13 35.5 69.55 21.07 8.05 ...
 $ cabin    : Factor w/ 187 levels "","A10","A11",..: 1 94 1 1 1 1 1 1 1 1 ...
 $ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 2 4 2 ...
 $ body     : int  NA NA NA NA 50 32 NA NA NA NA ...
 $ home.dest: Factor w/ 370 levels "","?Havana, Cuba",..: 121 213 1 1 1 1 322 350 1 1 ...
summary(data)
     pclass         survived          name               sex     
 Min.   :1.000   Min.   :0.0000   Length:1309        female:466  
 1st Qu.:2.000   1st Qu.:0.0000   Class :character   male  :843  
 Median :3.000   Median :0.0000   Mode  :character               
 Mean   :2.295   Mean   :0.3853                                  
 3rd Qu.:3.000   3rd Qu.:1.0000                                  
 Max.   :3.000   Max.   :1.0000                                  
                 NA's   :655                                     
      age              sibsp            parch            ticket    
 Min.   : 0.1667   Min.   :0.0000   Min.   :0.000   CA. 2343:  11  
 1st Qu.:21.0000   1st Qu.:0.0000   1st Qu.:0.000   1601    :   8  
 Median :28.0000   Median :0.0000   Median :0.000   CA 2144 :   8  
 Mean   :29.8811   Mean   :0.4989   Mean   :0.385   3101295 :   7  
 3rd Qu.:39.0000   3rd Qu.:1.0000   3rd Qu.:0.000   347077  :   7  
 Max.   :80.0000   Max.   :8.0000   Max.   :9.000   347082  :   7  
 NA's   :263                                        (Other) :1261  
      fare                     cabin      embarked      body      
 Min.   :  0.000                  :1014    :  2    Min.   :  1.0  
 1st Qu.:  7.896   C23 C25 C27    :   6   C:270    1st Qu.: 72.0  
 Median : 14.454   B57 B59 B63 B66:   5   Q:123    Median :155.0  
 Mean   : 33.295   G6             :   5   S:914    Mean   :160.8  
 3rd Qu.: 31.275   B96 B98        :   4            3rd Qu.:256.0  
 Max.   :512.329   C22 C26        :   4            Max.   :328.0  
 NA's   :1         (Other)        : 271            NA's   :1188   
                home.dest  
                     :564  
 New York, NY        : 64  
 London              : 14  
 Montreal, PQ        : 10  
 Cornwall / Akron, OH:  9  
 Paris, France       :  9  
 (Other)             :639  

The name variable consists of 1309 unique factors (there are 1309 observations…) and could be now transformed. If you are interested in how to do that, take a look at the following box.

length(unique(data$name))
[1] 1307

However, there is a title in each name. Let’s extract the titles:

  1. We will extract all names and split each name after each comma “,”.
  2. We will split the second split of the name after a point “.” and extract the titles.
first_split = sapply(data$name,
                     function(x) stringr::str_split(x, pattern = ",")[[1]][2])
titles = sapply(first_split,
                function(x) strsplit(x, ".",fixed = TRUE)[[1]][1])

We get 18 unique titles:

table(titles)
titles
         Capt           Col           Don          Dona            Dr 
            1             4             1             1             8 
     Jonkheer          Lady         Major        Master          Miss 
            1             1             2            61           260 
         Mlle           Mme            Mr           Mrs            Ms 
            2             1           757           197             2 
          Rev           Sir  the Countess 
            8             1             1 

A few titles have a very low occurrence rate:

titles = stringr::str_trim((titles))
titles %>%
 fct_count()
# A tibble: 18 × 2
   f                n
   <fct>        <int>
 1 Capt             1
 2 Col              4
 3 Don              1
 4 Dona             1
 5 Dr               8
 6 Jonkheer         1
 7 Lady             1
 8 Major            2
 9 Master          61
10 Miss           260
11 Mlle             2
12 Mme              1
13 Mr             757
14 Mrs            197
15 Ms               2
16 Rev              8
17 Sir              1
18 the Countess     1

We will combine titles with low occurrences into one title, which we can easily do with the forcats package.

titles2 =
  forcats::fct_collapse(titles,
                        officer = c("Capt", "Col", "Major", "Dr", "Rev"),
                        royal = c("Jonkheer", "Don", "Sir",
                                  "the Countess", "Dona", "Lady"),
                        miss = c("Miss", "Mlle"),
                        mrs = c("Mrs", "Mme", "Ms")
                        )

We can count titles again to see the new number of titles:

titles2 %>%  
   fct_count()
# A tibble: 6 × 2
  f           n
  <fct>   <int>
1 officer    23
2 royal       6
3 Master     61
4 miss      262
5 mrs       200
6 Mr        757

Add new title variable to data set:

data =
  data %>%
    mutate(title = titles2)

4.1.1 Imputation

NAs are a common problem in ML and most ML algorithms cannot handle NAs. For example, the age variable has 20% NAs:

summary(data$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.1667 21.0000 28.0000 29.8811 39.0000 80.0000     263 
sum(is.na(data$age)) / nrow(data)
[1] 0.2009167

There are few options how to handle NAs:

  • Drop observations with NAs, however, we may lose many observations (not what we want!)

  • Imputation, fill the missing values

We impute (fill) the missing values, for example with the median age. However, age itself might depend on other variables such as sex, class and title. Thus, instead of filling the NAs with the overall median of the passengers, we want to fill the NAs with the median age of these groups so that the associations with the other groups are preserved (or in other words, that the new values are hopefully closer to the unknown true values).

In tidyverse we can “group” the data, i.e. we will nest the observations within categorical variables for which we assume that there may be an association with age (here: group_by after sex, pclass and title). After grouping, all operations (such as our median(age....)) will be done within the specified groups (to get better estimates of these missing NAs).

data =
  data %>%
    select(survived, sex, age, fare, pclass) %>% 
    group_by(sex, pclass) %>%
    mutate(age2 = ifelse(is.na(age), median(age, na.rm = TRUE), age)) %>%
    mutate(fare2 = ifelse(is.na(fare), median(fare, na.rm = TRUE), fare)) %>%
    ungroup()

4.1.2 Preprocessing and Feature Selection

Later (tomorrow), we want to use Keras in our example, but it cannot handle factors and requires the data to be scaled.

Normally, one would do this for all predictors, but as we only show the pipeline here, we have sub-selected a bunch of predictors and do this only for them. We first scale the numeric predictors and change the factors with only two groups/levels into integers (this can be handled by Keras).

data_sub =
  data %>%
    select(survived, sex, age2, fare2, pclass) %>%
    mutate(age2 = scales::rescale(age2, c(0, 1)),
           fare2 = scales::rescale(fare2, c(0, 1))) %>%
    mutate(sex = as.integer(sex) - 1L,
           pclass = as.integer(pclass - 1L))

Factors with more than two levels should be one hot encoded (Make columns for every different factor level and write 1 in the respective column for every taken feature value and 0 else. For example: \(\{red, green, green, blue, red\} \rightarrow \{(0,0,1), (0,1,0), (0,1,0), (1,0,0), (0,0,1)\}\)):

one_title = model.matrix(~0+as.factor(title), data = data)
colnames(one_title) = levels(data$title)

one_sex = model.matrix(~0+as.factor(sex), data = data)
colnames(one_sex) = levels(data$sex)

one_pclass = model.matrix(~0+as.factor(pclass), data = data)
colnames(one_pclass) = paste0("pclass", 1:length(unique(data$pclass)))

And we have to add the dummy encoded variables to the data set:

data = cbind(data.frame(survived= data$survived),
                 one_title, one_sex, age = data$age2,
                 fare = data$fare2, one_pclass)
head(data)

4.2 Modelling

4.2.1 Split data for final predictions

To tune our hyperparameters and evaluate our models, we split the data into the training and testing data. The testing data are the observations where the response is NA:

summary(data_sub$survived)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.3853  1.0000  1.0000     655 

655 observations have NAs in our response variable, these are the observations for which we want to make predictions at the end of our pipeline.

data_new = data_sub[is.na(data_sub$survived),]
data_obs = data_sub[!is.na(data_sub$survived),]

4.2.2 Hyperparameter optimization

We want to tune our hyperparameters (\(\lambda\) and \(\alpha\)). Normally, we should do a nested CV on our training data (data_obs), however, we assume that the test data on the submission server is our outer split, so we can tune our hyperparameters using a n-fold Cross-Validation which serves as our inner CV.

Again, why is it important to tune hyperparameters? Hyperparameters (configuration parameters of our ML algorithms that (mostly) control their complexity) are usually tuned (optimized) in an automatic / systematic way. A common procedure, called random search, is to sample random configuration combinations from the set of hyperparameters and test for each combination the prediction error.

We implement manually a CV to tune the learning rate. We start with a 3xCV and 10x different learning rates:

library(cito)
set.seed(42)
model = dnn(survived~.,
            data = data_obs, 
            loss = "binomial",
            lr = tune(0.001, 0.1),
            tuning = config_tuning(CV = 3, steps = 10)
            )
Registered S3 methods overwritten by 'reformulas':
  method       from
  head.call    cito
  head.formula cito
  head.name    cito
Starting hyperparameter tuning...
Fitting final model...
model$tuning
# A tibble: 10 × 5
   steps  test train models      lr
   <int> <dbl> <dbl> <lgl>    <dbl>
 1     1  329.     0 NA     0.0394 
 2     2  319.     0 NA     0.0623 
 3     3  325.     0 NA     0.0627 
 4     4  326.     0 NA     0.0804 
 5     5  326.     0 NA     0.0658 
 6     6  315.     0 NA     0.0683 
 7     7  321.     0 NA     0.0417 
 8     8  317.     0 NA     0.0667 
 9     9  346.     0 NA     0.00368
10    10  333.     0 NA     0.0818 

4.3 Predictions and Submission

When we are satisfied with the performance of our model, we will create predictions for the new observations on the submission server. cito directly returns the best model so we do not have to fit the final model.

We submit our predictions to the submission server at http://rhsbio7.uni-regensburg.de:8500.

For the submission it is critical to change the predictions into a data.frame, select the second column (the probability to survive), and save it with the write.csv function:

data_new = data_sub[is.na(data_sub$survived),]
predictions = predict(model, data_new, type = "response")[,1] 
write.csv(data.frame(y = predictions), file = "Max_1.csv")

4.4 Exercises

Question: Hyperparameter tuning rf
Hyperparameter Explanation
mtry Subset of features randomly selected in each node (from which the algorithm can select the feature that will be used to split the data).
minimum node size Minimal number of observations allowed in a node (before the branching is canceled)
max depth Maximum number of tree depth

Coming back to the titanic dataset from the morning, we want to optimize nodesize, max depth, and mtry in our RF using a simple CV.

Prepare the data:

library(EcoData)
library(dplyr)
library(missRanger)
data(titanic_ml)
data = titanic_ml
data = 
  data |> select(survived, sex, age, fare, pclass)

# data imputation without the response variable!
data[,-1] = missRanger(data[,-1], verbose = 0) 

data$survived = as.factor(data$survived)

data_sub =
  data |> 
    mutate(age = scales::rescale(age, c(0, 1)),
           fare = scales::rescale(fare, c(0, 1))) |> 
    mutate(sex = as.integer(sex) - 1L,
           pclass = as.integer(pclass - 1L))
data_new = data_sub[is.na(data_sub$survived),] # for which we want to make predictions at the end
data_obs = data_sub[!is.na(data_sub$survived),] # data with known response
data_sub$survived = as.factor(data_sub$survived)
data_obs$survived = as.factor(data_obs$survived)

Hints:

  • adjust the ‘type’ argument in the predict(…) method (the default is to predict classes)
  • when predicting probabilities, the randomForest will return a matrix, a column for each class, we are interested in the probability of surviving (so the second column)
  • tune nodesize, mtry, and maxnodes
  • use more features and do feature engineering!
library(randomForest)
data_obs = data_sub[!is.na(data_sub$survived),] 
set.seed(42)

cv = 3
hyper_minnodesize = ... # 

tuning_results =
    sapply(1:length(hyper_minnodesize), function(k) {
        auc_inner = NULL
        for(j in 1:cv) {
          inner_split = as.integer(cut(1:nrow(data_obs), breaks = cv))
          train_inner = data_obs[inner_split != j, ]
          test_inner = data_obs[inner_split == j, ]
          
          model = randomForest(survived~.,data = train_inner, nodesize = hyper_minnodesize[k])
          predictions = predict(model, newdata=test_inner, type = "prob")[,2]
          
          auc_inner[j]= Metrics::auc(test_inner$survived, predictions)
        }
      return(mean(auc_inner))
    })

results = data.frame(minnodesize = hyper_minnodesize, AUC = tuning_results)

print(results)
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':

    combine
The following object is masked from 'package:ggplot2':

    margin
data_obs = data_sub[!is.na(data_sub$survived),] 
set.seed(42)

cv = 3
hyper_minnodesize = sample(300, 20)
hyper_mtry = sample(4, 20, replace = TRUE)

tuning_results =
    sapply(1:length(hyper_minnodesize), function(k) {
        auc_inner = NULL
        for(j in 1:cv) {
          inner_split = as.integer(cut(1:nrow(data_obs), breaks = cv))
          train_inner = data_obs[inner_split != j, ]
          test_inner = data_obs[inner_split == j, ]
          model = randomForest(survived~.,data = train_inner, 
                               nodesize = hyper_minnodesize[k], 
                               mtry = hyper_mtry[k])
          predictions = predict(model, test_inner, type="prob")[,2]
          
          auc_inner[j]= Metrics::auc(test_inner$survived, predictions)
        }
      return(mean(auc_inner))
    })

results = data.frame(minnodesize = hyper_minnodesize, mtry = hyper_mtry, AUC = tuning_results)

print(results)
   minnodesize mtry       AUC
1           49    2 0.8207934
2          153    3 0.8054506
3           74    3 0.8169019
4          228    2 0.8051004
5          146    2 0.8082379
6          122    4 0.7948290
7          300    4 0.7615291
8          128    4 0.7905482
9           24    2 0.8294439
10          89    2 0.8175564
11         165    1 0.7951680
12         110    4 0.7898386
13          20    1 0.8266383
14         291    4 0.7444037
15         283    2 0.8017500
16         109    4 0.8004647
17           5    2 0.8319759
18         212    3 0.7996817
19         259    4 0.7444037
20         292    3 0.7970839
# highest AUC / best hyperparameters
best_hyper = results[which.max(results$AUC),]
print(best_hyper)
   minnodesize mtry       AUC
17           5    2 0.8319759

Make predictions for the submission server:

model = randomForest(survived~.,data = data_obs, 
                     nodesize = best_hyper[1,1], 
                     mtry = best_hyper[1,2])

write.csv(data.frame(y = predict(model, newdata=data_new, type="prob")[,2]), file = "Max_titanic_rf.csv")