4  Machine learning pipeline

Before we specialize on any tuning, it is important to understand that machine learning always consists of a pipeline of actions, illustrated here on the Titanic dataset.

The typical machine learning workflow consists of:

Machine Learning pipeline

In the following example, we use tidyverse, a collection of R packages for data science / data manipulation mainly developed by Hadley Wickham.

The dplyr package is part of a framework called tidyverse. Unique features of the tidyverse are the pipe |> operator and tibble objects.

  • The |> operator:

    Applying several functions in sequence on an object often results in uncountable/confusing number of round brackets:

    library(tidyverse)
    ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
    ✔ dplyr     1.2.1     ✔ readr     2.2.0
    ✔ forcats   1.0.1     ✔ stringr   1.6.0
    ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
    ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
    ✔ purrr     1.2.2     
    ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ✖ dplyr::filter() masks stats::filter()
    ✖ dplyr::lag()    masks stats::lag()
    ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
    max(mean(range(c(5, 3, 2, 1))))
    [1] 3

    The pipe operator simplifies that by saying “apply the next function on the result of the current function”:

    c(5, 3, 2, 1) |> range() |> mean() |> max()
    [1] 3

    Which is easier to write, read, and to understand!

  • tibble objects are just an extension of data.frames. In the course we will use mostly data.frames, so it is better to transform the tibbles back to data.frames:

    air_grouped = airquality |> group_by(Month)
    
    class(air_grouped)
    [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
    air_grouped = as.data.frame(air_grouped)
    class(air_grouped)
    [1] "data.frame"

Another good reference is “R for data science” by Hadley Wickham: .

For this lecture you need the Titanic data set provided by us (via the EcoData package).

You can find it in GRIPS (datasets.RData in the data set and submission section) or at http://132.199.73.15:8500 (VPN for University of Regensburg is required!).

Motivation - We want a model to predict the survival probability of new passengers.

We have split the data set into training and an outer test/prediction data sets (the test/prediction split has one column less than the train split, as the response for the test/outer split is unknown).

The goal is to build a predictive model that can accurately predict the chances of survival for Titanic passengers!

The dataset:

library(tidyverse)
library(EcoData)
data(titanic_ml)
data = titanic_ml

The response variable:

unique(data$survived)
[1]  1  0 NA

0 = passenger died

1 = passenger survived

NA = we don’t have information about the passenger, at the end, we will make predictions for these passengers!

Important: Preprocessing of the data must be done for the training and testing data together!!

4.1 Data preparation

Load necessary libraries:

library(tidyverse)

Load data set:

library(EcoData)
data(titanic_ml)
data = titanic_ml

Standard summaries:

str(data)
'data.frame':   1309 obs. of  13 variables:
 $ pclass   : int  2 1 3 3 3 3 3 1 3 1 ...
 $ survived : int  1 1 0 0 0 0 0 1 0 1 ...
 $ name     : chr  "Sinkkonen, Miss. Anna" "Woolner, Mr. Hugh" "Sage, Mr. Douglas Bullen" "Palsson, Master. Paul Folke" ...
 $ sex      : Factor w/ 2 levels "female","male": 1 2 2 2 2 2 2 1 1 1 ...
 $ age      : num  30 NA NA 6 30.5 38.5 20 53 NA 42 ...
 $ sibsp    : int  0 0 8 3 0 0 0 0 0 0 ...
 $ parch    : int  0 0 2 1 0 0 0 0 0 0 ...
 $ ticket   : Factor w/ 929 levels "110152","110413",..: 221 123 779 542 589 873 472 823 588 834 ...
 $ fare     : num  13 35.5 69.55 21.07 8.05 ...
 $ cabin    : Factor w/ 187 levels "","A10","A11",..: 1 94 1 1 1 1 1 1 1 1 ...
 $ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 2 4 2 ...
 $ body     : int  NA NA NA NA 50 32 NA NA NA NA ...
 $ home.dest: Factor w/ 370 levels "","?Havana, Cuba",..: 121 213 1 1 1 1 322 350 1 1 ...
summary(data)
     pclass         survived             name          sex     
 Min.   :1.000   Min.   :0.0000   Length   :1309   female:466  
 1st Qu.:2.000   1st Qu.:0.0000   N.unique :1307   male  :843  
 Median :3.000   Median :0.0000   N.blank  :   0               
 Mean   :2.295   Mean   :0.3853   Min.nchar:  12               
 3rd Qu.:3.000   3rd Qu.:1.0000   Max.nchar:  82               
 Max.   :3.000   Max.   :1.0000                                
                 NAs    :655                                   
      age              sibsp            parch            ticket    
 Min.   : 0.1667   Min.   :0.0000   Min.   :0.000   CA. 2343:  11  
 1st Qu.:21.0000   1st Qu.:0.0000   1st Qu.:0.000   1601    :   8  
 Median :28.0000   Median :0.0000   Median :0.000   CA 2144 :   8  
 Mean   :29.8811   Mean   :0.4989   Mean   :0.385   3101295 :   7  
 3rd Qu.:39.0000   3rd Qu.:1.0000   3rd Qu.:0.000   347077  :   7  
 Max.   :80.0000   Max.   :8.0000   Max.   :9.000   347082  :   7  
 NAs    :263                                        (Other) :1261  
      fare                     cabin      embarked      body      
 Min.   :  0.000                  :1014    :  2    Min.   :  1.0  
 1st Qu.:  7.896   C23 C25 C27    :   6   C:270    1st Qu.: 72.0  
 Median : 14.454   B57 B59 B63 B66:   5   Q:123    Median :155.0  
 Mean   : 33.295   G6             :   5   S:914    Mean   :160.8  
 3rd Qu.: 31.275   B96 B98        :   4            3rd Qu.:256.0  
 Max.   :512.329   C22 C26        :   4            Max.   :328.0  
 NAs    :1         (Other)        : 271            NAs    :1188   
                home.dest  
                     :564  
 New York, NY        : 64  
 London              : 14  
 Montreal, PQ        : 10  
 Cornwall / Akron, OH:  9  
 Paris, France       :  9  
 (Other)             :639  

The name variable consists of 1309 unique factors (there are 1309 observations…) and could be now transformed. If you are interested in how to do that, take a look at the following box.

length(unique(data$name))
[1] 1307

However, there is a title in each name. Let’s extract the titles:

  1. We will extract all names and split each name after each comma “,”.
  2. We will split the second split of the name after a point “.” and extract the titles.
first_split = sapply(data$name,
                     function(x) stringr::str_split(x, pattern = ",")[[1]][2])
titles = sapply(first_split,
                function(x) strsplit(x, ".",fixed = TRUE)[[1]][1])

We get 18 unique titles:

table(titles)
titles
         Capt           Col           Don          Dona            Dr 
            1             4             1             1             8 
     Jonkheer          Lady         Major        Master          Miss 
            1             1             2            61           260 
         Mlle           Mme            Mr           Mrs            Ms 
            2             1           757           197             2 
          Rev           Sir  the Countess 
            8             1             1 

A few titles have a very low occurrence rate:

titles = stringr::str_trim((titles))
titles |>
 fct_count()
# A tibble: 18 × 2
   f                n
   <fct>        <int>
 1 Capt             1
 2 Col              4
 3 Don              1
 4 Dona             1
 5 Dr               8
 6 Jonkheer         1
 7 Lady             1
 8 Major            2
 9 Master          61
10 Miss           260
11 Mlle             2
12 Mme              1
13 Mr             757
14 Mrs            197
15 Ms               2
16 Rev              8
17 Sir              1
18 the Countess     1

We will combine titles with low occurrences into one title, which we can easily do with the forcats package.

titles2 =
  forcats::fct_collapse(titles,
                        officer = c("Capt", "Col", "Major", "Dr", "Rev"),
                        royal = c("Jonkheer", "Don", "Sir",
                                  "the Countess", "Dona", "Lady"),
                        miss = c("Miss", "Mlle"),
                        mrs = c("Mrs", "Mme", "Ms")
                        )

We can count titles again to see the new number of titles:

titles2 |>  
   fct_count()
# A tibble: 6 × 2
  f           n
  <fct>   <int>
1 officer    23
2 royal       6
3 Master     61
4 miss      262
5 mrs       200
6 Mr        757

Add new title variable to data set:

data =
  data |>
    mutate(title = titles2)

4.1.1 Imputation

Missing values (NAs) are a common problem in ML, and most algorithms cannot handle them. We deal with this through imputation — filling in the missing values. For example, the age variable has 20% NAs:

summary(data$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
 0.1667 21.0000 28.0000 29.8811 39.0000 80.0000     263 
sum(is.na(data$age)) / nrow(data)
[1] 0.2009167

There are few options how to handle NAs:

  • Drop observations with NAs, however, we may lose many observations (not what we want!)

  • Imputation, fill the missing values

We impute (fill) the missing values, for example with the median age. However, age itself might depend on other variables such as sex, class and title. Thus, instead of filling the NAs with the overall median of the passengers, we want to fill the NAs with the median age of these groups so that the associations with the other groups are preserved (or in other words, that the new values are hopefully closer to the unknown true values).

In tidyverse we can “group” the data, i.e. we will nest the observations within categorical variables for which we assume that there may be an association with age (here: group_by after sex, pclass and title). After grouping, all operations (such as our median(age....)) will be done within the specified groups (to get better estimates of these missing NAs).

data =
  data |>
    select(survived, sex, age, fare, pclass) |> 
    group_by(sex, pclass) |>
    mutate(age2 = ifelse(is.na(age), median(age, na.rm = TRUE), age)) |>
    mutate(fare2 = ifelse(is.na(fare), median(fare, na.rm = TRUE), fare)) |>
    ungroup()

4.1.2 Preprocessing and Feature Selection

Later (tomorrow), we want to use Keras in our example, but it cannot handle factors and requires the data to be scaled.

Normally, one would do this for all predictors, but as we only show the pipeline here, we have sub-selected a bunch of predictors and do this only for them. We first scale the numeric predictors and change the factors with only two groups/levels into integers (this can be handled by Keras).

data_sub =
  data |>
    select(survived, sex, age2, fare2, pclass) |>
    mutate(age2 = scales::rescale(age2, c(0, 1)),
           fare2 = scales::rescale(fare2, c(0, 1))) |>
    mutate(sex = as.integer(sex) - 1L,
           pclass = as.integer(pclass - 1L))

Factors with more than two levels should be one hot encoded (Make columns for every different factor level and write 1 in the respective column for every taken feature value and 0 else. For example: \(\{red, green, green, blue, red\} \rightarrow \{(0,0,1), (0,1,0), (0,1,0), (1,0,0), (0,0,1)\}\)):

one_title = model.matrix(~0+as.factor(title), data = data)
colnames(one_title) = levels(data$title)

one_sex = model.matrix(~0+as.factor(sex), data = data)
colnames(one_sex) = levels(data$sex)

one_pclass = model.matrix(~0+as.factor(pclass), data = data)
colnames(one_pclass) = paste0("pclass", 1:length(unique(data$pclass)))

And we have to add the dummy encoded variables to the data set:

data = cbind(data.frame(survived= data$survived),
                 one_title, one_sex, age = data$age2,
                 fare = data$fare2, one_pclass)
head(data)

4.2 Modelling

4.2.1 Split data for final predictions

The concepts behind training/validation splits, cross-validation, and hyperparameter tuning were covered in Section 3.3; here we simply apply them inside the pipeline. To tune and evaluate our models, we split the data into training and testing data. The testing data are the observations where the response is NA:

summary(data_sub$survived)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
 0.0000  0.0000  0.0000  0.3853  1.0000  1.0000     655 

655 observations have NAs in our response variable, these are the observations for which we want to make predictions at the end of our pipeline.

data_new = data_sub[is.na(data_sub$survived),]
data_obs = data_sub[!is.na(data_sub$survived),]

4.3 Exercises

4.3.0.1 Plant-pollinator - Tune random forest

Hyperparameter Explanation
minimum node size Minimal number of observations allowed in a node (before the branching is canceled)

Using the plant-pollinator dataset (see Section A.3), we want to optimize nodesize in our RF using a simple CV. The task is to predict whether a plant and a pollinator interact, based on their traits.

Prepare the data:

library(EcoData)
library(dplyr)
library(missRanger)
data(plantPollinator_df)
data = plantPollinator_df
data =
  data |> select(interaction, diameter, corolla, tongue, body)

# data imputation without the response variable!
data[,-1] = missRanger(data[,-1], verbose = 0)

data$interaction = as.factor(data$interaction)

data_sub = data
data_new = data_sub[is.na(data_sub$interaction),] # for which we want to make predictions at the end
data_obs = data_sub[!is.na(data_sub$interaction),] # data with known response

Hints:

  • adjust the ‘type’ argument in the predict(…) method (the default is to predict classes)
  • when predicting probabilities, the randomForest will return a matrix, a column for each class, we are interested in the probability of an interaction (so the second column)
  • tune nodesize and mtry
  • use more features and do feature engineering!
library(randomForest)
data_obs = data_sub[!is.na(data_sub$interaction),]
set.seed(42)

cv = 3
hyper_minnodesize = ... #
split = sample.int(cv, nrow(data_obs), replace = TRUE)

tuning_results =
    sapply(1:length(hyper_minnodesize), function(k) {
        auc_inner = NULL
        for(j in 1:cv) {
          inner_split = split == j
          train_inner = data_obs[!inner_split, ]
          test_inner = data_obs[inner_split, ]

          model = randomForest(interaction~.,data = train_inner, nodesize = hyper_minnodesize[k])
          predictions = predict(model, newdata=test_inner, type = "prob")[,2]

          auc_inner[j]= Metrics::auc(test_inner$interaction, predictions)
        }
      return(mean(auc_inner))
    })

results = data.frame(minnodesize = hyper_minnodesize, AUC = tuning_results)

print(results)
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':

    combine
The following object is masked from 'package:ggplot2':

    margin
data_obs = data_sub[!is.na(data_sub$interaction),]
set.seed(42)

cv = 3
hyper_minnodesize = sample(300, 20)
split = sample.int(cv, nrow(data_obs), replace = TRUE)

tuning_results =
    sapply(1:length(hyper_minnodesize), function(k) {
        auc_inner = NULL
        for(j in 1:cv) {
          inner_split = split == j
          train_inner = data_obs[!inner_split, ]
          test_inner = data_obs[inner_split, ]
          model = randomForest(interaction~.,data = train_inner,
                               nodesize = hyper_minnodesize[k])
          predictions = predict(model, test_inner, type="prob")[,2]

          auc_inner[j]= Metrics::auc(test_inner$interaction, predictions)
        }
      return(mean(auc_inner))
    })

results = data.frame(minnodesize = hyper_minnodesize, AUC = tuning_results)

print(results)
   minnodesize       AUC
1           49 0.6292065
2          153 0.6089005
3           74 0.6239616
4          228 0.6066147
5          146 0.6072871
6          122 0.6212592
7          300 0.5852408
8          128 0.6177919
9           24 0.6398260
10          89 0.6196561
11         165 0.6050228
12         110 0.6191791
13          20 0.6485062
14         291 0.5883411
15         283 0.5869634
16         109 0.6175374
17           5 0.6440715
18         212 0.6030497
19         259 0.5892946
20         292 0.5795048
# highest AUC / best hyperparameters
best_hyper = results[which.max(results$AUC),]
print(best_hyper)
   minnodesize       AUC
13          20 0.6485062

Make predictions for the submission server:

model = randomForest(interaction~.,data = data_obs,
                     nodesize = best_hyper[1,1])

write.csv(data.frame(y = predict(model, newdata=data_new, type="prob")[,2]), file = "Max_plant_poll_rf.csv")

4.3.0.2 Multi-species SDM with Random Forest

The multi-species SDM dataset requires some data tinkering. For example:

  • Think about oversampling the rare species
  • Aggregate human footprint variables and land cover variables

Modelling strategies:

  • random forest implementations have weights arguments that can be used to put higher weights on specific levels (e.g. the rare species?)
  • tune nodesize

The dataset is available for download on the submission server or here (for external participants)

Here is a minimal example of how to prepare the dataset, fit a simple model, and prepare a submission:

Prepare the data:

load("sdm.RData")

library(missRanger)
library(tidyverse)

sdm_long |> summary()

sdm_long[,-3] = sdm_long[,-3]  |>  mutate(across(everything(), ~ifelse(is.na(.), mean(., na.rm = TRUE), .)))

train = sdm_long[!is.na(sdm_long$present), ]
test = sdm_long[is.na(sdm_long$present), ]

# naive sub sampling of the training data due to computational constraints / we loose rare species?!!!
indices = sample.int(nrow(train), 5000)
train$present = as.factor(train$present) # crucial if you use the ranger package!

Fit the model

library(ranger)
set.seed(123)

rf = ranger(present ~ ., data = train[indices, -c(1, 4)], # without community and survey ID
            classification = TRUE, probability = TRUE, num.trees = 100L) # only 100 trees to speed up the predictions of the test dataset

# presence probability per community x species row:
preds = predict(rf, data = test)$predictions[, 2]

Prepare the submissions!

write.csv(data.frame(y = as.vector(preds)),
          file = "sdm.csv", row.names = FALSE)

4.3.0.3 Multi-species SDM with cito

Install the development version of cito via

devtools::install_github('citoverse/cito')

Minimal example

load("sdm.RData")

library(cito)


trainX = sdm_env$train
testX = sdm_env$test

trainX = missRanger::missRanger(trainX, num.trees = 50L)
testX = missRanger::missRanger(testX, num.trees = 50L)

trainXs = scale(trainX)[, 1:10]
testXs = scale(testX)[, 1:10]

colnames(trainXs) = stringr::str_remove(colnames(trainXs), "-")
colnames(trainXs) = stringr::str_remove(colnames(trainXs), "-")
colnames(testXs) = colnames(trainXs)

trainXs = trainXs[, - c(1)]
testXs = testXs[, - c(1)]

Fit the model

library(cito)
m = dnn(X = as.data.frame(trainXs), Y = Y_train, loss = "bernoulli", ce = FALSE, lr = 0.1, epochs = 200L )

probs = predict(m, newdata = testXs, type = "response") 

Prepare the submissions!

write.csv(data.frame(y = as.vector(probs)),
          file = "sdm.csv", row.names = FALSE)