library(tidyverse)
4 Machine learning pipeline
The Standard Machine Learning Pipeline at the example of the Titanic Data set
Before we specialize on any tuning, it is important to understand that machine learning always consists of a pipeline of actions.
The typical machine learning workflow consist of:
- Data cleaning and exploration (EDA = explorative data analysis) for example with tidyverse.
- Preprocessing and feature selection.
- Splitting data set into training and test set for evaluation.
- Model fitting.
- Model evaluation.
- New predictions.
Here is an (optional) video that explains the entire pipeline from a slightly different perspective:
In the following example, we use tidyverse, a collection of R packages for data science / data manipulation mainly developed by Hadley Wickham. A video that explains the basics can be found here :
Another good reference is “R for data science” by Hadley Wickham: .
For this lecture you need the Titanic data set provided by us. You can find it in GRIPS (datasets.RData in the data set and submission section) or at http://rhsbio6.uni-regensburg.de:8500.
We have split the data set already into training and test/prediction data sets (the test/prediction split has one column less than the train split, as the result is not known a priori).
4.1 Data preparation
Load necessary libraries:
Load data set:
library(EcoData)
data(titanic_ml)
= titanic_ml data
Standard summaries:
str(data)
'data.frame': 1309 obs. of 14 variables:
$ pclass : int 2 1 3 3 3 3 3 1 3 1 ...
$ survived : int 1 1 0 0 0 0 0 1 0 1 ...
$ name : chr "Sinkkonen, Miss. Anna" "Woolner, Mr. Hugh" "Sage, Mr. Douglas Bullen" "Palsson, Master. Paul Folke" ...
$ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 2 2 1 1 1 ...
$ age : num 30 NA NA 6 30.5 38.5 20 53 NA 42 ...
$ sibsp : int 0 0 8 3 0 0 0 0 0 0 ...
$ parch : int 0 0 2 1 0 0 0 0 0 0 ...
$ ticket : Factor w/ 929 levels "110152","110413",..: 221 123 779 542 589 873 472 823 588 834 ...
$ fare : num 13 35.5 69.55 21.07 8.05 ...
$ cabin : Factor w/ 187 levels "","A10","A11",..: 1 94 1 1 1 1 1 1 1 1 ...
$ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 2 4 2 ...
$ boat : Factor w/ 28 levels "","1","10","11",..: 3 28 1 1 1 1 1 19 1 15 ...
$ body : int NA NA NA NA 50 32 NA NA NA NA ...
$ home.dest: Factor w/ 370 levels "","?Havana, Cuba",..: 121 213 1 1 1 1 322 350 1 1 ...
summary(data)
pclass survived name sex
Min. :1.000 Min. :0.0000 Length:1309 female:466
1st Qu.:2.000 1st Qu.:0.0000 Class :character male :843
Median :3.000 Median :0.0000 Mode :character
Mean :2.295 Mean :0.3853
3rd Qu.:3.000 3rd Qu.:1.0000
Max. :3.000 Max. :1.0000
NA's :655
age sibsp parch ticket
Min. : 0.1667 Min. :0.0000 Min. :0.000 CA. 2343: 11
1st Qu.:21.0000 1st Qu.:0.0000 1st Qu.:0.000 1601 : 8
Median :28.0000 Median :0.0000 Median :0.000 CA 2144 : 8
Mean :29.8811 Mean :0.4989 Mean :0.385 3101295 : 7
3rd Qu.:39.0000 3rd Qu.:1.0000 3rd Qu.:0.000 347077 : 7
Max. :80.0000 Max. :8.0000 Max. :9.000 347082 : 7
NA's :263 (Other) :1261
fare cabin embarked boat
Min. : 0.000 :1014 : 2 :823
1st Qu.: 7.896 C23 C25 C27 : 6 C:270 13 : 39
Median : 14.454 B57 B59 B63 B66: 5 Q:123 C : 38
Mean : 33.295 G6 : 5 S:914 15 : 37
3rd Qu.: 31.275 B96 B98 : 4 14 : 33
Max. :512.329 C22 C26 : 4 4 : 31
NA's :1 (Other) : 271 (Other):308
body home.dest
Min. : 1.0 :564
1st Qu.: 72.0 New York, NY : 64
Median :155.0 London : 14
Mean :160.8 Montreal, PQ : 10
3rd Qu.:256.0 Cornwall / Akron, OH: 9
Max. :328.0 Paris, France : 9
NA's :1188 (Other) :639
The name variable consists of 1309 unique factors (there are 1309 observations…) and could be now transformed. If you are interested in how to do that, take a look at the following box.
length(unique(data$name))
[1] 1307
However, there is a title in each name. Let’s extract the titles:
- We will extract all names and split each name after each comma “,”.
- We will split the second split of the name after a point “.” and extract the titles.
= sapply(data$name,
first_split function(x) stringr::str_split(x, pattern = ",")[[1]][2])
= sapply(first_split,
titles function(x) strsplit(x, ".",fixed = TRUE)[[1]][1])
We get 18 unique titles:
table(titles)
titles
Capt Col Don Dona Dr
1 4 1 1 8
Jonkheer Lady Major Master Miss
1 1 2 61 260
Mlle Mme Mr Mrs Ms
2 1 757 197 2
Rev Sir the Countess
8 1 1
A few titles have a very low occurrence rate:
= stringr::str_trim((titles))
titles %>%
titles fct_count()
# A tibble: 18 × 2
f n
<fct> <int>
1 Capt 1
2 Col 4
3 Don 1
4 Dona 1
5 Dr 8
6 Jonkheer 1
7 Lady 1
8 Major 2
9 Master 61
10 Miss 260
11 Mlle 2
12 Mme 1
13 Mr 757
14 Mrs 197
15 Ms 2
16 Rev 8
17 Sir 1
18 the Countess 1
We will combine titles with low occurrences into one title, which we can easily do with the forcats package.
=
titles2 ::fct_collapse(titles,
forcatsofficer = c("Capt", "Col", "Major", "Dr", "Rev"),
royal = c("Jonkheer", "Don", "Sir",
"the Countess", "Dona", "Lady"),
miss = c("Miss", "Mlle"),
mrs = c("Mrs", "Mme", "Ms")
)
We can count titles again to see the new number of titles:
%>%
titles2 fct_count()
# A tibble: 6 × 2
f n
<fct> <int>
1 officer 23
2 royal 6
3 Master 61
4 miss 262
5 mrs 200
6 Mr 757
Add new title variable to data set:
=
data %>%
data mutate(title = titles2)
4.1.1 Imputation
NAs are a common problem in ML. For example, the age variable has 20% NAs:
summary(data$age)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.1667 21.0000 28.0000 29.8811 39.0000 80.0000 263
sum(is.na(data$age)) / nrow(data)
[1] 0.2009167
Either we remove all observations with NAs, or we impute (fill) the missing values, e.g. with the median age. However, age itself might depend on other variables such as sex, class and title. We want to fill the NAs with the median age of these groups. In tidyverse we can easily “group” the data, i.e. we will nest the observations (here: group_by after sex, pclass and title). After grouping, all operations (such as our median(age….)) will be done within the specified groups.
=
data %>%
data select(survived, sex, age, fare, pclass) %>%
group_by(sex, pclass) %>%
mutate(age2 = ifelse(is.na(age), median(age, na.rm = TRUE), age)) %>%
mutate(fare2 = ifelse(is.na(fare), median(fare, na.rm = TRUE), fare)) %>%
ungroup()
4.1.2 Preprocessing and Feature Selection
Later (tomorrow), we want to use Keras in our example, but it cannot handle factors and requires the data to be scaled.
Normally, one would do this for all predictors, but as we only show the pipeline here, we have sub-selected a bunch of predictors and do this only for them. We first scale the numeric predictors and change the factors with only two groups/levels into integers (this can be handled by Keras).
=
data_sub %>%
data select(survived, sex, age2, fare2, pclass) %>%
mutate(age2 = scales::rescale(age2, c(0, 1)),
fare2 = scales::rescale(fare2, c(0, 1))) %>%
mutate(sex = as.integer(sex) - 1L,
pclass = as.integer(pclass - 1L))
Factors with more than two levels should be one hot encoded (Make columns for every different factor level and write 1 in the respective column for every taken feature value and 0 else. For example: \(\{red, green, green, blue, red\} \rightarrow \{(0,0,1), (0,1,0), (0,1,0), (1,0,0), (0,0,1)\}\)):
= model.matrix(~0+as.factor(title), data = data)
one_title colnames(one_title) = levels(data$title)
= model.matrix(~0+as.factor(sex), data = data)
one_sex colnames(one_sex) = levels(data$sex)
= model.matrix(~0+as.factor(pclass), data = data)
one_pclass colnames(one_pclass) = paste0("pclass", 1:length(unique(data$pclass)))
And we have to add the dummy encoded variables to the data set:
= cbind(data.frame(survived= data$survived),
data age = data$age2,
one_title, one_sex, fare = data$fare2, one_pclass)
head(data)
4.2 Modelling
4.2.1 Split data for final predictions
To tune our hyperparameters and evaluate our models, we need to split the data as we learned in the CV section. Before doing so, however, we must split off the new observations in the data set :
summary(data_sub$survived)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.0000 0.0000 0.3853 1.0000 1.0000 655
655 observations have NAs in our response variable, these are the observations for which we want to make predictions at the end of our pipeline (we have no information about their actual values!).
= data_sub[is.na(data_sub$survived),]
data_new = data_sub[!is.na(data_sub$survived),] data_obs
4.2.2 Training and evaluation
Now, we can do a simple 10xCV with the observed_data:
library(glmnet)
library(glmnetUtils)
= data_sub[!is.na(data_sub$survived),]
data_obs = 10
cv
= as.integer(cut(1:nrow(data_obs), breaks = cv))
outer_split
= data.frame(
results set = 1:cv,
AUC = rep(NA, cv)
)
for(i in 1:cv) {
= data_obs[outer_split != i, ]
train_outer = data_obs[outer_split == i, ]
test_outer = glmnet(survived~.,data = train_outer, family = "binomial",alpha = 0.2)
model 2] = Metrics::auc(test_outer$survived, predict(model, test_outer,
results[i, alpha = 0.2,
s = 0.01,
type = "response"))
}
print(results)
set AUC
1 1 0.8650794
2 2 0.8723197
3 3 0.7632114
4 4 0.7541463
5 5 0.8984221
6 6 0.7200000
7 7 0.7829268
8 8 0.8701737
9 9 0.8787879
10 10 0.7867647
print(mean(results$AUC))
[1] 0.8191832
4.2.3 Hyperparameter optimization
We did a simple 10xCV to evaluate our model but we didn’t tune our hyperparameters (\(\lambda\) and \(\alpha\)). If we want to tune them, we need do another CV within each split of the model evaulation CV, which is called nested CV.
We used only one split (the split for the submission server doesn’t count) to evaluate the performance of the model before we made the final predictions. If we test many different hyperparameter combinations, how do we ensure that a certain hyperparameter is not only good for our training dataset but also good for the new data (our outer split on the submission server)? You may have guessed it already, we need to do another CV within the previous CV to check whether a certain hyperparameter solution generalizes to the whole data. To tune \(\lambda\), we would need to split the data another time (called nested CV).
Why is it important to tune hyperparameters? Hyperparameters (configuration parameters of our ML algorithms that (mostly) control their complexity) are usually tuned (optimized) in an automatic / systematic way. A common procedure, called random search, is to sample random configuration combinations from the set of hyperparameters and test for each combination the prediction error.
Let’s implement manually a nested CV to tune the \(\alpha\). Let’s start with a 5CVx5CV and 20x different alpha values:
library(glmnet)
library(glmnetUtils)
set.seed(42)
= data_sub[!is.na(data_sub$survived),]
data_obs = 5
cv = 5
cv_inner = runif(20,0, 1)
hyper_alpha
= as.integer(cut(1:nrow(data_obs), breaks = cv))
outer_split
= data.frame(
results set = rep(NA, cv),
alpha = rep(NA, cv),
AUC = rep(NA, cv)
)
for(i in 1:cv) {
= data_obs[outer_split != i, ]
train_outer = data_obs[outer_split == i, ]
test_outer
= NULL
best_alpha = NULL
best_auc
# inner split
for(j in 1:cv_inner) {
= as.integer(cut(1:nrow(train_outer), breaks = cv_inner))
inner_split = train_outer[inner_split != j, ]
train_inner = train_outer[inner_split == j, ]
test_inner
=
tuning_results_inner sapply(1:length(hyper_alpha), function(k) {
= glmnet(survived~.,data = train_inner, family = "binomial",alpha = hyper_alpha[k])
model return(Metrics::auc(test_inner$survived, predict(model, test_inner,
alpha = hyper_alpha[k],
s = 0.01,
type = "response")))
})= hyper_alpha[which.max(tuning_results_inner)]
best_alpha[j] = max(tuning_results_inner)
best_auc[j]
}= best_alpha[which.max(best_auc)]
best_alpha = glmnet(survived~., data = train_outer, alpha = best_alpha, family = "binomial")
model 1] = i
results[i, 2] = best_alpha
results[i, 3] = Metrics::auc(test_outer$survived, predict(model, test_outer, s = 0.01, alpha = best_alpha, type = "response"))
results[i,
}
print(results)
set alpha AUC
1 1 0.1346666 0.8720588
2 2 0.4577418 0.7554754
3 3 0.9148060 0.8178054
4 4 0.2861395 0.8238994
5 5 0.9148060 0.8156749
4.3 Predictions and Submission
When we are satisfied with the performance of our model, we will create predictions for the new observations on the submission server. But first we will train now our model on the full observed dataset:
\(\alpha = 0.915\) has the highest AUC, let’s use it to train the model on the full dataset:
= glmnet(survived~.,data = data_obs, family = "binomial",alpha = 0.915) model
We cannot assess the performance for the new observations because the true survival ratio is unknown, however, we can now submit our predictions to the submission server at http://rhsbio7.uni-regensburg.de:8500.
For the submission it is critical to change the predictions into a data.frame, select the second column (the probability to survive), and save it with the write.csv function:
= data_sub[is.na(data_sub$survived),]
data_new write.csv(data.frame(y = predict(model, data_new, alpha = 0.915, s = 0.01, type = "response")[,1] ), file = "Max_1.csv")
We have now used the \(\alpha\) value with the highest AUC here, but our tuning has shown that the best value of \(\alpha\) depends on the partitioning, so it would probably be better to build ten models and combine their predictions (e.g., by averaging the predictions):
=
prediction_ensemble sapply(results$alpha, function(alpha) {
= glmnet(survived~.,data = data_obs, family = "binomial",alpha = alpha)
model return(predict(model, data_new, alpha = alpha, s = 0.01, type = "response")[,1])
})write.csv(data.frame(y = apply(prediction_ensemble, 1, mean)), file = "Max_1.csv")
4.4 Exercises
4.5 Machine learning frameworks
As we have seen today, many of the machine learning algorithms are distributed over several packages but the general machine learning pipeline is very similar for all models: feature engineering, feature selection, hyperparameter tuning and cross-validation.
Machine learning frameworks such as mlr3
or tidymodels
provide a general interface for the ML pipeline, in particular the training and the hyperparameter tuning with nested CV. They support most ML packages/algorithms.
4.5.1 mlr3
The key features of mlr3 are:
- All common machine learning packages are integrated into mlr3, you can easily switch between different machine learning algorithms.
- A common ‘language’/workflow to specify machine learning pipelines.
- Support for different cross-validation strategies.
- Hyperparameter tuning for all supported machine learning algorithms.
- Ensemble models.
Useful links:
- mlr3-book (still in work)
- mlr3 website
- mlr3 cheatsheet
4.5.1.1 mlr3 - The Basic Workflow
The mlr3 package actually consists of several packages for different tasks (e.g. mlr3tuning for hyperparameter tuning, mlr3pipelines for data preparation pipes). But let’s start with the basic workflow:
library(EcoData)
library(tidyverse)
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(mlr3tuning)
library(mlr3measures)
data(nasa)
str(nasa)
'data.frame': 4687 obs. of 40 variables:
$ Neo.Reference.ID : int 3449084 3702322 3406893 NA 2363305 3017307 2438430 3653917 3519490 2066391 ...
$ Name : int NA 3702322 3406893 3082923 2363305 3017307 2438430 3653917 3519490 NA ...
$ Absolute.Magnitude : num 18.7 22.1 24.8 21.6 21.4 18.2 20 21 20.9 16.5 ...
$ Est.Dia.in.KM.min. : num 0.4837 0.1011 0.0291 0.1272 0.1395 ...
$ Est.Dia.in.KM.max. : num 1.0815 0.226 0.0652 0.2845 0.3119 ...
$ Est.Dia.in.M.min. : num 483.7 NA 29.1 127.2 139.5 ...
$ Est.Dia.in.M.max. : num 1081.5 226 65.2 284.5 311.9 ...
$ Est.Dia.in.Miles.min. : num 0.3005 0.0628 NA 0.0791 0.0867 ...
$ Est.Dia.in.Miles.max. : num 0.672 0.1404 0.0405 0.1768 0.1938 ...
$ Est.Dia.in.Feet.min. : num 1586.9 331.5 95.6 417.4 457.7 ...
$ Est.Dia.in.Feet.max. : num 3548 741 214 933 1023 ...
$ Close.Approach.Date : Factor w/ 777 levels "1995-01-01","1995-01-08",..: 511 712 472 239 273 145 428 694 87 732 ...
$ Epoch.Date.Close.Approach : num NA 1.42e+12 1.21e+12 1.00e+12 1.03e+12 ...
$ Relative.Velocity.km.per.sec: num 11.22 13.57 5.75 13.84 4.61 ...
$ Relative.Velocity.km.per.hr : num 40404 48867 20718 49821 16583 ...
$ Miles.per.hour : num 25105 30364 12873 30957 10304 ...
$ Miss.Dist..Astronomical. : num NA 0.0671 0.013 0.0583 0.0381 ...
$ Miss.Dist..lunar. : num 112.7 26.1 NA 22.7 14.8 ...
$ Miss.Dist..kilometers. : num 43348668 10030753 1949933 NA 5694558 ...
$ Miss.Dist..miles. : num 26935614 6232821 1211632 5418692 3538434 ...
$ Orbiting.Body : Factor w/ 1 level "Earth": 1 1 1 1 1 1 1 1 1 1 ...
$ Orbit.ID : int NA 8 12 12 91 NA 24 NA NA 212 ...
$ Orbit.Determination.Date : Factor w/ 2680 levels "2014-06-13 15:20:44",..: 69 NA 1377 1774 2275 2554 1919 731 1178 2520 ...
$ Orbit.Uncertainity : int 0 8 6 0 0 0 1 1 1 0 ...
$ Minimum.Orbit.Intersection : num NA 0.05594 0.00553 NA 0.0281 ...
$ Jupiter.Tisserand.Invariant : num 5.58 3.61 4.44 5.5 NA ...
$ Epoch.Osculation : num 2457800 2457010 NA 2458000 2458000 ...
$ Eccentricity : num 0.276 0.57 0.344 0.255 0.22 ...
$ Semi.Major.Axis : num 1.1 NA 1.52 1.11 1.24 ...
$ Inclination : num 20.06 4.39 5.44 23.9 3.5 ...
$ Asc.Node.Longitude : num 29.85 1.42 170.68 356.18 183.34 ...
$ Orbital.Period : num 419 1040 682 427 503 ...
$ Perihelion.Distance : num 0.794 0.864 0.994 0.828 0.965 ...
$ Perihelion.Arg : num 41.8 359.3 350 268.2 179.2 ...
$ Aphelion.Dist : num 1.4 3.15 2.04 1.39 1.51 ...
$ Perihelion.Time : num 2457736 2456941 2457937 NA 2458070 ...
$ Mean.Anomaly : num 55.1 NA NA 297.4 310.5 ...
$ Mean.Motion : num 0.859 0.346 0.528 0.843 0.716 ...
$ Equinox : Factor w/ 1 level "J2000": 1 1 NA 1 1 1 1 1 1 1 ...
$ Hazardous : int 0 0 0 1 1 0 0 0 1 1 ...
Let’s drop time, name and ID variable and create a classification task:
= nasa %>% select(-Orbit.Determination.Date,
data -Close.Approach.Date, -Name, -Neo.Reference.ID)
$Hazardous = as.factor(data$Hazardous)
data
# Create a classification task.
= TaskClassif$new(id = "nasa", backend = data,
task target = "Hazardous", positive = "1")
Create a generic pipeline of data transformation (imputation \(\rightarrow\) scaling \(\rightarrow\) encoding of categorical variables):
set.seed(123)
# Let's create the preprocessing graph.
= po("imputeoor") %>>% po("scale") %>>% po("encode")
preprocessing
# Run the task.
= preprocessing$train(task)[[1]]
transformed_task
$missings() transformed_task
Hazardous Absolute.Magnitude
4187 0
Aphelion.Dist Asc.Node.Longitude
0 0
Eccentricity Epoch.Date.Close.Approach
0 0
Epoch.Osculation Est.Dia.in.Feet.max.
0 0
Est.Dia.in.Feet.min. Est.Dia.in.KM.max.
0 0
Est.Dia.in.KM.min. Est.Dia.in.M.max.
0 0
Est.Dia.in.M.min. Est.Dia.in.Miles.max.
0 0
Est.Dia.in.Miles.min. Inclination
0 0
Jupiter.Tisserand.Invariant Mean.Anomaly
0 0
Mean.Motion Miles.per.hour
0 0
Minimum.Orbit.Intersection Miss.Dist..Astronomical.
0 0
Miss.Dist..kilometers. Miss.Dist..lunar.
0 0
Miss.Dist..miles. Orbit.ID
0 0
Orbit.Uncertainity Orbital.Period
0 0
Perihelion.Arg Perihelion.Distance
0 0
Perihelion.Time Relative.Velocity.km.per.hr
0 0
Relative.Velocity.km.per.sec Semi.Major.Axis
0 0
Equinox.J2000 Equinox..MISSING
0 0
Orbiting.Body.Earth Orbiting.Body..MISSING
0 0
We can even visualize the preprocessing graph:
$plot() preprocessing
To test our model (glmnet) with 10-fold cross-validated, we will do:
- Specify the missing target rows as validation so that they will be ignored.
- Specify the cross-validation, the learner (the machine learning model we want to use), and the measurement (AUC).
- Run (benchmark) our model.
set.seed(123)
$data()[1,] transformed_task
Hazardous Absolute.Magnitude Aphelion.Dist Asc.Node.Longitude Eccentricity
1: 0 -0.8132265 -0.3804201 -1.140837 -0.315606
Epoch.Date.Close.Approach Epoch.Osculation Est.Dia.in.Feet.max.
1: -4.792988 0.1402677 0.2714179
Est.Dia.in.Feet.min. Est.Dia.in.KM.max. Est.Dia.in.KM.min. Est.Dia.in.M.max.
1: 0.3134076 0.3007134 0.2565687 0.2710953
Est.Dia.in.M.min. Est.Dia.in.Miles.max. Est.Dia.in.Miles.min. Inclination
1: 0.2916245 0.2620443 0.258651 0.5442288
Jupiter.Tisserand.Invariant Mean.Anomaly Mean.Motion Miles.per.hour
1: 0.3840868 -1.028761 0.3193953 -0.2541306
Minimum.Orbit.Intersection Miss.Dist..Astronomical. Miss.Dist..kilometers.
1: -5.459119 -7.076926 0.2512296
Miss.Dist..lunar. Miss.Dist..miles. Orbit.ID Orbit.Uncertainity
1: 0.2398625 0.2381077 -9.651472 -1.007087
Orbital.Period Perihelion.Arg Perihelion.Distance Perihelion.Time
1: -0.3013135 -1.170536 -0.01831583 0.1052611
Relative.Velocity.km.per.hr Relative.Velocity.km.per.sec Semi.Major.Axis
1: -0.2816782 -0.2841407 -0.2791037
Equinox.J2000 Equinox..MISSING Orbiting.Body.Earth Orbiting.Body..MISSING
1: 1 0 1 0
$set_row_roles((1:nrow(data))[is.na(data$Hazardous)],
transformed_task"holdout")
= mlr3::rsmp("cv", folds = 10L)
cv10 = lrn("classif.glmnet", predict_type = "prob")
EN = msr("classif.auc") measurement
= mlr3::resample(transformed_task,
result resampling = cv10, store_models = TRUE)
EN,
# Calculate the average AUC of the holdouts.
$aggregate(measurement) result
Very cool! Preprocessing + 10-fold cross-validation model evaluation in a few lines of code!
Let’s create the final predictions:
= sapply(1:10, function(i) result$learners[[i]]$predict(transformed_task,
pred row_ids = (1:nrow(data))[is.na(data$Hazardous)])$data$prob[, "1", drop = FALSE])
dim(pred)
= apply(pred, 1, mean) predictions
You could now submit the predictions here.
But we are still not happy with the results, let’s do some hyperparameter tuning!
4.5.1.2 mlr3 - Hyperparameter Tuning
With mlr3, we can easily extend the above example to do hyperparameter tuning within nested cross-validation (the tuning has its own inner cross-validation).
Print the hyperparameter space of our glmnet learner:
$param_set EN
<ParamSet>
id class lower upper nlevels default parents
1: alpha ParamDbl 0 1 Inf 1
2: big ParamDbl -Inf Inf Inf 9.9e+35
3: devmax ParamDbl 0 1 Inf 0.999
4: dfmax ParamInt 0 Inf Inf <NoDefault[3]>
5: eps ParamDbl 0 1 Inf 1e-06
6: epsnr ParamDbl 0 1 Inf 1e-08
7: exact ParamLgl NA NA 2 FALSE
8: exclude ParamInt 1 Inf Inf <NoDefault[3]>
9: exmx ParamDbl -Inf Inf Inf 250
10: fdev ParamDbl 0 1 Inf 1e-05
11: gamma ParamDbl -Inf Inf Inf 1 relax
12: intercept ParamLgl NA NA 2 TRUE
13: lambda ParamUty NA NA Inf <NoDefault[3]>
14: lambda.min.ratio ParamDbl 0 1 Inf <NoDefault[3]>
15: lower.limits ParamUty NA NA Inf <NoDefault[3]>
16: maxit ParamInt 1 Inf Inf 100000
17: mnlam ParamInt 1 Inf Inf 5
18: mxit ParamInt 1 Inf Inf 100
19: mxitnr ParamInt 1 Inf Inf 25
20: newoffset ParamUty NA NA Inf <NoDefault[3]>
21: nlambda ParamInt 1 Inf Inf 100
22: offset ParamUty NA NA Inf
23: penalty.factor ParamUty NA NA Inf <NoDefault[3]>
24: pmax ParamInt 0 Inf Inf <NoDefault[3]>
25: pmin ParamDbl 0 1 Inf 1e-09
26: prec ParamDbl -Inf Inf Inf 1e-10
27: relax ParamLgl NA NA 2 FALSE
28: s ParamDbl 0 Inf Inf 0.01
29: standardize ParamLgl NA NA 2 TRUE
30: standardize.response ParamLgl NA NA 2 FALSE
31: thresh ParamDbl 0 Inf Inf 1e-07
32: trace.it ParamInt 0 1 2 0
33: type.gaussian ParamFct NA NA 2 <NoDefault[3]>
34: type.logistic ParamFct NA NA 2 <NoDefault[3]>
35: type.multinomial ParamFct NA NA 2 <NoDefault[3]>
36: upper.limits ParamUty NA NA Inf <NoDefault[3]>
id class lower upper nlevels default parents
value
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
value
Define the hyperparameter space of the random forest:
library(paradox)
=
EN_pars ::ParamSet$new(
paradoxlist(paradox::ParamDbl$new("alpha", lower = 0, upper = 1L),
::ParamDbl$new("lambda", lower = 0, upper = 0.5 )) )
paradoxprint(EN_pars)
<ParamSet>
id class lower upper nlevels default value
1: alpha ParamDbl 0 1.0 Inf <NoDefault[3]>
2: lambda ParamDbl 0 0.5 Inf <NoDefault[3]>
To set up the tuning pipeline we need:
- Inner cross-validation resampling object.
- Tuning criterion (e.g. AUC).
- Tuning method (e.g. random or block search).
- Tuning terminator (When should we stop tuning? E.g. after \(n\) iterations).
set.seed(123)
= mlr3::rsmp("cv", folds = 3L)
inner3 = msr("classif.auc")
measurement = mlr3tuning::tnr("random_search")
tuner = mlr3tuning::trm("evals", n_evals = 5L)
terminator = lrn("classif.glmnet", predict_type = "prob")
EN
= AutoTuner$new(learner = EN,
learner_tuner measure = measurement,
tuner = tuner,
terminator = terminator,
search_space = EN_pars,
resampling = inner3)
print(learner_tuner)
<AutoTuner:classif.glmnet.tuned>
* Model: list
* Search Space:
<ParamSet>
id class lower upper nlevels default value
1: alpha ParamDbl 0 1.0 Inf <NoDefault[3]>
2: lambda ParamDbl 0 0.5 Inf <NoDefault[3]>
* Packages: mlr3, mlr3tuning, mlr3learners, glmnet
* Predict Type: prob
* Feature Types: logical, integer, numeric
* Properties: multiclass, twoclass, weights
Now we can wrap it normally into the 10-fold cross-validated setup as done previously:
# Calculate the average AUC of the holdouts.
$aggregate(measurement) result
classif.auc
0.6767554
Let’s create the final predictions:
= sapply(1:3, function(i) result$learners[[i]]$predict(transformed_task,
pred row_ids = (1:nrow(data))[is.na(data$Hazardous)])$data$prob[, "1", drop = FALSE])
dim(pred)
= apply(pred, 1, mean) predictions