library(EcoData)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
set.seed(42)
= sample.int(nrow(birdabundance), 30)
indices = birdabundance[-indices,]
train = birdabundance[indices,]
test # ABUND is the response variable
12 Exercise
12.1 birdabundance dataset
Task:
Fit random forest on train data
Predict for test data
Calculate R2
Do the same with a lm and compare the predictive performance of both models
= randomForest(ABUND~., data = train)
rf = lm(ABUND~., data = train)
m
= predict(rf, newdata = test)
pred1 = predict(m, newdata = test)
pred2
cor(pred1, test$ABUND)**2
## [1] 0.6596678
cor(pred2, test$ABUND)**2
## [1] 0.1983452
RF clearly outperforms the linear regression model!
12.2 titantic dataset
library(EcoData)
library(randomForest)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(42)
= titanic %>% select(survived, age, pclass, sex, fare)
titanic_sub = titanic_sub[complete.cases(titanic_sub),]
titanic_sub
= sample.int(nrow(titanic_sub), 200)
indices = titanic_sub[-indices,]
train = titanic_sub[indices,] test
Task:
Fit random forest on train data
Predict for test data
Calculate Accuracy
Do the same with a glm (binomial) and compare the predictive performance of both models
What is the most important variable?
= randomForest(as.factor(survived)~., data = train)
rf = glm(survived~., data = train, family = binomial)
m
= predict(rf, newdata = test)
pred1 = predict(m, newdata = test, type = "response")
pred2
# pred2 are probabilities, we have to change them to levels
= ifelse(pred2 < 0.5, 0, 1)
pred2
mean(pred1 == test$survived) # RF
## [1] 0.82
mean(pred2 == test$survived) # glm
## [1] 0.765
RF is better than the glm!
varImpPlot(rf)
Sex is the most important variable!
12.3 Bias-variance tradeoff
An important concept of statistics and, in particular, ML is the concept of the bias-variance tradeoff - or in other words, finding the right complexity of the model. So how flexible should our model be so that it generalizes well to other/new observations. Many ML algorithms have complexity parameters (e.g. nodesize or mtry in RF) that control their complexity. Have a look at the following youtube video about the bias-variance tradeoff:
Let’s see how we can control the complexity in the Random Forest algorithm:
library(randomForest)
set.seed(123)
= airquality[complete.cases(airquality),]
data
= randomForest(Ozone~., data = data)
rf
= predict(rf, data)
pred importance(rf)
## IncNodePurity
## Solar.R 17969.59
## Wind 31978.36
## Temp 34176.71
## Month 10753.73
## Day 15436.47
#> IncNodePurity
#> Solar.R 17969.59
#> Wind 31978.36
#> Temp 34176.71
#> Month 10753.73
#> Day 15436.47
cat("RMSE: ", sqrt(mean((data$Ozone - pred)^2)), "\n")
## RMSE: 9.507848
#> RMSE: 9.507848
plot(data$Temp, data$Ozone)
lines(data$Temp[order(data$Temp)], pred[order(data$Temp)], col = "red")
Try different values for the nodesize and mtry and describe how the predictions depend on these parameters. (randomForest(..., nodesize = ..., mtry = ...)
(the exercise was taken from the ML course book)
library(randomForest)
set.seed(123)
= airquality[complete.cases(airquality),]
data
for(nodesize in c(1, 5, 15, 50, 100)){
for(mtry in c(1, 3, 5)){
= randomForest(Ozone~., data = data, mtry = mtry, nodesize = nodesize)
rf
= predict(rf, data)
pred
plot(data$Temp, data$Ozone, main = paste0(
"mtry: ", mtry, " nodesize: ", nodesize,
"\nRMSE: ", round(sqrt(mean((data$Ozone - pred)^2)), 2)
)
)lines(data$Temp[order(data$Temp)], pred[order(data$Temp)], col = "red")
} }
12.4 Deep Neural Networks
Task description: Predict the spatial distribution of the African elephant. In Ecology we call such model a species distribution model.
library(EcoData)
?elephant
The object elephant contains two subdatasets
elephant$occurenceData contains presence / absence data as well as bioclim variables (environmental predictors) for the African elephant
elephant$predictionData data with environmental predictors for spatial predictions
The environmental data consists of 19 environmental variables, called bio1 through bio19, which are public and globally available bioclimatic variables (see https://www.worldclim.org/data/bioclim.html for a description of the variables). For example, bio1 is the mean annual temperature. No understanding of these variables is required for the task, the only difficulty is that many of them are highly correlated because they encode similar information (e.g. there are several temperature variables).
The goal of this exercise is to fit a deep neural network based on the observed presence / absences, and then make new predictions of habitat suitability in space across Africa based on the fitted model. Thus, our workflow consists of two steps:
- building and optimizing the predictive model, and
- using the predictive model to make predictions on new data and visualizing the results.
Here an example of how you could do this
Build predictive model:
# Use subsample of data because too many observations and use the rest of data to validate our model
= sample.int(nrow(elephant$occurenceData), 500)
train_indices = elephant$occurenceData[train_indices, ]
df
library(cito)
= dnn(Presence~bio1,
model data = df,
loss = 'binomial',
verbose = FALSE)
plot(model)
To check the predictive power of the model for the observations we have not used to train the model ([-train_indices,]
)
library(pROC)
auc(df$Presence[-train_indices],
predict(model, newdata = df[-train_indices,],type = "response"))
## Warning in roc.default(response, predictor, auc = TRUE, ...): Deprecated use a
## matrix as predictor. Unexpected results may be produced, please pass a numeric
## vector.
## Area under the curve: 0.7117
The AUC is a common measure of goodness of fit for binary classification.
Drop some of the highly correlated variables (don’t use all of them).
Change architecture of the dnn (using the
hidden=c(...)
argument)Change the number of epochs and the learning rate (see documentation of
dnn
)
Make new predictions
The data for making spatial predictions is in elephant$predictionData. This new dataset is not a data.frame but a raster object, which is a special data class for spatial data. You can plot one of the predictors in the following way.
library(sp)
library(raster)
plot(elephant$predictionData$bio1)
As our new_data object is not a typical data.frame, we are not using the standard predict function for a dnn, which is ?predict.citodnn
, but the predict function from the raster object (which internally transforms the new_data into a classical data.frame, pass then the data.frame to our model, and then transforms the output back to a raster object). Therefore, the syntax is slightly different to how we previously used predict()
.
= predict(elephant$predictionData, model = model, type = "response")
predictions head(as.data.frame(predictions))
## layer
## 1 0.04989933
## 2 0.04631172
## 3 0.04300154
## 4 0.04146853
## 5 0.03866339
## 6 0.03732348
The advantage of the raster object is that we can directly use it to create a map (the raster object has coordinates for each observation):
spplot(predictions, colorkey = list(space = "left") )
Task: play around with the DNN to improve predictive accuracy. You can check predictive accuracy by looking the AUC of the test data. When improving the predictive power of the model, does the map change?
12.5 ML pipeline
If you want to know more about a typical ML pipeline/workflow, read this chapter from the ML course!