14  Explainable AI

The goal of explainable AI (xAI, aka interpretable machine learning) is to explain why a fitted machine learning model makes certain predictions. A typical example is to understand how important different variables are for predictions. The incentives for doing so range from a better technical understanding of the models over understanding which data is important for improving predictions to questions of fairness and discrimination (e.g. to understand if an algorithm uses skin color to make a decision).

14.1 A Practical Example

In this lecture we will work with an African Elephant occurrence dataset.

We will fit a random forest and use the iml package for xAI, see https://christophm.github.io/interpretable-ml-book/.

library(iml)
library(ranger) # different random Forest package!
library(EcoData)
set.seed(123)


data = EcoData::elephant$occurenceData
head(data)
      Presence       bio1       bio2       bio3       bio4        bio5
3364         0 -0.4981747 -0.2738045  0.5368968 -0.5409999 -0.36843571
6268         0  0.6085908 -0.5568352  1.0340686 -1.2492050 -0.11835651
10285        0 -0.7973005  1.4648130 -1.0540532  2.0759423  0.07614953
2247         0  0.6385034  1.3435141 -0.1591439 -0.5107148  1.10425291
9821         0  0.6684160 -0.6781341  0.6363311 -0.9906170  0.15950927
1351         0  0.9675418 -0.6781341 -0.3580126 -0.3748202  0.77081398
            bio6       bio7       bio8       bio9       bio10       bio11
3364   0.2947850 -0.5260099 -1.2253960  0.2494100 -0.64527314 -0.06267842
6268   0.8221087 -0.8938475  0.4233787  0.7746249  0.09168503  0.94419518
10285 -1.5860029  1.6284678  0.2768209 -1.5153122 -0.03648161 -1.44165748
2247  -0.1622288  0.8577603  0.4600181  0.5855475  0.54026827  0.68153250
9821   0.9099960 -0.8062671  0.3867393  0.8586593  0.31597665  0.94419518
1351   0.8748411 -0.3858812  0.3134604  1.0477367  0.98885151  0.94419518
           bio12      bio13       bio14        bio15      bio16      bio17
3364   0.6285371  0.6807958 -0.29703736 -0.008455252  0.7124535 -0.2949994
6268   1.1121516  0.5918442  0.01619202 -0.884507980  0.5607328  0.3506918
10285 -1.2351482 -1.3396742 -0.50585695  0.201797403 -1.3499999 -0.5616980
2247   0.5951165  0.8714061 -0.55806185  0.236839512  1.1012378 -0.5616980
9821   1.1003561  0.5537222  0.59044589 -1.024676416  0.6413344  0.7437213
1351   0.7287986  1.1255533 -0.50585695  0.236839512  1.2956300 -0.4494038
            bio18       bio19
3364  -1.06812752  1.96201807
6268   1.22589281 -0.36205814
10285 -0.42763181 -0.62895735
2247  -0.20541902 -0.58378979
9821   0.06254347 -0.05409751
1351  -0.90473576  2.47939193
?EcoData::elephant

Meaning of the bioclim variables:

Bioclim variable Meaning
bio1 Annual Mean Temperature
bio2 Mean Diurnal Range (Mean of monthly (max temp - min temp))
bio3 Isothermality (BIO2/BIO7) (×100)
bio4 Temperature Seasonality (standard deviation ×100)
bio5 Max Temperature of Warmest Month
bio6 Min Temperature of Coldest Month
bio7 Temperature Annual Range (BIO5-BIO6)
bio8 Mean Temperature of Wettest Quarter
bio9 Mean Temperature of Driest Quarter
bio10 Mean Temperature of Warmest Quarter
bio11 Mean Temperature of Coldest Quarter
bio12 Annual Precipitation
bio13 Precipitation of Wettest Month
bio14 Precipitation of Driest Month
bio15 Precipitation Seasonality (Coefficient of Variation)
bio16 Precipitation of Wettest Quarter
bio17 Precipitation of Driest Quarter
bio18 Precipitation of Warmest Quarter
bio19 Precipitation of Coldest Quarter
rf = ranger(as.factor(Presence) ~ ., data = data, probability = TRUE)

xAI packages are written generic, i.e. they can handle almost all machine learning models. When we want to use them, we first have to create a predictor object, that holds the model and the data. The iml package uses R6 classes, that means new objects can be created by calling Predictor$new(). (Do not worry if you do not know what R6 classes are, just use the command.)

We often have to warp our predict function inside a so called wrapper function so that the output of the predict function fits to iml (iml expects that the predict function returns a vector of predictions:

predict_wrapper = function(model, newdata) predict(model, data=newdata)$predictions[,2]

predictor = Predictor$new(rf, data = data[,-1], y = data[,1], predict.function = predict_wrapper)
predictor$task = "classif" # set task to classification
# "Predictor" is an object generator.

14.2 Feature Importance

Feature importance should not be mistaken with the random forest variable importance though they are related. It tells us how important the individual variables are for predictions, can be calculated for all machine learning models and is based on a permutation approach (have a look at the book):

imp = FeatureImp$new(predictor, loss = "ce")
plot(imp)

bio9 (Precipitation of the wettest Quarter) is the most important variable.

14.3 Partial Dependencies

Partial dependencies are similar to allEffects plots for normal regressions. The idea is to visualize “marginal effects” of predictors (with the “feature” argument we specify the variable we want to visualize):

eff = FeatureEffect$new(predictor, feature = "bio9", method = "pdp",
                        grid.size = 30)
plot(eff)

One disadvantage of partial dependencies is that they are sensitive to correlated predictors. Accumulated local effects can be used for accounting for correlation of predictors.

14.4 Accumulated Local Effects

Accumulated local effects (ALE) are basically partial dependencies plots but try to correct for correlations between predictors.

ale = FeatureEffect$new(predictor, feature = "bio9", method = "ale")
ale$plot()

If there is no collinearity, you shouldn’t see much difference between partial dependencies and ALE plots.

14.5 Friedman’s H-statistic

The H-statistic can be used to find interactions between predictors. However, again, keep in mind that the H-statistic is sensible to correlation between predictors:

interact = Interaction$new(predictor, "bio9",grid.size = 5L)
plot(interact)

14.6 Global Explainer - Simplifying the Machine Learning Model

Another idea is simplifying the machine learning model with another simpler model such as a decision tree. We create predictions with the machine learning model for a lot of different input values and then we fit a decision tree on these predictions. We can then interpret the easier model.

library(partykit)

tree = TreeSurrogate$new(predictor, maxdepth = 2)
plot(tree$tree)

14.7 Local Explainer - LIME Explaining Single Instances (observations)

The global approach is to simplify the entire machine learning-black-box model via a simpler model, which is then interpretable.

However, sometimes we are only interested in understanding how single predictions are generated. The LIME (Local interpretable model-agnostic explanations) approach explores the feature space around one observation and based on this locally fits a simpler model (e.g. a linear model):

lime.explain = LocalModel$new(predictor, x.interest = data[1,-1])
lime.explain$results
             beta x.recoded       effect        x.original feature
bio9  -0.03972318 0.2494100 -0.009907356 0.249409955204759    bio9
bio16 -0.12035200 0.7124535 -0.085745198 0.712453479144842   bio16
                feature.value
bio9   bio9=0.249409955204759
bio16 bio16=0.712453479144842
plot(lime.explain)

14.8 Local Explainer - Shapley

The Shapley method computes the so called Shapley value, feature contributions for single predictions, and is based on an approach from cooperative game theory. The idea is that each feature value of the instance is a “player” in a game, where the prediction is the reward. The Shapley value tells us how to fairly distribute the reward among the features.

shapley = Shapley$new(predictor, x.interest = data[1,-1])
shapley$plot()

14.9 Exercises

Question

Use one of the non-image based data sets (preferably Wine, which is also described in the data sets section Appendix A but wasn’t used yet, but you can also use Nasa or Titanic) and fit a random forest or a BRT using xgboost. Explore / interpret the fitted model using iml (see also the book: https://christophm.github.io/interpretable-ml-book/).

Tip:

If you use iml, you need to provide a proper prediction function wrapper:

# random Forest (ranger), regression:
predict_wrapper = function(model, newdata) predict(model, data=newdata)$predictions

# random Forest (ranger), classification:
predict_wrapper = function(model, newdata) predict(model, data=newdata)$predictions[,2]

# xgboost:
predict_wrapper = function(model, newdata) predict(model, as.matrix(newdata))
library(ranger)
library("iml")
set.seed(1234)

data = as.data.frame(EcoData::wine)
submission = data[which(is.na(data$quality)), -which(colnames(data) == "quality")]
data = data[complete.cases(data), ] # Removes sumbmission data as well.

# Remark: Features don't need to be scaled for regression trees.

rf = ranger(quality ~ ., data = data, importance = "impurity")
pred = round(predict(rf, data)$predictions)
table(pred, data$quality)
    
pred   3   4   5   6   7   8
   4   2   6   0   0   0   0
   5   0   4 133   1   0   0
   6   0   0   3 112   9   0
   7   0   0   0   0  26   3
(accuracy = mean(pred == data$quality)) # Fits pretty well (on the training data...)
[1] 0.9264214
# For submission:
#write.csv(round(predict(rf, submission)), file = "wine_RF.csv")

# Standard depiction of importance:
rf$importance
[1] "impurity"
# Setup wrapper
predict_wrapper = function(model, newdata) predict(model, data=newdata)$predictions


# IML:
predictor = Predictor$new(
    rf, data = data[,which(names(data) != "quality")], y = data$quality,
    predict.function = predict_wrapper
    )

# Mind: This is stochastical!
importance = FeatureImp$new(predictor, loss = "mae")

plot(importance)

# Comparison between standard importance and IML importance:
importanceRf = names(rf$variable.importance)[order(rf$variable.importance, decreasing = TRUE)]
importanceIML = importance$results[1]
comparison = cbind(importanceIML, importanceRf)
colnames(comparison) = c("IML", "RF")
as.matrix(comparison)
      IML                    RF                    
 [1,] "alcohol"              "alcohol"             
 [2,] "sulphates"            "volatile.acidity"    
 [3,] "volatile.acidity"     "sulphates"           
 [4,] "citric.acid"          "density"             
 [5,] "total.sulfur.dioxide" "total.sulfur.dioxide"
 [6,] "density"              "citric.acid"         
 [7,] "fixed.acidity"        "fixed.acidity"       
 [8,] "pH"                   "chlorides"           
 [9,] "free.sulfur.dioxide"  "pH"                  
[10,] "chlorides"            "residual.sugar"      
[11,] "residual.sugar"       "free.sulfur.dioxide" 

Mind that feature importance, and the random forest’s variable importance are related but not equal! Variable importance is a measure for determining importance while creating the forest (i.e. for fitting). Feature importance is a measure for how important a variable is for prediction.

Maybe you want to see other explanation methods as well. Surely you can use the other techniques of this section on your own.

library(xgboost)
library("iml")
set.seed(1234)

data = as.data.frame(EcoData::wine)
submission = data[which(is.na(data$quality)), -which(colnames(data) == "quality")]
data = data[complete.cases(data), ] # Removes sumbmission data as well.


data_xg = xgb.DMatrix(
  data = as.matrix(data[,which(names(data) != "quality")]),
  label = data$quality
)
brt = xgboost(data_xg, nrounds = 24)
[1] train-rmse:3.656523 
[2] train-rmse:2.609494 
[3] train-rmse:1.884807 
[4] train-rmse:1.384918 
[5] train-rmse:1.037362 
[6] train-rmse:0.800110 
[7] train-rmse:0.629324 
[8] train-rmse:0.508917 
[9] train-rmse:0.426155 
[10]    train-rmse:0.369580 
[11]    train-rmse:0.313017 
[12]    train-rmse:0.274227 
[13]    train-rmse:0.236959 
[14]    train-rmse:0.207364 
[15]    train-rmse:0.195811 
[16]    train-rmse:0.182500 
[17]    train-rmse:0.173310 
[18]    train-rmse:0.154747 
[19]    train-rmse:0.144045 
[20]    train-rmse:0.139083 
[21]    train-rmse:0.129605 
[22]    train-rmse:0.118541 
[23]    train-rmse:0.110689 
[24]    train-rmse:0.097798 
pred = round(predict(brt, newdata = data_xg)) # On

table(pred, data$quality)
    
pred   3   4   5   6   7   8
   3   2   0   0   0   0   0
   4   0  10   0   0   0   0
   5   0   0 136   0   0   0
   6   0   0   0 113   1   0
   7   0   0   0   0  34   0
   8   0   0   0   0   0   3
(accuracy = mean(pred == data$quality)) # Fits pretty well (on the training data...)
[1] 0.9966555
# For submission:
#write.csv(round(predict(rf, submission)), file = "wine_RF.csv")

# Standard depiction of importance:
xgboost::xgb.importance(model = brt)
                 Feature       Gain      Cover  Frequency
 1:              alcohol 0.28363854 0.13667721 0.07073509
 2:            sulphates 0.11331809 0.07015405 0.06657420
 3:        fixed.acidity 0.09844424 0.11359510 0.19278779
 4:     volatile.acidity 0.09582787 0.07098397 0.12760055
 5: total.sulfur.dioxide 0.09207959 0.09147259 0.07212205
 6:              density 0.07374571 0.14910006 0.08321775
 7:            chlorides 0.06025507 0.07972405 0.08876560
 8:       residual.sugar 0.05307941 0.07202137 0.08044383
 9:  free.sulfur.dioxide 0.04602735 0.04743503 0.06518724
10:                   pH 0.04477571 0.12562892 0.07489598
11:          citric.acid 0.03880842 0.04320764 0.07766990
# Setup wrapper
predict_wrapper = function(model, newdata) predict(model, as.matrix(newdata))


# IML:
predictor = Predictor$new(
    brt, data = data[,which(names(data) != "quality")], y = data$quality,
    predict.function = predict_wrapper
    )

# Mind: This is stochastical!
importance = FeatureImp$new(predictor, loss = "mae")

plot(importance)

Question

As we show in Section 13 of this chapter, a random forest will partition the importance of variables across collinear predictors, while a linear regression model (lm()) can identify which predictor is causally affecting the response (at least in theory, if all confounders are controlled). What about a boosted regression tree or an artificial neural network? Take the random forest example and add a boosted regression tree (easier, you can use e.g. https://rdrr.io/cran/xgboost/man/xgb.importance.html) or an artificial neural network and see if they are better than the random forest at identifying causal predictors.

library(xgboost)
set.seed(1234)

data = as.data.frame(EcoData::wine)
submission = data[which(is.na(data$quality)), -which(colnames(data) == "quality")]
data = data[complete.cases(data), ] # Removes sumbmission data as well.

data_xg = xgb.DMatrix(
  data = as.matrix(data[,which(names(data) != "quality")]),
  label = data$quality
)
brt = xgboost(data_xg, nrounds = 24)
[1] train-rmse:3.656523 
[2] train-rmse:2.609494 
[3] train-rmse:1.884807 
[4] train-rmse:1.384918 
[5] train-rmse:1.037362 
[6] train-rmse:0.800110 
[7] train-rmse:0.629324 
[8] train-rmse:0.508917 
[9] train-rmse:0.426155 
[10]    train-rmse:0.369580 
[11]    train-rmse:0.313017 
[12]    train-rmse:0.274227 
[13]    train-rmse:0.236959 
[14]    train-rmse:0.207364 
[15]    train-rmse:0.195811 
[16]    train-rmse:0.182500 
[17]    train-rmse:0.173310 
[18]    train-rmse:0.154747 
[19]    train-rmse:0.144045 
[20]    train-rmse:0.139083 
[21]    train-rmse:0.129605 
[22]    train-rmse:0.118541 
[23]    train-rmse:0.110689 
[24]    train-rmse:0.097798 
pred = round(predict(brt, newdata = data_xg)) # Only integers are allowed.
table(pred, data$quality)
    
pred   3   4   5   6   7   8
   3   2   0   0   0   0   0
   4   0  10   0   0   0   0
   5   0   0 136   0   0   0
   6   0   0   0 113   1   0
   7   0   0   0   0  34   0
   8   0   0   0   0   0   3
(accuracy = mean(pred == data$quality)) # Fits very well (on the training data...)
[1] 0.9966555
# For submission:
#write.csv(round(predict(rf, submission)), file = "wine_RF.csv")

# Look at variable importance:
xgboost::xgb.importance(model = brt)
                 Feature       Gain      Cover  Frequency
 1:              alcohol 0.28363854 0.13667721 0.07073509
 2:            sulphates 0.11331809 0.07015405 0.06657420
 3:        fixed.acidity 0.09844424 0.11359510 0.19278779
 4:     volatile.acidity 0.09582787 0.07098397 0.12760055
 5: total.sulfur.dioxide 0.09207959 0.09147259 0.07212205
 6:              density 0.07374571 0.14910006 0.08321775
 7:            chlorides 0.06025507 0.07972405 0.08876560
 8:       residual.sugar 0.05307941 0.07202137 0.08044383
 9:  free.sulfur.dioxide 0.04602735 0.04743503 0.06518724
10:                   pH 0.04477571 0.12562892 0.07489598
11:          citric.acid 0.03880842 0.04320764 0.07766990

Every method yields slightly different results, but the main ingredient is alcohol (and sulphates).

Bonus Task

If you’re done with the previous tasks and have still time and appetite, improve the submissions for our competition, in particular for the Wine data set. Possible ideas:

  • Use MLR framework (section Section 4.5.1).

  • Try Transfer learning (section Section 10.4.2). The winner from last years used transfer learning to win the flower competition

  • Search on kaggle for more ideas / try to copy the ideas. This was the winner two years ago.

A minimal example for the (unbalanced!) Wine data set:

library(tensorflow)
library(keras)
set_random_seed(123L, disable_gpu = FALSE)  # Already sets R's random seed.

readin = function(percentageTest = 0.2, aggregate = 0){
    # Parameter "aggregate" packs the classes with very low abundances into one.
    # If "aggregate" equals to NA, NaN, Null, 0 or FALSE, no aggregation is performed.
    # Else, the given number is the boundary.
    # Every class with less elements than the boundary is aggregated into one.
    
    # WARNING: These classes cannot be distinguished from then on!
    # Using the predictions for submission needs further processing!
    
    # Just for random selection of features, independent of the amount of function calls.
    set.seed(12345)
    
    train = as.data.frame(EcoData::wine)
    indicesTrain = which(!is.na(train$quality))
    labelsTrain = train$quality[indicesTrain]
    labelsTrain = labelsTrain - min(labelsTrain)  # Start at 0 (for softmax).
    train = train[, -which(colnames(train) == "quality")]
    
    if(!is.na(aggregate) & aggregate){
        indices = names(table(labelsTrain)[
            table(labelsTrain) < aggregate & table(labelsTrain) > 0
        ])
        if(length(indices)){
            labelsTrain[labelsTrain %in% indices] = -1
            labelsTrain = as.factor(labelsTrain)
            levels(labelsTrain) = 1:length(levels(labelsTrain)) - 1
            labelsTrain = as.integer(labelsTrain)
        }
    }
    
    # Impute missing values (before any splitting, to get the highest power):
    train = missRanger::missRanger(
        data = train,
        maxiter = 10L,
        seed = 123,
        num.trees = 200L
    )
    
    # Separate submission data (mind scaling!):
    submission = scale(train[-indicesTrain,])
    train = scale(train[indicesTrain,])
    
    # Very asymmetric training data:
    cat(paste0("Size of training set: ", length(labelsTrain), "\n"))
    print(table(labelsTrain))
    
    if(percentageTest == 0){
      return(list(
        "labelsTrain" = labelsTrain,
        "labelsTest" = list(),
        "train" = train,
        "test" = list(),
        "submission" = submission
      ))
    }
    
    # Split into training and test set:
    len = nrow(train)
    indicesTest = sample(x = 1:len, size = percentageTest * len, replace = FALSE)
    test = as.data.frame(train[indicesTest,])
    labelsTest = labelsTrain[indicesTest]
    train = as.data.frame(train[-indicesTest,])
    labelsTrain = labelsTrain[-indicesTest]
    
    return(list(
        "labelsTrain" = labelsTrain,
        "labelsTest" = labelsTest,
        "train" = train,
        "test" = test,
        "submission" = submission
    ))
}

retVal = readin(aggregate = 0)
labelsTrain = retVal[["labelsTrain"]]
labelsTest = retVal[["labelsTest"]]
train = retVal[["train"]]
test = retVal[["test"]]
submission = retVal[["submission"]]
rm(retVal)

classNumber = length(table(labelsTrain))

model = keras_model_sequential()
model %>%
    layer_dense(units = 200L, activation = "leaky_relu",
    kernel_regularizer = regularizer_l2(0.00035),
    input_shape = ncol(train)) %>%
    layer_dropout(0.45) %>%
    layer_dense(units = 100L, activation = "relu",
    bias_regularizer = regularizer_l1_l2(0.5)) %>%
    layer_dropout(0.2) %>%
    layer_dense(units = 100L, activation = "leaky_relu",
    kernel_regularizer = regularizer_l2(0.00035),
    bias_regularizer = regularizer_l1_l2(0.1)) %>%
    layer_dropout(0.25) %>%
    layer_dense(units = 50L, activation = "gelu") %>%
    layer_dense(units = 25L, activation = "elu") %>%
    layer_dropout(0.35) %>%
    # We need probabilities. So we use the softmax function.
    # Remember, the labels MUST start at 0!
    layer_dense(units = classNumber, activation = "softmax")

model %>%
    keras::compile(loss = loss_binary_crossentropy,
                   optimizer = optimizer_adamax(learning_rate = 0.015))

model_history = 
    model %>% # Mind the matrix property (no data.frame)!
        fit(x = as.matrix(train), y = k_one_hot(labelsTrain, classNumber),
            epochs = 80L, batch = 12L, shuffle = TRUE)

plot(model_history)

# Accuracy on training set (!)
pred = predict(model, as.matrix(train)) %>% apply(1, which.max) - 1
Metrics::accuracy(pred, labelsTrain)
table(pred, labelsTrain)

# Accuracy on test set
pred = predict(model, as.matrix(test)) %>% apply(1, which.max) - 1
Metrics::accuracy(pred, labelsTest)
table(pred, labelsTest)

Recognize overfitting of your model selection strategy by changing the seed few times (while keeping the model constant) and increase the percentage of test data. Furthermore, consider fitting a random forest for good quality as well.

For the final predictions, we use the whole data set without holdouts:

library(tensorflow)
library(keras)
set_random_seed(321L, disable_gpu = FALSE)  # Already sets R's random seed.

retVal = readin(percentageTest = 0, aggregate = 0)
labelsTrain = retVal[["labelsTrain"]]
labelsTest = retVal[["labelsTest"]]
train = retVal[["train"]]
test = retVal[["test"]]
submission = retVal[["submission"]]
rm(retVal)

classNumber = length(table(labelsTrain))

model = keras_model_sequential()
model %>%
    layer_dense(units = 200L, activation = "leaky_relu",
    kernel_regularizer = regularizer_l2(0.00035),
    input_shape = ncol(train)) %>%
    layer_dropout(0.45) %>%
    layer_dense(units = 100L, activation = "relu",
    bias_regularizer = regularizer_l1_l2(0.5)) %>%
    layer_dropout(0.2) %>%
    layer_dense(units = 100L, activation = "leaky_relu",
    kernel_regularizer = regularizer_l2(0.00035),
    bias_regularizer = regularizer_l1_l2(0.1)) %>%
    layer_dropout(0.25) %>%
    layer_dense(units = 50L, activation = "gelu") %>%
    layer_dense(units = 25L, activation = "elu") %>%
    layer_dropout(0.35) %>%
    # We need probabilities. So we use the softmax function.
    # Remember, the labels MUST start at 0!
    layer_dense(units = classNumber, activation = "softmax")

model %>%
    keras::compile(loss = loss_binary_crossentropy,
                   optimizer = optimizer_adamax(learning_rate = 0.015))

model_history = 
    model %>% # Mind the matrix property (no data.frame)!
        fit(x = as.matrix(train), y = k_one_hot(labelsTrain, classNumber),
            epochs = 80L, batch = 12L, shuffle = TRUE)

plot(model_history)

# Accuracy on training set (!)
pred = predict(model, as.matrix(train)) %>% apply(1, which.max) - 1
Metrics::accuracy(pred, labelsTrain)
table(pred, labelsTrain)

# Reverse subtraction (for start at 0) and create submission file.
write.csv(pred + min(as.data.frame(EcoData::wine)$quality, na.rm = TRUE),
          file = "wine_NN.csv")