6  Distance-based Algorithms

In this chapter, we introduce support-vector machines (SVMs) and other distance-based methods Hint: Distance-based models need scaling!

6.1 K-Nearest-Neighbor

K-nearest-neighbor (kNN) is a simple algorithm that stores all the available cases and classifies the new data based on a similarity measure. It is mostly used to classify a data point based on how its \(k\) nearest neighbors are classified.

Let us first see an example:

x = scale(iris[,1:4])
y = iris[,5]
plot(x[-100,1], x[-100, 3], col = y)
points(x[100,1], x[100, 3], col = "blue", pch = 18, cex = 1.3)

Which class would you decide for the blue point? What are the classes of the nearest points? Well, this procedure is used by the k-nearest-neighbors classifier and thus there is actually no “real” learning in a k-nearest-neighbors classification.

For applying a k-nearest-neighbors classification, we first have to scale the data set, because we deal with distances and want the same influence of all predictors. Imagine one variable has values from -10.000 to 10.000 and another from -1 to 1. Then the influence of the first variable on the distance to the other points is much stronger than the influence of the second variable. On the iris data set, we have to split the data into training and test set on our own. Then we will follow the usual pipeline.

data = iris
data[,1:4] = apply(data[,1:4],2, scale)
indices = sample.int(nrow(data), 0.7*nrow(data))
train = data[indices,]
test = data[-indices,]

Fit model and create predictions:

library(kknn)
set.seed(123)

knn = kknn(Species~., train = train, test = test)
summary(knn)

Call:
kknn(formula = Species ~ ., train = train, test = test)

Response: "nominal"
          fit prob.setosa prob.versicolor prob.virginica
1      setosa           1      0.00000000     0.00000000
2      setosa           1      0.00000000     0.00000000
3      setosa           1      0.00000000     0.00000000
4      setosa           1      0.00000000     0.00000000
5      setosa           1      0.00000000     0.00000000
6      setosa           1      0.00000000     0.00000000
7      setosa           1      0.00000000     0.00000000
8      setosa           1      0.00000000     0.00000000
9      setosa           1      0.00000000     0.00000000
10     setosa           1      0.00000000     0.00000000
11     setosa           1      0.00000000     0.00000000
12     setosa           1      0.00000000     0.00000000
13     setosa           1      0.00000000     0.00000000
14     setosa           1      0.00000000     0.00000000
15 versicolor           0      1.00000000     0.00000000
16 versicolor           0      1.00000000     0.00000000
17 versicolor           0      1.00000000     0.00000000
18 versicolor           0      1.00000000     0.00000000
19 versicolor           0      1.00000000     0.00000000
20 versicolor           0      1.00000000     0.00000000
21  virginica           0      0.27541743     0.72458257
22 versicolor           0      1.00000000     0.00000000
23  virginica           0      0.06450608     0.93549392
24 versicolor           0      1.00000000     0.00000000
25 versicolor           0      1.00000000     0.00000000
26 versicolor           0      0.98430840     0.01569160
27 versicolor           0      1.00000000     0.00000000
28 versicolor           0      1.00000000     0.00000000
29 versicolor           0      1.00000000     0.00000000
30 versicolor           0      0.82711887     0.17288113
31  virginica           0      0.04881448     0.95118552
32  virginica           0      0.00000000     1.00000000
33 versicolor           0      0.93549392     0.06450608
34  virginica           0      0.00000000     1.00000000
35  virginica           0      0.00000000     1.00000000
36  virginica           0      0.00000000     1.00000000
37  virginica           0      0.00000000     1.00000000
38 versicolor           0      0.65252004     0.34747996
39  virginica           0      0.12578435     0.87421565
40  virginica           0      0.01569160     0.98430840
41  virginica           0      0.00000000     1.00000000
42  virginica           0      0.00000000     1.00000000
43  virginica           0      0.00000000     1.00000000
44  virginica           0      0.00000000     1.00000000
45  virginica           0      0.08512700     0.91487300
table(test$Species, fitted(knn))
            
             setosa versicolor virginica
  setosa         14          0         0
  versicolor      0         13         2
  virginica       0          3        13

6.2 Support Vector Machines (SVMs)

Support vectors machines have a different approach. They try to divide the predictor space into sectors for each class. To do so, a support-vector machine fits the parameters of a hyperplane (a \(n-1\) dimensional subspace in a \(n\)-dimensional space) in the predictor space by optimizing the distance between the hyperplane and the nearest point from each class.

Fitting a support-vector machine:

library(e1071)

data = iris
data[,1:4] = apply(data[,1:4], 2, scale)
indices = sample.int(nrow(data), 0.7*nrow(data))
train = data[indices,]
test = data[-indices,]

sm = svm(Species~., data = train, kernel = "linear")
pred = predict(sm, newdata = test)
oldpar = par(mfrow = c(1, 2))
plot(test$Sepal.Length, test$Petal.Length,
     col =  pred, main = "predicted")
plot(test$Sepal.Length, test$Petal.Length,
     col =  test$Species, main = "observed")

par(oldpar)

mean(pred == test$Species) # Accuracy.
[1] 0.9777778

Support-vector machines can only work on linearly separable problems. (A problem is called linearly separable if there exists at least one line in the plane with all of the points of one class on one side of the hyperplane and all the points of the others classes on the other side).

If this is not possible, we however, can use the so called kernel trick, which maps the predictor space into a (higher dimensional) space in which the problem is linear separable. After having identified the boundaries in the higher-dimensional space, we can project them back into the original dimensions.

x1 = seq(-3, 3, length.out = 100)
x2 = seq(-3, 3, length.out = 100)
X = expand.grid(x1, x2)
y = apply(X, 1, function(t) exp(-t[1]^2 - t[2]^2))
y = ifelse(1/(1+exp(-y)) < 0.62, 0, 1)

image(matrix(y, 100, 100))
animation::saveGIF(
  {
    for(i in c("truth", "linear", "radial", "sigmoid")){
      if(i == "truth"){
        image(matrix(y, 100,100),
        main = "Ground truth", axes = FALSE, las = 2)
      }else{
        sv = e1071::svm(x = x, y = factor(y), kernel = i)
        image(matrix(as.numeric(as.character(predict(sv, x))), 100, 100),
        main = paste0("Kernel: ", i), axes = FALSE, las = 2)
        axis(1, at = seq(0,1, length.out = 10),
        labels = round(seq(-3, 3, length.out = 10), 1))
        axis(2, at = seq(0,1, length.out = 10),
        labels = round(seq(-3, 3, length.out = 10), 1), las = 2)
      }
    }
  },
  movie.name = "svm.gif", autobrowse = FALSE, interval = 2
)

As you have seen, this does not work with every kernel. Hence, the problem is to find the actual correct kernel, which is again an optimization procedure and can thus be approximated.

6.3 Clustering methods

In unsupervised learning, we want to identify patterns in data without having any examples (supervision) about what the correct patterns / classes are. As an example, consider the iris data set. Here, we have 150 observations of 4 floral traits:

iris = datasets::iris
colors = hcl.colors(3)
traits = as.matrix(iris[,1:4])
species = iris$Species
image(y = 1:4, x = 1:length(species) , z = traits,
      ylab = "Floral trait", xlab = "Individual")
segments(50.5, 0, 50.5, 5, col = "black", lwd = 2)
segments(100.5, 0, 100.5, 5, col = "black", lwd = 2)

Trait distributions of iris dataset

The observations are from 3 species and indeed those species tend to have different traits, meaning that the observations form 3 clusters.

pairs(traits, pch = as.integer(species), col = colors[as.integer(species)])

Scatterplots for trait-trait combinations.

However, imagine we don’t know what species are, what is basically the situation in which people in the antique have been. The people just noted that some plants have different flowers than others, and decided to give them different names. This kind of process is what unsupervised learning does.

6.3.1 Hierarchical Clustering

A cluster refers to a collection of data points aggregated together because of certain similarities.

In hierarchical clustering, a hierarchy (tree) between data points is built.

  • Agglomerative: Start with each data point in their own cluster, merge them up hierarchically.
  • Divisive: Start with all data points in one cluster, and split hierarchically.

Merges / splits are done according to linkage criterion, which measures distance between (potential) clusters. Cut the tree at a certain height to get clusters.

Here an example

set.seed(123)

#Reminder: traits = as.matrix(iris[,1:4]).

d = dist(traits)
hc = hclust(d, method = "complete")

plot(hc, main="")
rect.hclust(hc, k = 3)  # Draw rectangles around the branches.

Results of hierarchical clustering. Red rectangle is drawn around the corresponding clusters.

Same plot, but with colors for true species identity

library(ape)

plot(as.phylo(hc),
     tip.color = colors[as.integer(species)],
     direction = "downwards")

Results of hierarchical clustering. Colors correspond to the three species classes.
hcRes3 = cutree(hc, k = 3)   #Cut a dendrogram tree into groups.

Calculate confusion matrix. Note we are switching labels here so that it fits to the species.

tmp = hcRes3
tmp[hcRes3 == 2] = 3
tmp[hcRes3 == 3] = 2
hcRes3 = tmp
table(hcRes3, species)
Confusion matrix for predicted and observed species classes.
setosa versicolor virginica
50 0 0
0 27 1
0 23 49

Note that results might change if you choose a different agglomeration method, distance metric or scale of your variables. Compare, e.g. to this example:

hc = hclust(d, method = "ward.D2")

plot(as.phylo(hc),
     tip.color = colors[as.integer(species)],
     direction = "downwards")

Results of hierarchical clustering. Colors correspond to the three species classes. Different agglomeration method
hcRes3 = cutree(hc, k = 3)   #Cut a dendrogram tree into groups.
table(hcRes3, species)
Confusion matrix for predicted and observed species classes.
setosa versicolor virginica
50 0 0
0 49 15
0 1 35

Which method is best?

library(dendextend)
set.seed(123)

methods = c("ward.D", "single", "complete", "average",
             "mcquitty", "median", "centroid", "ward.D2")
out = dendlist()   # Create a dendlist object from several dendrograms.
for(method in methods){
  res = hclust(d, method = method)
  out = dendlist(out, as.dendrogram(res))
}
names(out) = methods
print(out)
$ward.D
'dendrogram' with 2 branches and 150 members total, at height 199.6205 

$single
'dendrogram' with 2 branches and 150 members total, at height 1.640122 

$complete
'dendrogram' with 2 branches and 150 members total, at height 7.085196 

$average
'dendrogram' with 2 branches and 150 members total, at height 4.062683 

$mcquitty
'dendrogram' with 2 branches and 150 members total, at height 4.497283 

$median
'dendrogram' with 2 branches and 150 members total, at height 2.82744 

$centroid
'dendrogram' with 2 branches and 150 members total, at height 2.994307 

$ward.D2
'dendrogram' with 2 branches and 150 members total, at height 32.44761 

attr(,"class")
[1] "dendlist"
get_ordered_3_clusters = function(dend){
  # order.dendrogram function returns the order (index)
  # or the "label" attribute for the leaves.
  # cutree: Cut the tree (dendrogram) into groups of data.
  cutree(dend, k = 3)[order.dendrogram(dend)]
}
dend_3_clusters = lapply(out, get_ordered_3_clusters)

# Calculate Fowlkes-Mallows Index (determine the similarity between clusterings)
compare_clusters_to_iris = function(clus){
  FM_index(clus, rep(1:3, each = 50), assume_sorted_vectors = TRUE)
}

clusters_performance = sapply(dend_3_clusters, compare_clusters_to_iris)
dotchart(sort(clusters_performance), xlim = c(0.3, 1),
         xlab = "Fowlkes-Mallows index",
         main = "Performance of linkage methods
         in detecting the 3 species \n in our example",
         pch = 19)

We might conclude that ward.D2 works best here. However, as we will learn later, optimizing the method without a hold-out for testing implies that our model may be overfitting. We should check this using cross-validation.

6.3.2 K-means Clustering

Another example for an unsupervised learning algorithm is k-means clustering, one of the simplest and most popular unsupervised machine learning algorithms.

To start with the algorithm, you first have to specify the number of clusters (for our example the number of species). Each cluster has a centroid, which is the assumed or real location representing the center of the cluster (for our example this would be how an average plant of a specific species would look like). The algorithm starts by randomly putting centroids somewhere. Afterwards each data point is assigned to the respective cluster that raises the overall in-cluster sum of squares (variance) related to the distance to the centroid least of all. After the algorithm has placed all data points into a cluster the centroids get updated. By iterating this procedure until the assignment doesn’t change any longer, the algorithm can find the (locally) optimal centroids and the data points belonging to this cluster. Note that results might differ according to the initial positions of the centroids. Thus several (locally) optimal solutions might be found.

The “k” in K-means refers to the number of clusters and the ‘means’ refers to averaging the data-points to find the centroids.

A typical pipeline for using k-means clustering looks the same as for other algorithms. After having visualized the data, we fit a model, visualize the results and have a look at the performance by use of the confusion matrix. By setting a fixed seed, we can ensure that results are reproducible.

set.seed(123)

#Reminder: traits = as.matrix(iris[,1:4]).

kc = kmeans(traits, 3)
print(kc)
K-means clustering with 3 clusters of sizes 50, 62, 38

Cluster means:
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     5.006000    3.428000     1.462000    0.246000
2     5.901613    2.748387     4.393548    1.433871
3     6.850000    3.073684     5.742105    2.071053

Clustering vector:
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
[112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
[149] 3 2

Within cluster sum of squares by cluster:
[1] 15.15100 39.82097 23.87947
 (between_SS / total_SS =  88.4 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

Visualizing the results. Color codes true species identity, symbol shows cluster result.

plot(iris[c("Sepal.Length", "Sepal.Width")],
     col =  colors[as.integer(species)], pch = kc$cluster)
points(kc$centers[, c("Sepal.Length", "Sepal.Width")],
       col = colors, pch = 1:3, cex = 3)

We see that there are are some discrepancies. Confusion matrix:

table(iris$Species, kc$cluster)
            
              1  2  3
  setosa     50  0  0
  versicolor  0 48  2
  virginica   0 14 36

If you want to animate the clustering process, you could run

library(animation)

saveGIF(kmeans.ani(x = traits[,1:2], col = colors),
        interval = 1, ani.width = 800, ani.height = 800)

Elbow technique to determine the probably best suited number of clusters:

set.seed(123)

getSumSq = function(k){ kmeans(traits, k, nstart = 25)$tot.withinss }

#Perform algorithm for different cluster sizes and retrieve variance.
iris.kmeans1to10 = sapply(1:10, getSumSq)
plot(1:10, iris.kmeans1to10, type = "b", pch = 19, frame = FALSE,
     xlab = "Number of clusters K",
     ylab = "Total within-clusters sum of squares",
     col = c("black", "red", rep("black", 8)))

Often, one is interested in sparse models. Furthermore, higher k than necessary tends to overfitting. At the kink in the picture, the sum of squares dropped enough and k is still low enough. But keep in mind, this is only a rule of thumb and might be wrong in some special cases.

6.3.3 Density-based Clustering

Determine the affinity of a data point according to the affinity of its k nearest neighbors. This is a very general description as there are many ways to do so.

#Reminder: traits = as.matrix(iris[,1:4]).

library(dbscan)

Attaching package: 'dbscan'
The following object is masked from 'package:stats':

    as.dendrogram
set.seed(123)

kNNdistplot(traits, k = 4)   # Calculate and plot k-nearest-neighbor distances.
abline(h = 0.4, lty = 2)

dc = dbscan(traits, eps = 0.4, minPts = 6)
print(dc)
DBSCAN clustering for 150 objects.
Parameters: eps = 0.4, minPts = 6
Using euclidean distances and borderpoints = TRUE
The clustering contains 4 cluster(s) and 32 noise points.

 0  1  2  3  4 
32 46 36 14 22 

Available fields: cluster, eps, minPts, dist, borderPoints
library(factoextra)
fviz_cluster(dc, traits, geom = "point", ggtheme = theme_light())

6.3.4 Model-based Clustering

The last class of methods for unsupervised clustering are so-called model-based clustering methods.

library(mclust)
Package 'mclust' version 6.1.1
Type 'citation("mclust")' for citing this R package in publications.
mb = Mclust(traits)

Mclust automatically compares a number of candidate models (clusters, shape) according to BIC (The BIC is a criterion for classifying algorithms depending their prediction quality and their usage of parameters). We can look at the selected model via:

mb$G # Two clusters.
[1] 2
mb$modelName # > Ellipsoidal, equal shape.
[1] "VEV"

We see that the algorithm prefers having 2 clusters. For better comparability to the other 2 methods, we will override this by setting:

mb3 = Mclust(traits, 3)

Result in terms of the predicted densities for 3 clusters

plot(mb3, "density")

Predicted clusters:

plot(mb3, what=c("classification"), add = T)

Confusion matrix:

table(iris$Species, mb3$classification)
setosa versicolor virginica
50 0 0
0 49 15
0 1 35

6.3.5 Ordination

Ordination is used in explorative analysis and compared to clustering, similar objects are ordered together. So there is a relationship between clustering and ordination. Here a PCA ordination on on the iris data set.

pcTraits = prcomp(traits, center = TRUE, scale. = TRUE)
biplot(pcTraits, xlim = c(-0.25, 0.25), ylim = c(-0.25, 0.25))

You can cluster the results of this ordination, ordinate before clustering, or superimpose one on the other.

6.4 Exercise - kNN and SVM

Question: Hyperparameter tuning of kNN

Combing back to the titanic dataset from the morning, we want to optimize the number of neighbors (k) and the kernel of the kNN:

Prepare the data:

library(EcoData)
library(dplyr)

Attaching package: 'dplyr'
The following object is masked from 'package:ape':

    where
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(missRanger)
data(titanic_ml)
data = titanic_ml
data = 
  data %>% select(survived, sex, age, fare, pclass)
data[,-1] = missRanger(data[,-1], verbose = 0)

data_sub =
  data %>%
    mutate(age = scales::rescale(age, c(0, 1)),
           fare = scales::rescale(fare, c(0, 1))) %>%
    mutate(sex = as.integer(sex) - 1L,
           pclass = as.integer(pclass - 1L))
data_new = data_sub[is.na(data_sub$survived),] # for which we want to make predictions at the end
data_obs = data_sub[!is.na(data_sub$survived),] # data with known response

Hints:

  • check the help of the kNN function to get an idea about the hyperparameters
library(kknn)
set.seed(42)
data_obs = data_sub[!is.na(data_sub$survived),] 
cv = 3

outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv))

hyper_k = ... # must be integer vector
hyper_kernel = ... # must be character vector

results = data.frame(
  set = rep(NA, cv),
  k = rep(NA, cv),
  kernel = rep(NA, cv),
  AUC = rep(NA, cv)
)

for(i in 1:cv) {
  train_outer = data_obs[outer_split != i, ]
  test_outer = data_obs[outer_split == i, ]
  
  tuning_results = 
      sapply(1:length(hyper_k), function(k) {
        predictions = kknn(as.factor(survived)~., train = train_outer, test = test_outer, k = hyper_k[k], scale = FALSE, kernel = hyper_kernel[k])
        return(Metrics::auc(test_outer$survived, predictions$prob[,2]))
      })
  
  results[i, 1] = i
  results[i, 2] = hyper_k[which.max(tuning_results)]
  results[i, 3] = hyper_kernel[which.max(tuning_results)]  
  results[i, 4] = max(tuning_results)
}

print(results)
library(kknn)
set.seed(42)
data_obs = data_sub[!is.na(data_sub$survived),] 
cv = 3

outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv))

# sample minnodesize values (must be integers)
hyper_k = sample(10, 10)
hyper_kernel = sample(c("triangular", "inv", "gaussian", "rank"), 10, replace = TRUE)

results = data.frame(
  set = rep(NA, cv),
  k = rep(NA, cv),
  kernel = rep(NA, cv),
  AUC = rep(NA, cv)
)

for(i in 1:cv) {
  train_outer = data_obs[outer_split != i, ]
  test_outer = data_obs[outer_split == i, ]
  
  tuning_results = 
      sapply(1:length(hyper_k), function(k) {
        predictions = kknn(as.factor(survived)~., train = train_outer, test = test_outer, k = hyper_k[k], scale = FALSE, kernel = hyper_kernel[k])
        return(Metrics::auc(test_outer$survived, predictions$prob[,2]))
      })
  
  results[i, 1] = i
  results[i, 2] = hyper_k[which.max(tuning_results)]
  results[i, 3] = hyper_kernel[which.max(tuning_results)]  
  results[i, 4] = max(tuning_results)
}

print(results)
  set  k   kernel       AUC
1   1 10 gaussian 0.8078358
2   2  9     rank 0.8047081
3   3  6      inv 0.8299800

Make predictions:

prediction_ensemble = 
  sapply(1:nrow(results), function(i) {
    predictions = kknn(as.factor(survived)~., train = data_obs, test = data_new, k = results$k[i], scale = FALSE, kernel = results$kernel[i])
    return(predictions$prob[,2])
  })

# Single predictions from the ensemble model:
write.csv(data.frame(y = apply(prediction_ensemble, 1, mean)), file = "Max_titanic_ensemble.csv")
Question: kNN and SVM

Fit a standard k-nearest-neighbor classifier and a support vector machine with a linear kernel (check help) on the Sonar dataset, and report what fitted better.

Prepare dataset:

library(mlbench)
set.seed(123)

data(Sonar)
data = Sonar
#str(data)

# Do not forget scaling! This may be done implicitly by most functions.
# Here, it's done explicitly for teaching purposes.
data = cbind.data.frame(
  scale(data[,-length(data)]),
  "class" = data[,length(data)]
)

n = length(data[,1])
indicesTrain = sample.int(n, (n+1) %/% 2) # Take (at least) 50 % of the data.

train = data[indicesTrain,]
test = data[-indicesTrain,]

Tasks:

  • Fit a svm (from the e1071 package) on the train dataset and make predictions for the test dataset
  • Fit a kNN (from the kknn package) on the train dataset and make predictions for the test dataset
  • Calculate confusion matrices to compare the performance
library(e1071)
library(kknn)

knn = kknn(class~., train = train, test = test, scale = FALSE,
           kernel = "rectangular")
predKNN = predict(knn, newdata = test)

sm = svm(class~., data = train, scale = FALSE, kernel = "linear")
predSVM = predict(sm, newdata = test)
K-nearest-neighbor, standard (rectangular) kernel:
       labelsTest
predKNN  M  R
      M 46 29
      R  8 21
Correctly classified:  67  /  104
Support-vector machine, linear kernel:
       labelsTest
predSVM  M  R
      M 41 15
      R 13 35
Correctly classified:  76  /  104

K-nearest neighbor fitted (slightly) better.

6.5 Exercise - Unsupervised learning

Task

Go through the 4(5) unsupervised algorithms from the supervised chapter Section 2.2, and check

  • if they are sensitive (i.e. if results change)
  • if you scale the input features (= predictors), instead of using the raw data.

Discuss in your group: Which is more appropriate for this analysis and/or in general: Scaling or not scaling?

library(dendextend)

methods = c("ward.D", "single", "complete", "average",
            "mcquitty", "median", "centroid", "ward.D2")

cluster_all_methods = function(distances){
  out = dendlist()
  for(method in methods){
    res = hclust(distances, method = method)
    out = dendlist(out, as.dendrogram(res))
  }
  names(out) = methods

  return(out)
}

get_ordered_3_clusters = function(dend){
  return(cutree(dend, k = 3)[order.dendrogram(dend)])
}

compare_clusters_to_iris = function(clus){
  return(FM_index(clus, rep(1:3, each = 50), assume_sorted_vectors = TRUE))
}

do_clustering = function(traits, scale = FALSE){
  set.seed(123)
  headline = "Performance of linkage methods\nin detecting the 3 species\n"

  if(scale){
    traits = scale(traits)  # Do scaling on copy of traits.
    headline = paste0(headline, "Scaled")
  }else{ headline = paste0(headline, "Not scaled") }

  distances = dist(traits)
  out = cluster_all_methods(distances)
  dend_3_clusters = lapply(out, get_ordered_3_clusters)
  clusters_performance = sapply(dend_3_clusters, compare_clusters_to_iris)
  dotchart(sort(clusters_performance), xlim = c(0.3,1),
           xlab = "Fowlkes-Mallows index",
           main = headline,
           pch = 19)
}

traits = as.matrix(iris[,1:4])

# Do clustering on unscaled data.
do_clustering(traits, FALSE)

# Do clustering on scaled data.
do_clustering(traits, TRUE)

It seems that scaling is harmful for hierarchical clustering. But this might be a deception. Be careful: If you have data on different units or magnitudes, scaling is definitely useful! Otherwise variables with higher values get higher influence.

do_clustering = function(traits, scale = FALSE){
  set.seed(123)

  if(scale){
    traits = scale(traits)  # Do scaling on copy of traits.
    headline = "K-means Clustering\nScaled\nSum of all tries: "
  }else{ headline = "K-means Clustering\nNot scaled\nSum of all tries: " }

  getSumSq = function(k){ kmeans(traits, k, nstart = 25)$tot.withinss }
  iris.kmeans1to10 = sapply(1:10, getSumSq)

  headline = paste0(headline, round(sum(iris.kmeans1to10), 2))

  plot(1:10, iris.kmeans1to10, type = "b", pch = 19, frame = FALSE,
       main = headline,
       xlab = "Number of clusters K",
       ylab = "Total within-clusters sum of squares",
       col = c("black", "red", rep("black", 8)) )
}

traits = as.matrix(iris[,1:4])

# Do clustering on unscaled data.
do_clustering(traits, FALSE)

# Do clustering on scaled data.
do_clustering(traits, TRUE)

It seems that scaling is harmful for K-means clustering. But this might be a deception. Be careful: If you have data on different units or magnitudes, scaling is definitely useful! Otherwise variables with higher values get higher influence.

library(dbscan)

correct = as.factor(iris[,5])
# Start at 1. Noise points will get 0 later.
levels(correct) = 1:length(levels(correct))
correct
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
[112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[149] 3 3
Levels: 1 2 3
do_clustering = function(traits, scale = FALSE){
  set.seed(123)

  if(scale){ traits = scale(traits) } # Do scaling on copy of traits.

  #####
  # Play around with the parameters "eps" and "minPts" on your own!
  #####
  dc = dbscan(traits, eps = 0.41, minPts = 4)

  labels = as.factor(dc$cluster)
  noise = sum(dc$cluster == 0)
  levels(labels) = c("noise", 1:( length(levels(labels)) - 1))

  tbl = table(correct, labels)
  correct_classified = 0
  for(i in 1:length(levels(correct))){
    correct_classified = correct_classified + tbl[i, i + 1]
  }

  cat( if(scale){ "Scaled" }else{ "Not scaled" }, "\n\n" )
  cat("Confusion matrix:\n")
  print(tbl)
  cat("\nCorrect classified points: ", correct_classified, " / ", length(iris[,5]))
  cat("\nSum of noise points: ", noise, "\n")
}

traits = as.matrix(iris[,1:4])

# Do clustering on unscaled data.
do_clustering(traits, FALSE)
Not scaled 

Confusion matrix:
       labels
correct noise  1  2  3  4
      1     3 47  0  0  0
      2     5  0 38  3  4
      3    17  0  0 33  0

Correct classified points:  118  /  150
Sum of noise points:  25 
# Do clustering on scaled data.
do_clustering(traits, TRUE)
Scaled 

Confusion matrix:
       labels
correct noise  1  2  3  4
      1     9 41  0  0  0
      2    14  0 36  0  0
      3    36  0  1  4  9

Correct classified points:  81  /  150
Sum of noise points:  59 

It seems that scaling is harmful for density based clustering. But this might be a deception. Be careful: If you have data on different units or magnitudes, scaling is definitely useful! Otherwise variables with higher values get higher influence.

library(mclust)

do_clustering = function(traits, scale = FALSE){
  set.seed(123)

  if(scale){ traits = scale(traits) } # Do scaling on copy of traits.

  mb3 = Mclust(traits, 3)

  tbl = table(iris$Species, mb3$classification)

  cat( if(scale){ "Scaled" }else{ "Not scaled" }, "\n\n" )
  cat("Confusion matrix:\n")
  print(tbl)
  cat("\nCorrect classified points: ", sum(diag(tbl)), " / ", length(iris[,5]))
}

traits = as.matrix(iris[,1:4])

# Do clustering on unscaled data.
do_clustering(traits, FALSE)
Not scaled 

Confusion matrix:
            
              1  2  3
  setosa     50  0  0
  versicolor  0 45  5
  virginica   0  0 50

Correct classified points:  145  /  150
# Do clustering on scaled data.
do_clustering(traits, TRUE)
Scaled 

Confusion matrix:
            
              1  2  3
  setosa     50  0  0
  versicolor  0 45  5
  virginica   0  0 50

Correct classified points:  145  /  150

For model based clustering, scaling does not matter.

traits = as.matrix(iris[,1:4])

biplot(prcomp(traits, center = TRUE, scale. = TRUE),
       main = "Use integrated scaling")

biplot(prcomp(scale(traits), center = FALSE, scale. = FALSE),
       main = "Scale explicitly")

biplot(prcomp(traits, center = FALSE, scale. = FALSE),
       main = "No scaling at all")

For PCA ordination, scaling matters. Because we are interested in directions of maximal variance, all parameters should be scaled, or the one with the highest values might dominate all others.