= scale(iris[,1:4])
x = iris[,5]
y plot(x[-100,1], x[-100, 3], col = y)
points(x[100,1], x[100, 3], col = "blue", pch = 18, cex = 1.3)
6 Distance-based Algorithms
In this chapter, we introduce support-vector machines (SVMs) and other distance-based methods Hint: Distance-based models need scaling!
6.1 K-Nearest-Neighbor
K-nearest-neighbor (kNN) is a simple algorithm that stores all the available cases and classifies the new data based on a similarity measure. It is mostly used to classify a data point based on how its \(k\) nearest neighbors are classified.
Let us first see an example:
Which class would you decide for the blue point? What are the classes of the nearest points? Well, this procedure is used by the k-nearest-neighbors classifier and thus there is actually no “real” learning in a k-nearest-neighbors classification.
For applying a k-nearest-neighbors classification, we first have to scale the data set, because we deal with distances and want the same influence of all predictors. Imagine one variable has values from -10.000 to 10.000 and another from -1 to 1. Then the influence of the first variable on the distance to the other points is much stronger than the influence of the second variable. On the iris data set, we have to split the data into training and test set on our own. Then we will follow the usual pipeline.
= iris
data 1:4] = apply(data[,1:4],2, scale)
data[,= sample.int(nrow(data), 0.7*nrow(data))
indices = data[indices,]
train = data[-indices,] test
Fit model and create predictions:
library(kknn)
set.seed(123)
= kknn(Species~., train = train, test = test)
knn summary(knn)
Call:
kknn(formula = Species ~ ., train = train, test = test)
Response: "nominal"
fit prob.setosa prob.versicolor prob.virginica
1 setosa 1 0.00000000 0.00000000
2 setosa 1 0.00000000 0.00000000
3 setosa 1 0.00000000 0.00000000
4 setosa 1 0.00000000 0.00000000
5 setosa 1 0.00000000 0.00000000
6 setosa 1 0.00000000 0.00000000
7 setosa 1 0.00000000 0.00000000
8 setosa 1 0.00000000 0.00000000
9 setosa 1 0.00000000 0.00000000
10 setosa 1 0.00000000 0.00000000
11 setosa 1 0.00000000 0.00000000
12 setosa 1 0.00000000 0.00000000
13 setosa 1 0.00000000 0.00000000
14 setosa 1 0.00000000 0.00000000
15 versicolor 0 1.00000000 0.00000000
16 versicolor 0 1.00000000 0.00000000
17 versicolor 0 1.00000000 0.00000000
18 versicolor 0 1.00000000 0.00000000
19 versicolor 0 1.00000000 0.00000000
20 versicolor 0 1.00000000 0.00000000
21 virginica 0 0.27541743 0.72458257
22 versicolor 0 1.00000000 0.00000000
23 virginica 0 0.06450608 0.93549392
24 versicolor 0 1.00000000 0.00000000
25 versicolor 0 1.00000000 0.00000000
26 versicolor 0 0.98430840 0.01569160
27 versicolor 0 1.00000000 0.00000000
28 versicolor 0 1.00000000 0.00000000
29 versicolor 0 1.00000000 0.00000000
30 versicolor 0 0.82711887 0.17288113
31 virginica 0 0.04881448 0.95118552
32 virginica 0 0.00000000 1.00000000
33 versicolor 0 0.93549392 0.06450608
34 virginica 0 0.00000000 1.00000000
35 virginica 0 0.00000000 1.00000000
36 virginica 0 0.00000000 1.00000000
37 virginica 0 0.00000000 1.00000000
38 versicolor 0 0.65252004 0.34747996
39 virginica 0 0.12578435 0.87421565
40 virginica 0 0.01569160 0.98430840
41 virginica 0 0.00000000 1.00000000
42 virginica 0 0.00000000 1.00000000
43 virginica 0 0.00000000 1.00000000
44 virginica 0 0.00000000 1.00000000
45 virginica 0 0.08512700 0.91487300
table(test$Species, fitted(knn))
setosa versicolor virginica
setosa 14 0 0
versicolor 0 13 2
virginica 0 3 13
6.2 Support Vector Machines (SVMs)
Support vectors machines have a different approach. They try to divide the predictor space into sectors for each class. To do so, a support-vector machine fits the parameters of a hyperplane (a \(n-1\) dimensional subspace in a \(n\)-dimensional space) in the predictor space by optimizing the distance between the hyperplane and the nearest point from each class.
Fitting a support-vector machine:
library(e1071)
= iris
data 1:4] = apply(data[,1:4], 2, scale)
data[,= sample.int(nrow(data), 0.7*nrow(data))
indices = data[indices,]
train = data[-indices,]
test
= svm(Species~., data = train, kernel = "linear")
sm = predict(sm, newdata = test) pred
= par(mfrow = c(1, 2))
oldpar plot(test$Sepal.Length, test$Petal.Length,
col = pred, main = "predicted")
plot(test$Sepal.Length, test$Petal.Length,
col = test$Species, main = "observed")
par(oldpar)
mean(pred == test$Species) # Accuracy.
[1] 0.9777778
Support-vector machines can only work on linearly separable problems. (A problem is called linearly separable if there exists at least one line in the plane with all of the points of one class on one side of the hyperplane and all the points of the others classes on the other side).
If this is not possible, we however, can use the so called kernel trick, which maps the predictor space into a (higher dimensional) space in which the problem is linear separable. After having identified the boundaries in the higher-dimensional space, we can project them back into the original dimensions.
= seq(-3, 3, length.out = 100)
x1 = seq(-3, 3, length.out = 100)
x2 = expand.grid(x1, x2)
X = apply(X, 1, function(t) exp(-t[1]^2 - t[2]^2))
y = ifelse(1/(1+exp(-y)) < 0.62, 0, 1)
y
image(matrix(y, 100, 100))
::saveGIF(
animation
{for(i in c("truth", "linear", "radial", "sigmoid")){
if(i == "truth"){
image(matrix(y, 100,100),
main = "Ground truth", axes = FALSE, las = 2)
else{
}= e1071::svm(x = x, y = factor(y), kernel = i)
sv image(matrix(as.numeric(as.character(predict(sv, x))), 100, 100),
main = paste0("Kernel: ", i), axes = FALSE, las = 2)
axis(1, at = seq(0,1, length.out = 10),
labels = round(seq(-3, 3, length.out = 10), 1))
axis(2, at = seq(0,1, length.out = 10),
labels = round(seq(-3, 3, length.out = 10), 1), las = 2)
}
}
},movie.name = "svm.gif", autobrowse = FALSE, interval = 2
)
As you have seen, this does not work with every kernel. Hence, the problem is to find the actual correct kernel, which is again an optimization procedure and can thus be approximated.
6.3 Clustering methods
In unsupervised learning, we want to identify patterns in data without having any examples (supervision) about what the correct patterns / classes are. As an example, consider the iris data set. Here, we have 150 observations of 4 floral traits:
= datasets::iris
iris = hcl.colors(3)
colors = as.matrix(iris[,1:4])
traits = iris$Species
species image(y = 1:4, x = 1:length(species) , z = traits,
ylab = "Floral trait", xlab = "Individual")
segments(50.5, 0, 50.5, 5, col = "black", lwd = 2)
segments(100.5, 0, 100.5, 5, col = "black", lwd = 2)
The observations are from 3 species and indeed those species tend to have different traits, meaning that the observations form 3 clusters.
pairs(traits, pch = as.integer(species), col = colors[as.integer(species)])
However, imagine we don’t know what species are, what is basically the situation in which people in the antique have been. The people just noted that some plants have different flowers than others, and decided to give them different names. This kind of process is what unsupervised learning does.
6.3.1 Hierarchical Clustering
A cluster refers to a collection of data points aggregated together because of certain similarities.
In hierarchical clustering, a hierarchy (tree) between data points is built.
- Agglomerative: Start with each data point in their own cluster, merge them up hierarchically.
- Divisive: Start with all data points in one cluster, and split hierarchically.
Merges / splits are done according to linkage criterion, which measures distance between (potential) clusters. Cut the tree at a certain height to get clusters.
Here an example
set.seed(123)
#Reminder: traits = as.matrix(iris[,1:4]).
= dist(traits)
d = hclust(d, method = "complete")
hc
plot(hc, main="")
rect.hclust(hc, k = 3) # Draw rectangles around the branches.
Same plot, but with colors for true species identity
library(ape)
plot(as.phylo(hc),
tip.color = colors[as.integer(species)],
direction = "downwards")
= cutree(hc, k = 3) #Cut a dendrogram tree into groups. hcRes3
Calculate confusion matrix. Note we are switching labels here so that it fits to the species.
= hcRes3
tmp == 2] = 3
tmp[hcRes3 == 3] = 2
tmp[hcRes3 = tmp
hcRes3 table(hcRes3, species)
setosa | versicolor | virginica |
---|---|---|
50 | 0 | 0 |
0 | 27 | 1 |
0 | 23 | 49 |
Note that results might change if you choose a different agglomeration method, distance metric or scale of your variables. Compare, e.g. to this example:
= hclust(d, method = "ward.D2")
hc
plot(as.phylo(hc),
tip.color = colors[as.integer(species)],
direction = "downwards")
= cutree(hc, k = 3) #Cut a dendrogram tree into groups.
hcRes3 table(hcRes3, species)
setosa | versicolor | virginica |
---|---|---|
50 | 0 | 0 |
0 | 49 | 15 |
0 | 1 | 35 |
Which method is best?
library(dendextend)
set.seed(123)
= c("ward.D", "single", "complete", "average",
methods "mcquitty", "median", "centroid", "ward.D2")
= dendlist() # Create a dendlist object from several dendrograms.
out for(method in methods){
= hclust(d, method = method)
res = dendlist(out, as.dendrogram(res))
out
}names(out) = methods
print(out)
$ward.D
'dendrogram' with 2 branches and 150 members total, at height 199.6205
$single
'dendrogram' with 2 branches and 150 members total, at height 1.640122
$complete
'dendrogram' with 2 branches and 150 members total, at height 7.085196
$average
'dendrogram' with 2 branches and 150 members total, at height 4.062683
$mcquitty
'dendrogram' with 2 branches and 150 members total, at height 4.497283
$median
'dendrogram' with 2 branches and 150 members total, at height 2.82744
$centroid
'dendrogram' with 2 branches and 150 members total, at height 2.994307
$ward.D2
'dendrogram' with 2 branches and 150 members total, at height 32.44761
attr(,"class")
[1] "dendlist"
= function(dend){
get_ordered_3_clusters # order.dendrogram function returns the order (index)
# or the "label" attribute for the leaves.
# cutree: Cut the tree (dendrogram) into groups of data.
cutree(dend, k = 3)[order.dendrogram(dend)]
}= lapply(out, get_ordered_3_clusters)
dend_3_clusters
# Calculate Fowlkes-Mallows Index (determine the similarity between clusterings)
= function(clus){
compare_clusters_to_iris FM_index(clus, rep(1:3, each = 50), assume_sorted_vectors = TRUE)
}
= sapply(dend_3_clusters, compare_clusters_to_iris)
clusters_performance dotchart(sort(clusters_performance), xlim = c(0.3, 1),
xlab = "Fowlkes-Mallows index",
main = "Performance of linkage methods
in detecting the 3 species \n in our example",
pch = 19)
We might conclude that ward.D2 works best here. However, as we will learn later, optimizing the method without a hold-out for testing implies that our model may be overfitting. We should check this using cross-validation.
6.3.2 K-means Clustering
Another example for an unsupervised learning algorithm is k-means clustering, one of the simplest and most popular unsupervised machine learning algorithms.
To start with the algorithm, you first have to specify the number of clusters (for our example the number of species). Each cluster has a centroid, which is the assumed or real location representing the center of the cluster (for our example this would be how an average plant of a specific species would look like). The algorithm starts by randomly putting centroids somewhere. Afterwards each data point is assigned to the respective cluster that raises the overall in-cluster sum of squares (variance) related to the distance to the centroid least of all. After the algorithm has placed all data points into a cluster the centroids get updated. By iterating this procedure until the assignment doesn’t change any longer, the algorithm can find the (locally) optimal centroids and the data points belonging to this cluster. Note that results might differ according to the initial positions of the centroids. Thus several (locally) optimal solutions might be found.
The “k” in K-means refers to the number of clusters and the ‘means’ refers to averaging the data-points to find the centroids.
A typical pipeline for using k-means clustering looks the same as for other algorithms. After having visualized the data, we fit a model, visualize the results and have a look at the performance by use of the confusion matrix. By setting a fixed seed, we can ensure that results are reproducible.
set.seed(123)
#Reminder: traits = as.matrix(iris[,1:4]).
= kmeans(traits, 3)
kc print(kc)
K-means clustering with 3 clusters of sizes 50, 62, 38
Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.006000 3.428000 1.462000 0.246000
2 5.901613 2.748387 4.393548 1.433871
3 6.850000 3.073684 5.742105 2.071053
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
[112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
[149] 3 2
Within cluster sum of squares by cluster:
[1] 15.15100 39.82097 23.87947
(between_SS / total_SS = 88.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
Visualizing the results. Color codes true species identity, symbol shows cluster result.
plot(iris[c("Sepal.Length", "Sepal.Width")],
col = colors[as.integer(species)], pch = kc$cluster)
points(kc$centers[, c("Sepal.Length", "Sepal.Width")],
col = colors, pch = 1:3, cex = 3)
We see that there are are some discrepancies. Confusion matrix:
table(iris$Species, kc$cluster)
1 2 3
setosa 50 0 0
versicolor 0 48 2
virginica 0 14 36
If you want to animate the clustering process, you could run
library(animation)
saveGIF(kmeans.ani(x = traits[,1:2], col = colors),
interval = 1, ani.width = 800, ani.height = 800)
Elbow technique to determine the probably best suited number of clusters:
set.seed(123)
= function(k){ kmeans(traits, k, nstart = 25)$tot.withinss }
getSumSq
#Perform algorithm for different cluster sizes and retrieve variance.
= sapply(1:10, getSumSq)
iris.kmeans1to10 plot(1:10, iris.kmeans1to10, type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters K",
ylab = "Total within-clusters sum of squares",
col = c("black", "red", rep("black", 8)))
Often, one is interested in sparse models. Furthermore, higher k than necessary tends to overfitting. At the kink in the picture, the sum of squares dropped enough and k is still low enough. But keep in mind, this is only a rule of thumb and might be wrong in some special cases.
6.3.3 Density-based Clustering
Determine the affinity of a data point according to the affinity of its k nearest neighbors. This is a very general description as there are many ways to do so.
#Reminder: traits = as.matrix(iris[,1:4]).
library(dbscan)
Attaching package: 'dbscan'
The following object is masked from 'package:stats':
as.dendrogram
set.seed(123)
kNNdistplot(traits, k = 4) # Calculate and plot k-nearest-neighbor distances.
abline(h = 0.4, lty = 2)
= dbscan(traits, eps = 0.4, minPts = 6)
dc print(dc)
DBSCAN clustering for 150 objects.
Parameters: eps = 0.4, minPts = 6
Using euclidean distances and borderpoints = TRUE
The clustering contains 4 cluster(s) and 32 noise points.
0 1 2 3 4
32 46 36 14 22
Available fields: cluster, eps, minPts, dist, borderPoints
library(factoextra)
fviz_cluster(dc, traits, geom = "point", ggtheme = theme_light())
6.3.4 Model-based Clustering
The last class of methods for unsupervised clustering are so-called model-based clustering methods.
library(mclust)
Package 'mclust' version 6.1.1
Type 'citation("mclust")' for citing this R package in publications.
= Mclust(traits) mb
Mclust automatically compares a number of candidate models (clusters, shape) according to BIC (The BIC is a criterion for classifying algorithms depending their prediction quality and their usage of parameters). We can look at the selected model via:
$G # Two clusters. mb
[1] 2
$modelName # > Ellipsoidal, equal shape. mb
[1] "VEV"
We see that the algorithm prefers having 2 clusters. For better comparability to the other 2 methods, we will override this by setting:
= Mclust(traits, 3) mb3
Result in terms of the predicted densities for 3 clusters
plot(mb3, "density")
Predicted clusters:
plot(mb3, what=c("classification"), add = T)
Confusion matrix:
table(iris$Species, mb3$classification)
setosa | versicolor | virginica |
---|---|---|
50 | 0 | 0 |
0 | 49 | 15 |
0 | 1 | 35 |
6.3.5 Ordination
Ordination is used in explorative analysis and compared to clustering, similar objects are ordered together. So there is a relationship between clustering and ordination. Here a PCA ordination on on the iris data set.
= prcomp(traits, center = TRUE, scale. = TRUE)
pcTraits biplot(pcTraits, xlim = c(-0.25, 0.25), ylim = c(-0.25, 0.25))
You can cluster the results of this ordination, ordinate before clustering, or superimpose one on the other.
6.4 Exercise - kNN and SVM
Combing back to the titanic dataset from the morning, we want to optimize the number of neighbors (k) and the kernel of the kNN:
Prepare the data:
library(EcoData)
library(dplyr)
Attaching package: 'dplyr'
The following object is masked from 'package:ape':
where
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(missRanger)
data(titanic_ml)
= titanic_ml
data =
data %>% select(survived, sex, age, fare, pclass)
data -1] = missRanger(data[,-1], verbose = 0)
data[,
=
data_sub %>%
data mutate(age = scales::rescale(age, c(0, 1)),
fare = scales::rescale(fare, c(0, 1))) %>%
mutate(sex = as.integer(sex) - 1L,
pclass = as.integer(pclass - 1L))
= data_sub[is.na(data_sub$survived),] # for which we want to make predictions at the end
data_new = data_sub[!is.na(data_sub$survived),] # data with known response data_obs
Hints:
- check the help of the kNN function to get an idea about the hyperparameters
library(kknn)
set.seed(42)
= data_sub[!is.na(data_sub$survived),]
data_obs = 3
cv
= as.integer(cut(1:nrow(data_obs), breaks = cv))
outer_split
# sample minnodesize values (must be integers)
= sample(10, 10)
hyper_k = sample(c("triangular", "inv", "gaussian", "rank"), 10, replace = TRUE)
hyper_kernel
= data.frame(
results set = rep(NA, cv),
k = rep(NA, cv),
kernel = rep(NA, cv),
AUC = rep(NA, cv)
)
for(i in 1:cv) {
= data_obs[outer_split != i, ]
train_outer = data_obs[outer_split == i, ]
test_outer
=
tuning_results sapply(1:length(hyper_k), function(k) {
= kknn(as.factor(survived)~., train = train_outer, test = test_outer, k = hyper_k[k], scale = FALSE, kernel = hyper_kernel[k])
predictions return(Metrics::auc(test_outer$survived, predictions$prob[,2]))
})
1] = i
results[i, 2] = hyper_k[which.max(tuning_results)]
results[i, 3] = hyper_kernel[which.max(tuning_results)]
results[i, 4] = max(tuning_results)
results[i,
}
print(results)
set k kernel AUC
1 1 10 gaussian 0.8078358
2 2 9 rank 0.8047081
3 3 6 inv 0.8299800
Make predictions:
=
prediction_ensemble sapply(1:nrow(results), function(i) {
= kknn(as.factor(survived)~., train = data_obs, test = data_new, k = results$k[i], scale = FALSE, kernel = results$kernel[i])
predictions return(predictions$prob[,2])
})
# Single predictions from the ensemble model:
write.csv(data.frame(y = apply(prediction_ensemble, 1, mean)), file = "Max_titanic_ensemble.csv")
Fit a standard k-nearest-neighbor classifier and a support vector machine with a linear kernel (check help) on the Sonar dataset, and report what fitted better.
Prepare dataset:
library(mlbench)
set.seed(123)
data(Sonar)
= Sonar
data #str(data)
# Do not forget scaling! This may be done implicitly by most functions.
# Here, it's done explicitly for teaching purposes.
= cbind.data.frame(
data scale(data[,-length(data)]),
"class" = data[,length(data)]
)
= length(data[,1])
n = sample.int(n, (n+1) %/% 2) # Take (at least) 50 % of the data.
indicesTrain
= data[indicesTrain,]
train = data[-indicesTrain,] test
Tasks:
- Fit a svm (from the e1071 package) on the train dataset and make predictions for the test dataset
- Fit a kNN (from the kknn package) on the train dataset and make predictions for the test dataset
- Calculate confusion matrices to compare the performance
library(e1071)
library(kknn)
= kknn(class~., train = train, test = test, scale = FALSE,
knn kernel = "rectangular")
= predict(knn, newdata = test)
predKNN
= svm(class~., data = train, scale = FALSE, kernel = "linear")
sm = predict(sm, newdata = test) predSVM
K-nearest-neighbor, standard (rectangular) kernel:
labelsTest
predKNN M R
M 46 29
R 8 21
Correctly classified: 67 / 104
Support-vector machine, linear kernel:
labelsTest
predSVM M R
M 41 15
R 13 35
Correctly classified: 76 / 104
K-nearest neighbor fitted (slightly) better.
6.5 Exercise - Unsupervised learning
Go through the 4(5) unsupervised algorithms from the supervised chapter Section 2.2, and check
- if they are sensitive (i.e. if results change)
- if you scale the input features (= predictors), instead of using the raw data.
Discuss in your group: Which is more appropriate for this analysis and/or in general: Scaling or not scaling?
library(dendextend)
= c("ward.D", "single", "complete", "average",
methods "mcquitty", "median", "centroid", "ward.D2")
= function(distances){
cluster_all_methods = dendlist()
out for(method in methods){
= hclust(distances, method = method)
res = dendlist(out, as.dendrogram(res))
out
}names(out) = methods
return(out)
}
= function(dend){
get_ordered_3_clusters return(cutree(dend, k = 3)[order.dendrogram(dend)])
}
= function(clus){
compare_clusters_to_iris return(FM_index(clus, rep(1:3, each = 50), assume_sorted_vectors = TRUE))
}
= function(traits, scale = FALSE){
do_clustering set.seed(123)
= "Performance of linkage methods\nin detecting the 3 species\n"
headline
if(scale){
= scale(traits) # Do scaling on copy of traits.
traits = paste0(headline, "Scaled")
headline else{ headline = paste0(headline, "Not scaled") }
}
= dist(traits)
distances = cluster_all_methods(distances)
out = lapply(out, get_ordered_3_clusters)
dend_3_clusters = sapply(dend_3_clusters, compare_clusters_to_iris)
clusters_performance dotchart(sort(clusters_performance), xlim = c(0.3,1),
xlab = "Fowlkes-Mallows index",
main = headline,
pch = 19)
}
= as.matrix(iris[,1:4])
traits
# Do clustering on unscaled data.
do_clustering(traits, FALSE)
# Do clustering on scaled data.
do_clustering(traits, TRUE)
It seems that scaling is harmful for hierarchical clustering. But this might be a deception. Be careful: If you have data on different units or magnitudes, scaling is definitely useful! Otherwise variables with higher values get higher influence.
= function(traits, scale = FALSE){
do_clustering set.seed(123)
if(scale){
= scale(traits) # Do scaling on copy of traits.
traits = "K-means Clustering\nScaled\nSum of all tries: "
headline else{ headline = "K-means Clustering\nNot scaled\nSum of all tries: " }
}
= function(k){ kmeans(traits, k, nstart = 25)$tot.withinss }
getSumSq = sapply(1:10, getSumSq)
iris.kmeans1to10
= paste0(headline, round(sum(iris.kmeans1to10), 2))
headline
plot(1:10, iris.kmeans1to10, type = "b", pch = 19, frame = FALSE,
main = headline,
xlab = "Number of clusters K",
ylab = "Total within-clusters sum of squares",
col = c("black", "red", rep("black", 8)) )
}
= as.matrix(iris[,1:4])
traits
# Do clustering on unscaled data.
do_clustering(traits, FALSE)
# Do clustering on scaled data.
do_clustering(traits, TRUE)
It seems that scaling is harmful for K-means clustering. But this might be a deception. Be careful: If you have data on different units or magnitudes, scaling is definitely useful! Otherwise variables with higher values get higher influence.
library(dbscan)
= as.factor(iris[,5])
correct # Start at 1. Noise points will get 0 later.
levels(correct) = 1:length(levels(correct))
correct
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
[112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[149] 3 3
Levels: 1 2 3
= function(traits, scale = FALSE){
do_clustering set.seed(123)
if(scale){ traits = scale(traits) } # Do scaling on copy of traits.
#####
# Play around with the parameters "eps" and "minPts" on your own!
#####
= dbscan(traits, eps = 0.41, minPts = 4)
dc
= as.factor(dc$cluster)
labels = sum(dc$cluster == 0)
noise levels(labels) = c("noise", 1:( length(levels(labels)) - 1))
= table(correct, labels)
tbl = 0
correct_classified for(i in 1:length(levels(correct))){
= correct_classified + tbl[i, i + 1]
correct_classified
}
cat( if(scale){ "Scaled" }else{ "Not scaled" }, "\n\n" )
cat("Confusion matrix:\n")
print(tbl)
cat("\nCorrect classified points: ", correct_classified, " / ", length(iris[,5]))
cat("\nSum of noise points: ", noise, "\n")
}
= as.matrix(iris[,1:4])
traits
# Do clustering on unscaled data.
do_clustering(traits, FALSE)
Not scaled
Confusion matrix:
labels
correct noise 1 2 3 4
1 3 47 0 0 0
2 5 0 38 3 4
3 17 0 0 33 0
Correct classified points: 118 / 150
Sum of noise points: 25
# Do clustering on scaled data.
do_clustering(traits, TRUE)
Scaled
Confusion matrix:
labels
correct noise 1 2 3 4
1 9 41 0 0 0
2 14 0 36 0 0
3 36 0 1 4 9
Correct classified points: 81 / 150
Sum of noise points: 59
It seems that scaling is harmful for density based clustering. But this might be a deception. Be careful: If you have data on different units or magnitudes, scaling is definitely useful! Otherwise variables with higher values get higher influence.
library(mclust)
= function(traits, scale = FALSE){
do_clustering set.seed(123)
if(scale){ traits = scale(traits) } # Do scaling on copy of traits.
= Mclust(traits, 3)
mb3
= table(iris$Species, mb3$classification)
tbl
cat( if(scale){ "Scaled" }else{ "Not scaled" }, "\n\n" )
cat("Confusion matrix:\n")
print(tbl)
cat("\nCorrect classified points: ", sum(diag(tbl)), " / ", length(iris[,5]))
}
= as.matrix(iris[,1:4])
traits
# Do clustering on unscaled data.
do_clustering(traits, FALSE)
Not scaled
Confusion matrix:
1 2 3
setosa 50 0 0
versicolor 0 45 5
virginica 0 0 50
Correct classified points: 145 / 150
# Do clustering on scaled data.
do_clustering(traits, TRUE)
Scaled
Confusion matrix:
1 2 3
setosa 50 0 0
versicolor 0 45 5
virginica 0 0 50
Correct classified points: 145 / 150
For model based clustering, scaling does not matter.
= as.matrix(iris[,1:4])
traits
biplot(prcomp(traits, center = TRUE, scale. = TRUE),
main = "Use integrated scaling")
biplot(prcomp(scale(traits), center = FALSE, scale. = FALSE),
main = "Scale explicitly")
biplot(prcomp(traits, center = FALSE, scale. = FALSE),
main = "No scaling at all")
For PCA ordination, scaling matters. Because we are interested in directions of maximal variance, all parameters should be scaled, or the one with the highest values might dominate all others.