head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
pairs(iris, col = iris$Species)
11 Multivariate Statistics
Multivariate statistics help us analyze datasets that have many variables at once. Instead of looking at each variable separately, we look at how they relate to each other and to the samples in our data. In ecology, for example, this helps us understand patterns in species composition or environmental variables. In this class, we’ll explore two common multivariate methods:
Ordination (to reduce complexity and visualize patterns) and
Clustering (to group similar observations).
11.1 Ordination:
11.1.1 PCA
We’ll start with Principal Component Analysis (PCA). This is a method to reduce the number of variables while keeping most of the variation in the data. It’s helpful for visualizing and summarizing complex datasets.The focus for the PCA is in the relationship among variables!
We’ll use the classic iris
dataset, which has flower measurements for three iris species.
The pairs()
plot shows the relationships between all pairs of variables, and how they differ among the species.
We run PCA using the four numeric columns and scale the variables to give them equal weight. The summary shows how much variance is explained by each principal component. The “cumulative proportion” shows how much of the total variation is captured as we include more components.
= prcomp(iris[, 1:4], scale = T) # always set scale = T
pca # when data is very skewed --> better transform e.g. log
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
# standard deviation^2 is variance!!!
# cum prop of PC2 is the variance that is visualized in a biplot
We can visualize the variance explained by each component. The scree plot shows the absolute variance, while the barplot shows the proportion explained by each component. These plots help us decide how many principal components to keep. Often, we look for a small number of components that explain most of the variation.
# absolute variance of each component
plot(pca) # see row1 of the summary(pca): (sd)^2 = variance
# rel variance of each component
barplot(summary(pca)$importance[2, ],
ylab="proportion of variance explained") # displays % of variance explained by PCs
The biplot shows how the samples are positioned (numbers) in the reduced space (based on the first two components), and how the variables contribute to those components (red arrows).
biplot(pca) # displays PC1 and PC2 AND rotation (vectors) of the different variables AND observations
abline(v=0, lty="dashed") # adding lines in 0,0 to see better
abline(h=0, lty="dashed")
11.1.2 NMDS
Now let’s try Non-metric Multidimensional Scaling (NMDS). This is another type of ordination that works with distance (or dissimilarity) between samples (observations) instead of raw data. It’s often used in ecology to explore patterns in species composition across sites.
The focus on the NMDS is in the observations - the distance between observations given their properties (the variables).
We’ll use the dune
dataset from the vegan
package, which is a community dataset for plants in dunes. Each row is a site, and each column is a species. The values are species counts.
library(vegan)
## Carregando pacotes exigidos: permute
## Carregando pacotes exigidos: lattice
#?vegan
data("dune")
str(dune)
## 'data.frame': 20 obs. of 30 variables:
## $ Achimill: num 1 3 0 0 2 2 2 0 0 4 ...
## $ Agrostol: num 0 0 4 8 0 0 0 4 3 0 ...
## $ Airaprae: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Alopgeni: num 0 2 7 2 0 0 0 5 3 0 ...
## $ Anthodor: num 0 0 0 0 4 3 2 0 0 4 ...
## $ Bellpere: num 0 3 2 2 2 0 0 0 0 2 ...
## $ Bromhord: num 0 4 0 3 2 0 2 0 0 4 ...
## $ Chenalbu: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Cirsarve: num 0 0 0 2 0 0 0 0 0 0 ...
## $ Comapalu: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Eleopalu: num 0 0 0 0 0 0 0 4 0 0 ...
## $ Elymrepe: num 4 4 4 4 4 0 0 0 6 0 ...
## $ Empenigr: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyporadi: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Juncarti: num 0 0 0 0 0 0 0 4 4 0 ...
## $ Juncbufo: num 0 0 0 0 0 0 2 0 4 0 ...
## $ Lolipere: num 7 5 6 5 2 6 6 4 2 6 ...
## $ Planlanc: num 0 0 0 0 5 5 5 0 0 3 ...
## $ Poaprat : num 4 4 5 4 2 3 4 4 4 4 ...
## $ Poatriv : num 2 7 6 5 6 4 5 4 5 4 ...
## $ Ranuflam: num 0 0 0 0 0 0 0 2 0 0 ...
## $ Rumeacet: num 0 0 0 0 5 6 3 0 2 0 ...
## $ Sagiproc: num 0 0 0 5 0 0 0 2 2 0 ...
## $ Salirepe: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Scorautu: num 0 5 2 2 3 3 3 3 2 3 ...
## $ Trifprat: num 0 0 0 0 2 5 2 0 0 0 ...
## $ Trifrepe: num 0 5 2 1 2 5 2 2 3 6 ...
## $ Vicilath: num 0 0 0 0 0 0 0 0 0 1 ...
## $ Bracruta: num 0 0 2 2 2 6 2 2 2 2 ...
## $ Callcusp: num 0 0 0 0 0 0 0 0 0 0 ...
#?dune
summary(dune)
## Achimill Agrostol Airaprae Alopgeni Anthodor
## Min. :0.0 Min. :0.0 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.00
## Median :0.0 Median :1.5 Median :0.00 Median :0.00 Median :0.00
## Mean :0.8 Mean :2.4 Mean :0.25 Mean :1.80 Mean :1.05
## 3rd Qu.:2.0 3rd Qu.:4.0 3rd Qu.:0.00 3rd Qu.:3.25 3rd Qu.:2.25
## Max. :4.0 Max. :8.0 Max. :3.00 Max. :8.00 Max. :4.00
## Bellpere Bromhord Chenalbu Cirsarve Comapalu
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.0 Min. :0.0
## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.0 1st Qu.:0.0
## Median :0.00 Median :0.00 Median :0.00 Median :0.0 Median :0.0
## Mean :0.65 Mean :0.75 Mean :0.05 Mean :0.1 Mean :0.2
## 3rd Qu.:2.00 3rd Qu.:0.50 3rd Qu.:0.00 3rd Qu.:0.0 3rd Qu.:0.0
## Max. :3.00 Max. :4.00 Max. :1.00 Max. :2.0 Max. :2.0
## Eleopalu Elymrepe Empenigr Hyporadi Juncarti
## Min. :0.00 Min. :0.0 Min. :0.0 Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:0.00 1st Qu.:0.00
## Median :0.00 Median :0.0 Median :0.0 Median :0.00 Median :0.00
## Mean :1.25 Mean :1.3 Mean :0.1 Mean :0.45 Mean :0.90
## 3rd Qu.:1.00 3rd Qu.:4.0 3rd Qu.:0.0 3rd Qu.:0.00 3rd Qu.:0.75
## Max. :8.00 Max. :6.0 Max. :2.0 Max. :5.00 Max. :4.00
## Juncbufo Lolipere Planlanc Poaprat Poatriv
## Min. :0.00 Min. :0.0 Min. :0.0 Min. :0.0 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:0.00
## Median :0.00 Median :2.0 Median :0.0 Median :3.0 Median :4.00
## Mean :0.65 Mean :2.9 Mean :1.3 Mean :2.4 Mean :3.15
## 3rd Qu.:0.00 3rd Qu.:6.0 3rd Qu.:3.0 3rd Qu.:4.0 3rd Qu.:5.00
## Max. :4.00 Max. :7.0 Max. :5.0 Max. :5.0 Max. :9.00
## Ranuflam Rumeacet Sagiproc Salirepe Scorautu
## Min. :0.0 Min. :0.0 Min. :0 Min. :0.00 Min. :0.0
## 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:0 1st Qu.:0.00 1st Qu.:2.0
## Median :0.0 Median :0.0 Median :0 Median :0.00 Median :2.0
## Mean :0.7 Mean :0.9 Mean :1 Mean :0.55 Mean :2.7
## 3rd Qu.:2.0 3rd Qu.:0.5 3rd Qu.:2 3rd Qu.:0.00 3rd Qu.:3.0
## Max. :4.0 Max. :6.0 Max. :5 Max. :5.00 Max. :6.0
## Trifprat Trifrepe Vicilath Bracruta Callcusp
## Min. :0.00 Min. :0.00 Min. :0.0 Min. :0.00 Min. :0.0
## 1st Qu.:0.00 1st Qu.:1.00 1st Qu.:0.0 1st Qu.:1.50 1st Qu.:0.0
## Median :0.00 Median :2.00 Median :0.0 Median :2.00 Median :0.0
## Mean :0.45 Mean :2.35 Mean :0.2 Mean :2.45 Mean :0.5
## 3rd Qu.:0.00 3rd Qu.:3.00 3rd Qu.:0.0 3rd Qu.:4.00 3rd Qu.:0.0
## Max. :5.00 Max. :6.00 Max. :2.0 Max. :6.00 Max. :4.0
First, let’s chose the distance/dissimilarity metric. We use here a Bray-Curtis dissimilarity matrix (default in the metaMDS
function), which is commonly used for community data. It measures how different the species compositions are between sites.
The metaMDS
function runs NMDS and tries to represent the distances between sites in 2D space as accurately as possible.
= metaMDS(dune)
NMDS ## Run 0 stress 0.1192678
## Run 1 stress 0.1886532
## Run 2 stress 0.1192678
## ... Procrustes: rmse 1.258026e-05 max resid 3.164268e-05
## ... Similar to previous best
## Run 3 stress 0.1183186
## ... New best solution
## ... Procrustes: rmse 0.02027014 max resid 0.06496123
## Run 4 stress 0.1183186
## ... Procrustes: rmse 4.230384e-06 max resid 1.475206e-05
## ... Similar to previous best
## Run 5 stress 0.1192678
## Run 6 stress 0.1809578
## Run 7 stress 0.1889638
## Run 8 stress 0.1808911
## Run 9 stress 0.3680059
## Run 10 stress 0.1183186
## ... Procrustes: rmse 1.159455e-05 max resid 3.442588e-05
## ... Similar to previous best
## Run 11 stress 0.1183186
## ... Procrustes: rmse 4.643337e-06 max resid 1.388314e-05
## ... Similar to previous best
## Run 12 stress 0.2361935
## Run 13 stress 0.1192678
## Run 14 stress 0.1192679
## Run 15 stress 0.1183186
## ... Procrustes: rmse 1.447333e-05 max resid 4.510486e-05
## ... Similar to previous best
## Run 16 stress 0.1192678
## Run 17 stress 0.1192678
## Run 18 stress 0.1192679
## Run 19 stress 0.1808911
## Run 20 stress 0.1808911
## *** Best solution repeated 4 times
# gives information on NMDS: distance measure, stress (should be low)
NMDS ##
## Call:
## metaMDS(comm = dune)
##
## global Multidimensional Scaling using monoMDS
##
## Data: dune
## Distance: bray
##
## Dimensions: 2
## Stress: 0.1183186
## Stress type 1, weak ties
## Best solution was repeated 4 times in 20 tries
## The best solution was from try 3 (random start)
## Scaling: centring, PC rotation, halfchange scaling
## Species: expanded scores based on 'dune'
The STRESS (Standard Residuals Sum of Squares), is a measure of how the sites positions in the bidimensional configuration deviates from the original distances (from the distance matrix). I can be used as a measure of how adequate is the analysis for the dataset. As a rule of thumb, we have:
stress < 0.05 great representation;
stress < 0.01 good ordination
stress < 0.2 reazonable ordination
stress > 0.2 - be suspicious
stress >= 0.3 ordination is abitrary (do not interpret it)
The NMDS plot shows how similar or different the sites are in terms of species composition. Sites that are close together have more similar species compositions.
ordiplot(NMDS, type = "t") #"t" = text for the species
Why we should be careful when interpreting patterns in ordination plots? Because it may find patterns even if the relationship between variables doesn’t exist:
set.seed(123)
= data.frame(pollution = rnorm(30),
random temperature = rnorm(30),
moisture = rnorm(30),
tourists = rnorm(30),
wind = rnorm(30),
dogs = rnorm(30))
head(random)
## pollution temperature moisture tourists wind dogs
## 1 -0.56047565 0.4264642 0.3796395 0.9935039 0.1176466 0.7877388
## 2 -0.23017749 -0.2950715 -0.5023235 0.5483970 -0.9474746 0.7690422
## 3 1.55870831 0.8951257 -0.3332074 0.2387317 -0.4905574 0.3322026
## 4 0.07050839 0.8781335 -1.0185754 -0.6279061 -0.2560922 -1.0083766
## 5 0.12928774 0.8215811 -1.0717912 1.3606524 1.8438620 -0.1194526
## 6 1.71506499 0.6886403 0.3035286 -0.6002596 -0.6519499 -0.2803953
= prcomp(random, scale = T)
pca biplot(pca)
summary(pca) # similar variance on all axes
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.2054 1.1078 1.0649 0.9707 0.8530 0.71817
## Proportion of Variance 0.2422 0.2046 0.1890 0.1570 0.1213 0.08596
## Cumulative Proportion 0.2422 0.4467 0.6357 0.7928 0.9140 1.00000
11.2 Constrained Ordination
A constrained ordination is similar to a PCA or NMDS, but considers additional predictors for the ordination.
Here an example, using the dune dataset (see via ?dune, str(dune)) together with the dune.env environmental data
data(dune.env)
= rda(dune ~ as.numeric(Manure) + as.numeric(Moisture),
RDA data = dune.env)
plot(RDA)
summary(RDA)
##
## Call:
## rda(formula = dune ~ as.numeric(Manure) + as.numeric(Moisture), data = dune.env)
##
## Partitioning of variance:
## Inertia Proportion
## Total 84.12 1.0000
## Constrained 31.20 0.3709
## Unconstrained 52.92 0.6291
##
## Eigenvalues, and their contribution to the variance
##
## Importance of components:
## RDA1 RDA2 PC1 PC2 PC3 PC4 PC5
## Eigenvalue 19.0185 12.1864 10.7889 7.85393 6.67657 6.2084 5.12380
## Proportion Explained 0.2261 0.1449 0.1283 0.09336 0.07937 0.0738 0.06091
## Cumulative Proportion 0.2261 0.3709 0.4992 0.59255 0.67192 0.7457 0.80663
## PC6 PC7 PC8 PC9 PC10 PC11 PC12
## Eigenvalue 3.50320 3.10988 2.47760 1.83130 1.63276 1.03305 0.87265
## Proportion Explained 0.04164 0.03697 0.02945 0.02177 0.01941 0.01228 0.01037
## Cumulative Proportion 0.84827 0.88524 0.91469 0.93646 0.95587 0.96815 0.97852
## PC13 PC14 PC15 PC16 PC17
## Eigenvalue 0.642862 0.469072 0.301387 0.215853 0.177610
## Proportion Explained 0.007642 0.005576 0.003583 0.002566 0.002111
## Cumulative Proportion 0.986164 0.991740 0.995323 0.997889 1.000000
##
## Accumulated constrained eigenvalues
## Importance of components:
## RDA1 RDA2
## Eigenvalue 19.0185 12.1864
## Proportion Explained 0.6095 0.3905
## Cumulative Proportion 0.6095 1.0000
- important part at the top
- variance explained by the two variables = prop constrained = 37.09%
- how much is explained by each RDA = see importance of components prop explained
- PCs are the unconstrained axes
- species scores = coordinates of species in the plot
- site scores = coordinates of sites in the plot
- biplot scores = coordinates of environmental variable vectors
barplot(summary(RDA)$cont$importance[2, ], las = 2,
col = c(rep ('red', 2),
rep ('black', length(summary(RDA)$cont$importance[2, ])-2)),
ylab="proportion of variance explained")
displays % of variance explained by PCs
11.3 Cluster analysis
Cluster analysis groups similar observations. It can be useful when we want to classify sites or samples based on their characteristics (variables).
There are two main types of clustering:
Hierarchical clustering: builds a tree (dendrogram) of nested clusters.
Non-hierarchical clustering (like K-means): partitions the data into a fixed number of clusters.
Feature | Hierarchical | K-means (Non-hierarchical) |
---|---|---|
Number of groups | Decided after tree is built | Must be chosen before running |
Input/ data type | Distance matrix | Raw (standardized) data |
Result | Dendrogram (tree) | Cluster assignments |
Flexibility | Can inspect different levels | Fixed number of groups |
Sensitive to scaling? | Yes | Yes |
Good for | Exploring nested group structure | Partitioning into defined clusters |
Hierarchical clustering gives a full tree of nested groups, which is great for exploring data. K-means is simpler and faster, but you need to choose the number of groups first.
11.3.1 Hierarchical clustering
Let’s pretend we don’t know which species the iris individuals belong. We will use a hierarchical clustering technique to see how the individuals will be clustered by their morphological variables. We plot the results and then color individuals by their known species to see if the clustering could provide reliable classifications
For clustering you also need a dissimilarity matrix, here we are using the default from dist
, the euclidian distance.
library(cluster) # clustering
# example for distance matrix
dist(iris[1:3, 1:4]) # creates a distance matrix (comparison of all possible sample pairs)
## 1 2
## 2 0.5385165
## 3 0.5099020 0.3000000
Clustering and plotting.
= hclust(dist(iris[, 1:4]))
hc plot(hc)
THe default plot.hclust
is ugly and don’t allow to color the tips (individuals). For that, we will use the package ape
for phylogenetic analyses to get a pretty dendogram.
library(ape)
<- as.phylo(hc, method = "ward.D2") # converts object type "hcclust" into type "phylo"
new.hc plot(new.hc, tip.color = as.numeric(iris$Species))
# change plotting type:
plot(as.phylo(hc), tip.color = as.numeric(iris$Species), type = "fan")
The clustering algorithm did a good job in classifying the 3 species. Only 4 individuals (71,73,8,107) were wrongly attributed to a cluster formed mostly by another species. However, we see that the only species that had an isolated branch for itself was the setosa (black color), the other species had individuals spread in different braches.
11.3.2 Non-hiearchical clustering
The most common method for non-hierarchical clustering is K-means clustering, which partitions the data into a fixed number of groups. K-means clustering requires you to choose the number of clusters in advance. It works by assigning observations to the nearest cluster center, then updating the centers.
Note: K-means uses raw data (not a distance matrix), so we need to prepare the data differently.
set.seed(123) # choice of first k centers is random and depends on (random) seed
= kmeans(iris[, 1:4], centers = 3) # centers = number of clusters
cl
$cluster
cl## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
table(cl$cluster) # see the cluster assignment
##
## 1 2 3
## 50 62 38
as.numeric(iris$Species)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3
Let’s compare the clustering assignmet with the real species data:
table(clusters = cl$cluster, real = iris$Species)
## real
## clusters setosa versicolor virginica
## 1 50 0 0
## 2 0 48 14
## 3 0 2 36
We see again that the setosa
species had all its individuals correctly clustered. The species versicolor
and virginica
presented some overlap.
Let’s see the individuals that were wrongly attributed to another species in the pca plot (now using the function ordiplot
from the vegan
package):
<- data.frame(cluster = cl$cluster, species = iris$Species)
wrong.class $wrong <- "black" # creating column to color
wrong.class# color red for the wrong classification
$wrong[wrong.class$species=="virginica" &
wrong.class$cluster != 3 |
wrong.class$species=="versicolor" &
wrong.class$cluster != 2] <- "red"
wrong.class
ordiplot(pca, display="sites",) |>
points("site", col=wrong.class$wrong, pch=16)
abline(v=0, lty="dashed") # adding lines in 0,0 to see better
abline(h=0, lty="dashed")
This was just a brief overview of clustering methods. Here is a good online source for you to learn more.