[11]{.chapter-number}  [Multivariate Statistics]{.chapter-title}

11 Multivariate Statistics

11.1 Unconstrained

## PCA
# we use the same dataset of flower characteristics of three species of iris
pairs(iris, col = iris$Species)


head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
pca = prcomp(iris[, 1:4], scale = T) # always set scale = T
# when data is very skewed --> better transform e.g. log

summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000
# standard deviation^2 is variance!!!
# cum prop of PC2 is the variance that is visualized in a biplot


# plot the result
# absolute variance of each component
plot(pca) # see row1 of the summary(pca): (sd)^2 = variance


# rel variance of each component
barplot(summary(pca)$importance[2, ], 
        ylab="proportion of variance explained") # displays % of variance explained by PCs


# Biplot
biplot(pca) # displays PC1 and PC2 AND rotation (vectors) of the different variables AND observations





## distance-based approach: NMDS

library(vegan)
## Carregando pacotes exigidos: permute
## Carregando pacotes exigidos: lattice
?vegan

# community dataset for plants in dunes (included in vegan package):

data("dune")
str(dune) # display structure of the dataset
## 'data.frame':    20 obs. of  30 variables:
##  $ Achimill: num  1 3 0 0 2 2 2 0 0 4 ...
##  $ Agrostol: num  0 0 4 8 0 0 0 4 3 0 ...
##  $ Airaprae: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Alopgeni: num  0 2 7 2 0 0 0 5 3 0 ...
##  $ Anthodor: num  0 0 0 0 4 3 2 0 0 4 ...
##  $ Bellpere: num  0 3 2 2 2 0 0 0 0 2 ...
##  $ Bromhord: num  0 4 0 3 2 0 2 0 0 4 ...
##  $ Chenalbu: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Cirsarve: num  0 0 0 2 0 0 0 0 0 0 ...
##  $ Comapalu: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Eleopalu: num  0 0 0 0 0 0 0 4 0 0 ...
##  $ Elymrepe: num  4 4 4 4 4 0 0 0 6 0 ...
##  $ Empenigr: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyporadi: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Juncarti: num  0 0 0 0 0 0 0 4 4 0 ...
##  $ Juncbufo: num  0 0 0 0 0 0 2 0 4 0 ...
##  $ Lolipere: num  7 5 6 5 2 6 6 4 2 6 ...
##  $ Planlanc: num  0 0 0 0 5 5 5 0 0 3 ...
##  $ Poaprat : num  4 4 5 4 2 3 4 4 4 4 ...
##  $ Poatriv : num  2 7 6 5 6 4 5 4 5 4 ...
##  $ Ranuflam: num  0 0 0 0 0 0 0 2 0 0 ...
##  $ Rumeacet: num  0 0 0 0 5 6 3 0 2 0 ...
##  $ Sagiproc: num  0 0 0 5 0 0 0 2 2 0 ...
##  $ Salirepe: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Scorautu: num  0 5 2 2 3 3 3 3 2 3 ...
##  $ Trifprat: num  0 0 0 0 2 5 2 0 0 0 ...
##  $ Trifrepe: num  0 5 2 1 2 5 2 2 3 6 ...
##  $ Vicilath: num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Bracruta: num  0 0 2 2 2 6 2 2 2 2 ...
##  $ Callcusp: num  0 0 0 0 0 0 0 0 0 0 ...
?dune
summary(dune) # display summary of the dataset (summary statistics for each variable)
##     Achimill      Agrostol      Airaprae       Alopgeni       Anthodor   
##  Min.   :0.0   Min.   :0.0   Min.   :0.00   Min.   :0.00   Min.   :0.00  
##  1st Qu.:0.0   1st Qu.:0.0   1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0.00  
##  Median :0.0   Median :1.5   Median :0.00   Median :0.00   Median :0.00  
##  Mean   :0.8   Mean   :2.4   Mean   :0.25   Mean   :1.80   Mean   :1.05  
##  3rd Qu.:2.0   3rd Qu.:4.0   3rd Qu.:0.00   3rd Qu.:3.25   3rd Qu.:2.25  
##  Max.   :4.0   Max.   :8.0   Max.   :3.00   Max.   :8.00   Max.   :4.00  
##     Bellpere       Bromhord       Chenalbu       Cirsarve      Comapalu  
##  Min.   :0.00   Min.   :0.00   Min.   :0.00   Min.   :0.0   Min.   :0.0  
##  1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0.0   1st Qu.:0.0  
##  Median :0.00   Median :0.00   Median :0.00   Median :0.0   Median :0.0  
##  Mean   :0.65   Mean   :0.75   Mean   :0.05   Mean   :0.1   Mean   :0.2  
##  3rd Qu.:2.00   3rd Qu.:0.50   3rd Qu.:0.00   3rd Qu.:0.0   3rd Qu.:0.0  
##  Max.   :3.00   Max.   :4.00   Max.   :1.00   Max.   :2.0   Max.   :2.0  
##     Eleopalu       Elymrepe      Empenigr      Hyporadi       Juncarti   
##  Min.   :0.00   Min.   :0.0   Min.   :0.0   Min.   :0.00   Min.   :0.00  
##  1st Qu.:0.00   1st Qu.:0.0   1st Qu.:0.0   1st Qu.:0.00   1st Qu.:0.00  
##  Median :0.00   Median :0.0   Median :0.0   Median :0.00   Median :0.00  
##  Mean   :1.25   Mean   :1.3   Mean   :0.1   Mean   :0.45   Mean   :0.90  
##  3rd Qu.:1.00   3rd Qu.:4.0   3rd Qu.:0.0   3rd Qu.:0.00   3rd Qu.:0.75  
##  Max.   :8.00   Max.   :6.0   Max.   :2.0   Max.   :5.00   Max.   :4.00  
##     Juncbufo       Lolipere      Planlanc      Poaprat       Poatriv    
##  Min.   :0.00   Min.   :0.0   Min.   :0.0   Min.   :0.0   Min.   :0.00  
##  1st Qu.:0.00   1st Qu.:0.0   1st Qu.:0.0   1st Qu.:0.0   1st Qu.:0.00  
##  Median :0.00   Median :2.0   Median :0.0   Median :3.0   Median :4.00  
##  Mean   :0.65   Mean   :2.9   Mean   :1.3   Mean   :2.4   Mean   :3.15  
##  3rd Qu.:0.00   3rd Qu.:6.0   3rd Qu.:3.0   3rd Qu.:4.0   3rd Qu.:5.00  
##  Max.   :4.00   Max.   :7.0   Max.   :5.0   Max.   :5.0   Max.   :9.00  
##     Ranuflam      Rumeacet      Sagiproc    Salirepe       Scorautu  
##  Min.   :0.0   Min.   :0.0   Min.   :0   Min.   :0.00   Min.   :0.0  
##  1st Qu.:0.0   1st Qu.:0.0   1st Qu.:0   1st Qu.:0.00   1st Qu.:2.0  
##  Median :0.0   Median :0.0   Median :0   Median :0.00   Median :2.0  
##  Mean   :0.7   Mean   :0.9   Mean   :1   Mean   :0.55   Mean   :2.7  
##  3rd Qu.:2.0   3rd Qu.:0.5   3rd Qu.:2   3rd Qu.:0.00   3rd Qu.:3.0  
##  Max.   :4.0   Max.   :6.0   Max.   :5   Max.   :5.00   Max.   :6.0  
##     Trifprat       Trifrepe       Vicilath      Bracruta       Callcusp  
##  Min.   :0.00   Min.   :0.00   Min.   :0.0   Min.   :0.00   Min.   :0.0  
##  1st Qu.:0.00   1st Qu.:1.00   1st Qu.:0.0   1st Qu.:1.50   1st Qu.:0.0  
##  Median :0.00   Median :2.00   Median :0.0   Median :2.00   Median :0.0  
##  Mean   :0.45   Mean   :2.35   Mean   :0.2   Mean   :2.45   Mean   :0.5  
##  3rd Qu.:0.00   3rd Qu.:3.00   3rd Qu.:0.0   3rd Qu.:4.00   3rd Qu.:0.0  
##  Max.   :5.00   Max.   :6.00   Max.   :2.0   Max.   :6.00   Max.   :4.0

NMDS = metaMDS(dune)
## Run 0 stress 0.1192678 
## Run 1 stress 0.1886532 
## Run 2 stress 0.1192678 
## ... Procrustes: rmse 1.258026e-05  max resid 3.164268e-05 
## ... Similar to previous best
## Run 3 stress 0.1183186 
## ... New best solution
## ... Procrustes: rmse 0.02027014  max resid 0.06496123 
## Run 4 stress 0.1183186 
## ... Procrustes: rmse 4.230384e-06  max resid 1.475206e-05 
## ... Similar to previous best
## Run 5 stress 0.1192678 
## Run 6 stress 0.1809578 
## Run 7 stress 0.1889638 
## Run 8 stress 0.1808911 
## Run 9 stress 0.3680059 
## Run 10 stress 0.1183186 
## ... Procrustes: rmse 1.159455e-05  max resid 3.442588e-05 
## ... Similar to previous best
## Run 11 stress 0.1183186 
## ... Procrustes: rmse 4.643337e-06  max resid 1.388314e-05 
## ... Similar to previous best
## Run 12 stress 0.2361935 
## Run 13 stress 0.1192678 
## Run 14 stress 0.1192679 
## Run 15 stress 0.1183186 
## ... Procrustes: rmse 1.447333e-05  max resid 4.510486e-05 
## ... Similar to previous best
## Run 16 stress 0.1192678 
## Run 17 stress 0.1192678 
## Run 18 stress 0.1192679 
## Run 19 stress 0.1808911 
## Run 20 stress 0.1808911 
## *** Best solution repeated 4 times
# algorithm is iterative

NMDS # gives information on NMDS: distance measure, stress (should be low)
## 
## Call:
## metaMDS(comm = dune) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     dune 
## Distance: bray 
## 
## Dimensions: 2 
## Stress:     0.1183186 
## Stress type 1, weak ties
## Best solution was repeated 4 times in 20 tries
## The best solution was from try 3 (random start)
## Scaling: centring, PC rotation, halfchange scaling 
## Species: expanded scores based on 'dune'
# stress of >= 0.2 = be suspicious, stress >=0.3 indicates that ordination is arbitrary
# increase Dimensions if k is too high
ordiplot(NMDS, type = "t") #"t" = text



# if we have time:
# distance measure can be changed (default is Bray-Curtis): see 
?vegdist # some recommendations there
NMDS2 = metaMDS(dune, distance="euclidean")
## Run 0 stress 0.1174523 
## Run 1 stress 0.1174523 
## ... Procrustes: rmse 3.580249e-06  max resid 1.150625e-05 
## ... Similar to previous best
## Run 2 stress 0.1174523 
## ... Procrustes: rmse 2.054645e-06  max resid 3.818306e-06 
## ... Similar to previous best
## Run 3 stress 0.1174523 
## ... Procrustes: rmse 2.128171e-06  max resid 5.656827e-06 
## ... Similar to previous best
## Run 4 stress 0.1174523 
## ... Procrustes: rmse 2.022272e-06  max resid 6.26572e-06 
## ... Similar to previous best
## Run 5 stress 0.1174523 
## ... Procrustes: rmse 2.488733e-06  max resid 8.218612e-06 
## ... Similar to previous best
## Run 6 stress 0.1177339 
## ... Procrustes: rmse 0.01706434  max resid 0.05525565 
## Run 7 stress 0.1174523 
## ... New best solution
## ... Procrustes: rmse 1.353662e-06  max resid 2.894393e-06 
## ... Similar to previous best
## Run 8 stress 0.1174523 
## ... Procrustes: rmse 1.40202e-06  max resid 4.031046e-06 
## ... Similar to previous best
## Run 9 stress 0.1174523 
## ... Procrustes: rmse 2.805574e-06  max resid 7.787809e-06 
## ... Similar to previous best
## Run 10 stress 0.1174523 
## ... Procrustes: rmse 1.140966e-06  max resid 2.993735e-06 
## ... Similar to previous best
## Run 11 stress 0.1177339 
## ... Procrustes: rmse 0.0170643  max resid 0.05525551 
## Run 12 stress 0.1174523 
## ... Procrustes: rmse 1.964847e-06  max resid 5.765074e-06 
## ... Similar to previous best
## Run 13 stress 0.1174523 
## ... Procrustes: rmse 1.231259e-06  max resid 2.345061e-06 
## ... Similar to previous best
## Run 14 stress 0.1174523 
## ... Procrustes: rmse 1.055035e-06  max resid 2.452641e-06 
## ... Similar to previous best
## Run 15 stress 0.1177339 
## ... Procrustes: rmse 0.01706422  max resid 0.05525614 
## Run 16 stress 0.1177339 
## ... Procrustes: rmse 0.01706442  max resid 0.05525521 
## Run 17 stress 0.1177339 
## ... Procrustes: rmse 0.01706456  max resid 0.05525899 
## Run 18 stress 0.1177339 
## ... Procrustes: rmse 0.01706413  max resid 0.05525604 
## Run 19 stress 0.1177339 
## ... Procrustes: rmse 0.01706403  max resid 0.05525579 
## Run 20 stress 0.1174523 
## ... Procrustes: rmse 2.007087e-06  max resid 5.652376e-06 
## ... Similar to previous best
## *** Best solution repeated 8 times
NMDS2 
## 
## Call:
## metaMDS(comm = dune, distance = "euclidean") 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     dune 
## Distance: euclidean 
## 
## Dimensions: 2 
## Stress:     0.1174523 
## Stress type 1, weak ties
## Best solution was repeated 8 times in 20 tries
## The best solution was from try 7 (random start)
## Scaling: centring, PC rotation 
## Species: expanded scores based on 'dune'
ordiplot(NMDS2, type = "t")

Caution

# Why we should be careful when interpreting patterns in ordination plots
set.seed(123)

random = data.frame(pollution = rnorm(30),
                    temperature = rnorm(30),
                    moisture = rnorm(30),
                    tourists = rnorm(30),
                    wind = rnorm(30),
                    dogs = rnorm(30))
head(random)
##     pollution temperature   moisture   tourists       wind       dogs
## 1 -0.56047565   0.4264642  0.3796395  0.9935039  0.1176466  0.7877388
## 2 -0.23017749  -0.2950715 -0.5023235  0.5483970 -0.9474746  0.7690422
## 3  1.55870831   0.8951257 -0.3332074  0.2387317 -0.4905574  0.3322026
## 4  0.07050839   0.8781335 -1.0185754 -0.6279061 -0.2560922 -1.0083766
## 5  0.12928774   0.8215811 -1.0717912  1.3606524  1.8438620 -0.1194526
## 6  1.71506499   0.6886403  0.3035286 -0.6002596 -0.6519499 -0.2803953
pca = prcomp(random, scale = T)
biplot(pca)

summary(pca) # similar variance on all axes
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6
## Standard deviation     1.2054 1.1078 1.0649 0.9707 0.8530 0.71817
## Proportion of Variance 0.2422 0.2046 0.1890 0.1570 0.1213 0.08596
## Cumulative Proportion  0.2422 0.4467 0.6357 0.7928 0.9140 1.00000

11.2 Constrained

# 2 multivariate datasets (abundances + environment)


## RDA
str(dune) # species composition
## 'data.frame':    20 obs. of  30 variables:
##  $ Achimill: num  1 3 0 0 2 2 2 0 0 4 ...
##  $ Agrostol: num  0 0 4 8 0 0 0 4 3 0 ...
##  $ Airaprae: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Alopgeni: num  0 2 7 2 0 0 0 5 3 0 ...
##  $ Anthodor: num  0 0 0 0 4 3 2 0 0 4 ...
##  $ Bellpere: num  0 3 2 2 2 0 0 0 0 2 ...
##  $ Bromhord: num  0 4 0 3 2 0 2 0 0 4 ...
##  $ Chenalbu: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Cirsarve: num  0 0 0 2 0 0 0 0 0 0 ...
##  $ Comapalu: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Eleopalu: num  0 0 0 0 0 0 0 4 0 0 ...
##  $ Elymrepe: num  4 4 4 4 4 0 0 0 6 0 ...
##  $ Empenigr: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyporadi: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Juncarti: num  0 0 0 0 0 0 0 4 4 0 ...
##  $ Juncbufo: num  0 0 0 0 0 0 2 0 4 0 ...
##  $ Lolipere: num  7 5 6 5 2 6 6 4 2 6 ...
##  $ Planlanc: num  0 0 0 0 5 5 5 0 0 3 ...
##  $ Poaprat : num  4 4 5 4 2 3 4 4 4 4 ...
##  $ Poatriv : num  2 7 6 5 6 4 5 4 5 4 ...
##  $ Ranuflam: num  0 0 0 0 0 0 0 2 0 0 ...
##  $ Rumeacet: num  0 0 0 0 5 6 3 0 2 0 ...
##  $ Sagiproc: num  0 0 0 5 0 0 0 2 2 0 ...
##  $ Salirepe: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Scorautu: num  0 5 2 2 3 3 3 3 2 3 ...
##  $ Trifprat: num  0 0 0 0 2 5 2 0 0 0 ...
##  $ Trifrepe: num  0 5 2 1 2 5 2 2 3 6 ...
##  $ Vicilath: num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Bracruta: num  0 0 2 2 2 6 2 2 2 2 ...
##  $ Callcusp: num  0 0 0 0 0 0 0 0 0 0 ...
data("dune.env")
str(dune.env) # environmental variables
## 'data.frame':    20 obs. of  5 variables:
##  $ A1        : num  2.8 3.5 4.3 4.2 6.3 4.3 2.8 4.2 3.7 3.3 ...
##  $ Moisture  : Ord.factor w/ 4 levels "1"<"2"<"4"<"5": 1 1 2 2 1 1 1 4 3 2 ...
##  $ Management: Factor w/ 4 levels "BF","HF","NM",..: 4 1 4 4 2 2 2 2 2 1 ...
##  $ Use       : Ord.factor w/ 3 levels "Hayfield"<"Haypastu"<..: 2 2 2 2 1 2 3 3 1 1 ...
##  $ Manure    : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 3 5 5 3 3 4 4 2 2 ...

RDA = rda(dune ~ as.numeric(Manure) + as.numeric(Moisture), 
         data = dune.env)
plot(RDA)


summary(RDA)
## 
## Call:
## rda(formula = dune ~ as.numeric(Manure) + as.numeric(Moisture),      data = dune.env) 
## 
## Partitioning of variance:
##               Inertia Proportion
## Total           84.12     1.0000
## Constrained     31.20     0.3709
## Unconstrained   52.92     0.6291
## 
## Eigenvalues, and their contribution to the variance 
## 
## Importance of components:
##                          RDA1    RDA2     PC1     PC2     PC3    PC4     PC5
## Eigenvalue            19.0185 12.1864 10.7889 7.85393 6.67657 6.2084 5.12380
## Proportion Explained   0.2261  0.1449  0.1283 0.09336 0.07937 0.0738 0.06091
## Cumulative Proportion  0.2261  0.3709  0.4992 0.59255 0.67192 0.7457 0.80663
##                           PC6     PC7     PC8     PC9    PC10    PC11    PC12
## Eigenvalue            3.50320 3.10988 2.47760 1.83130 1.63276 1.03305 0.87265
## Proportion Explained  0.04164 0.03697 0.02945 0.02177 0.01941 0.01228 0.01037
## Cumulative Proportion 0.84827 0.88524 0.91469 0.93646 0.95587 0.96815 0.97852
##                           PC13     PC14     PC15     PC16     PC17
## Eigenvalue            0.642862 0.469072 0.301387 0.215853 0.177610
## Proportion Explained  0.007642 0.005576 0.003583 0.002566 0.002111
## Cumulative Proportion 0.986164 0.991740 0.995323 0.997889 1.000000
## 
## Accumulated constrained eigenvalues
## Importance of components:
##                          RDA1    RDA2
## Eigenvalue            19.0185 12.1864
## Proportion Explained   0.6095  0.3905
## Cumulative Proportion  0.6095  1.0000
# important part at the top
# variance explained by the two variables = prop constrained = 37.09%
# how much is explained by each RDA = see importance of components prop explained
# PCs are the unconstrained axes

# species scores = coordinates of species in the plot
# site scores = coordinates of sites in the plot
# biplot scores = coordinates of environmental variable vectors


barplot(summary(RDA)$cont$importance[2, ],  las = 2,
        col = c(rep ('red', 2), 
                rep ('black', length(summary(RDA)$cont$importance[2, ])-2)),
        ylab="proportion of variance explained") # displays % of variance explained by PCs

11.3 Clustering

# hierarchical clustering: 

library(cluster) # clustering
library(ape) # phylogenetic analyses, here to get pretty dendrogram

# example for distance matrix
dist(iris[1:3, 1:4]) # creates a distance matrix (comparison of all possible sample pairs), 
##           1         2
## 2 0.5385165          
## 3 0.5099020 0.3000000
# default method ="euclidean", but can be changed

# get Hierarchical Clustering
hc = hclust(dist(iris[, 1:4]))
plot(hc)


# for colors use package ape
plot(as.phylo(hc), tip.color = as.numeric(iris$Species)) # as.phylo converts object type "hcclust" into object type "phylo"

# change plotting type:
plot(as.phylo(hc), tip.color = as.numeric(iris$Species), type = "fan")




# lets try another clustering algorithm
data(animals)
str(animals)
## 'data.frame':    20 obs. of  6 variables:
##  $ war: int  1 1 2 1 2 2 2 2 2 1 ...
##  $ fly: int  1 2 1 1 1 1 2 2 1 2 ...
##  $ ver: int  1 1 2 1 2 2 2 2 2 1 ...
##  $ end: int  1 1 1 1 2 1 1 2 2 1 ...
##  $ gro: int  2 2 1 1 2 2 2 1 2 1 ...
##  $ hai: int  1 2 2 2 2 2 1 1 1 1 ...

# Agglomerative Nesting 
aa <- agnes(animals)
plot(aa, which.plots = 2) #which.plots: plots only plot 2

# first is banner plot...



### non-hierarchical
# kmeans 

set.seed(123) # choice of first k centers is random and depends on (random) seed
cl = kmeans(iris[, 1:4], centers = 3) # centers = number of clusters to be generated

cl$cluster
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
as.numeric(iris$Species)
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3


# without the same seed you may have to exchange cluster 2 and 3 to be able to compare assignment to species with real species in the plot
temp = cl$cluster
# temp[cl$cluster==2] = 3
# temp[cl$cluster==3] = 2

temp
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
as.numeric(iris$Species)
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3

# original species
pairs(iris[, 1:4], col = as.numeric(iris$Species))


# species found by cluster
pairs(iris[, 1:4], col = temp)


# display samples that have been assigned to the wrong species
same = as.numeric(iris$Species) == temp
pairs(iris[, 1:4], col = as.numeric(same) + 1)

palette()
## [1] "black"   "#DF536B" "#61D04F" "#2297E6" "#28E2E5" "#CD0BBC" "#F5C710"
## [8] "gray62"

11.4 Exercises

In this exercise you will practice analyzing multivariate datasets using:

ordination methods which explore higher-order correlations among variables and similarities between observations
clustering methods which aggregate individual observations into clusters.

11.4.0.1 Useful functions for multivariate data analyses

pairs() - create pairplot (plot all variables against each other)
prcomp() - calculate PCA
plot(pca_object) - plot variance explained by PC axes
biplot() - plot PCA
metaMDS() - calculate NMDS
ordiplot(nmds_object) - plot NMDS
rda() - calculate RDA
plot(rda_object) - plot RDA
dist() - create distance matrix
hclust() - perform hierarchical clustering; apply to distance matrix
plot(as.phylo(hc_object)) - plot dendrogram from hclust output
agnes() - another clustering algorithm
kmeans() - perform kmeans non-hierarchical clustering; access cluster assignments using kmeans_object$cluster

11.4.0.2 Background on the dataset

Marine coastal ecosystems harbor the most productive and diverse communities on Earth. However, they are extremely vulnerable to climate change and human activities (such as landclaim, pollution, recreation purposes, …) As a consequence, the performance of this ecosystem has decreased considerably. To better understand these ecosystems, the dutch governmental institute (RIKZ: Rijksinstituut voor Kust en Zee) started a research project on sandy beaches to investigate the relationship between environmental factors and how they might affect benthic fauna. (from the LMU and Zuur, Ieno, Smith (2007), Chapter 12.8-12.10 and 27)

Question
We now want to use ordination methods to explore

the variability in environmental conditions
the variability in species richness

We also want to use clustering methods to define 3 environmental types and hierarchically cluster the samples with respect to their species richness.

Dataset

Read in the dataset as follows:

dat = read.table(file = "http://evol.bio.lmu.de/_statgen/Multivariate/11SS/RIKZGroups.txt", header = T)
head(dat)

You already know the functions str() and summary() to get an overview of a dataset and to see which variables the dataset contains.

columns 2:5 is species richness within four larger faunal groups
columns 7:16 are different environmental variables

Not important for us are:

column 6 (week), a time stamp
column 17 (sorting1), variable from the observational design

Let’s get into the analysis!

11.4.1 Unconstrained ordination (PCA)

Carry out the following analyses:

Make two PCAs, one for the environmental and one for the species richness data (see columns above).
Name the results pca_environment and pca_species.
Create a biplot for each PCA.
Create a barplot for the proportion of variance explained by each component.

For example, the result for species richness should look like this:

You need the following functions and the package vegan:

prcomp()
biplot()
barplot()
summary()

library(vegan)

Hints

Don’t forget to scale the variables in the PCA.
In order to get the proportion of the explained variance, have a look at the summary of your analysis. str() shows you what the summary contains. You want to specifically look at the importance, here the second row contains the proportion of variance of all components. So what we want to plot is: summary(pca_species)$importance[2,].
Don’t forget to give your plots a title.

Now, use your results to answer the questions on elearning-extern (Q 1-3) (“07_Test for Exercise in R”).

In which multivariate dataset do the first and second components explain more variance?
What are the two samples that have the highest score on the respective third PCA axes?
From looking at the biplot: Which environmental variables would you choose, if you have only the resource to measure three variables?
1. NAP, angle 1 and chalk
2. salinity, penetrability and grain size
3. grain size, chalk and exposure
Describe the following correlations:
1. salinity and humus
2. angle1 and exposure
3. Polychaeta and Mollusca

Solution

## 1.) Conduct the principle component analyses (PCAs):

# a) PCA of environmental data
pca_environment = prcomp(dat[,7:16], scale = T)
summary(pca_environment)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5    PC6     PC7
## Standard deviation     1.8476 1.3300 1.0919 1.0297 0.99521 0.7746 0.72784
## Proportion of Variance 0.3414 0.1769 0.1192 0.1060 0.09904 0.0600 0.05298
## Cumulative Proportion  0.3414 0.5182 0.6375 0.7435 0.84254 0.9025 0.95551
##                            PC8     PC9    PC10
## Standard deviation     0.52299 0.34727 0.22541
## Proportion of Variance 0.02735 0.01206 0.00508
## Cumulative Proportion  0.98286 0.99492 1.00000

op <- par(mfrow = c(1,2))
biplot(pca_environment, main = "PCA")            # plot the results of the PCA as a rotation matrix
barplot(summary(pca_environment)$importance[2,], # get the importance measure
     main = "Components of environment",
     ylab = "Proportion of variance explained")


# b) PCA of species richness data
pca_species = prcomp(dat[,2:5], scale = T)
summary(pca_species)
## Importance of components:
##                           PC1    PC2    PC3    PC4
## Standard deviation     1.1177 1.0251 1.0095 0.8251
## Proportion of Variance 0.3123 0.2627 0.2548 0.1702
## Cumulative Proportion  0.3123 0.5750 0.8298 1.0000

biplot(pca_species, main = "PCA")            # plot the results of the PCA as a rotation matrix
barplot(summary(pca_species)$importance[2,], # get the importance measure
     main = "Components of species",
     ylab = "Proportion of variance explained")

par(op)

# From the *summary()* output we can see that the first and second components explain more variance in the species PCA.


##  2.) What are the two samples that have the highest score on the third PCA axis?

# -> order the samples by their PC3 coordinate:
order(pca_environment$x[,'PC3'], decreasing = T) # -> 24 is highest
##  [1] 24 21 23 28 25 16 22 40 36 19  8 12  6 30  9  7 39 32 44  1 15  4 31 13 14
## [26]  5 10 35 37 45 27 20 18 34 11 17 26 29 38  3 41 33  2 42 43
order(pca_species$x[,'PC3'], decreasing = T) # -> 7 is highest
##  [1]  7  8  6  1 37  5 13  3 35  4 14 38 11 42 29 39 28 43  2 45 26 34 17 27 36
## [26] 18 20 15 23 30 41 21 19 31 12 16 33 25 32 24 40 44 22 10  9


## 3.) Which environmental variables would you choose, if you have only the resource to measure three variables? 

# From looking at the biplot, we choose 3 variables that describe a lot of variation (i.e. have a large length in the biplot) and have little collinearity. 
# -> For example, an appropriate choice would be salinity, penetrability and grain size.


## 4.) We can get information on the correlations of variables by looking at their representation in the biplot: 
  #  a) salinity and humus:  same direction  ->  positively correlated
  #  b) angle1 and exposure:  opposite directions  ->  negatively correlated
  #  c) Polychaeta and Mollusca:  almost perpendicular  ->  uncorrelated

11.4.2 Clustering

11.4.2.1 K-means

We want to use clustering methods to define 3 environmental types. Use the function kmeans() with centers = 3 (number of clusters to be generated). Remember to set a seed; the choice of the first k centers is random.

set.seed(467456)
cl = #...

Compare the three clusters with the result of the PCA using the following code that uses another plotting framework called ggplot2:

library(ggfortify)
autoplot(pca_environment, colour = cl$cluster, size = 2, loadings = TRUE, loadings.colour = 'black',
         loadings.label = TRUE, loadings.label.size = 3, loadings.label.colour = "black")

The colors of the points represent the three clusters. Answer the following question on elearning-extern (Q 4-5):

How is it possible that four observations in the middle (in red - if you have used the same seed) belong to a different cluster than the observations around them (in black)?
Which environmental variables are on average high within the black cluster (cluster 1)?

Solution


# Create clusters using all environmental variables
set.seed(467456)
cl <- kmeans(dat[,7:16], centers = 3) # 3 clusters to be generated

# Plot
library(ggfortify)
autoplot(pca_environment, colour = cl$cluster, size = 2, loadings = TRUE, loadings.colour = 'black',
         loadings.label = TRUE, loadings.label.size = 3, loadings.label.colour = "black")


# To understand why the four observations in the middle belong to a different cluster than the observations around them, we must take into account, that the biplot is only a 2-dimensional representation of a more-than-2 dimensional space. Therefore, the four points in the middle will be dissimilar to the points around them with respect to a variable that is not well represented by the first two PCA axes.

# Environmental variables that are on average high within cluster 1:
# -> looking at the plot we find that exposure and grain size are high on average within cluster 1

11.4.2.2 Hierarchical clustering

Now we want to hierarchically cluster the samples with respect to their species richness, as shown in the following plot:

Create the same plot using the functions:

hclust()
plot()
as.phylo()

Load the package ape. Have a look at the help for hclust() to read what the function does and look at the examples for further help on how to use the function. Then have a look at what the function as.phylo() does. Now, color the labels using the variable week. You can do this using the argument “tip.color =”.

Choose the correct statement(s) about the species richness and its sampling on elearning-extern(Q6). To be able to read the plot more easily, you can click Zoom in the top pane of the Plots window in RStudio.

Solution

library(ape)

hc = hclust(dist(dat[, 2:5]))
plot(as.phylo(hc), tip.color = dat$week)