[4]{.chapter-number}  [Summary statistics]{.chapter-title}

4 Summary statistics

4.1 Numeric values

For continuous measurements such as size, weight, time, length, etc.

Typical summary statistics:

Mean, range, max, min, variance, standard deviation, etc.
Quantiles (% of data below/over quantile): e.g. median == 50% quantile, 50% of the data below and 50% over.
Central moments:
- First moment: Mean
- Second moment: standard deviation, measure of spread
- Third moment: skewness, measure of symmetry
- Fourth moment: kurtosis, measure of peakedness

Code

set.seed(10)
A = scale(exp(rnorm(1000,sd = 0.5)))
hist(A, main = "")
abline(v = quantile(A, probs = c(0.05, 0.5, 0.95)), col = "#AA0366", lwd = 1.4)
text(quantile(A, probs = c(0.05, 0.5, 0.95)), 
     y = 500, labels = c("0.05Q", "0.5Q", "0.95Q"), pos = 3, xpd = NA)
abline(v = mean(A), col = "#AA0366", lwd = 1.4)
text(x = mean(A), y = 530, labels="Mean", pos = 3, xpd = NA)

How to calculate them in R:

mean(airquality$Temp)
## [1] 77.88235
median(airquality$Temp)
## [1] 79
min(airquality$Temp)
## [1] 56
max(airquality$Temp)
## [1] 97
range(airquality$Temp)
## [1] 56 97
var(airquality$Temp)
## [1] 89.59133
sd(airquality$Temp)
## [1] 9.46527
#the same:
sqrt(var(airquality$Temp))
## [1] 9.46527

Quantiles and mean:

quantile(airquality$Temp)
##   0%  25%  50%  75% 100% 
##   56   72   79   85   97
mean(airquality$Ozone, na.rm = TRUE) #na.rm: removes NAs
## [1] 42.12931

How to look for NAs

How to look for NA values:

airquality[is.na(airquality$Ozone),] 
##     Ozone Solar.R Wind Temp Month Day
## 5      NA      NA 14.3   56     5   5
## 10     NA     194  8.6   69     5  10
## 25     NA      66 16.6   57     5  25
## 26     NA     266 14.9   58     5  26
## 27     NA      NA  8.0   57     5  27
## 32     NA     286  8.6   78     6   1
## 33     NA     287  9.7   74     6   2
## 34     NA     242 16.1   67     6   3
## 35     NA     186  9.2   84     6   4
## 36     NA     220  8.6   85     6   5
## 37     NA     264 14.3   79     6   6
## 39     NA     273  6.9   87     6   8
## 42     NA     259 10.9   93     6  11
## 43     NA     250  9.2   92     6  12
## 45     NA     332 13.8   80     6  14
## 46     NA     322 11.5   79     6  15
## 52     NA     150  6.3   77     6  21
## 53     NA      59  1.7   76     6  22
## 54     NA      91  4.6   76     6  23
## 55     NA     250  6.3   76     6  24
## 56     NA     135  8.0   75     6  25
## 57     NA     127  8.0   78     6  26
## 58     NA      47 10.3   73     6  27
## 59     NA      98 11.5   80     6  28
## 60     NA      31 14.9   77     6  29
## 61     NA     138  8.0   83     6  30
## 65     NA     101 10.9   84     7   4
## 72     NA     139  8.6   82     7  11
## 75     NA     291 14.9   91     7  14
## 83     NA     258  9.7   81     7  22
## 84     NA     295 11.5   82     7  23
## 102    NA     222  8.6   92     8  10
## 103    NA     137 11.5   86     8  11
## 107    NA      64 11.5   79     8  15
## 115    NA     255 12.6   75     8  23
## 119    NA     153  5.7   88     8  27
## 150    NA     145 13.2   77     9  27
# or
summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
##

We can visualize variables with two plots, a histogram and boxplot:

hist(airquality$Temp)

boxplot(airquality$Temp)

In a box plot, the box ranges from 0.25 to 0.75 quantiles (the interquartile range, IQR), with the median as the thick line in the box. The whiskers (the other two vertical lines) are either the min/max values or the distance to 1.5*IQR (and points outside this range are considered outliers).

4.2 Categorical data

How often do different events occur?

table(mtcars$gear) # one-way table
## 
##  3  4  5 
## 15 12  5

table(mtcars$cyl, mtcars$gear) # two-way table
##    
##      3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2

4.3 Numeric with categorical

We can also calculate summary statistics per level of a categorical variable for a numerical variable of interest, in different ways. Let’s calculate the mean temperature per month of the airquality dataset.

Using aggregate:

aggregate(Temp ~ Month, data = airquality, FUN = mean) #or
##   Month     Temp
## 1     5 65.54839
## 2     6 79.10000
## 3     7 83.90323
## 4     8 83.96774
## 5     9 76.90000
aggregate(airquality$Temp, by = list(airquality$Month), FUN = mean) #the same
##   Group.1        x
## 1       5 65.54839
## 2       6 79.10000
## 3       7 83.90323
## 4       8 83.96774
## 5       9 76.90000

NOTE the formula syntax: Temp ~ Month means “calculate summary statistics for Temp per level of Month”. You specify the dataset in the data argument.

Or using tapply:

tapply(airquality$Temp, airquality$Month, mean)
##        5        6        7        8        9 
## 65.54839 79.10000 83.90323 83.96774 76.90000

NOTE the different output formats of both functions aggregate and tapply: the first one gives a data.frame, the second one gives a vector with names.

We can do with base R but nowadays you will see a lot of code using the dplyr package:

library(dplyr) # uses tidyverse syntax
## 
## Anexando pacote: 'dplyr'
## Os seguintes objetos são mascarados por 'package:stats':
## 
##     filter, lag
## Os seguintes objetos são mascarados por 'package:base':
## 
##     intersect, setdiff, setequal, union

air_grouped <- airquality %>% # define dataset to be summarised
  group_by(Month) %>% # define grouping factor (note that you don't need to tell which dataset to look in here (no $))
  summarise(meanwind = mean(Wind), 
            meansolar = mean(Solar.R), 
            maxOzone = max(Ozone), 
            varTemp = var(Temp)) # define which summary statistics you want to be calculated
air_grouped
## # A tibble: 5 × 5
##   Month meanwind meansolar maxOzone varTemp
##   <int>    <dbl>     <dbl>    <int>   <dbl>
## 1     5    11.6        NA        NA    47.0
## 2     6    10.3       190.       NA    43.5
## 3     7     8.94      216.       NA    18.6
## 4     8     8.79       NA        NA    43.4
## 5     9    10.2       167.       NA    69.8

dplyr and tidyverse

The dplyr package is part of a big software environment called tidyverse. Unique features of the tidyverse are the pipe %>% operator and tibble objects.

The %>% operator:

Applying several functions in sequence on an object often results in uncountable/confusing number of round brackets:
```
max(mean(range(c(5, 3, 2, 1))))
## [1] 3
```
The pipe operator simplifies that by saying “apply the next function on the result of the current function”:
```
c(5, 3, 2, 1) %>% range %>% mean %>% max
## [1] 3
```
Which is easier to write, read, and to understand!
tibble objects are just an extension of data.frames. In the course we will use mostly data.frames, so it is better to transform the tibbles back to data.frames:
```
class(air_grouped)
## [1] "tbl_df"     "tbl"        "data.frame"
air_grouped = as.data.frame(air_grouped)
class(air_grouped)
## [1] "data.frame"
```

4.4 Correlations

If two variables A and B are related, we say that they are correlated (correlation != causality!!). We can calculate the magnitude of this relatedness with correlation factors. Correlation factors are normalized covariances (two variables have two variances, how they vary independent of each other, and they share a covariance, how much they vary together) in the range of \([-1,1]\).

The Pearson’s correlation coefficient/factor can be calculated by:

\[ r = cor(x_1, x_2) = \frac{ \frac{1}{N} \sum_{i=1}^n (x_{1i} - \overline{x_{1}}) (x_{2i} - \overline{x_{2}}) }{ \sqrt{\frac{1}{N} \sum_{i=1}^n (x_{1i} - \overline{x_{1}})^2 (x_{2i} - \overline{x_{2}})^2} } \]

The Pearson correlation works well with linear relationship, but poorly with non-linear relationships and is sensitive to outliers. Also transforming one of the variables affects the correlation:

A = runif(1000, -6, 6) # creating A variable
B = plogis(A*3.5)      # creating B variable 
par(mfrow = c(1, 2)) # for plotting the plots side by side
plot(B, A)
text(x = 0.6, y = -4, paste0("Pearson: ", 
                             round(cor(A, B, method = "pearson"), 3))) 
plot(log(B), A)
text(x = -12, y = 0, paste0("Pearson: ", 
                            round(cor(A, log(B), method = "pearson"), 3)))

The correlation factor for the untransformed data is 0.9 (left graph) but they have a perfect relationship.

Rank correlation factors

Idea: Arranging the data following their order and using their rank 1…n for x and y, respectively. There are two different rank correlation factors:

Spearman: calculate Pearson based on ranks.
Kendall: counts the number of data pairs that point in the same direction

cor(A, B)
## [1] 0.9096673
cor(order(A), order(B)) # Spearman
## [1] 1
cor(A, B, method = "spearman") # Spearman
## [1] 1
cor(A, B, method = "kendall")
## [1] 1

Example:

cor(airquality$Solar.R, airquality$Ozone, use = "complete.obs") # pearson = default
## [1] 0.3483417
cor(airquality$Solar.R, airquality$Ozone, use = "complete.obs", 
    method = "spearman")
## [1] 0.3481865

Pairs plot for many variables: correlates all variables to each other:

plot(airquality)

pairs(airquality) #the same

You can change what is plotted in upper and lower panel:

# copy function from help
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y, use = "complete.obs")) ### complete.obs must be added manually
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(airquality, lower.panel = panel.smooth, upper.panel = panel.cor)

Warning

Descriptive statistics characterize properties of the data without providing any test or statement of probability.