1  Getting Started

1.1 Organization of this book

The aim of this book is to introduce you to the main techniques and concepts that are used when performing regression analyses in applied settings. This book is organized in three parts:

  1. Modelling the mean: The first part of this book is focusing on how to model the mean response as a function of the one or several predictor variables. For this chapter, we will mainly stick to the assumptions of the linear regression, which is that residuals are i.i.d. normally distributed. Topics we will cover here are linear regression, ANOVA and mixed models
  2. Model Choice The second part covers model choice, including how to handle missing data, model selection, causal inference and non-parametric methods
  3. Modelling the Distribution The third part of the book is about modelling different residual distributions. We will relax the iid normal assumptions of the LM, and move to GLMs and modelling variance and correlation of residuals.

1.2 Your R System

In this course, we work with the combination of R + RStudio.

  • R is the calculation engine that performs the computations.
  • RStudio is the editor that helps you sending inputs to R and collect outputs.

Make sure you have a recent version of R + RStudio installed on your computer. If you have never used RStudio, here is a good video introducing the basic system and how R and RStudio interact.

1.3 Libraries that you will need

The R engine comes with a number of base functions, but one of the great things about R is that you can extend these base functions by libraries that can be programmed by anyone. In principle, you can install libraries from any website or file. In practice, however, most commonly used libraries are distributed via two major repositories. For statistical methods, this is CRAN, and for bioinformatics, this is Bioconductor.

To install a package from a library, use the command

install.packages(LIBRARY)

Exchange “LIBRARY” with the name of the library you want to install. The default is to search the package in CRAN, but you can specify other repositories or file locations in the function. For Windows / Mac, R should work out of the box. For other UNIX based systems, may also need to install

build-essential
gfortran
libmagick++-dev
r-base-dev
cmake

If you are new to installing packages on Debian / Ubuntu, etc., type the following:

sudo apt update && sudo apt install -y --install-recommends build-essential gfortran libmagick++-dev r-base-dev cmake

In this book, we will often use data sets from the EcoData package, which is not on CRAN, but on a GitHub page. To install the package, if you don’t have the devtools package installed already, first install devtools from CRAN by running

install.packages("devtools")

Then install the EcoData package via

devtools::install_github(repo = "TheoreticalEcology/EcoData",
                         dependencies = T, build_vignettes = T)

For your convenience, the EcoData installation also forces the installation of most of the packages needed in this book, so this may take a while. If you want to load only the EcoData package, or if you encounter problems during the install, set dependencies = F, build_vignettes = F.

1.4 Assumed R knowledge

As mentioned in the preface, this book assumes that you have basic knowledge about data manipulation (reading in data, removing or selecting columns or rows, calculating means per group etc.) and plotting in R. Note that for both purposes, there are currently two main schools in the R environment which do the same things, but with very different syntax:

  1. base R, which uses functions such as plot(), apply(), aggregate()
  2. tidyverse, with packages such as dplyr and ggplot2, which provide functions such as mutate(), filter() and heavily rely on the %>% pipe operator.

There are many opinions about advantages and disadvantages of the two schools. I’m agnostic about this, or more precisely, I think you should get to know both schools and then decide based on the purpose. I see advantages of tidyverse in particular for data manipulation, while I often prefer baseR plots over ggplot2. To keep it simple, however, all code in this course uses base R.

Note

The tidyverse framework is currently trying to expand to the tasks of statistical / machine learning models as well, trying to streamline statistical workflows. While this certainly has a lot of potential, I don’t see it as general / mature enough to recommend it as a default for the statistical workflow.

1.5 Reminder basic R programming

In the remainder of this section, I will summarize basic plots and data manipulations skills in R, as they would be taught in an introductory R course (for example here). Please read through them and make sure you know these commands in advance of the course. At the end of this part, there is a small exercise about data manipulation. If you are confindent that you can do it, you can also go there directly without reading the content of this section.

1.5.1 Representing Data in R

1.5.1.1 Exploring Data Structures

A fundamental requirement for working with data is representing it in a computer. In R, we can either read in data (e.g. with functions such as read.table()), or we can assign variables certain values. For example, if I type

x <- 1

the variable x now contains some data, namely the value 1, and I can use x in as a placeholder for the data it contains in further calculations.

Alternatively to the <- operator, you can also use = (in all circumstances that you are likely to encounter, it’s the same).

x = 1

If you have worked with R previously, this should all be familiar to you, and you should also know that the commands

class(x)
dim(x)
str(x)

allow you to explore the structure of variables and the data they contain. Ask yourself, or discuss with your partner(s):

Excercise

What is the meaning of the three functions, and what is the structure / properties of the following data types in R:

  • Atomic types (which atomic types exist),
  • list,
  • vector,
  • data.frame,
  • matrix,
  • array.

Atomic types: e.g. numeric, factor, boolean …; List can have severa tyes, vector not! data.frame is list of vectors. matrix is 2-dim array, array can have any dim, only one type.

Excercise

What is the data type of the iris data set, which is built-in in R under the name

iris

Iris, like most simple datasets, is of type data.frame

1.5.1.2 Dynamic Typing

R is a dynamically typed language, which means that the type of variables is determined automatically depending on what values you supply. Try this:

x = 1
class(x)
x = "dog"
class(x)

This also works if a data set already exists, i.e. if you assign a different value, the type will automatically be changed. Look at what happens when we assign a character value to a previously numeric column in a data.frame:

iris$Sepal.Length[2] = "dog"
str(iris)

Note that all numeric values are changed to characters as well. You can try to force back the values to numeric by:

iris$Sepal.Length = as.numeric(iris$Sepal.Length)

Have a look at what this does to the values in iris$Sepal.Length.

Note: The actions above operate on a local copy of the iris data set. You don’t overwrite the base data and can use it again in a new R session or reset it with data(iris).

1.5.2 Data Selection, Slicing and Subsetting

1.5.2.1 Subsetting and Slicing for Single Data Types

We often want to select only a subset of our data. You can generally subset from data structures using indices and TRUE/FALSE (or T/F). Here for a vector:

vector = 1:6
vector[1] # First element.
vector[1:3] # Elements 1, 2, 3.
vector[c(1, 5, 6)]  # Elements 1, 5, 6.
vector[c(T, T, F, F, T)]  # Elements 1, 2, 5.

Careful, special behavior of R: If you specify fewer values than needed, the input vector will be repeated. This is called “recycling”.

vector[c(T, F)] # works but repeats T,F 3 times - error prone!

For a list, it’s basically the same, except the following points:

  • Elements in lists usually have a name, so you can also access those via list$name.
  • Lists accessed with [] return a list. If you want to select a single element, you have to access it via [[]], as in list[[2]].
myList = list(a = 1, b = "dog", c = TRUE)
myList[1]
$a
[1] 1
myList[[1]]
[1] 1
myList$a
[1] 1

For data.frames and other objects with dimension > 2, the same is true, except that you have several indices.

matrix = matrix(1:16, nrow = 4)
matrix[1, 2]  # Element in first row, second column.
matrix[1:2,]  # First two rows, all columns.
matrix[, c(T, F ,T)]  # All rows, 1st and 3rd column.

The syntax matrix[1,] is also called slicing, for obvious reasons.

Data.frames are the same as matrices, except that, like with lists of vectors, you can also access columns via names as in data.frame$column. This is because a data.frame ist a list of vectors.

1.5.2.2 Logic and Slicing

Slicing is very powerful if you combine it with logical operators, such as “&” (logical and), “|” (logical or), “==” (equal), “!=” (not equal), “<=”, “>”, etc. Here are a few examples:

head(iris[iris$Species == "virginica", ])  
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
101          6.3         3.3          6.0         2.5 virginica
102          5.8         2.7          5.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica
104          6.3         2.9          5.6         1.8 virginica
105          6.5         3.0          5.8         2.2 virginica
106          7.6         3.0          6.6         2.1 virginica

Note that this is identical to using the subset command:

head(subset(iris, Species == "virginica"))  
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
101          6.3         3.3          6.0         2.5 virginica
102          5.8         2.7          5.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica
104          6.3         2.9          5.6         1.8 virginica
105          6.5         3.0          5.8         2.2 virginica
106          7.6         3.0          6.6         2.1 virginica

You can also combine several logical commands:

iris[iris$Species == "virginica" & iris$Sepal.Length > 7, ]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
103          7.1         3.0          5.9         2.1 virginica
106          7.6         3.0          6.6         2.1 virginica
108          7.3         2.9          6.3         1.8 virginica
110          7.2         3.6          6.1         2.5 virginica
118          7.7         3.8          6.7         2.2 virginica
119          7.7         2.6          6.9         2.3 virginica
123          7.7         2.8          6.7         2.0 virginica
126          7.2         3.2          6.0         1.8 virginica
130          7.2         3.0          5.8         1.6 virginica
131          7.4         2.8          6.1         1.9 virginica
132          7.9         3.8          6.4         2.0 virginica
136          7.7         3.0          6.1         2.3 virginica

1.5.3 Applying Functions and Aggregates Across a Data Set

Apart from selecting data, you will often combine or calculate statistics on data.

1.5.3.1 Functions

Maybe this is a good time to remind you about functions. The two basic options we use in R are:

  • Variables / data structures.
  • Functions.

We have already used variables / data structures. Variables have a name and if you type this name in R, you get the values that are inside the respective data structure.

Functions are algorithms that are called like:

function(variable)

For example, you can do:

summary(iris)

If you want to know what the summary function does, type ?summary, or put your mouse on the function and press “F1”.

To be able to work properly with data, you have to know how to define your own functions. This works like the following:

squareValue = function(x){
  temp = x * x 
  return(temp)
}

Tasks
  1. Try what happens if you type in squareValue(2).
  2. Write a function for multiplying 2 values. Hint: This should start with function(x1, x2).
  3. Change the first line of the squareValue function to function(x = 3) and try out the following commands: squareValue(2), squareValue(). What is the sense of this syntax?
Solution

1

multiply = function(x1, x2){
  return(x1 * x2)
}

2

squareValue(2)
[1] 4

3

squareValue = function(x = 3){
  temp = x * x 
  return(temp)
}

squareValue(2)
[1] 4
squareValue()
[1] 9

The given value (3 in the example above) is the default value. This value is used automatically, if no value is supplied for the respective variable. Default values can be specified for all variables, but you should put them to the end of the function definition. Hint: In R, it is always useful to name the parameters when using functions.

Look at the following example:

testFunction = function(a = 1, b, c = 3){
  return(a * b + c)
}

testFunction()
Error in testFunction(): argument "b" is missing, with no default
testFunction(10)
Error in testFunction(10): argument "b" is missing, with no default
testFunction(10, 20)
[1] 203
testFunction(10, 20, 30)
[1] 230
testFunction(b = 10, c = 20, a = 30)
[1] 320



1.5.3.2 The apply() Function

Now that we know functions, we can introduce functions that use functions. One of the most important is the apply function. The apply function applies a function of a data structure, typically a matrix or data.frame.

Try the following:

apply(iris[,1:4], 2, mean)

Tasks
  1. Check the help of apply to understand what this does.
  2. Why is the first result of apply(iris[,1:4], 2, mean) NA? Check the help of mean to understand this.
  3. Try apply(iris[,1:4], 1, mean). Think about what has changed here.
  4. What would happen if you use iris instead of iris[,1:4]?
Solution

1

?apply

2

Remember, what we have done above (if you run this part separately, execute the following lines again):

iris$Sepal.Length[2] = "Hund"
iris$Sepal.Length = as.numeric(iris$Sepal.Length)
Warning: NAs introduced by coercion
apply(iris[,1:4], 2, mean)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
          NA     3.057333     3.758000     1.199333 

Taking the mean of a character sequence is not possible, so the result is NA (Not Available, missing value(s)).

But you can skip missing values with the option na.rm = TRUE of the mean function. To use it with the apply function, pass the argument(s) after.

apply(iris[,1:4], 2, mean, na.rm = T)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    5.849664     3.057333     3.758000     1.199333 

3

apply(iris[,1:4], 1, mean)
  [1] 2.550    NA 2.350 2.350 2.550 2.850 2.425 2.525 2.225 2.400 2.700 2.500
 [13] 2.325 2.125 2.800 3.000 2.750 2.575 2.875 2.675 2.675 2.675 2.350 2.650
 [25] 2.575 2.450 2.600 2.600 2.550 2.425 2.425 2.675 2.725 2.825 2.425 2.400
 [37] 2.625 2.500 2.225 2.550 2.525 2.100 2.275 2.675 2.800 2.375 2.675 2.350
 [49] 2.675 2.475 4.075 3.900 4.100 3.275 3.850 3.575 3.975 2.900 3.850 3.300
 [61] 2.875 3.650 3.300 3.775 3.350 3.900 3.650 3.400 3.600 3.275 3.925 3.550
 [73] 3.800 3.700 3.725 3.850 3.950 4.100 3.725 3.200 3.200 3.150 3.400 3.850
 [85] 3.600 3.875 4.000 3.575 3.500 3.325 3.425 3.775 3.400 2.900 3.450 3.525
 [97] 3.525 3.675 2.925 3.475 4.525 3.875 4.525 4.150 4.375 4.825 3.400 4.575
[109] 4.200 4.850 4.200 4.075 4.350 3.800 4.025 4.300 4.200 5.100 4.875 3.675
[121] 4.525 3.825 4.800 3.925 4.450 4.550 3.900 3.950 4.225 4.400 4.550 5.025
[133] 4.250 3.925 3.925 4.775 4.425 4.200 3.900 4.375 4.450 4.350 3.875 4.550
[145] 4.550 4.300 3.925 4.175 4.325 3.950

Arrays (and thus matrices, data.frame(s), etc.) have several dimensions. For a simple 2D array (or matrix), the first dimension is the rows and the second dimension is the columns. The second parameter of the “apply” function specifies the dimension of which the mean should be computed. If you use 1, you demand the row means (150), if you use 2, you request the column means (5, resp. 4).

4

apply(iris, 2, mean)
Warning in mean.default(newX[, i], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(newX[, i], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(newX[, i], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(newX[, i], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(newX[, i], ...): argument is not numeric or logical:
returning NA
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
          NA           NA           NA           NA           NA 

The 5th column is “Species”. These values are not numeric. So the whole data.frame is taken as a data.frame full of characters.

apply(iris[,1:4], 2, str)
 num [1:150] 5.1 NA 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
NULL
apply(iris, 2, str)
 chr [1:150] "5.1" NA "4.7" "4.6" "5.0" "5.4" "4.6" "5.0" "4.4" "4.9" "5.4" ...
 chr [1:150] "3.5" "3.0" "3.2" "3.1" "3.6" "3.9" "3.4" "3.4" "2.9" "3.1" ...
 chr [1:150] "1.4" "1.4" "1.3" "1.5" "1.4" "1.7" "1.4" "1.5" "1.4" "1.5" ...
 chr [1:150] "0.2" "0.2" "0.2" "0.2" "0.2" "0.4" "0.3" "0.2" "0.2" "0.1" ...
 chr [1:150] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" ...
NULL

Remark: The “NULL” statement is the return value of apply. str returns nothing (but prints something out), so the returned vector (or array, list, …) is empty, just like:

c()
NULL



1.5.3.3 The aggregate() Function

aggregate() calculates a function per grouping variable. Try out this example:

aggregate(. ~ Species, data = iris, FUN = max)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa          5.8         4.4          1.9         0.6
2 versicolor          7.0         3.4          5.1         1.8
3  virginica          7.9         3.8          6.9         2.5

Note that max` is the function to get the maximum value, and has nothing to do with your lecturer, who should be spelled Max.

The dot is general R syntax and usually refers to “use all columns in the data set”.

1.5.3.4 For loops

Apply and aggregate are convenience function for a far more general concept that exists in all programming language, which is the for loop. In R, a for loop look like this:

for (i in 1:10){
  #doSomething
}

and if it is executed, it will excecute 10 times the main block in the curly brackes, while counting the index variable i from 1:10. To demonstrate this, let’s execute a shorter for lool, going from 1:3, and printing i

for (i in 1:3){
  print(i)
}
[1] 1
[1] 2
[1] 3

For loops are very useful when you want to execute the same task many times. This can be for plotting, but also for data manipulation. For example, if I would like to re-programm the apply function with a for loop, it would look like that:

apply(iris[,1:4], 2, mean, na.rm = T)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    5.849664     3.057333     3.758000     1.199333 
out = rep(NA, 4)
for (i in 1:4){
  out[i] = mean(iris[,i])
}
out
[1]       NA 3.057333 3.758000 1.199333

1.5.4 Plotting

I assume that you have already made plots with R. Else here is a super-quick 5-min introduction video. In this course, we will not be using a lot graphics, but it will be useful for you to know the basic plot commands. Note in particular that the following two commands are identical:

  • plot(iris$Sepal.Length, iris$Sepal.Width)
  • plot(Sepal.Width ~ Sepal.Length, data = iris)

The second option is preferable, because it allows you to subset data easier and can be directly copied to regression functions.

plot(Sepal.Width ~ Sepal.Length, data = iris[iris$Species == "versicolor", ])

The plot command will use the standard plot depending on the type of variable supplied. For example, if the x axis is a factor, a boxplot will be produced.

plot(Sepal.Width ~ Species, data = iris)

You can change color, size, shape etc. and this is often useful for visualization.

plot(iris$Sepal.Length, iris$Sepal.Width, col = iris$Species,
     cex = iris$Petal.Length)

For more help on plotting, I recommend:

1.5.5 Final exercise

Exercise - Data wrangling

We work with the airquality dataset:

dat = airquality
  1. Before working with a dataset, you should always get an overview of it. Helpful functions for this are str(), View(), summary(), head(), and tail(). Apply them to dat and make sure to understand what they do.
  2. What is the data type of the variable ‘Month’? Transform it to a factor
  3. Scale the variable Wind and save it as a new variable in dat
  4. Transform the variable ‘Temp’ (log-transform) and save it as a new variable in dat
  5. Extract the first 100 rows of dat and remove the NAs from the subsetted dataset
  6. Plot the variables of the dataset, and against each other (e.g. Wind, Wind vs Temp, Temp vs Month, all simultaneously)
  7. Calculate correlation indices between the numerical variables (e.g. Wind and Temp, Temp and Ozone). What is the difference between Spearman and Pearson correlation?
  1. str() helps us to check the data types of the variables, ensure that they are correct, e.g. categorical variables should be factors and continuous variables should be either num (numeric) or int (integer). summary()returns important summary statistics of our variables and informs us about NAs in the data

    str(dat)
    'data.frame':   153 obs. of  6 variables:
     $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
     $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
     $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
     $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
     $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
     $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
    summary(dat)
         Ozone           Solar.R           Wind             Temp      
     Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
     1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
     Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
     Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
     3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
     Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
     NA's   :37       NA's   :7                                       
         Month            Day      
     Min.   :5.000   Min.   : 1.0  
     1st Qu.:6.000   1st Qu.: 8.0  
     Median :7.000   Median :16.0  
     Mean   :6.993   Mean   :15.8  
     3rd Qu.:8.000   3rd Qu.:23.0  
     Max.   :9.000   Max.   :31.0  
    

    There are NAs in Ozone and Solar.R! Also, Month is not a factor!

  2. We have to transform Month into a factor:

    dat$Month = as.factor(dat$Month)
    str(dat)
    'data.frame':   153 obs. of  6 variables:
     $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
     $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
     $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
     $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
     $ Month  : Factor w/ 5 levels "5","6","7","8",..: 1 1 1 1 1 1 1 1 1 1 ...
     $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
  3. Scaling means that the variables are centered and standardized (divided by their standard deviation):

    dat$sWind = scale(dat$Wind)
    summary(dat)
         Ozone           Solar.R           Wind             Temp       Month 
     Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   5:31  
     1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   6:30  
     Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   7:31  
     Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   8:31  
     3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   9:30  
     Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00         
     NA's   :37       NA's   :7                                              
          Day             sWind.V1      
     Min.   : 1.0   Min.   :-2.3438868  
     1st Qu.: 8.0   1st Qu.:-0.7259482  
     Median :16.0   Median :-0.0730957  
     Mean   :15.8   Mean   : 0.0000000  
     3rd Qu.:23.0   3rd Qu.: 0.4378323  
     Max.   :31.0   Max.   : 3.0492420  
    
  4. Use logfunction to transform the variable (be aware of NAs!)

    dat$logTemp = log(dat$Temp)
  5. Use [rows, cols] to subset the data and complete.cases() to remove observations with NAs

    dat_sub = dat[1:100,]
    summary(dat_sub)
         Ozone           Solar.R           Wind            Temp       Month 
     Min.   :  1.00   Min.   :  7.0   Min.   : 1.70   Min.   :56.00   5:31  
     1st Qu.: 16.00   1st Qu.:101.0   1st Qu.: 7.40   1st Qu.:69.00   6:30  
     Median : 34.00   Median :223.0   Median : 9.70   Median :79.50   7:31  
     Mean   : 41.59   Mean   :193.3   Mean   :10.07   Mean   :76.87   8: 8  
     3rd Qu.: 63.00   3rd Qu.:274.0   3rd Qu.:12.00   3rd Qu.:84.00   9: 0  
     Max.   :135.00   Max.   :334.0   Max.   :20.70   Max.   :93.00         
     NA's   :31       NA's   :7                                             
          Day              sWind.V1          logTemp     
     Min.   : 1.00   Min.   :-2.3438868   Min.   :4.025  
     1st Qu.: 7.00   1st Qu.:-0.7259482   1st Qu.:4.234  
     Median :14.50   Median :-0.0730957   Median :4.376  
     Mean   :14.93   Mean   : 0.0313607   Mean   :4.334  
     3rd Qu.:23.00   3rd Qu.: 0.5797567   3rd Qu.:4.431  
     Max.   :31.00   Max.   : 3.0492420   Max.   :4.533  
    
    dat_sub = dat_sub[complete.cases(dat_sub),]
    summary(dat_sub)
         Ozone          Solar.R            Wind            Temp       Month 
     Min.   :  1.0   Min.   :  7.00   Min.   : 4.00   Min.   :57.00   5:24  
     1st Qu.: 16.0   1st Qu.: 97.25   1st Qu.: 7.40   1st Qu.:67.75   6: 9  
     Median : 33.0   Median :223.00   Median : 9.70   Median :81.00   7:26  
     Mean   : 41.5   Mean   :192.53   Mean   :10.15   Mean   :76.61   8: 5  
     3rd Qu.: 61.5   3rd Qu.:274.25   3rd Qu.:12.00   3rd Qu.:84.25   9: 0  
     Max.   :135.0   Max.   :334.00   Max.   :20.70   Max.   :92.00         
          Day              sWind.V1          logTemp     
     Min.   : 1.00   Min.   :-1.6910344   Min.   :4.043  
     1st Qu.: 7.75   1st Qu.:-0.7259482   1st Qu.:4.216  
     Median :15.50   Median :-0.0730957   Median :4.394  
     Mean   :14.97   Mean   : 0.0550798   Mean   :4.330  
     3rd Qu.:21.00   3rd Qu.: 0.5797567   3rd Qu.:4.434  
     Max.   :31.00   Max.   : 3.0492420   Max.   :4.522  
  6. Single continuous variables can be visualized using a histogram (hist) , for two variables, it depends on their data types:

    Scenario Which plot R command
    Numeric Histogram or boxplot hist() andboxplot
    Numeric with numeric Scatterplot plot
    Numeric with categorical Boxplot boxplot(numeric~categorical)
    Categorical with categorical mosaicplot or grouped barplot mosaicplot(table(categorical, categorical)) or barplot(data, beside=TRUE)
    # Numeric
    hist(dat$Wind, main = "Wind")

    # Numeric vs numeric
    plot(dat$Wind, dat$Solar.R)

    # Numeric with categorical
    boxplot(Wind~Month, data = dat)

    # All with all
    pairs(dat)

  7. Spearman is a rank correlation factor, less sensitive against outliers and non-linearity:

    # Pearson
    cor(dat$Wind, dat$Temp, use = "complete.obs")
    [1] -0.4579879
    # Spearman
    cor(dat$Wind, dat$Temp, use = "complete.obs", method = "spearman")
    [1] -0.4465408