2  Preparing your data

Before we can start with the analysis, we must prepare our data, but how is our data represented?

Data typically consist of a table with several observations with one or more variables:

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

One row per observation is the standard format for most programs. There are different types of variables (with respect to the analysis):

Tip

Example: We measure plant growth and nitrogen in the soil. Growth is the dependent variable and nitrogen is the explanatory variable

Variables differ in range and scale:

But before we can start with our analysis, we need first to prepare our raw data, which includes:

2.1 Importing data

The recommended data format for your raw data is csv. You can export to csv from excel. If you have a csv file in standard (international) format, the command to import is simply

dat = read.csv(file = "../data/myData.csv")

If your csv file departs from standard settings (e.g. you use a , insted of a . as decimal points), you will have to modify the function. Go on the read.csv function and press F1 to get the help, which explains all that. Alternatively, you can use the import menu to the top right in RStudio.

You can open the documentation of a R function by pressing F1 while the cursor is on the function name or by runnin ?read.scv

Here is a video Video of how to read in csv data in R.

Typically, data will be recorded electronically with a measurement device, or you have to enter it manually using a spreadsheet program, e.g. MS Excel. The best format for data storage is csv (comma separated values) because it is long-term compatibility with all kinds of programs / systems (Excel can export to csv).

After raw data is entered, it should never be manipulated by hand! If you modify data by hand, make a copy and document all changes (additional text file). Better: Make changes using a script

Data handling in R:

  • create R script “dataprep.R” or similar and import dataset

  • possibly combine different datasets

  • clean data (remove NAs, impossible values etc.)

  • save as Rdata (derived data)

R can also import data from nearly any data source, including xls or xlsx files. Here and here two websites with import explanations for many different data formats

2.2 Cleaning the data

Checking / cleaning means that you ensure that you have written in your data correctly, and that you resolve issues with the data. Most real data has some problems, e.g. missing values etc. The basic checks that I would recommend is:

Usually, this will immediately uncover some problems. The exact solution will depend very much on the nature of the data, but common things are typos in the raw data (e.g. letters in a column that should be numeric, etc), but minimally you should

  • Look at your data (double click in Rstudio, or view() to see if anything is weird)
  • Run summary() and str() to check range, NAs, and type of all variables (e.g. categorical variables are often imported as character, change them to factors with the as.factor() function)

Here is a video that shows an example of a cleaning process.

2.3 Subsetting, aggregating or re-structuring your data

Often, you just want to use a part of your data, or copy, merge or split data. All you need to know is explained here Section 1.3