[3]{.chapter-number}  [Preparing your data]{.chapter-title}

3 Preparing your data

Data typically consist of a table with several observations with one or more variables:

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

The standard of the tabular format is that columns are the variables and the rows are the observations, with one row per observation.

There are different types of variables (with respect to the analysis):

Response variable (or outcome or dependent): variable of interest we want to know what it is influenced by. Typically, there is one variable of particular interest.
Predictor variable (or explanatory or independent or covariates): variables that potentially influence the response variable.

Tip

Example: We measure plant growth and nitrogen in the soil. Growth is the dependent variable and nitrogen is the explanatory variable

Variables differ in range and scale:

Scale of measure: nominal (unordered), ordinal (ordered), metric (differences can be interpreted)
Continuous numeric variables (ordered and continuous / real), e.g. temperature
Integer numeric variables (ordered, integer). An important special case of those are count data, i.e. 0,1,2,3, …
Categorical variables (e.g. a fixed set of options such as red, green blue), which can further be divided into:
- Unordered categorical variables (Nominal) such as red, green, blue
- Binary (Dichotomous) variables (dead / survived, 0/1)
- Ordered categorical variables (small, medium, large)

It is important that you record the variables according to their nature. You have to make sure that the type is properly recognized after reading in the data, because many methods treat a variable differently if it is numeric or categorical.

But before we can start with our analysis, we need first to prepare our raw data, which includes:

Reading in the data
Cleaning the data
Subsetting, aggregating or re-structuring your data

Remarks on data handling

Typically, data will be recorded electronically with a measurement device, or you have to enter it manually using a spreadsheet program, e.g. MS Excel. The best format for data storage is csv (comma separated values) because it is long-term compatibility with all kinds of programs / systems (MS Excel can export to csv).

After raw data is entered, it should never be manipulated by hand! If you modify data by hand, make a copy and document all changes (additional text file). Better: Make changes using a script

Data handling in R:

create R script “dataprep.R” or similar and import dataset
possibly combine different datasets
clean data (remove NAs, impossible values etc.)
save as Rdata (derived data)

3.1 Importing data

The recommended data format for your raw data is csv. You can export to csv from excel. If you have a csv file in standard (international) format, the command to import is simply

dat = read.csv(file = "../data/myData.csv")

If your csv file departs from standard settings (e.g. you use a , insted of a . as decimal points), you will have to modify the function. Go on the read.csv function and press F1 to get the help, which explains all that. Alternatively, you can use the import menu to the top right in RStudio.

You can open the documentation of a R function by pressing F1 while the cursor is on the function name or by runnin ?read.scv

Here is a video Video of how to read in csv data in R.

Remarks on data handling

Typically, data will be recorded electronically with a measurement device, or you have to enter it manually using a spreadsheet program, e.g. MS Excel. The best format for data storage is csv (comma separated values) because it is long-term compatibility with all kinds of programs / systems (Excel can export to csv).

After raw data is entered, it should never be manipulated by hand! If you modify data by hand, make a copy and document all changes (additional text file). Better: Make changes using a script

Data handling in R:

create R script “dataprep.R” or similar and import dataset
possibly combine different datasets
clean data (remove NAs, impossible values etc.)
save as Rdata (derived data)

R can also import data from nearly any data source, including xls or xlsx files. Here and here two websites with import explanations for many different data formats

3.2 Cleaning the data

Checking / cleaning means that you ensure that you have written in your data correctly, and that you resolve issues with the data. Most real data has some problems, e.g. missing values etc. The basic checks that I would recommend is:

Usually, this will immediately uncover some problems. The exact solution will depend very much on the nature of the data, but common things are typos in the raw data (e.g. letters in a column that should be numeric, etc), but minimally you should

Look at your data (double click in Rstudio, or view() to see if anything is weird)
Run summary() and str() to check range, NAs, and type of all variables (e.g. categorical variables are often imported as character, change them to factors with the as.factor() function)

Here is a video that shows an example of a cleaning process.

3.3 Subsetting, aggregating or re-structuring your data

Often, you just want to use a part of your data, or copy, merge or split data. All you need to know is explained here ?sec-datamanipulation