## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
2 Preparing your data
Before we can start with the analysis, we must prepare our data, but how is our data represented?
Data typically consist of a table with several observations with one or more variables:
One row per observation is the standard format for most programs. There are different types of variables (with respect to the analysis):
Dependent / response variable variable of interest we want to know what it is influenced by. Typically, there is one variable of particular interest.
Explanatory or independent variables, aka. predictors / covariates / treatments variables that potentially influence the response variable
Example: We measure plant growth and nitrogen in the soil. Growth is the dependent variable and nitrogen is the explanatory variable
Variables differ in range and scale:
Scale of measure: nominal (unordered), ordinal (ordered), metric (differences can be interpreted)
Numeric or metric variables Examples: Body size, conductivity
Non-numeric = categorial variables
unordered / nominal (red, green, blue) ordered /
ordinal (tiny, small, large) Attention: categories small = 1m, medium = 1.5m, large = 2m would be interpreted as numeric, because their relative differences are defined
Special case: binary variable (0,1), technically it is also categorial, but with two levels only we call in binary
But before we can start with our analysis, we need first to prepare our raw data, which includes:
- Reading in the data
- Cleaning the data
- Subsetting, aggregating or re-structuring your data
2.1 Importing data
The recommended data format for your raw data is csv. You can export to csv from excel. If you have a csv file in standard (international) format, the command to import is simply
= read.csv(file = "../data/myData.csv") dat
If your csv file departs from standard settings (e.g. you use a , insted of a . as decimal points), you will have to modify the function. Go on the read.csv
function and press F1 to get the help, which explains all that. Alternatively, you can use the import menu to the top right in RStudio.
You can open the documentation of a R function by pressing F1 while the cursor is on the function name or by runnin ?read.scv
Here is a video Video of how to read in csv data in R.
Typically, data will be recorded electronically with a measurement device, or you have to enter it manually using a spreadsheet program, e.g. MS Excel. The best format for data storage is csv (comma separated values) because it is long-term compatibility with all kinds of programs / systems (Excel can export to csv).
After raw data is entered, it should never be manipulated by hand! If you modify data by hand, make a copy and document all changes (additional text file). Better: Make changes using a script
Data handling in R:
create R script “dataprep.R” or similar and import dataset
possibly combine different datasets
clean data (remove NAs, impossible values etc.)
save as Rdata (derived data)
2.2 Cleaning the data
Checking / cleaning means that you ensure that you have written in your data correctly, and that you resolve issues with the data. Most real data has some problems, e.g. missing values etc. The basic checks that I would recommend is:
Usually, this will immediately uncover some problems. The exact solution will depend very much on the nature of the data, but common things are typos in the raw data (e.g. letters in a column that should be numeric, etc), but minimally you should
- Look at your data (double click in Rstudio, or view() to see if anything is weird)
- Run
summary()
andstr()
to check range, NAs, and type of all variables (e.g. categorical variables are often imported as character, change them to factors with theas.factor()
function)
Here is a video that shows an example of a cleaning process.
2.3 Subsetting, aggregating or re-structuring your data
Often, you just want to use a part of your data, or copy, merge or split data. All you need to know is explained here Section 1.3