R data analysis and visualisation for beginners Part 1

Introduction

R is a open source software. It is a programming language and environment which is well-suited for statistical analyses. It can be downloaded from R-project. Several GUIs and editors are available for R. We will use the development environment RStudio (free software, available at RStudio). Both R and RStudio are available for Linux, Windows and MAC OS.

Here are a few simple and easy online available tutorials and books: Paradis Short-Intro Quick-R Introduction to R R for Data Science

In Rstudio, we can either use the normal command line input for R or write scripts in the editor and run these in R. We will focus on the command line input. Some important basic commands are:

q()                     # Quits R (important if you use R without a GUI)
?command_name           # Calls up the manual for a command (try ?q())
getwd()                 # Shows the working directory
setwd("directory_name") # Sets the working directory

The home directory can also be set in Rstudio by clicking on Session → Set working directory → Choose directory. The home directory is important since it is the directory where all output of R will be written to. Note that everything written after an # is not evaluated by R - we will use this for commenting.

There are two ways to write code in R. Either by typing directly into the console or by writing a script. Both have it advantages and disadvantages. By typing into the console you immediately run your code and see the output. On the other hand writing a script helps with typos and reproducibility. In the end it always comes down to personal preference and the task you are doing.

###Write and run this chunk of code directly in the console
2
2 + 2
a
a <- 3
a
a + 2
###Write this chunk of code in an R script
###This is a comment###
2     #I can write comments wherever I want
2 + 2 #Sum of two numbers 
b
b <- 5 # Assigning a value to an object
b

To navigate through your code few shortcuts are good to know. Ctrl+L clears the console while Ctrl+Enter runs a line from the script. Tab auto completes a command you are writing and upward arrow recalls the previous command you wrote in the console.

Objects, values and classes

To make the best of the R language, you’ll need a strong understanding of the basic data types and data structures and how to operate on those.

This is the most important step in understanding and successfully using R because these are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners (and advanced users).

To understand computations in R, two slogans are helpful:

    Everything that exists is an object.
    Everything that happens is a function call.

John Chambers

When using R, your data, functions, results etc. are stored in the active memory of the PC in the form of objects of different classes to which you assign names.

Object	Class
vector	numeric, integer, character, complex or logical
factor	numeric, integer, character
array	numeric, integer, character, complex or logical
matrix	numeric, integer, character, complex or logical
data frame	numeric, integer, character, complex or logical
list	numeric, character, complex, logical, function, expression…

First let us focus on numerical objects, also called double. To create numerical objects we need to write our result to a variable. A variable is an object to which we have given a name and assigned a value. Here’s an example:

x <- 1
y <- 2
z <- 10
Peter <- x + y
Bernard <- y - x
Rabbit <- z * Peter

Now we can display the values of these variables simply by typing their name.

Name	Value
x	1
y	2
z	10
Peter	3
Bernard	1
Rabbit	30

To see the which class type our variables are:

typeof(x)

R as a pocket calculator, first steps

The user can do a lot of things with these objects, like basic arithmetical calculations \(+, −, ∗\) and \(/\).

Exercise: Do basic arithmetical calculations for any combination of x, y and z. Can you use one variable more than once? Is the result of the operation saved somewhere? Why?

Many more complicated functions are available in R, here is a small list:

^: power (so 2 to the power of 2 would be 2^2 in R)
sqrt(x): square root of x
log(x), exp(x): (natural) logarithm of x, exponential function of x
sin(x), cos(x), tan(x): trigonometric functions of x
abs(x): absolute value |x| of x

Exercise: Sum the values of x, y and z into a variable space. Try doing it in more than one way, there is a hint in the previous sentence.

Exercise: Make a new variable n. Assign it a negative value and then add it to one of the existing variables. Now do the same with its absolute value.

Other object classes

So far, we have used one type of object classes, numeric. 4 other important object classes are character, logical, integer and function. The class character consists of objects that are strings of symbols (e.g. words but also something like A?7Fd). The class logical is reserved for the logical expressions TRUE and FALSE. Such objects are often useful in programming (more on that later). Integer is a class which only allows integer-valued entries. NA is the class reserved for missing values. NA is not really an object class, but gets it class information from what object class the value is missing from. If no such information is available, NA is treated as logical.

word <- "hello"
word2 <- "A?7Fd"
mv <- NA
Bool <- TRUE

To display a class of an object you can use a function like mode().

mode(x)
mode(sin)
mode(word)
mode(Peter)

We shall talk about functions a bit later.

Note that R (temporarily) saves all objects defined in the so called workspace. The workspace can be saved (permanently) after a session and reloaded to use objects defined in an earlier session. The workspace can be saved in Rstudio using the menu in the upper right corner. If you end R, you will always be asked whether the workspace should be saved (it is saved in the working directory). In RStudio, you can also easily delete all defined objects by clicking on Clear all in the upper right quarter.

Vectors, matrices, lists and data frames

Often, we will not only deal with single objects, but with several objects at once. For example, SNP data of one individual consists of all nucleotides at each SNP position, hence consists of a vector (an ordered set) of entries.

Vectors

For example, the vector (1,2,3,4) can be defined by

v1 <- c(1,2,3,4)

Note that a vector consists of objects of the same class and that by using c() on vectors, you can concatenate vectors. Try using the command str instead of mode to get further information about an object.

v2 <- c("a", 1, 2, 3)
mode(v2)
str(v2)

Here we saw what happens when we mix different classes in a vector. R will create a resulting vector that is the least common denominator. The coercion will move towards the one that’s easiest to coerce to.

d <- c(1, 2, 3, TRUE)
e <- c("a", 2, 3, TRUE)

With some imagination you can see why that has a potential of inducing many a headache. Luckily it is easy in R to change the class of a vector with a function call as.<class.name>(), meaning as.numeric(d) to change vector d from the previous example into a numeric one.

Exercise: Try converting some of your vectors to other classes and try to comprehend the results.

Vectors can also be made by concatenating other vectors.

v3 <- c(v1, v1, v1)
v3

Instead of typing all entries by hand, vectors consisting of copies of the same element or of equidistant values can be defined by the following commands:

rep(4, 6)       # The first argument gives the object to repeat, the second argument the number of repetitions
seq(2, 4, 0.5)  # Consists of values with distance 0.5 from 2 to 4 (including 2 and 4)
1:7             # A vector consisting of 1,...,7
seq(along = v2) # Vector (1,...,length(v2))

We can use various functions to examine our vectors.

length(v2)
class(v2)
mode(v2)
summary(v2)
str(v2)
mean(v2)
var(v2)
.
.
.

Working with vectors

Let v1 be a vector with numerical entries, for example

v1 <- 1:5
mode(v1)
v1 + 5

A certain entry, say the i-th entry of v1 can be accessed by v1[i]. Note that you can also access a sub vector by specifying all of the entries you want to access.

v1[1]        # The first entry of the vector
v1[1] + 5    # The first entry of the vector increased by 5
v1[c(1, 2)]  # The first two entries of the vector
v1[-c(1, 2)] # All entries of the vector apart from the first two
v1[v1<3]     # All entries smaller than 3

Exercise: Make a vector of all numbers between 3 and 21. Multiply the second element of the vector by 3. Multiply the whole vector by 4. Multiply all but the last element of the vector by 5.

Exercise: Construct a vector of length 300 consisting of 100 copies of numbers 1,2,3.

Exercise: What kind of an object is v1<3?

In the last example, we introduced a vector consisting of objects of class logical. Let’s look at it: v1<3

Such operations (a vector, a comparison operator and a object the vector is compared to) produces a vector giving the result of the comparison for each vector entry. Here’s a list of logical comparison operators:

==, !=: equal, unequal
>, >=: greater, greater or equal
<, <=: smaller, smaller or equal
&, |: and, or (to combine logical expressions, each expression has to be put into ())
!(logical condition): negation of a logical condition

logicv1 <- (v1>1) & (v1<4)
v1[logicv1]

So far, accessing entries of a vector resulted in the output of the accessed entries, discarding the information at which position of the original vector the entries are placed. This information can be retrieved by which.

v4 <- −7:3   # Defines v4 as the vector (−7, −6, ..., 1, 2, 3). Note the blank!
which(v4>0) # Gives the positions of all entries of v4 bigger than 0

Note that to get the position of a minimal or maximal entry of a numerical vector v1, you can use which.min(v1) and which.max(v1) (if there are several entries tied for minimum or maximum, the entry with lowest position is shown). To overwrite entries in a vector, you just have to assign a new value to the entry:

v1[1] <- 100
v1[1]
v1

Here’s a list of some useful functions for a numerical vector v:

max(v), min(v): Gives the maximal/minimal value of a vector
sum(v): Sums the entries of v
mean(v): arithmetic mean of v
sort(v): sort entries of v in increasing order. To sort in decreasing order, add the second argument decreasing=TRUE

For a logical vector v, the following commands might be useful:

any(v): Is at least one entry of v TRUE?
all(v): Are all entries of v TRUE?

Matrices

Another important structure in R is a matrix. A matrix is a rectangular scheme of \(n * m\) values, where \(n\) is the number of rows and \(m\) is the number of columns. A matrix can be defined by listing all entries in a vector, specifying the number of rows and stating whether the entries are ordered by rows or columns. The \(n × n\) identity matrix can be defined by diag(n).

matrix1 <- matrix(c(1, 2, 3, 4), nrow = 2, byrow = TRUE) # Ordered by rows
matrix2 <- matrix(c(1, 2, 3, 4), nrow = 2,byrow = FALSE) # Ordered by columns
D <- diag(2)
matrix3 <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 4, 
                  byrow = TRUE)

Note that since the first argument of matrix is a vector, you can use the commands written down in the chapter about vectors to easily built matrices with specific patterns in their entries. To access an entry of a matrix, you have to specify its row and column. As in the case of vectors, you can also access several entries at once

matrix1[1,2] # Accesses the entry in the 1st row, 2nd column
matrix1[ ,1] # Accesses the first column
matrix1[1, ] # Accesses the first row

You can assign names to the rows and columns of a matrix.

colnames(matrix1) <- c("A","B")  # Assign names to the columns of matrix1
rownames(matrix1) <- c("C", "D") # Assign names to the rows of matrix1
matrix1
matrix1["C","D"]
matrix1[,"B"]

There are many operations available to manipulate matrices. The following list shows some important commands for matrices:

v3 <- c(1, 1)       # Defines a 2-dimensional vector
t(matrix1)          # Transposes matrix1 (switches rows with columns)
matrix1 %∗% matrix2 # Multiplies matrix1 with matrix 2
matrix1 %∗% v3      # Multiplies matrix with vector
solve(matrix1)      # Inverts the matrix
solve(matrix1, v3)  # Solves the system of linear equation matrix1*x=v3
cbind(matrix1, v3)  # Adds v3 as a new column (works also adding matrices)
rbind(matrix1, v3)  # Adds v3 as a new row (works also adding matrices)

Exercise: Make a 3x3 matrix (3 rows, 3 columns), where the first column contains only number 4.

Exercise: From the above matrix extract columns where the sum of numbers in the column is bigger then 10. Use the colSums() function.