R data analysis and visualisation for beginners part 2

Data frames

Normally, we will deal with data collected from different individuals. This data can be seen as a scheme with rows and columns (similar to a matrix), where the rows stand for the individuals and the columns stand for each measured variable or some information about experimental factors. Note here that in contrast to a matrix, the columns may have any object as entry (but the type of object is equal for all rows/individuals). This type of data structure is called a data frame in R. It can be defined by the command data.frame as follows:

data1 <- data.frame(height = c(3, 4, 5, 3), weight = c(5, 5, 4, 2), 
                    treatment = c("c", "a", "a", "b"))
str(data1) # Note that the third column is referred to as a factor

If we compare defining a data frame to defining a matrix, we see that we enter each column as a separate vector and we can name the columns (similar naming is possible for matrices and vectors). Non-numerical values are mostly experimental conditions in a data frame and will be treated as factors. To access entries, there are two possibilities: We can do as with matrices or directly address the columns.

data1[1, 2]        # Accesses the entry in the 1st row, second column
data1$weight[1]    # Does the same
data1$weight       # The column weight
data1[[1]]         # The first column

The benefit of using the column names is that you don’t have to memorize the exact structure of the data frame, but just the column names (thus, use reasonable column names). For programming though, it’s often easier to address the columns by number and not by name. If you are about to work with one data frame a lot, you can use attach() to add the data frame to the search path of R. This means that R knows that if you type in a column name, it’s from said data frame. You can detach by using detach().

data1$height
height        # Doesn't work
attach(data1)
height        # Works now
data1$height
detach(data1)

We know how to access different columns of a data frame. However, we will often be interested to work with a subset of data, for example only data from individuals/rows under a certain experimental condition (e.g., a certain factor level of an experimental factor). Such subsets of data can be accessed by subset().

subset(data1, treatment == "a")                       # Chooses all rows with treatment a
data_sub <- subset(data1, treatment %in% c("a", "b")) # Chooses all rows with treatment a or b
str(data_sub)
table(data_sub$treatment)                             # Unused factor levels are kept by subset
data_sub2 <- droplevels(data_sub$treatment)           # Kicks out unused factor levels
table(data_sub2)
subset(data1, treatment == "a", select = height)      # Shows the height values for all individuals with treatment a

As with matrices, rbind() and cbind() can be used to glue data sets together. A more flexible command is merge() (which we don’t cover here), to learn about it type ?merge(). A similar, but more flexible class for such lists is list(). All defined objects are displayed in a list in the upper right corner in Rstudio.

Exercise: Make a data frame with 3 columns and 5 rows, make sure first column is sequence of numbers 1:5, second column is a character vector, and the third one is a logical vector. Now add a fourth column of numbers. Extract those rows where fourth column is bigger than 2. Add first to the fourth column.

Missing data

As already mentioned, missing data should be coded as NA. One way to exclude missing data is to only keep data rows that are complete for all variables. This can be done by na.exclude. For vectors and data frames, it marks all missing values to be ignored for further computations. To just show the positions of NA, use is.na. is.na gives the same object format with logical values indicating whether there is a missing value TRUE or not FALSE. To check whether there is any missing data at all, type any(is.na())

v_na <- c(NA, 2, 4, NA)
data_na <- data.frame(b1 = 1:4, b2 = c(NA, 3, 3, NA))
data_na
is.na(v_na)
any(is.na(v_na))
is.na(data_na)
data_good <- na.exclude(data_na) # Throw out all rows with NA
data_good
str(data_good)
mean(na.exclude(v_na))           # Compute the mean of present values, no permanent change in v_n

Input and output of data

As already seen, data input is strenuous in R. Thus, we will mostly import data from other sources, for example tables from OpenOffice or Excel, text files including measuring data, output data from other programs etc.. We start first with a data set in format .txt, which we will create using the built-in editor in Rstudio. Open a new .txt-file by clicking on the button marked with \(+\) in the upper left corner in the GUI and choosing a new text file. Type in:

height;weight;sex

170;65;F

180;80;M

177;81;M

Note here that we have a heading containing the column names and semicolons which separate different values (imagine this being some randomly generated patient data). Save the file as data2.txt in your working directory. We will now import this data set as a data frame in R. This can be either done by clicking on Import Dataset in Rstudio in the upper right corner (and specify the header, the separation symbol etc.). The same can be done by using the command read.csv() with the right parameters (learn about it by reading the manual ?read.csv())

data2 <- read.csv("data2.txt", sep = ";")

Note that the argument header has the default value TRUE meaning that the program reads in the first row of the text as the header containing column names (note also that read.csv() expects the decimal symbol). A similar command is read.table(), which just has different default values for the arguments. To save data and objects produced in R permanently, there are several possibilities. Text files can be produced by write.table(), which has the same arguments as read.table() (and the same default values). You only have to specify the output file. Here’s an example:

write.table(data1, "data1.txt", sep = ";") # Separation symbol ;

You can look at the output file in the built-in editor of Rstudio. Note that we don’t use write.csv() since it allows no control on the arguments (it’s made this way to enable problem-free export to Excel, so use it then).

To save a R-object as a R-object, you can use the command save(). For example, we write out \(v1\) and \(v2\) in a binary file with save() by writing all objects to write out as arguments and specify the file to write in. The written file is a binary file so you can’t directly access it by editor, but have to load it to R with load().

save(v1, v2, file = "vector")
v1 <- 0                      # Change v1
v2 <- 0                      # Change v2
load("vector")               # Load the previous definitions of the vectors
v1
v2

The workspace including all defined variables can be saved by save.image(“file_name”) and loaded by load(“file_name”).

After we learned how to import and export data let us look into the data itself. It’s essential to inspect any imported data before analysis. Here are four key functions to do so applied on tuberculosis metadata from Mozambique:

install.packages("readxl")  #install the package necessary to read excel files, needed only once
install.packages("writexl")  #install the package necessary to write excel files, needed only once
library(readxl)             #load the package necessary to read excel files, needed ever time you start RStudio
library(writexl)             #load the package necessary to write excel files, needed ever time you start RStudio
Moz <- read_excel("Mozambique.xlsx", na = "NA") #Import the excel file
head(Moz)                   #show first six rows of every column
head(Moz, n = 10)           #show first ten rows of every column
tail(Moz)                   #show last six rows of every column
summary(Moz)                
str(Moz)

Exercise: Write data1 to a file data3.txt this time using tab \t as a separator. Open the file in a text editor to see the difference between it and data2.txt.

Functions

Remember what we said at the beginning. To understand computations in R, two slogans are helpful:

Everything that exists is an object.
Everything that happens is a function call.

v2 <- c("a", 1, 2, 3)
str(v2)

So if we go back to this example from before we now know that \(v2\) is an object. str() on the other hand is a function that provides us with some information on \(v2\).

How to make a function?

All R functions have three parts:

the body, the code inside the function.
the formals, the list of arguments which controls how you can call the function.
the environment, the map of the location of the functions variables.

Here we will focus on the first two components. To define a new function, we have to specify the arguments of the function, a function name and the function itself. For example, we can define the function that calculates \(2*b + 2\) by:

a <- function(b) {
  2 * b + 2
  }
formals(a)
body(a)
mode(a) # shows the mode of a
a(4)    # computes the value of the function a for an input number 4

A function may have more than one argument, and the arguments don’t necessarily have to be objects of the class numeric. For example R can also plot mathematical functions by using the function curve(). curve() has many possible arguments. Type ?curve() to get an overview. Note that some arguments have a predefined default value, meaning that if you don’t specify a value for such an argument, the default value is used. For starters, we will focus on the arguments:

expr: The function which to plot
from: Lower bound of the \(x\)-coordinate of the plot
to: Upper bound of the \(x\)-coordinate of the plot
xlab: Label of the \(x\)-axis, can either be written in text (“text”) or as a mathematical expression (using expression())
ylab: Label of the \(y\)-axis, , can either be written in text (“text”) or as a mathematical expression (using expression())

Here’s the command to let R plot the function \(sin2x\) from \(−2π\) to \(2π\) (with labelled axes).

sin2x <- function(x) {
  sin(2**x)
  }
curve(sin2x, from = -2 * pi, to = 2 * pi, xlab = "x␣",
      ylab = expression(sin(2 * x)))

Note that R doesn’t keep track of objects defined in a function unless you force it to return their values. By default, just the last evaluated expression is returned (as seen). Using return at the end of your function, you can specify the return values. Here’s an example:

testf <- function(x) {
  a <- 1
  c <- 2
  return(c(a, c))
  }
testf(1)

Exercise: Create a function that converts Fahrenheit to Celsius.

Exercise: Plot a function \(x*x\) or \(x\)²(both will of course give the same result) for \(x\) from -100 to 100.

Plotting in R

One of the main strengths of R comes from its strong graphical possibilities. Here we will learn the basics of the plotting functions while it is encouraged to look into various online plotting tutorials if you want to learn more:

basic plotting

Mother of graphs!

For a basic scatterplot we will use one of inbuilt R datasets since we need numerical values for a nice scatterplot.

data(iris)                               #loading a plant dataset already existing in R
class(iris)                              #lets see what type of data this is
summary(iris)                            #summary of the dataset
plot(iris$Sepal.Length,iris$Sepal.Width) #plots the length and width of the plants
?plot                                    #shows us all the options we can use with the plot function
plot(iris$Sepal.Length, iris$Petal.Length, col=iris$Species,pch=19)
plot(iris$Sepal.Length, iris$Petal.Length, col=iris$Species,pch=19, xlim=c(0, 10), ylim=c(0, 8))
plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Iris_Data", xlab="Petal_length",ylab="Petal_Width")
par(mfrow = c(3,1))
plot(iris$Sepal.Length, iris$Petal.Length, col=iris$Species,pch=19)
plot(iris$Sepal.Length, iris$Petal.Width, col=iris$Species,pch=19)
plot(iris$Sepal.Width, iris$Petal.Length, col=iris$Species,pch=19)
par(mfrow= c(1,1))

There are many different plotting functions in R, some of them are hist() that produces a histogram, barplot() and boxplot(). Lets explore Mozambique data with their help.

hist(Moz$Age, breaks = 20, col="lightblue", main="Age distribution",
     xlab="Patient age", border="black" ) #using a histogram to see the patient age distribution
boxplot(Age ~ Res_Summary, data = Moz, col = c("lightblue", "green", "pink", "red", "orange","darkblue"), 
        main="Resistance status by Patient age", xlab="Resistance", ylab="Age")
barplot(table(Moz$Lineage), col=c("red", "blue", "green","orange"), main="MTB lineage",
        xlab="Lineage", ylab="Count")

You can also plot linear models and if brave delve deep into regression.To fit a linear model to a data set, we just have to specify the linear model we want to use and then plot the data using the plot() function.

x <- c(1,2,3,4,5)
y <- c(1.6,4,6.5,7.5,10)
plot(x,y)

Here, you can again add graphical arguments to plot. Now we want to add a regression line. We define the regression of y on x as:

reg <- lm(y~x)
reg
str(reg)
abline(reg)     #draws the regression line into the plot

We could spend hours on talking about how to modify our plots and graphs in R. The best way in the end is to read the help files for specific functions, online tutorials and browse for code you need. Whatever you want done was already done by someone, don’t forget that. For example pairs() produces a matrix of scatter plots.

pairs(iris[1:4], main = "Edgar Anderson's Iris Data", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

Exercise: Do a scatterplot of sepal length and sepal width from the iris data set, use different symbols to represent the sample.

Bonus Exercise: Plot a histogram of petal length for each species separately, remember subset(). Plot all three in one image in one row.