9 Handling data

Base R refers to functions that come with an installation of R, not those that are in packages that are must be installed. We will learn how to handle data both using the base R functions, and using the package dplyr.

The dplyr package is part of the tidyverse, which you can learn more about from watching this video by SHARCNET (High Performance Computing Ontario).

9.1 Base R

9.1.1 Useful functions

max() outputs the maximum value in a vector (recall, a vector is a list of numbers, e.g., c(0,2,5)), matrix, or data frame;
min() outputs the minimum value in a vector, matrix, or data frame;
length() outputs the total number of elements, in a vector, matrix, or data frame;
sum() adds or takes the sum of all values in a vector, matrix, or data frame;
mean() calculates the mean of all values in a vector or matrix;
colMeans() calculates the mean of a column in a matrix or data frame.
head() prints the first 6 rows of the argument to your console. This can be helpful if you want to check your data without having to sort through a long output.
tail() similar to head() but prints the last 6 rows of the argument.

TRY IT! In the Console do some tests on these functions by creating a test matrix and a test data frame:

test = data.frame(col1 = c(0,2), col2 = c(5,6))
test2 = matrix(c(2,3),c(5,1),nrow=2)
mean(test)

## Warning in mean.default(test): argument is not numeric or logical: returning NA

## [1] NA

colMeans(test)

## col1 col2 
##  1.0  5.5

mean(test2)

## [1] 2.5

colMeans(test2)

## [1] 2.5 2.5 2.5 2.5 2.5

Which command generates an error? Why? How is mean() different to colMeans()? Is the length(), when applied to a data frame or matrix, equal to the number of rows?

9.1.2 Extracting rows and columns

After we have loaded/imported our data we want to be able to work with it. Typical operations are to extract specific columns, rows, or values, and to add new columns and rows. Recall that columns run top to bottom, while rows run left to right.

In addition, columns will sometimes have names:

names(data)

## [1] "site"       "experiment" "length"     "width"      "height"

To extract a column, we use the symbol $ followed by the name of the column (in this case site):

  data$site

##  [1] 1 1 1 2 2 2 3 3 3 4 4 4

If the columns do not have names you can specify the column you require by using data[,n], where n is the number of column you require (beginning from the left). When no number precedes the ,n this means all rows:

data[,1]

##  [1] 1 1 1 2 2 2 3 3 3 4 4 4

To extract row n, and all the columns, we use data[n,]. Row 1 (all columns) is extracted as:

  data[1,]

##   site experiment length width height
## 1    1          1    2.2   1.3    9.6

If we want one specific value, data[n,m] returns the value in row n and column m:

  data[1,2]

## [1] 1

Furthermore, data[,n:m] returns all the rows for columns n to m:

  data[2:4,]

##   site experiment length width height
## 2    1          2    2.1   2.2    7.6
## 3    1          3    2.7   1.5    2.2
## 4    2          1    3.0   4.5    1.5

while data[n:m,] returns all the columns for rows n to m:

  data[,2:3]

##    experiment length
## 1           1    2.2
## 2           2    2.1
## 3           3    2.7
## 4           1    3.0
## 5           2    3.1
## 6           3    2.5
## 7           1    1.9
## 8           2    1.1
## 9           3    3.5
## 10          1    2.9
## 11          2    4.5
## 12          3    1.2

Note that in the two examples above n must be bigger than m. This method is called slicing.

This are simple commands to work with data, but as data scientists frequently want to work with data in this way, the dplyr package was created to handle data more efficiently.

9.2 dplyr

We recommend using dplyr to handle data as the commands are concise, making the code easier to read and write. Install and load the dplyr R package.

The dplyr package contains several different function that are very useful when manipulating and cleaning large datasets. We will be highlighting the use of the select(), mutate(), length(), arrange(), and filter().

9.2.1 Selecting a column

The command select(data, length) selects the length column of the dataframe data:

select(data, length)

##    length
## 1     2.2
## 2     2.1
## 3     2.7
## 4     3.0
## 5     3.1
## 6     2.5
## 7     1.9
## 8     1.1
## 9     3.5
## 10    2.9
## 11    4.5
## 12    1.2

This achieves the same result as the data$length command we learned in the previous section.

9.2.2 Adding a new column

There are several different methods to add new columns to a data set in R, but when using the dplyr package, the function mutate() is appropriate. Below we create a new column called site_double that is equal to the site column times 2.

data2 = mutate(data, site_double = site * 2)

We can see that data2 has a new column called site_double.

9.2.3 Sorting the data

The arrange() function orders the rows of a data frame by the values in a specific column. Below we arrange data by increasing order of the width column.:

arrange(data, width)

##    site experiment length width height
## 1     3          2    1.1   0.5    2.3
## 2     1          1    2.2   1.3    9.6
## 3     1          3    2.7   1.5    2.2
## 4     3          1    1.9   1.8    4.5
## 5     3          3    3.5   2.0    7.5
## 6     1          2    2.1   2.2    7.6
## 7     4          1    2.9   2.7    3.2
## 8     2          3    2.5   2.8    3.0
## 9     2          2    3.1   3.1    4.0
## 10    2          1    3.0   4.5    1.5
## 11    4          2    4.5   4.8    6.5
## 12    4          3    1.2    NA    2.7

9.2.4 Filtering data

The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. For a row to be retained, TRUE must be produced for all conditions. To extract all rows of data with length greater than 7 cm:

filter(data, height > 7)

##   site experiment length width height
## 1    1          1    2.2   1.3    9.6
## 2    1          2    2.1   2.2    7.6
## 3    3          3    3.5   2.0    7.5

We can also filter with two (or more) conditions. We use & (meaning and) if we require both conditions to be satisfied, or | (meaning or) if we require at least one condition is satisfied. Below, we filter the data so that data3 contains only plants that have length greater than 8 cm and width greater than 5 cm.

data3=filter(data, height >8 & width > 5)

However, no plants met both conditions so data3 is empty.

We can filter to retain plants from Site 1 or Site 3:

filter(data, site == 1 |site == 3)

##   site experiment length width height
## 1    1          1    2.2   1.3    9.6
## 2    1          2    2.1   2.2    7.6
## 3    1          3    2.7   1.5    2.2
## 4    3          1    1.9   1.8    4.5
## 5    3          2    1.1   0.5    2.3
## 6    3          3    3.5   2.0    7.5

In R, == evaluates whether something is TRUE or FALSE, while = assigns the value on the right hand side to the variable on the left hand side. TRY IT In the Console try the following:

# Assigns x the value 0
x = 0
# Evaluates TRUE if x = 0 (TRUE since we assigned x=0 on the above line of code)
x == 0

## [1] TRUE

# Evalues TRUE if x = 1 (FALSE since we assigned x=0 on the first line of code)
x == 1

## [1] FALSE

Therefore, when we write site == 1, TRUE is returned when the value of site is 1, and these rows are retained in the output. If we wrote site = 1, then R would try to assign the variable site with the value 1, which would return an error message since the second argument of the filter() function requires a logical statement that evaluates as TRUE or FALSE and site = 1 assigns a value; it is not a logical statement.

9.2.5 Removing NA entries in data

Often experimental data is not as neat or pretty as example data. There is often missing data, or empty or NA entries which may cause issues in graphing or data analysis. R does not allow empty values, so empty values in .csv or .xlsx files are often converted to NA when these data are uploaded into R. Luckily, again the filter function has a solution to this! We can filter the data to remove rows with NULL or NA entries.

#filter the data to remove rows with null values in the height column
filter(data, !is.na(width))

##    site experiment length width height
## 1     1          1    2.2   1.3    9.6
## 2     1          2    2.1   2.2    7.6
## 3     1          3    2.7   1.5    2.2
## 4     2          1    3.0   4.5    1.5
## 5     2          2    3.1   3.1    4.0
## 6     2          3    2.5   2.8    3.0
## 7     3          1    1.9   1.8    4.5
## 8     3          2    1.1   0.5    2.3
## 9     3          3    3.5   2.0    7.5
## 10    4          1    2.9   2.7    3.2
## 11    4          2    4.5   4.8    6.5

Observe that row 12 of data has now been removed because it contained NA in the width column. To understand this command, note that is.na(x) returns TRUE if x = NA. The ! means ‘not’ and reverses the TRUE and FALSE outputs. We illustrate ! below:

# 1 equal to 2 (FALSE)
1 == 2

## [1] FALSE

# 1 is not equal to 2 (TRUE)
1 != 2

## [1] TRUE

Recall, that filter() retains the rows that evaluate as TRUE: we wish to remove the rows with NA, so we need to write a command that evaluates these to FALSE. Therefore, we use !is.na().

9.2.6 Grouping

The functiongroup_by() can be used with summarize() and tally(). Recall that our data consists of different sites:

head(data)

##   site experiment length width height
## 1    1          1    2.2   1.3    9.6
## 2    1          2    2.1   2.2    7.6
## 3    1          3    2.7   1.5    2.2
## 4    2          1    3.0   4.5    1.5
## 5    2          2    3.1   3.1    4.0
## 6    2          3    2.5   2.8    3.0

The code below calculates the mean length for each site (because we group data by site):

by_site <- group_by(data,site)
data2 <- summarize(by_site,length=mean(length))
print(data2)

## # A tibble: 4 × 2
##    site length
##   <int>  <dbl>
## 1     1   2.33
## 2     2   2.87
## 3     3   2.17
## 4     4   2.87

Applying the function group_by() doesn’t make the data noticeably different aside from noting the grouping:

by_site <- group_by(data,site)
head(print(by_site))

## # A tibble: 12 × 5
## # Groups:   site [4]
##     site experiment length width height
##    <int>      <int>  <dbl> <dbl>  <dbl>
##  1     1          1    2.2   1.3    9.6
##  2     1          2    2.1   2.2    7.6
##  3     1          3    2.7   1.5    2.2
##  4     2          1    3     4.5    1.5
##  5     2          2    3.1   3.1    4  
##  6     2          3    2.5   2.8    3  
##  7     3          1    1.9   1.8    4.5
##  8     3          2    1.1   0.5    2.3
##  9     3          3    3.5   2      7.5
## 10     4          1    2.9   2.7    3.2
## 11     4          2    4.5   4.8    6.5
## 12     4          3    1.2  NA      2.7

## # A tibble: 6 × 5
## # Groups:   site [2]
##    site experiment length width height
##   <int>      <int>  <dbl> <dbl>  <dbl>
## 1     1          1    2.2   1.3    9.6
## 2     1          2    2.1   2.2    7.6
## 3     1          3    2.7   1.5    2.2
## 4     2          1    3     4.5    1.5
## 5     2          2    3.1   3.1    4  
## 6     2          3    2.5   2.8    3

The first argument, x, of summarize(x,y) is the grouped data, and the second argument, y, has the structure name of the new column = function to be applied to(column of grouped data). Recalling that the function min() calculates the minimum of its argument, we might also calculate the minimum width for each of the sites:

summarize(by_site, min.width = min(width))

## # A tibble: 4 × 2
##    site min.width
##   <int>     <dbl>
## 1     1       1.3
## 2     2       2.8
## 3     3       0.5
## 4     4      NA

The function tally() counts the number of records in each group.

tally(by_site)

## # A tibble: 4 × 2
##    site     n
##   <int> <int>
## 1     1     3
## 2     2     3
## 3     3     3
## 4     4     3

Equivalently, we could apply the function length() using summarize():

summarize(by_site,n=length(site))

## # A tibble: 4 × 2
##    site     n
##   <int> <int>
## 1     1     3
## 2     2     3
## 3     3     3
## 4     4     3

9.2.7 Joining

A good description of joining data sets using inner_join() is provided here.

9.2.8 Pipes

The pipe function, %>%, allows for compact coding when the output on the lefthand size of %>% is the input for the calculations performed on the righthand size of %>%. Consider:

data2 <- data %>%
         filter(!is.na(width)) %>%
         mutate(height_cm=height*100) %>%
         select(site, experiment, height_cm)
head(data2)

##   site experiment height_cm
## 1    1          1       960
## 2    1          2       760
## 3    1          3       220
## 4    2          1       150
## 5    2          2       400
## 6    2          3       300

We end up with a new dataframe, data2, that is data with three operations sequentially performed. The advantage of using %>% is the compactness and clarity of the code to achieve this result. Let’s understand these commands further.

data2 is the name of the output when we are finished all our operations;
data is the data that we start with;
filter(!is.na(width)) removes all rows with NA in the width column of data. In the previous sections, when we demonstarted filter(x,y), the first argument, x, was a dataframe, and the second argument was a logical statement, i.e., !is.na(width). Above, when %>% is used only the logical statement is provided and the data that the function applies to is supplied on the left hand side of %>%.
mutate(height_cm=height*100) converts height in meters to centimeters, and this operation is applied to all the data that passed through the pipe from the previous operation where NAs in the width column were removed.
select(site, experiment, height_cm) is used to output only columns of interest, and again, the data argument to the select() function is supplied by the lefthand size of the pipe.

Note that the pipes will not work if the output of one operation is of the wrong type to be the input to the next operation.

To understand the value of pipes, consider the equivalent code written without using pipes:

data2 <- filter(data,!is.na(width))
data3 <- mutate(data2,height_cm=height*100)
data4 <- select(data3, site, experiment, height_cm)
head(data4)

##   site experiment height_cm
## 1    1          1       960
## 2    1          2       760
## 3    1          3       220
## 4    2          1       150
## 5    2          2       400
## 6    2          3       300

Without using pipes, we create variables data2 and data3 that only serve as intermediaries, that we may never use again in our subsequent lines of code. Creating many variables can bog down both the computer memory and your own memory, when you are trying to remember the variable name you gave to a particular output when writing subsequent commands.

9.3 Reshaping Data

Data frequently come in formats that are unsuitable for analysis. You would be required to do modifications on the data in some way. The process of cleaning or modifying the data to make it ready for analysis is known as tidying, or munging. The goal is generally to get the data into a tidy format. You often get data where items that should be in the rows are in the columns or vice versa yet it’s not in a transpose form. Some plotting functions in R have a special way the data should be formatted before it can produce the required plot. In these circumstances, you would have to reshape the data into a format that is easier to work with. This section considers the various ways we can format data into a wide or long format. Depending on your goal, you may want the data frame to be in one of these specific formats. We hereby going to learn how to reshape data into one of these formats.

9.3.1 Reshaping data from wide to long format

We use the stack method in base R to transform data available in the form of separate columns within a data frame or a list into a single column. The stack method uses the two functions cbind() and stack() produces a result in the form of a data frame with two columns. We will demonstrate how this method is done by creating a sample data frame as follows and following through the lines of code below.

# creating a data frame
data <- data.frame(name=c('Josh', 'Josh', 'Jake',
                          'Jake', 'Gorge', 'Gorge'),
                   semester=c(rep(LETTERS[1:2],3)),
                   quiz_BIOL1001=c(1, 3, 4, 8, 9, 1),
                   quiz_BIOL1002=c(2, 4, 6, 2, 7, 3))

data

##    name semester quiz_BIOL1001 quiz_BIOL1002
## 1  Josh        A             1             2
## 2  Josh        B             3             4
## 3  Jake        A             4             6
## 4  Jake        B             8             2
## 5 Gorge        A             9             7
## 6 Gorge        B             1             3

We would now bind the first two columns of the data frame named data and stake the third and fourth columns and save it in a new name called data_long.

data_long <- cbind(data[1:2], stack(data[3:4]))

You can query the the new data created in the console to see the content.

data_long

##     name semester values           ind
## 1   Josh        A      1 quiz_BIOL1001
## 2   Josh        B      3 quiz_BIOL1001
## 3   Jake        A      4 quiz_BIOL1001
## 4   Jake        B      8 quiz_BIOL1001
## 5  Gorge        A      9 quiz_BIOL1001
## 6  Gorge        B      1 quiz_BIOL1001
## 7   Josh        A      2 quiz_BIOL1002
## 8   Josh        B      4 quiz_BIOL1002
## 9   Jake        A      6 quiz_BIOL1002
## 10  Jake        B      2 quiz_BIOL1002
## 11 Gorge        A      7 quiz_BIOL1002
## 12 Gorge        B      3 quiz_BIOL1002

Using melt method in reshape2 package in R. As we already know, we will have to install and load the reshape2 package before it can be used in R if not already done.

#install.packages("reshape2")
library(reshape2)

Now we will see how the melt method in this package can be used to stack data frame columns together by going through an example. You can read more about the syntax for the melt() function by typing ?melt in the console. We would use the data frame named data for this ecample.

# binding the first two columns as it is and stacking the third and fourth columns
data_melt <- reshape2::melt(data, id.var = c('name', 'semester'),
                           variable.name = 'course_name')
data_melt

##     name semester   course_name value
## 1   Josh        A quiz_BIOL1001     1
## 2   Josh        B quiz_BIOL1001     3
## 3   Jake        A quiz_BIOL1001     4
## 4   Jake        B quiz_BIOL1001     8
## 5  Gorge        A quiz_BIOL1001     9
## 6  Gorge        B quiz_BIOL1001     1
## 7   Josh        A quiz_BIOL1002     2
## 8   Josh        B quiz_BIOL1002     4
## 9   Jake        A quiz_BIOL1002     6
## 10  Jake        B quiz_BIOL1002     2
## 11 Gorge        A quiz_BIOL1002     7
## 12 Gorge        B quiz_BIOL1002     3

Here is another example on how to use the melt() function in the reshape2 package to reshape data.

#create data frame
data2 <- data.frame(person=c('A', 'A', 'B', 'B', 'C', 'C'),
                   trial=c(1, 2, 1, 2, 1, 2),
                   outcome1=c(7, 6, 6, 5, 4, 4),
                   outcome2=c(4, 4, 5, 5, 3, 2))
data2

##   person trial outcome1 outcome2
## 1      A     1        7        4
## 2      A     2        6        4
## 3      B     1        6        5
## 4      B     2        5        5
## 5      C     1        4        3
## 6      C     2        4        2

Reshape data2 from the wide format to a long format by staking the third and fourth columns.

#melt columns of data frame
data2_long <- melt(data2, id.var = c('person', 'trial'), variable.name = 'outcome')
data2_long

##    person trial  outcome value
## 1       A     1 outcome1     7
## 2       A     2 outcome1     6
## 3       B     1 outcome1     6
## 4       B     2 outcome1     5
## 5       C     1 outcome1     4
## 6       C     2 outcome1     4
## 7       A     1 outcome2     4
## 8       A     2 outcome2     4
## 9       B     1 outcome2     5
## 10      B     2 outcome2     5
## 11      C     1 outcome2     3
## 12      C     2 outcome2     2

Using the tidyr package. The tidyr package make use of the following functions:

pivot_longer(): Reshapes a data frame from wide to long format.

pivot_wider(): Reshapes a data frame from long to wide format.

We will study how to use pivot_longer() function in the next example and study how to use pivot_wider() function in the section that follows.

To use these functions, we would be required to run install.packages("tidyr") to get the package installed if you have not done that previously. Also remember to load the package. I would also use the pipe operator, %>% in the code. %>% is not required to use tidyr, but it does make things easier. To use the pipe operator, you must also install dplyr package.

# I have "commented out" these two lines of code with the # sign. Just take away the # and run the lines. 
#install.packages("tidyr") 
#install.packages("dplyr")

library(tidyr)
library(dplyr)

We pivot the data frame into a long format. Note that we use the pipe operator to make the work much easier. You can read more about how to use the pipe operator from subsection 8.2.8 of the Quantitative Guide.

data %>% pivot_longer(cols=c('quiz_BIOL1001', 'quiz_BIOL1002'),
                    names_to='quiz',
                    values_to='marks')

## # A tibble: 12 × 4
##    name  semester quiz          marks
##    <chr> <chr>    <chr>         <dbl>
##  1 Josh  A        quiz_BIOL1001     1
##  2 Josh  A        quiz_BIOL1002     2
##  3 Josh  B        quiz_BIOL1001     3
##  4 Josh  B        quiz_BIOL1002     4
##  5 Jake  A        quiz_BIOL1001     4
##  6 Jake  A        quiz_BIOL1002     6
##  7 Jake  B        quiz_BIOL1001     8
##  8 Jake  B        quiz_BIOL1002     2
##  9 Gorge A        quiz_BIOL1001     9
## 10 Gorge A        quiz_BIOL1002     7
## 11 Gorge B        quiz_BIOL1001     1
## 12 Gorge B        quiz_BIOL1002     3

Another example with data2 data frame using the pivot_longer() function in the tidyr package.

#pivot the data frame into a long format.
data2 %>% pivot_longer(cols=c('outcome1', 'outcome2'),
                    names_to='outcomes',
                    values_to='values')

## # A tibble: 12 × 4
##    person trial outcomes values
##    <chr>  <dbl> <chr>     <dbl>
##  1 A          1 outcome1      7
##  2 A          1 outcome2      4
##  3 A          2 outcome1      6
##  4 A          2 outcome2      4
##  5 B          1 outcome1      6
##  6 B          1 outcome2      5
##  7 B          2 outcome1      5
##  8 B          2 outcome2      5
##  9 C          1 outcome1      4
## 10 C          1 outcome2      3
## 11 C          2 outcome1      4
## 12 C          2 outcome2      2

Another function of tidyr that can be used to transform the data from wide to long is the gather() function. We again show how to use this function.

df_long <- data %>% gather(quiz, marks, quiz_BIOL1001:quiz_BIOL1002)
df_long

##     name semester          quiz marks
## 1   Josh        A quiz_BIOL1001     1
## 2   Josh        B quiz_BIOL1001     3
## 3   Jake        A quiz_BIOL1001     4
## 4   Jake        B quiz_BIOL1001     8
## 5  Gorge        A quiz_BIOL1001     9
## 6  Gorge        B quiz_BIOL1001     1
## 7   Josh        A quiz_BIOL1002     2
## 8   Josh        B quiz_BIOL1002     4
## 9   Jake        A quiz_BIOL1002     6
## 10  Jake        B quiz_BIOL1002     2
## 11 Gorge        A quiz_BIOL1002     7
## 12 Gorge        B quiz_BIOL1002     3

9.3.2 Reshaping data from long to wide format

We will be using the tidyr package to reshape data from long to wide. Let us study how to use pivot_wider() function in the next example to pivot data frame into a wide format. We will be using the example data frame created in the previous examples.

#pivot the data frame into a wide format
df_wide <- df_long %>% pivot_wider(names_from = quiz, values_from = marks)
df_wide

## # A tibble: 6 × 4
##   name  semester quiz_BIOL1001 quiz_BIOL1002
##   <chr> <chr>            <dbl>         <dbl>
## 1 Josh  A                    1             2
## 2 Josh  B                    3             4
## 3 Jake  A                    4             6
## 4 Jake  B                    8             2
## 5 Gorge A                    9             7
## 6 Gorge B                    1             3

Another function of tidyr that can be used to transform the data from long back to wide is the spread() function. We show how to use this function.

df_wide2 <- df_long %>% spread(quiz, marks)
df_wide2

##    name semester quiz_BIOL1001 quiz_BIOL1002
## 1 Gorge        A             9             7
## 2 Gorge        B             1             3
## 3  Jake        A             4             6
## 4  Jake        B             8             2
## 5  Josh        A             1             2
## 6  Josh        B             3             4

We show another example of using the spread() function to transform back data2 from long to wide.

data2_long %>% spread(outcome, value)

##   person trial outcome1 outcome2
## 1      A     1        7        4
## 2      A     2        6        4
## 3      B     1        6        5
## 4      B     2        5        5
## 5      C     1        4        3
## 6      C     2        4        2

There are many other functions to reshape data in R but these are the few handy ones that you can always stick to in your data analysis and plotting.