9 Handling data
Base R refers to functions that come with an installation of R
, not those that are in packages that are must be installed. We will learn how to handle data both using the base R functions, and using the package dplyr
.
The dplyr
package is part of the tidyverse
, which you can learn more about from watching this video by SHARCNET (High Performance Computing Ontario).
9.1 Base R
9.1.1 Useful functions
max()
outputs the maximum value in a vector (recall, a vector is a list of numbers, e.g.,c(0,2,5)
), matrix, or data frame;min()
outputs the minimum value in a vector, matrix, or data frame;length()
outputs the total number of elements, in a vector, matrix, or data frame;sum()
adds or takes the sum of all values in a vector, matrix, or data frame;mean()
calculates the mean of all values in a vector or matrix;colMeans()
calculates the mean of a column in a matrix or data frame.head()
prints the first 6 rows of the argument to your console. This can be helpful if you want to check your data without having to sort through a long output.tail()
similar tohead()
but prints the last 6 rows of the argument.
TRY IT! In the Console
do some tests on these functions by creating a test matrix and a test data frame:
## Warning in mean.default(test): argument is not numeric or logical: returning NA
## [1] NA
## col1 col2
## 1.0 5.5
## [1] 2.5
## [1] 2.5 2.5 2.5 2.5 2.5
Which command generates an error? Why? How is mean()
different to colMeans()
? Is the length()
, when applied to a data frame or matrix, equal to the number of rows?
9.1.2 Extracting rows and columns
After we have loaded/imported our data we want to be able to work with it. Typical operations are to extract specific columns, rows, or values, and to add new columns and rows. Recall that columns run top to bottom, while rows run left to right.
In addition, columns will sometimes have names:
## [1] "site" "experiment" "length" "width" "height"
To extract a column, we use the symbol $
followed by the name of the column (in this case site
):
## [1] 1 1 1 2 2 2 3 3 3 4 4 4
If the columns do not have names you can specify the column you require by using data[,n]
, where n
is the number of column you require (beginning from the left). When no number precedes the ,n
this means all rows:
## [1] 1 1 1 2 2 2 3 3 3 4 4 4
To extract row n
, and all the columns, we use data[n,]
. Row 1 (all columns) is extracted as:
## site experiment length width height
## 1 1 1 2.2 1.3 9.6
If we want one specific value, data[n,m]
returns the value in row n
and column m
:
## [1] 1
Furthermore, data[,n:m]
returns all the rows for columns n
to m
:
## site experiment length width height
## 2 1 2 2.1 2.2 7.6
## 3 1 3 2.7 1.5 2.2
## 4 2 1 3.0 4.5 1.5
while data[n:m,]
returns all the columns for rows n
to m
:
## experiment length
## 1 1 2.2
## 2 2 2.1
## 3 3 2.7
## 4 1 3.0
## 5 2 3.1
## 6 3 2.5
## 7 1 1.9
## 8 2 1.1
## 9 3 3.5
## 10 1 2.9
## 11 2 4.5
## 12 3 1.2
Note that in the two examples above n
must be bigger than m
. This method is called slicing.
This are simple commands to work with data, but as data scientists frequently want to work with data in this way, the dplyr
package was created to handle data more efficiently.
9.2 dplyr
We recommend using dplyr
to handle data as the commands are concise, making the code easier to read and write. Install and load the dplyr
R package.
The dplyr
package contains several different function that are very useful when manipulating and cleaning large datasets. We will be highlighting the use of the select()
, mutate()
, length()
, arrange()
, and filter()
.
9.2.1 Selecting a column
The command select(data, length)
selects the length
column of the dataframe data
:
## length
## 1 2.2
## 2 2.1
## 3 2.7
## 4 3.0
## 5 3.1
## 6 2.5
## 7 1.9
## 8 1.1
## 9 3.5
## 10 2.9
## 11 4.5
## 12 1.2
This achieves the same result as the data$length
command we learned in the previous section.
9.2.2 Adding a new column
There are several different methods to add new columns to a data set in R, but when using the dplyr
package, the function mutate()
is appropriate. Below we create a new column called site_double
that is equal to the site
column times 2.
We can see that data2
has a new column called site_double
.
9.2.3 Sorting the data
The arrange()
function orders the rows of a data frame by the values in a specific column. Below we arrange data
by increasing order of the width
column.:
## site experiment length width height
## 1 3 2 1.1 0.5 2.3
## 2 1 1 2.2 1.3 9.6
## 3 1 3 2.7 1.5 2.2
## 4 3 1 1.9 1.8 4.5
## 5 3 3 3.5 2.0 7.5
## 6 1 2 2.1 2.2 7.6
## 7 4 1 2.9 2.7 3.2
## 8 2 3 2.5 2.8 3.0
## 9 2 2 3.1 3.1 4.0
## 10 2 1 3.0 4.5 1.5
## 11 4 2 4.5 4.8 6.5
## 12 4 3 1.2 NA 2.7
9.2.4 Filtering data
The filter()
function is used to subset a data frame, retaining all rows that satisfy your conditions. For a row to be retained, TRUE
must be produced for all conditions. To extract all rows of data
with length greater than 7 cm:
## site experiment length width height
## 1 1 1 2.2 1.3 9.6
## 2 1 2 2.1 2.2 7.6
## 3 3 3 3.5 2.0 7.5
We can also filter with two (or more) conditions. We use &
(meaning and) if we require both conditions to be satisfied, or |
(meaning or) if we require at least one condition is satisfied. Below, we filter the data
so that data3
contains only plants that have length greater than 8 cm and width greater than 5 cm.
However, no plants met both conditions so data3
is empty.
We can filter to retain plants from Site 1 or Site 3:
## site experiment length width height
## 1 1 1 2.2 1.3 9.6
## 2 1 2 2.1 2.2 7.6
## 3 1 3 2.7 1.5 2.2
## 4 3 1 1.9 1.8 4.5
## 5 3 2 1.1 0.5 2.3
## 6 3 3 3.5 2.0 7.5
In R, ==
evaluates whether something is TRUE
or FALSE
, while =
assigns the value on the right hand side to the variable on the left hand side. TRY IT In the Console
try the following:
# Assigns x the value 0
x = 0
# Evaluates TRUE if x = 0 (TRUE since we assigned x=0 on the above line of code)
x == 0
## [1] TRUE
## [1] FALSE
Therefore, when we write site == 1
, TRUE
is returned when the value of site
is 1
, and these rows are retained in the output. If we wrote site = 1
, then R
would try to assign the variable site
with the value 1
, which would return an error message since the second argument of the filter()
function requires a logical statement that evaluates as TRUE
or FALSE
and site = 1
assigns a value; it is not a logical statement.
9.2.5 Removing NA entries in data
Often experimental data is not as neat or pretty as example data. There is often missing data, or empty or NA
entries which may cause issues in graphing or data analysis. R
does not allow empty values, so empty values in .csv
or .xlsx
files are often converted to NA
when these data are uploaded into R
. Luckily, again the filter
function has a solution to this! We can filter the data to remove rows with NULL
or NA
entries.
## site experiment length width height
## 1 1 1 2.2 1.3 9.6
## 2 1 2 2.1 2.2 7.6
## 3 1 3 2.7 1.5 2.2
## 4 2 1 3.0 4.5 1.5
## 5 2 2 3.1 3.1 4.0
## 6 2 3 2.5 2.8 3.0
## 7 3 1 1.9 1.8 4.5
## 8 3 2 1.1 0.5 2.3
## 9 3 3 3.5 2.0 7.5
## 10 4 1 2.9 2.7 3.2
## 11 4 2 4.5 4.8 6.5
Observe that row 12 of data
has now been removed because it contained NA
in the width
column. To understand this command, note that is.na(x)
returns TRUE
if x = NA
. The !
means ‘not’ and reverses the TRUE
and FALSE
outputs. We illustrate !
below:
## [1] FALSE
## [1] TRUE
Recall, that filter()
retains the rows that evaluate as TRUE
: we wish to remove the rows with NA
, so we need to write a command that evaluates these to FALSE
. Therefore, we use !is.na()
.
9.2.6 Grouping
The functiongroup_by()
can be used with summarize()
and tally()
. Recall that our data consists of different sites:
## site experiment length width height
## 1 1 1 2.2 1.3 9.6
## 2 1 2 2.1 2.2 7.6
## 3 1 3 2.7 1.5 2.2
## 4 2 1 3.0 4.5 1.5
## 5 2 2 3.1 3.1 4.0
## 6 2 3 2.5 2.8 3.0
The code below calculates the mean length
for each site
(because we group data
by site
):
## # A tibble: 4 × 2
## site length
## <int> <dbl>
## 1 1 2.33
## 2 2 2.87
## 3 3 2.17
## 4 4 2.87
Applying the function group_by()
doesn’t make the data noticeably different aside from noting the grouping:
## # A tibble: 12 × 5
## # Groups: site [4]
## site experiment length width height
## <int> <int> <dbl> <dbl> <dbl>
## 1 1 1 2.2 1.3 9.6
## 2 1 2 2.1 2.2 7.6
## 3 1 3 2.7 1.5 2.2
## 4 2 1 3 4.5 1.5
## 5 2 2 3.1 3.1 4
## 6 2 3 2.5 2.8 3
## 7 3 1 1.9 1.8 4.5
## 8 3 2 1.1 0.5 2.3
## 9 3 3 3.5 2 7.5
## 10 4 1 2.9 2.7 3.2
## 11 4 2 4.5 4.8 6.5
## 12 4 3 1.2 NA 2.7
## # A tibble: 6 × 5
## # Groups: site [2]
## site experiment length width height
## <int> <int> <dbl> <dbl> <dbl>
## 1 1 1 2.2 1.3 9.6
## 2 1 2 2.1 2.2 7.6
## 3 1 3 2.7 1.5 2.2
## 4 2 1 3 4.5 1.5
## 5 2 2 3.1 3.1 4
## 6 2 3 2.5 2.8 3
The first argument, x
, of summarize(x,y)
is the grouped data, and the second argument, y
, has the structure name of the new column
= function to be applied to(column of grouped data)
. Recalling that the function min()
calculates the minimum of its argument, we might also calculate the minimum width for each of the sites:
## # A tibble: 4 × 2
## site min.width
## <int> <dbl>
## 1 1 1.3
## 2 2 2.8
## 3 3 0.5
## 4 4 NA
The function tally()
counts the number of records in each group.
## # A tibble: 4 × 2
## site n
## <int> <int>
## 1 1 3
## 2 2 3
## 3 3 3
## 4 4 3
Equivalently, we could apply the function length()
using summarize()
:
## # A tibble: 4 × 2
## site n
## <int> <int>
## 1 1 3
## 2 2 3
## 3 3 3
## 4 4 3
9.2.7 Joining
A good description of joining data sets using inner_join()
is provided here.
9.2.8 Pipes
The pipe function, %>%
, allows for compact coding when the output on the lefthand size of %>%
is the input for the calculations performed on the righthand size of %>%
. Consider:
data2 <- data %>%
filter(!is.na(width)) %>%
mutate(height_cm=height*100) %>%
select(site, experiment, height_cm)
head(data2)
## site experiment height_cm
## 1 1 1 960
## 2 1 2 760
## 3 1 3 220
## 4 2 1 150
## 5 2 2 400
## 6 2 3 300
We end up with a new dataframe, data2
, that is data
with three operations sequentially performed. The advantage of using %>%
is the compactness and clarity of the code to achieve this result. Let’s understand these commands further.
data2
is the name of the output when we are finished all our operations;data
is the data that we start with;filter(!is.na(width))
removes all rows withNA
in thewidth
column ofdata
. In the previous sections, when we demonstartedfilter(x,y)
, the first argument,x
, was a dataframe, and the second argument was a logical statement, i.e.,!is.na(width)
. Above, when%>%
is used only the logical statement is provided and the data that the function applies to is supplied on the left hand side of%>%
.mutate(height_cm=height*100)
converts height in meters to centimeters, and this operation is applied to all the data that passed through the pipe from the previous operation whereNA
s in thewidth
column were removed.select(site, experiment, height_cm)
is used to output only columns of interest, and again, the data argument to theselect()
function is supplied by the lefthand size of the pipe.
Note that the pipes will not work if the output of one operation is of the wrong type to be the input to the next operation.
To understand the value of pipes, consider the equivalent code written without using pipes:
data2 <- filter(data,!is.na(width))
data3 <- mutate(data2,height_cm=height*100)
data4 <- select(data3, site, experiment, height_cm)
head(data4)
## site experiment height_cm
## 1 1 1 960
## 2 1 2 760
## 3 1 3 220
## 4 2 1 150
## 5 2 2 400
## 6 2 3 300
Without using pipes, we create variables data2
and data3
that only serve as intermediaries, that we may never use again in our subsequent lines of code. Creating many variables can bog down both the computer memory and your own memory, when you are trying to remember the variable name you gave to a particular output when writing subsequent commands.
9.3 Reshaping Data
Data frequently come in formats that are unsuitable for analysis. You would be required to do modifications on the data in some way. The process of cleaning or modifying the data to make it ready for analysis is known as tidying
, or munging
. The goal is generally to get the data into a tidy
format. You often get data where items that should be in the rows are in the columns or vice versa yet it’s not in a transpose form. Some plotting functions in R have a special way the data should be formatted before it can produce the required plot. In these circumstances, you would have to reshape the data into a format that is easier to work with. This section considers the various ways we can format data into a wide or long format. Depending on your goal, you may want the data frame to be in one of these specific formats. We hereby going to learn how to reshape data into one of these formats.
9.3.1 Reshaping data from wide to long format
- We use the stack method in base R to transform data available in the form of separate columns within a data frame or a list into a single column. The stack method uses the two functions
cbind()
andstack()
produces a result in the form of a data frame with two columns. We will demonstrate how this method is done by creating a sample data frame as follows and following through the lines of code below.
# creating a data frame
data <- data.frame(name=c('Josh', 'Josh', 'Jake',
'Jake', 'Gorge', 'Gorge'),
semester=c(rep(LETTERS[1:2],3)),
quiz_BIOL1001=c(1, 3, 4, 8, 9, 1),
quiz_BIOL1002=c(2, 4, 6, 2, 7, 3))
data
## name semester quiz_BIOL1001 quiz_BIOL1002
## 1 Josh A 1 2
## 2 Josh B 3 4
## 3 Jake A 4 6
## 4 Jake B 8 2
## 5 Gorge A 9 7
## 6 Gorge B 1 3
We would now bind the first two columns of the data frame named data
and stake the third and fourth columns and save it in a new name called data_long
.
You can query the the new data created in the console to see the content.
## name semester values ind
## 1 Josh A 1 quiz_BIOL1001
## 2 Josh B 3 quiz_BIOL1001
## 3 Jake A 4 quiz_BIOL1001
## 4 Jake B 8 quiz_BIOL1001
## 5 Gorge A 9 quiz_BIOL1001
## 6 Gorge B 1 quiz_BIOL1001
## 7 Josh A 2 quiz_BIOL1002
## 8 Josh B 4 quiz_BIOL1002
## 9 Jake A 6 quiz_BIOL1002
## 10 Jake B 2 quiz_BIOL1002
## 11 Gorge A 7 quiz_BIOL1002
## 12 Gorge B 3 quiz_BIOL1002
- Using melt method in
reshape2
package in R. As we already know, we will have to install and load thereshape2
package before it can be used in R if not already done.
Now we will see how the melt method in this package can be used to stack data frame columns together by going through an example. You can read more about the syntax for the melt()
function by typing ?melt in the console. We would use the data frame named data
for this ecample.
# binding the first two columns as it is and stacking the third and fourth columns
data_melt <- reshape2::melt(data, id.var = c('name', 'semester'),
variable.name = 'course_name')
data_melt
## name semester course_name value
## 1 Josh A quiz_BIOL1001 1
## 2 Josh B quiz_BIOL1001 3
## 3 Jake A quiz_BIOL1001 4
## 4 Jake B quiz_BIOL1001 8
## 5 Gorge A quiz_BIOL1001 9
## 6 Gorge B quiz_BIOL1001 1
## 7 Josh A quiz_BIOL1002 2
## 8 Josh B quiz_BIOL1002 4
## 9 Jake A quiz_BIOL1002 6
## 10 Jake B quiz_BIOL1002 2
## 11 Gorge A quiz_BIOL1002 7
## 12 Gorge B quiz_BIOL1002 3
Here is another example on how to use the melt()
function in the reshape2
package to reshape data.
#create data frame
data2 <- data.frame(person=c('A', 'A', 'B', 'B', 'C', 'C'),
trial=c(1, 2, 1, 2, 1, 2),
outcome1=c(7, 6, 6, 5, 4, 4),
outcome2=c(4, 4, 5, 5, 3, 2))
data2
## person trial outcome1 outcome2
## 1 A 1 7 4
## 2 A 2 6 4
## 3 B 1 6 5
## 4 B 2 5 5
## 5 C 1 4 3
## 6 C 2 4 2
Reshape data2
from the wide format to a long format by staking the third and fourth columns.
#melt columns of data frame
data2_long <- melt(data2, id.var = c('person', 'trial'), variable.name = 'outcome')
data2_long
## person trial outcome value
## 1 A 1 outcome1 7
## 2 A 2 outcome1 6
## 3 B 1 outcome1 6
## 4 B 2 outcome1 5
## 5 C 1 outcome1 4
## 6 C 2 outcome1 4
## 7 A 1 outcome2 4
## 8 A 2 outcome2 4
## 9 B 1 outcome2 5
## 10 B 2 outcome2 5
## 11 C 1 outcome2 3
## 12 C 2 outcome2 2
- Using the tidyr package. The
tidyr
package make use of the following functions:
pivot_longer()
: Reshapes a data frame from wide to long format.
pivot_wider()
: Reshapes a data frame from long to wide format.
We will study how to use pivot_longer()
function in the next example and study how to use pivot_wider()
function in the section that follows.
To use these functions, we would be required to run install.packages("tidyr")
to get the package installed if you have not done that previously. Also remember to load the package. I would also use the pipe operator, %>%
in the code. %>%
is not required to use tidyr
, but it does make things easier. To use the pipe operator, you must also install dplyr
package.
# I have "commented out" these two lines of code with the # sign. Just take away the # and run the lines.
#install.packages("tidyr")
#install.packages("dplyr")
library(tidyr)
library(dplyr)
We pivot the data frame into a long format. Note that we use the pipe operator to make the work much easier. You can read more about how to use the pipe operator from subsection 8.2.8 of the Quantitative Guide.
## # A tibble: 12 × 4
## name semester quiz marks
## <chr> <chr> <chr> <dbl>
## 1 Josh A quiz_BIOL1001 1
## 2 Josh A quiz_BIOL1002 2
## 3 Josh B quiz_BIOL1001 3
## 4 Josh B quiz_BIOL1002 4
## 5 Jake A quiz_BIOL1001 4
## 6 Jake A quiz_BIOL1002 6
## 7 Jake B quiz_BIOL1001 8
## 8 Jake B quiz_BIOL1002 2
## 9 Gorge A quiz_BIOL1001 9
## 10 Gorge A quiz_BIOL1002 7
## 11 Gorge B quiz_BIOL1001 1
## 12 Gorge B quiz_BIOL1002 3
Another example with data2
data frame using the pivot_longer()
function in the tidyr
package.
#pivot the data frame into a long format.
data2 %>% pivot_longer(cols=c('outcome1', 'outcome2'),
names_to='outcomes',
values_to='values')
## # A tibble: 12 × 4
## person trial outcomes values
## <chr> <dbl> <chr> <dbl>
## 1 A 1 outcome1 7
## 2 A 1 outcome2 4
## 3 A 2 outcome1 6
## 4 A 2 outcome2 4
## 5 B 1 outcome1 6
## 6 B 1 outcome2 5
## 7 B 2 outcome1 5
## 8 B 2 outcome2 5
## 9 C 1 outcome1 4
## 10 C 1 outcome2 3
## 11 C 2 outcome1 4
## 12 C 2 outcome2 2
- Another function of
tidyr
that can be used to transform the data from wide to long is thegather()
function. We again show how to use this function.
## name semester quiz marks
## 1 Josh A quiz_BIOL1001 1
## 2 Josh B quiz_BIOL1001 3
## 3 Jake A quiz_BIOL1001 4
## 4 Jake B quiz_BIOL1001 8
## 5 Gorge A quiz_BIOL1001 9
## 6 Gorge B quiz_BIOL1001 1
## 7 Josh A quiz_BIOL1002 2
## 8 Josh B quiz_BIOL1002 4
## 9 Jake A quiz_BIOL1002 6
## 10 Jake B quiz_BIOL1002 2
## 11 Gorge A quiz_BIOL1002 7
## 12 Gorge B quiz_BIOL1002 3
9.3.2 Reshaping data from long to wide format
- We will be using the tidyr package to reshape data from long to wide. Let us study how to use
pivot_wider()
function in the next example to pivot data frame into a wide format. We will be using the example data frame created in the previous examples.
#pivot the data frame into a wide format
df_wide <- df_long %>% pivot_wider(names_from = quiz, values_from = marks)
df_wide
## # A tibble: 6 × 4
## name semester quiz_BIOL1001 quiz_BIOL1002
## <chr> <chr> <dbl> <dbl>
## 1 Josh A 1 2
## 2 Josh B 3 4
## 3 Jake A 4 6
## 4 Jake B 8 2
## 5 Gorge A 9 7
## 6 Gorge B 1 3
- Another function of
tidyr
that can be used to transform the data from long back to wide is thespread()
function. We show how to use this function.
## name semester quiz_BIOL1001 quiz_BIOL1002
## 1 Gorge A 9 7
## 2 Gorge B 1 3
## 3 Jake A 4 6
## 4 Jake B 8 2
## 5 Josh A 1 2
## 6 Josh B 3 4
We show another example of using the spread()
function to transform back data2
from long to wide.
## person trial outcome1 outcome2
## 1 A 1 7 4
## 2 A 2 6 4
## 3 B 1 6 5
## 4 B 2 5 5
## 5 C 1 4 3
## 6 C 2 4 2
There are many other functions to reshape data in R but these are the few handy ones that you can always stick to in your data analysis and plotting.