3 PART II: Managing biological data

Getting started

Review best practices for guidance on how to organize your files.
Download, to an appropriate folder, the following .csv files from Brightspace:
- shrub-volume-data.csv
- shrub-volume-experiments.csv
- shrub-volume-sites.csv
- surveys.csv
- species.csv
- plots.csv
Open and save a new R Script. Complete the information at the start of the R Script template.
Set your working directory
Install and load the dplyr R package.

If you’re unsure how to install a package, see R Packages.

Throughout these instructions are links to the relevant sections in Quantitative Skills for Biology.

If you do not complete these steps, you will not be able to complete the HAND IN questions.

TO HAND-IN You are to hand in a .R file, formatted as shown in Best Practices - Template where the lines of code to produce the output asked for under HAND-IN are produced in the order they are asked for.

3.1 Grouping and Joining Data

3.1.1 Shrub Volume Aggregation

In this section, we revisit grouping data and using pipes. The following code calculates the mean width of plants at each site. Both the pipe operation method and the non-pipe operation method will be highlighted.

# attach dplyr
library(dplyr)

# load the data 
shrub_dims <- read.csv("lab_data/shrub-volume-data.csv")

# Calculate the mean width for each site:
## 1. Without using pipes:
by_site <- group_by(shrub_dims, site)
avg_width <- summarize(by_site, avg_width=mean(width))

## 2. Using pipes:
avg_width_pipe <- shrub_dims %>%
  group_by(site) %>%
  summarize(mean(width))

# Showing the first six rows with both methods
head(avg_width)

## # A tibble: 5 x 2
##    site avg_width
##   <int>     <dbl>
## 1     1      1.67
## 2     2      3.47
## 3     3      1.43
## 4     4     NA   
## 5    11      1.2

head(avg_width_pipe)

## # A tibble: 5 x 2
##    site `mean(width)`
##   <int>         <dbl>
## 1     1          1.67
## 2     2          3.47
## 3     3          1.43
## 4     4         NA   
## 5    11          1.2

HAND IN. [Q1] Add a line of code to your R script where you use pipes to calculate the mean height of plants in each experiment, and use head() to print the output (the above code is intended as a guide you can modify).

HAND IN. [Q2] Add a line of code to your R script where you use pipes to calculate the maximum, max(), height of a plant at each site and use head() to print the output.

3.1.2 Shrub Volume Join

In addition to the main data table on shrub dimensions, there are two additional data tables:

shrub-to-volume-experiment.csv, which describes the manipulation for each experiment; and - shrub-volume-sites.csv, which provides information about the different sites.

Your are to use inner_join() to create a new data frame with the site details added. You can read about inner_join() here.

HAND IN. [Q3]. In your R script, indicate with a comment that you are starting a new section ## Shrub volume join.

Add lines of code to your R script where you:

Import the experiments data, shrub-volume-experiment.csv and use head() to view the first six rows of data.
View the first six rows of the data shrub-volume-data.csv (note that in the above code this was assigned the variable name shrub_dims).

HAND IN. [Q4] In your R script, add a commented out line of code where you provide the name of columns that appear in both data sets.

HAND IN. [Q5] In your R script, add a commented out line of code where you provide the values, from the column with the same name, that appear in both data sets.

HAND IN. [Q6] In your R script, add a line of code where you use the inner_join() function to combine the experiments data with the shrub dimensions, shrub_dims data, to produce:

##   site experiment length width height manipulation
## 1    1          1    2.2   1.3    9.6      control
## 2    1          2    2.1   2.2    7.6         burn
## 3    1          3    2.7   1.5    2.2      rainout
## 4    2          1    3.0   4.5    1.5      control
## 5    2          2    3.1   3.1    4.0         burn
## 6    2          3    2.5   2.8    3.0      rainout

HAND IN. [Q7] During the data entry, one occurrence of site 1, was accidently entered as 11. When the data are combined, what happens too this row of data? In your R script, add a commented out line of code answering this question.

HAND IN. [Q8] In your R script, add lines of code to import the sites data, shrub-to-volume-sites, and then combine these data with both the data on shrub dimensions and the experiment data to produce a single data frame that contains all of the data. Hand in the tail() output of the new data frame. Your output should be as below (Hint: you’ve just created a single data frame that contains both the shrub and experimental data.)

##    site experiment length width height manipulation latitude longitude
## 7     3          1    1.9   1.8    4.5      control    29.80    -82.15
## 8     3          2    1.1   0.5    2.3         burn    29.80    -82.15
## 9     3          3    3.5   2.0    7.5      rainout    29.80    -82.15
## 10    4          1    2.9   2.7    3.2      control    29.99    -82.62
## 11    4          2    4.5   4.8    6.5         burn    29.99    -82.62
## 12    4          3    1.2    NA    2.7      rainout    29.99    -82.62
##    elevation
## 7         57
## 8         57
## 9         57
## 10        62
## 11        62
## 12        62

3.2 Portal Data Aggregation

In this section you will be using the group_by() and filter() functions. More information about these functions can be found by clicking the links or through the ?function and ??function commands. Examples for how to use the ? and ?? command are below.

?group_by
??group_by

Hint: For this section it maybe useful to use pipe %>% operations and the tally() function. For example, I can tally the number of individuals that either identify as M or F in the sex column.

survey <- read.csv("lab_data/surveys.csv")
survey %>% 
  group_by(sex) %>%
  tally()

## # A tibble: 3 x 2
##   sex       n
##   <chr> <int>
## 1 ""     2511
## 2 "F"   15690
## 3 "M"   17348

HAND IN. [Q9] In your R script indicate with a comment that you are starting a new section ## Portal data aggregation.

HAND IN. [Q9] In your R script add a line of code that imports the data survey.csv (Portal Teaching Database survey table).

HAND IN. [Q10] In your R script add a line of code that uses the group_by() function to count of the number of individuals in each species ID.

HAND IN. [Q11] In your R script add a line of code that uses the group_by() function to count of the number of individuals in each species ID in each year. (Hint: you can group_by two variables).

HAND IN. [Q12] In your R script add a line of code where you use the filter(), group_by(), and summarize() functions to calculate the mean mass of species DM in each year. Include the tail output. (Hint: double check the survey data is ready to use.)

3.3 Fix the code

This section is a follow up to the Shrub Volume Aggregation.

The following code is supposed to import the shrub volume data and calculate the average shrub volume for each site, and separately, for each experiment.

read.csv(shrub-volume-data.csv)
shrub_data %>%
  mutate(volume = length * width * height) %>%
  group_by(site) %>%
  summarise(mean_volume= max(volume))

shrub_data %>%
  mutate(volume = length * width * height) %>%
  group_by(experiment) %>%
  sumarize(mean_volume = mean(volume))

# This is an example of a comment within code

HAND IN. [Q13] In your R script indicate with a comment that you are starting a new section ## Fix the code.

HAND IN. [Q14] Copy and paste the code with error into your R script. Fix the errors in the code so that it does what it’s supposed to.

HAND IN. [Q15] Add comments # throughout to indicate what you have corrected. A portion of your mark for this question will be determined by the quality of the comments.

3.4 Portal Data Joins

HAND IN. [Q16] In your R script indicate with a comment that you are starting a new section ## Portal data joins.

HAND IN. [Q17] In your R script add a lines of code that load and assign a variable name to these data sets: species.csv and plot.csv. This exercise also uses surveys.csv, which was loaded during the last question.

HAND IN. [Q18] In your R script, add lines of code that use the data sets above, and answer the following questions:

Use inner_join() to create a table that contains the information for both the survey table and the species table. Include both the head and tail output of the new data frame.
Use inner_join() twice to create a table that contains the information from all three tables. Include both the head and the tail output of the new data frame.
Use inner_join() and filter() to get a data frame with the information from the survey and plots tables where the plot_type is Control. Include only the tail output of the new data frame.

3.5 Portal Data `dplyr` Review

The following questions will be using the same data frames as 3.4 above:surveys.csv, species.csv, and plot.csv.

HAND IN [Q19] In your R script indicate with a comment that you are starting a new section ## Portal data dplyr review.

HAND IN [Q20] In your R script add lines of code to achieve the following:

We want you to do an analysis comparing the size of individuals on the Control plot to the Long-term Krat Exclosure. Create a data frame with the year, genus, species, weight, and plot_type for all cases where the plot type is either Control or Long-term Krat Exclosure. Only include cases where Taxa is Rodent. Remove any records where the weight is missing. Include the head output and the length() of the new data set. entire Hint: it maybe easiest to use the pipe %>% operation to create

3.6 Extracting vectors from data frames

This section uses functions from Base R (i.e., that didn’t need to be loaded as a package). See here if you are stuck.

HAND IN [Q21] In your R script indicate with a comment that you are starting a new section ## Extracting vectors from data frames.

HAND IN [Q22] In your R script add lines of code to do the following:

Using the survey.csv (Portal Teaching Database survey table):

Use $ to extract the weight column into a vector and rename into weight_vector.
Use [,] to extract the month column into a vector and rename into month_vector. Recall that in [x,y] x corresponds to the row, and y corresponds to the column.
Extract the hindfoot_length column into a vector and calculate the mean hindfoot_length ignoring null values.

3.7 Building data frames from vectors

In this section you will make a data frame.

You have data on the length, width, and height of 10 individuals of the English yew Taxus baccata stored in the following vectors:

length <- c(2.2, 2.1, 2.7, 3.0, 3.1, 2.5, 1.9, 1.1, 3.5, 2.9)
width <- c(1.3, 2.2, 1.5, 4.5, 3.1, NA, 1.8, 0.5, 2.0, 2.7)
height <- c(9.6, 7.6, 2.2, 1.5, 4.0, 3.0, 4.5, 2.3, 7.5, 3.2)

HAND IN [Q23] In your R script indicate with a comment that you are starting a new section ## Building data frames from vectors.

HAND IN [Q24] In your R script add lines of code to make a data frame that contains these three vectors as columns along with a genus column containing the name Taxus for all rows and a species column containing the word baccata for all rows. The output you code should produce is below.

##   Genus species length width height
## 1 Taxus baccata    2.2   1.3    9.6
## 2 Taxus baccata    2.1   2.2    7.6
## 3 Taxus baccata    2.7   1.5    2.2
## 4 Taxus baccata    3.0   4.5    1.5
## 5 Taxus baccata    3.1   3.1    4.0
## 6 Taxus baccata    2.5    NA    3.0