3 PART II: Managing biological data
Getting started
Review best practices for guidance on how to organize your files.
Download, to an appropriate folder, the following .csv files from Brightspace:
shrub-volume-data.csv
shrub-volume-experiments.csv
shrub-volume-sites.csv
surveys.csv
species.csv
plots.csv
Open and save a new R Script. Complete the information at the start of the R Script template.
Set your working directory
Install and load the dplyr R package.
If you’re unsure how to install a package, see R Packages.
Throughout these instructions are links to the relevant sections in Quantitative Skills for Biology.
If you do not complete these steps, you will not be able to complete the HAND IN questions.
TO HAND-IN
You are to hand in a .R
file, formatted as shown in Best Practices - Template where the lines of code to produce the output asked for under HAND-IN are produced in the order they are asked for.
3.1 Grouping and Joining Data
3.1.1 Shrub Volume Aggregation
In this section, we revisit grouping data and using pipes. The following code calculates the mean width of plants at each site. Both the pipe operation method and the non-pipe operation method will be highlighted.
# attach dplyr
library(dplyr)
# load the data
shrub_dims <- read.csv("lab_data/shrub-volume-data.csv")
# Calculate the mean width for each site:
## 1. Without using pipes:
by_site <- group_by(shrub_dims, site)
avg_width <- summarize(by_site, avg_width=mean(width))
## 2. Using pipes:
avg_width_pipe <- shrub_dims %>%
group_by(site) %>%
summarize(mean(width))
# Showing the first six rows with both methods
head(avg_width)
## # A tibble: 5 x 2
## site avg_width
## <int> <dbl>
## 1 1 1.67
## 2 2 3.47
## 3 3 1.43
## 4 4 NA
## 5 11 1.2
## # A tibble: 5 x 2
## site `mean(width)`
## <int> <dbl>
## 1 1 1.67
## 2 2 3.47
## 3 3 1.43
## 4 4 NA
## 5 11 1.2
HAND IN. [Q1] Add a line of code to your R script where you use pipes to calculate the mean height of plants in each experiment, and use head()
to print the output (the above code is intended as a guide you can modify).
HAND IN. [Q2] Add a line of code to your R script where you use pipes to calculate the maximum, max()
, height of a plant at each site and use head()
to print the output.
3.1.2 Shrub Volume Join
In addition to the main data table on shrub dimensions, there are two additional data tables:
shrub-to-volume-experiment.csv
, which describes the manipulation for each experiment; and -shrub-volume-sites.csv
, which provides information about the different sites.
Your are to use inner_join()
to create a new data frame with the site details added. You can read about inner_join()
here.
HAND IN. [Q3]. In your R script, indicate with a comment that you are starting a new section ## Shrub volume join
.
Add lines of code to your R script where you:
Import the experiments data,
shrub-volume-experiment.csv
and usehead()
to view the first six rows of data.View the first six rows of the data
shrub-volume-data.csv
(note that in the above code this was assigned the variable nameshrub_dims
).
HAND IN. [Q4] In your R script, add a commented out line of code where you provide the name of columns that appear in both data sets.
HAND IN. [Q5] In your R script, add a commented out line of code where you provide the values, from the column with the same name, that appear in both data sets.
HAND IN. [Q6] In your R script, add a line of code where you use the inner_join()
function to combine the experiments data with the shrub dimensions, shrub_dims
data, to produce:
## site experiment length width height manipulation
## 1 1 1 2.2 1.3 9.6 control
## 2 1 2 2.1 2.2 7.6 burn
## 3 1 3 2.7 1.5 2.2 rainout
## 4 2 1 3.0 4.5 1.5 control
## 5 2 2 3.1 3.1 4.0 burn
## 6 2 3 2.5 2.8 3.0 rainout
HAND IN. [Q7] During the data entry, one occurrence of site 1, was accidently entered as 11. When the data are combined, what happens too this row of data? In your R script, add a commented out line of code answering this question.
HAND IN. [Q8] In your R script, add lines of code to import the sites data, shrub-to-volume-sites
, and then combine these data with both the data on shrub dimensions and the experiment data to produce a single data frame that contains all of the data. Hand in the tail()
output of the new data frame. Your output should be as below (Hint: you’ve just created a single data frame that contains both the shrub and experimental data.)
## site experiment length width height manipulation latitude longitude
## 7 3 1 1.9 1.8 4.5 control 29.80 -82.15
## 8 3 2 1.1 0.5 2.3 burn 29.80 -82.15
## 9 3 3 3.5 2.0 7.5 rainout 29.80 -82.15
## 10 4 1 2.9 2.7 3.2 control 29.99 -82.62
## 11 4 2 4.5 4.8 6.5 burn 29.99 -82.62
## 12 4 3 1.2 NA 2.7 rainout 29.99 -82.62
## elevation
## 7 57
## 8 57
## 9 57
## 10 62
## 11 62
## 12 62
3.2 Portal Data Aggregation
In this section you will be using the group_by()
and filter()
functions. More information about these functions can be found by clicking the links or through the ?function
and ??function
commands. Examples for how to use the ?
and ??
command are below.
Hint: For this section it maybe useful to use pipe %>%
operations and the tally()
function. For example, I can tally the number of individuals that either identify as M
or F
in the sex column.
## # A tibble: 3 x 2
## sex n
## <chr> <int>
## 1 "" 2511
## 2 "F" 15690
## 3 "M" 17348
HAND IN. [Q9] In your R script indicate with a comment that you are starting a new section ## Portal data aggregation
.
HAND IN. [Q9] In your R script add a line of code that imports the data survey.csv
(Portal Teaching Database survey table).
HAND IN. [Q10] In your R script add a line of code that uses the group_by()
function to count of the number of individuals in each species ID
.
HAND IN. [Q11] In your R script add a line of code that uses the group_by()
function to count of the number of individuals in each species ID in each year. (Hint: you can group_by two variables).
HAND IN. [Q12] In your R script add a line of code where you use the filter()
, group_by()
, and summarize()
functions to calculate the mean mass of species DM
in each year. Include the tail
output. (Hint: double check the survey
data is ready to use.)
3.3 Fix the code
This section is a follow up to the Shrub Volume Aggregation.
The following code is supposed to import the shrub volume data and calculate the average shrub volume for each site, and separately, for each experiment.
read.csv(shrub-volume-data.csv)
shrub_data %>%
mutate(volume = length * width * height) %>%
group_by(site) %>%
summarise(mean_volume= max(volume))
shrub_data %>%
mutate(volume = length * width * height) %>%
group_by(experiment) %>%
sumarize(mean_volume = mean(volume))
# This is an example of a comment within code
HAND IN. [Q13] In your R script indicate with a comment that you are starting a new section ## Fix the code
.
HAND IN. [Q14] Copy and paste the code with error into your R script. Fix the errors in the code so that it does what it’s supposed to.
HAND IN. [Q15] Add comments #
throughout to indicate what you have corrected. A portion of your mark for this question will be determined by the quality of the comments.
3.4 Portal Data Joins
HAND IN. [Q16] In your R script indicate with a comment that you are starting a new section ## Portal data joins
.
HAND IN. [Q17] In your R script add a lines of code that load and assign a variable name to these data sets: species.csv
and plot.csv
. This exercise also uses surveys.csv
, which was loaded during the last question.
HAND IN. [Q18] In your R script, add lines of code that use the data sets above, and answer the following questions:
Use
inner_join()
to create a table that contains the information for both thesurvey
table and thespecies
table. Include both the head and tail output of the new data frame.Use
inner_join()
twice to create a table that contains the information from all three tables. Include both the head and the tail output of the new data frame.Use
inner_join()
andfilter()
to get a data frame with the information from thesurvey
andplots
tables where theplot_type
isControl
. Include only the tail output of the new data frame.
3.5 Portal Data dplyr
Review
The following questions will be using the same data frames as 3.4 above:surveys.csv
, species.csv
, and plot.csv
.
HAND IN [Q19] In your R script indicate with a comment that you are starting a new section ## Portal data dplyr review
.
HAND IN [Q20] In your R script add lines of code to achieve the following:
We want you to do an analysis comparing the size of individuals on the Control
plot to the Long-term Krat Exclosure
. Create a data frame with the year
, genus
, species
, weight
, and plot_type
for all cases where the plot type is either Control
or Long-term Krat Exclosure
. Only include cases where Taxa
is Rodent
. Remove any records where the weight
is missing. Include the head output and the length()
of the new data set. entire Hint: it maybe easiest to use the pipe %>% operation to create
3.6 Extracting vectors from data frames
This section uses functions from Base R (i.e., that didn’t need to be loaded as a package). See here if you are stuck.
HAND IN [Q21] In your R script indicate with a comment that you are starting a new section ## Extracting vectors from data frames
.
HAND IN [Q22] In your R script add lines of code to do the following:
Using the survey.csv
(Portal Teaching Database survey table):
Use
$
to extract the weight column into a vector and rename intoweight_vector
.Use [,] to extract the month column into a vector and rename into
month_vector
. Recall that in [x,y] x corresponds to the row, and y corresponds to the column.Extract the
hindfoot_length
column into a vector and calculate the meanhindfoot_length
ignoring null values.
3.7 Building data frames from vectors
In this section you will make a data frame.
You have data on the length
, width
, and height
of 10 individuals of the English yew Taxus baccata stored in the following vectors:
length <- c(2.2, 2.1, 2.7, 3.0, 3.1, 2.5, 1.9, 1.1, 3.5, 2.9)
width <- c(1.3, 2.2, 1.5, 4.5, 3.1, NA, 1.8, 0.5, 2.0, 2.7)
height <- c(9.6, 7.6, 2.2, 1.5, 4.0, 3.0, 4.5, 2.3, 7.5, 3.2)
HAND IN [Q23] In your R script indicate with a comment that you are starting a new section ## Building data frames from vectors
.
HAND IN [Q24] In your R script add lines of code to make a data frame that contains these three vectors as columns along with a genus column containing the name Taxus for all rows and a species column containing the word baccata for all rows. The output you code should produce is below.
## Genus species length width height
## 1 Taxus baccata 2.2 1.3 9.6
## 2 Taxus baccata 2.1 2.2 7.6
## 3 Taxus baccata 2.7 1.5 2.2
## 4 Taxus baccata 3.0 4.5 1.5
## 5 Taxus baccata 3.1 3.1 4.0
## 6 Taxus baccata 2.5 NA 3.0