23 Mar 24 ASSIGNMENT Selection for COVID-19 variants

Class will be in CSF 2238.

Visit the CoVaRRnet’s website for duotang

Below is an analysis that estimates the selection coefficient related to this plot estimating the selection coefficients for the BQ* and XBB* variants relative to each other for all of Canada:

Note that BQ* and XBB* are aggregations of finer variant classifications.

ASSIGNMENT 4: Your assignment is to write an .Rmd notebook that estimates the selection coefficient for a subvariant. You need to include a graph, and state your estimate of the selection coefficient. This assignment is due FRIDAY MARCH 31

The duotong website contains several of the graphs for the other variants and/or regions to give you an idea of how data rich your chosen analysis might be before you start it.

First we load the packages we will need:

library("dplyr")
library("ggplot2")

The data can be download from Duotang. You can do this by clicking on Frequency table download. However, I have already downloaded the data and uploaded it to a website so it can also be loaded into R as follows:

data = read.delim("https://raw.githubusercontent.com/ahurford/biol-4605-data/main/data2/Last%20120%20days.tsv", header=TRUE)

Note that the file is .tsv format, so I needed to look up what R function could read that in.

Now that the data is loaded, the challenge in replicating the analysis, is cleaning the data up into the right form so that we can do the linear regression. We use the tidyr package commands to clean our data. It may be very helpful to review the tidyr commands here.

data = data%>%rename(percent = X..Frequency)

The code above renames a column to make it a bit easier to work with in subsequent coding.

data$Frequency=as.numeric(data$Frequency)
data$Date = as.Date(data$Date)
data$percent =as.numeric(sub("%", "", data$percent))

The above commands are to correctly designate the type of variables in the columns. Generally, dates don’t read in as dates, and you need to explicitly state that they are dates using as.Date() before R will treat them as such. When data is stored with a %, then R tends initially understand that as a word, and fails to complete problems such as 10% + 10% until we remove the % and tag the resulting quantity as a number.

df= data.frame(Date = data$Date, Number = data$Frequency, Lineage = data$Lineage)

g1=ggplot(df, aes(x = Date, y = Number, fill=Lineage)) + 
  geom_bar(stat = "identity")+
  scale_fill_manual(values = rainbow(26))
g1

(I even tried to replicate the color palette of the plot, however, that didn’t work and the colors are a bit ghastly!)

To replicate the plot I am trying to replicate, I need to aggregate some lineages. (This may not be necessary for your analysis. Additionally, there may be a more elegant way of achieve the same thing. Below, is what worked for me).

data2=data%>%mutate(group=case_when(Lineage == "BQ.1" ~ "BQ*",
                                    Lineage == "BQ.1.1" ~ "BQ*",
                                    Lineage=="BQ.1.1.1" ~ "BQ*",
                                    Lineage=="BQ.1.1.18" ~ "BQ*",
                                    Lineage=="BQ.1.1.3" ~ "BQ*",
                                    Lineage=="BQ.1.1.4" ~ "BQ*",
                                    Lineage=="BQ.1.1.5" ~ "BQ*",
                                    Lineage=="BQ.1.10.1"~ "BQ*",
                                    Lineage=="BQ.1.12"~"BQ*",
                                    Lineage=="BQ.1.13"~ "BQ*",
                                    Lineage=="BQ.1.14"~ "BQ*",
                                    Lineage=="BQ.1.2"~ "BQ*",
                                    Lineage=="BQ.1.25"~ "BQ*",
                                    Lineage=="BQ.1.3"~ "BQ*",
                                    Lineage=="BQ.1.5"~ "BQ*",
                                    Lineage=="XBB.1"~ "XBB*",
                                    Lineage=="XBB.1.5"~ "XBB*",
                              ))
head(data2,10)
##          Date         Lineage Frequency percent group
## 1  2022-11-07  other lineages       799      36  <NA>
## 2  2022-11-07          BA.4.6       116       5  <NA>
## 3  2022-11-07          BA.5.1       111       5  <NA>
## 4  2022-11-07          BA.5.2       185       8  <NA>
## 5  2022-11-07        BA.5.2.1       260      12  <NA>
## 6  2022-11-07       BA.5.2.13        25       1  <NA>
## 7  2022-11-07        BE.1.2.1        27       1  <NA>
## 8  2022-11-07            BF.7       122       5  <NA>
## 9  2022-11-07            BQ.1       131       6   BQ*
## 10 2022-11-07          BQ.1.1       233      10   BQ*

This has created a new column called group. I need to filter out just the BQ* and XBB* variants, and I need to sum the proportions attributed to each of the lineages aggregated into BQ* or XBB* (such aggregation may not be needed for your analysis).

data2 = data2%>%group_by(group,Date) %>% 
  summarise(total_prop=sum(percent)/100)%>%filter(group == "XBB*"|group=="BQ*")
## `summarise()` has grouped output by
## 'group'. You can override using the
## `.groups` argument.

Additionally, I need to calculate the logarithm of the proportion of the variants, and filter out any reports of 0 because the logarithm of 0 is undefined. Also, here, I need to convert from percentage to proportion.

data2 = data2%>%filter(total_prop>0)%>%mutate(log_percent = log(total_prop))

Make a plot:

g3 =ggplot(data2, aes(x = Date, y = total_prop, group=group, colour=group)) + 
  geom_point() +
  geom_line() +
  ylab('proportion of all variants')
g3

I want to do the linear regression, but to do this I ended up splitting out the XBB* and QB* variants:

XBB = filter(data2, group == "XBB*")
BQ = filter(data2, group == "BQ*")

Below, we perform the linear regressions:

lm.XBB = lm(log_percent ~ Date,data=XBB)
lm.BQ = lm(log_percent ~ Date,data=BQ)

To plot the linear regressions, and the 95% confidence interval, we use the predict() function on the linear regression objects, lm.XBB and lm.BQ.

PI.XBB = predict(lm.XBB, se.fit=TRUE, interval="confidence", level=0.95)$fit
PI.BQ = predict(lm.BQ, se.fit=TRUE, interval="confidence", level=0.95)$fit

Below, I organize the data, and the linear regression outputs into a data frame to make a plot:

XBB.fit = data.frame(Date = XBB$Date, fit = PI.XBB[,1], lwr = PI.XBB[,2], uppr = PI.XBB[,3])
BQ.fit = data.frame(Date = BQ$Date, fit = PI.BQ[,1], lwr = PI.BQ[,2], uppr = PI.BQ[,3])

g4=ggplot(data=XBB.fit, aes(x=Date))+
  geom_ribbon(aes(ymin=lwr, ymax=uppr, fill = "#F8766D"), alpha = 0.1)+
  geom_line(aes(y=fit), color = "#F8766D")+
  geom_point(aes(y=log_percent), data=BQ, color = "#00BA38")+
geom_ribbon(aes(ymin=lwr, ymax=uppr, fill = "#00BA38"), alpha = 0.1, data=BQ.fit)+
  geom_line(aes(y=fit), data = BQ.fit, color="#00BA38")+
  geom_point(aes(y=log_percent), data=XBB,  color = "#F8766D")+
  ylab("log(proportion of all variants)")
  g4

That is a good plot, but we still need to know the slope of the regression to estimate the selection coefficients for the variants:

summary(lm.XBB)
## 
## Call:
## lm(formula = log_percent ~ Date, data = XBB)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57728 -0.19989  0.01419  0.21737  0.41425 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.663e+02  5.633e+01  -17.15 8.31e-10 ***
## Date         4.976e-02  2.908e-03   17.11 8.55e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3071 on 12 degrees of freedom
## Multiple R-squared:  0.9606, Adjusted R-squared:  0.9573 
## F-statistic: 292.8 on 1 and 12 DF,  p-value: 8.545e-10
summary(lm.BQ)
## 
## Call:
## lm(formula = log_percent ~ Date, data = BQ)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.63967 -0.24884  0.09348  0.20036  0.28267 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.3785411 36.8420520   0.037    0.971
## Date        -0.0001081  0.0019031  -0.057    0.955
## 
## Residual standard error: 0.2691 on 15 degrees of freedom
## Multiple R-squared:  0.0002149,  Adjusted R-squared:  -0.06644 
## F-statistic: 0.003224 on 1 and 15 DF,  p-value: 0.9555

This tells me the slope of my regression for the XBB* variant is 0.05 and the intercept is -966.29. For the BQ* variant, the slope is 0 and the intercept is 1.379. You can get these values to print out in your Rmarkdown by writing round(coef(lm.BQ)[1],3) (for example) but enclosed within downward dashes, and where the first dash is followed by r.

These estimates of the slope are close to the slope of 0.08 for XBB* and 0.02 for BQ* shown in the figure. I expect they are not exact, because the data shown in the figure is more finely resolved on the Date axis than the data that could be downloaded.

The selection coefficient is the slope of this regression, see slide 23. The selection coefficient is \(s_{XBB*} = r_{XBB*} - r_{all}\), where \(r_i\) is the exponential growth rate of the \(i\) variant.