diff --git a/config.yaml b/config.yaml index cab81e68..615e9bfb 100644 --- a/config.yaml +++ b/config.yaml @@ -64,6 +64,7 @@ episodes: #- describe-cases.Rmd #- simple-analysis.Rmd - delays-reuse.Rmd +- quantify-transmissibility.Rmd - delays-functions.Rmd # Information for Learners diff --git a/episodes/quantify-transmissibility.Rmd b/episodes/quantify-transmissibility.Rmd new file mode 100644 index 00000000..cea7484d --- /dev/null +++ b/episodes/quantify-transmissibility.Rmd @@ -0,0 +1,488 @@ +--- +title: 'Quantifying transmission' +teaching: 30 +exercises: 0 +--- + +```{r setup, echo = FALSE, warning = FALSE, message = FALSE} +library(EpiNow2) +library(ggplot2) +withr::local_options(list(mc.cores = 4)) +``` + +:::::::::::::::::::::::::::::::::::::: questions + +- How can I estimate the time-varying reproduction number ($Rt$) and growth rate from a time series of case data? +- How can I quantify geographical heterogeneity from these transmission metrics? + + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Learn how to estimate transmission metrics from a time series of case data using the R package `EpiNow2` + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: prereq + +## Prerequisites + +Learners should familiarise themselves with following concepts before working through this tutorial: + +**Statistics**: probability distributions, principle of Bayesian analysis. + +**Epidemic theory**: Effective reproduction number. + +::::::::::::::::::::::::::::::::: + + + +::::::::::::::::::::::::::::::::::::: callout +### Reminder: the Effective Reproduction Number, $R_t$ + +The [basic reproduction number](../learners/reference.md#basic), $R_0$, is the average number of cases caused by one infectious individual in a entirely susceptible population. + +But in an ongoing outbreak, the population does not remain entirely susceptible as those that recover from infection are typically immune. Moreover, there can be changes in behaviour or other factors that affect transmission. When we are interested in monitoring changes in transmission we are therefore more interested in the value of the **effective reproduction number**, $R_t$, the average number of cases caused by one infectious individual in the population at time $t$. + +:::::::::::::::::::::::::::::::::::::::::::::::: + + +## Introduction + +The transmission intensity of an outbreak is quantified using two key metrics: the reproduction number, which informs on the strength of the transmission by indicating how many new cases are expected from each existing case; and the [growth rate](../learners/reference.md#growth), which informs on the speed of the transmission by indicating how rapidly the outbreak is spreading or declining (doubling/halving time) within a population. To estimate these key metrics using case data we must account for delays between the date of infections and date of reported cases. In an outbreak situation, data are usually available on reported dates only, therefore we must use estimation methods to account for these delays when trying to understand changes in transmission over time. For more details on the distinction between speed and strength of transmission and implications for control, see [Dushoff & Park, 2021](https://royalsocietypublishing.org/doi/full/10.1098/rspb.2020.1556). + +In the next tutorials we will focus on how to use the functions in `{EpiNow2}` to estimate transmission metrics of case data. We will not cover the theoretical background of the models or inference framework, for details on these concepts see the [vignette](https://epiforecasts.io/EpiNow2/dev/articles/estimate_infections.html). + + +::::::::::::::::::::::::::::::::::::: callout +### Bayesian inference + +The R package `EpiNow2` uses a [Bayesian inference](../learners/reference.md#bayesian) framework to estimate reproduction numbers and infection times based on reporting dates. + +In Bayesian inference, we use prior knowledge (prior distributions) with data (in a likelihood function) to find the posterior probability. + +

Posterior probability $\propto$ likelihood $\times$ prior probability +

+ +:::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::: instructor + +Refer to the prior probability distribution and the [posterior probability](https://en.wikipedia.org/wiki/Posterior_probability) distribution. + +In the ["`Expected change in daily cases`" callout](#expected-change-in-daily-cases), by "the posterior probability that $R_t < 1$", we refer specifically to the [area under the posterior probability distribution curve](https://www.nature.com/articles/nmeth.3368/figures/1). + +:::::::::::::::::::::::::::::::::::::::::::::::: + + +The first step is to load the `{EpiNow2}` package: + +```{r, eval = FALSE} +library(EpiNow2) +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor + +This tutorial illustrates the usage of `epinow()` to estimate the time-varying reproduction number and infection times. Learners should understand the necessary inputs to the model and the limitations of the model output. + +:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: + + +## Delay distributions and case data +### Case data + +To illustrate the functions of `EpiNow2` we will use outbreak data of the start of the COVID-19 pandemic from the United Kingdom. The data are available in the R package `{incidence2}`. + +```{r} +dplyr::as_tibble(incidence2::covidregionaldataUK) +``` + +To use the data, we must format the data to have two columns: + ++ `date`: the date (as a date object see `?is.Date()`), ++ `confirm`: number of confirmed cases on that date. + +Let's use `{dplyr}` for this: + +```{r, warning = FALSE, message = FALSE} +library(dplyr) + +cases <- incidence2::covidregionaldataUK %>% + select(date, cases_new) %>% + group_by(date) %>% + summarise(confirm = sum(cases_new, na.rm = TRUE)) %>% + ungroup() +``` + +::::::::::::::::::::::::: spoiler + +### When to use incidence2? + +We can also use the `{incidence2}` package to aggregate cases. However, if you ever need to aggregate you data in a different time **interval** (i.e., days, weeks or months) or per **group** categories, we recommend you to explore the `incidence2::incidence()` function: + +```{r, warning = FALSE, message = FALSE, eval=FALSE} +library(tidyr) +library(dplyr) + +incidence2::covidregionaldataUK %>% + # preprocess missing values + tidyr::replace_na(list(cases_new = 0)) %>% + # compute the daily incidence + incidence2::incidence( + date_index = "date", + counts = "cases_new", + groups = "region", + interval = "week" + ) +``` + +You can also estimate transmission metrics from {incidence2} objects using the `{i2extras}` package. Read further in the [Fitting curves](https://www.reconverse.org/i2extras/articles/fitting_epicurves.html) vignette! + +::::::::::::::::::::::::: + +There are case data available for `r dim(cases)[1]` days, but in an outbreak situation it is likely we would only have access to the beginning of this data set. Therefore we assume we only have the first 90 days of this data. + +```{r echo = FALSE} +ggplot(cases[1:90, ], aes(x = date, y = confirm)) + + geom_col() + + theme_grey( + base_size = 15 + ) +``` + + + +### Delay distributions +We assume there are delays from the time of infection until the time a case is reported. We specify these delays as distributions to account for the uncertainty in individual level differences. The delay can consist of multiple types of delays/processes. A typical delay from time of infection to case reporting may consist of: + +

**time from infection to symptom onset** (the [incubation period](../learners/reference.md#incubation)) + **time from symptom onset to case notification** (the reporting time) +.

+ +The delay distribution for each of these processes can either estimated from data or obtained from the literature. We can express uncertainty about what the correct parameters of the distributions by assuming the distributions have **fixed** parameters or whether they have **variable** parameters. To understand the difference between **fixed** and **variable** distributions, let's consider the incubation period. + +::::::::::::::::::::::::::::::::::::: callout +### Delays and data +The number of delays and type of delay are a flexible input that depend on the data. The examples below highlight how the delays can be specified for different data sources: + +
+ +| Data source | Delay(s) | +| ------------- |-------------| +|Time of symptom onset |Incubation period | +|Time of case report |Incubation period + time from symptom onset to case notification | +|Time of hospitalisation |Incubation period + time from symptom onset to hospitalisation | + +
+ + +:::::::::::::::::::::::::::::::::::::::::::::::: + + + +#### Incubation period distribution + +The distribution of incubation period for many diseases can usually be obtained from the literature. The package `{epiparameter}` contains a library of epidemiological parameters for different diseases obtained from the literature. + +We specify a (fixed) gamma distribution with mean $\mu = 4$ and standard deviation $\sigma= 2$ (shape = $4$, scale = $1$) using the function `dist_spec()` as follows: + +```{r} +incubation_period_fixed <- dist_spec( + mean = 4, sd = 2, + max = 20, distribution = "gamma" +) +incubation_period_fixed +``` + +The argument `max` is the maximum value the distribution can take, in this example 20 days. + +::::::::::::::::::::::::::::::::::::: callout +### Why a gamma distrubution? + +The incubation period has to be positive in value. Therefore we must specific a distribution in `dist_spec` which is for positive values only. + +`dist_spec()` supports log normal and gamma distributions, which are distributions for positive values only. + +For all types of delay, we will need to use distributions for positive values only - we don't want to include delays of negative days in our analysis! + +:::::::::::::::::::::::::::::::::::::::::::::::: + + + +#### Including distribution uncertainty + +To specify a **variable** distribution, we include uncertainty around the mean $\mu$ and standard deviation $\sigma$ of our gamma distribution. If our incubation period distribution has a mean $\mu$ and standard deviation $\sigma$, then we assume the mean ($\mu$) follows a Normal distribution with standard deviation $\sigma_{\mu}$: + +$$\mbox{Normal}(\mu,\sigma_{\mu}^2)$$ + +and a standard deviation ($\sigma$) follows a Normal distribution with standard deviation $\sigma_{\sigma}$: + +$$\mbox{Normal}(\sigma,\sigma_{\sigma}^2).$$ + +We specify this using `dist_spec` with the additional arguments `mean_sd` ($\sigma_{\mu}$) and `sd_sd` ($\sigma_{\sigma}$). + +```{r} +incubation_period_variable <- dist_spec( + mean = 4, sd = 2, + mean_sd = 0.5, sd_sd = 0.5, + max = 20, distribution = "gamma" +) +incubation_period_variable +``` + + + +#### Reporting delays + +After the incubation period, there will be an additional delay of time from symptom onset to case notification: the reporting delay. We can specify this as a fixed or variable distribution, or estimate a distribution from data. + +When specifying a distribution, it is useful to visualise the probability density to see the peak and spread of the distribution, in this case we will use a log normal distribution. We can use the functions `convert_to_logmean()` and `convert_to_logsd()` to convert the mean and standard deviation of a normal distribution to that of a log normal distribution. + +If we want to assume that the mean reporting delay is 2 days (with a standard deviation of 1 day), the log normal distribution will look like: + +```{r} +log_mean <- convert_to_logmean(2, 1) +log_sd <- convert_to_logsd(2, 1) +x <- seq(from = 0, to = 10, length = 1000) +df <- data.frame(x = x, density = dlnorm(x, meanlog = log_mean, sdlog = log_sd)) +ggplot(df) + + geom_line( + aes(x, density) + ) + + theme_grey( + base_size = 15 + ) +``` + +Using the mean and standard deviation for the log normal distribution, we can specify a fixed or variable distribution using `dist_spec()` as before: + +```{r} +reporting_delay_variable <- dist_spec( + mean = log_mean, sd = log_sd, + mean_sd = 0.5, sd_sd = 0.5, + max = 10, distribution = "lognormal" +) +``` + +If data is available on the time between symptom onset and reporting, we can use the function `estimate_delay()` to estimate a log normal distribution from a vector of delays. The code below illustrates how to use `estimate_delay()` with synthetic delay data. + +```{r, eval = FALSE } +delay_data <- rlnorm(500, log(5), 1) # synthetic delay data +reporting_delay <- estimate_delay( + delay_data, + samples = 1000, + bootstraps = 10 +) +``` + + +#### Generation time + +We also must specify a distribution for the generation time. Here we will use a log normal distribution with mean 3.6 and standard deviation 3.1 ([Ganyani et al. 2020](https://doi.org/10.2807/1560-7917.ES.2020.25.17.2000257)). + + +```{r} +generation_time_variable <- dist_spec( + mean = 3.6, sd = 3.1, + mean_sd = 0.5, sd_sd = 0.5, + max = 20, distribution = "lognormal" +) +``` + + +## Finding estimates + +The function `epinow()` is a wrapper for the function `estimate_infections()` used to estimate cases by date of infection. The generation time distribution and delay distributions must be passed using the functions ` generation_time_opts()` and `delay_opts()` respectively. + +There are numerous other inputs that can be passed to `epinow()`, see `EpiNow2::?epinow()` for more detail. +One optional input is to specify a log normal prior for the effective reproduction number $R_t$ at the start of the outbreak. We specify a mean and standard deviation as arguments of `prior` within `rt_opts()`: + +```{r, eval = FALSE} +rt_log_mean <- convert_to_logmean(2, 1) +rt_log_sd <- convert_to_logsd(2, 1) +rt <- rt_opts(prior = list(mean = rt_log_mean, sd = rt_log_sd)) +``` + +::::::::::::::::::::::::::::::::::::: callout +### Bayesian inference using Stan + +The Bayesian inference is performed using MCMC methods with the program [Stan](https://mc-stan.org/). There are a number of default inputs to the Stan functions including the number of chains and number of samples per chain (see `?EpiNow2::stan_opts()`). + +To reduce computation time, we can run chains in parallel. To do this, we must set the number of cores to be used. By default, 4 MCMC chains are run (see `stan_opts()$chains`), so we can set an equal number of cores to be used in parallel as follows: + +```{r} +withr::local_options(list(mc.cores = 4)) +``` + +To find the maximum number of available cores on your machine, use `parallel::detectCores()`. + +:::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, echo = FALSE} +rt_log_mean <- convert_to_logmean(2, 1) +rt_log_sd <- convert_to_logsd(2, 1) + +incubation_period_fixed <- dist_spec( + mean = 4, sd = 2, + max = 20, distribution = "gamma" +) + +log_mean <- convert_to_logmean(2, 1) +log_sd <- convert_to_logsd(2, 1) +reporting_delay_fixed <- dist_spec( + mean = log_mean, sd = log_sd, + max = 10, distribution = "lognormal" +) + +generation_time_fixed <- dist_spec( + mean = 3.6, sd = 3.1, + max = 20, distribution = "lognormal" +) +``` + +*Note: in the code below fixed distributions are used instead of variable. This is to speed up computation time. It is generally recommended to use variable distributions that account for additional uncertainty.* + +::::::::::::::::::::::::::::::::: spoiler + +### On reducing computation time + +Using an appropriate number of samples and chains is crucial for ensuring convergence and obtaining reliable estimates in Bayesian computations using Stan. Inadequate sampling or insufficient chains may lead to issues such as divergent transitions, impacting the accuracy and stability of the inference process. + +For the purpose of this tutorial, we can add more configuration details to get an useful output in less time. You can specify a fixed number of `samples` and `chains` to the `stan` argument using the `stan_opts()` function: + +The code in the proposed code chunk can take around 10 minutes. We expect this alternative code chunk below using `stan_opts()` to take approximately 3 minutes: + +```{r,eval=FALSE} +estimates <- epinow( + # same code as previous chunk + reported_cases = reported_cases, + generation_time = generation_time_opts(generation_time_fixed), + delays = delay_opts( + incubation_period_fixed + reporting_delay_fixed + ), + rt = rt_opts( + prior = list(mean = rt_log_mean, sd = rt_log_sd) + ), + # [new] set a fixed number of samples and chains + stan = stan_opts(samples = 1000, chains = 3) +) +``` + +::::::::::::::::::::::::::::::::: + +```{r, message = FALSE, eval = TRUE} +reported_cases <- cases[1:90, ] + +estimates <- epinow( + reported_cases = reported_cases, + generation_time = generation_time_opts(generation_time_fixed), + delays = delay_opts( + incubation_period_fixed + reporting_delay_fixed + ), + rt = rt_opts( + prior = list(mean = rt_log_mean, sd = rt_log_sd) + ) +) +``` + +### Results + +We can extract and visualise estimates of the effective reproduction number through time: + +```{r} +estimates$plots$R +``` + +The uncertainty in the estimates increases through time. This is because estimates are informed by data in the past - within the delay periods. This difference in uncertainty is categorised into **Estimate** (green) utilises all data and **Estimate based on partial data** (orange) estimates that are based on less data (because infections that happened at the time are more likely to not have been observed yet) and therefore have increasingly wider intervals towards the date of the last data point. Finally, the **Forecast** (purple) is a projection ahead of time. + +We can also visualise the growth rate estimate through time: +```{r} +estimates$plots$growth_rate +``` + +To extract a summary of the key transmission metrics at the *latest date* in the data: + +```{r} +summary(estimates) +``` + +As these estimates are based on partial data, they have a wide uncertainty interval. + ++ From the summary of our analysis we see that the expected change in daily cases is `r summary(estimates)$estimate[summary(estimates)$measure=="Expected change in daily cases"]` with the estimated new confirmed cases `r summary(estimates)$estimate[summary(estimates)$measure=="New confirmed cases by infection date"]`. + ++ The effective reproduction number $R_t$ estimate (on the last date of the data) is `r summary(estimates)$estimate[summary(estimates)$measure=="Effective reproduction no."]`. + ++ The exponential growth rate of case numbers is `r summary(estimates)$estimate[summary(estimates)$measure=="Rate of growth"]`. + ++ The doubling time (the time taken for case numbers to double) is `r summary(estimates)$estimate[summary(estimates)$measure=="Doubling/halving time (days)"]`. + +::::::::::::::::::::::::::::::::::::: callout +### `Expected change in daily cases` + +A factor describing expected change in daily cases based on the posterior probability that $R_t < 1$. + +
+| Probability ($p$) | Expected change | +| ------------- |-------------| +|$p < 0.05$ |Increasing | +|$0.05 \leq p< 0.4$ |Likely increasing | +|$0.4 \leq p< 0.6$ |Stable | +|$0.6 \leq p < 0.95$ |Likely decreasing | +|$0.95 \leq p \leq 1$ |Decreasing | +
+ +:::::::::::::::::::::::::::::::::::::::::::::::: + + + + +## Quantify geographical heterogeneity + +The outbreak data of the start of the COVID-19 pandemic from the United Kingdom from the R package `{incidence2}` includes the region in which the cases were recorded. To find regional estimates of the effective reproduction number and cases, we must format the data to have three columns: + ++ `date`: the date, ++ `region`: the region, ++ `confirm`: number of confirmed cases for a region on a given date. + +```{r} +regional_cases <- + incidence2::covidregionaldataUK[, c("date", "cases_new", "region")] +colnames(regional_cases) <- c("date", "confirm", "region") + +# extract the first 90 dates for all regions +dates <- sort(unique(regional_cases$date))[1:90] +regional_cases <- regional_cases[which(regional_cases$date %in% dates), ] + +head(regional_cases) +``` + +To find regional estimates, we use the same inputs as `epinow()` to the function `regional_epinow()`: + +```{r, message = FALSE, eval = TRUE} +estimates_regional <- regional_epinow( + reported_cases = regional_cases, + generation_time = generation_time_opts(generation_time_fixed), + delays = delay_opts( + incubation_period_fixed + reporting_delay_fixed + ), + rt = rt_opts( + prior = list(mean = rt_log_mean, sd = rt_log_sd) + ) +) + +estimates_regional$summary$summarised_results$table + +estimates_regional$summary$plots$R +``` + + +## Summary + +`EpiNow2` can be used to estimate transmission metrics from case data at any time in the course of an outbreak. The reliability of these estimates depends on the quality of the data and appropriate choice of delay distributions. In the next tutorial we will learn how to make forecasts and investigate some of the additional inference options available in `EpiNow2`. + +::::::::::::::::::::::::::::::::::::: keypoints + +- Transmission metrics can be estimated from case data after accounting for delays +- Uncertainty can be accounted for in delay distributions + +::::::::::::::::::::::::::::::::::::::::::::::::