jhu-data.csv
un-population.csv
home-office.csv
extra-uk-arrivals.csv
clusters_DTA_MCC_0.5.csv
clusters_DTA.csv
IATA_CountryLevel_Dec_May.csv
Due to ethical and legal restrictions we cannot upload IATA_CountryLevel_Dec_May.csv
. To download the
JHU and UN data we have the following script, download-data.sh
.
wget "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"
mv time_series_covid19_deaths_global.csv raw-data/jhu-deaths.csv
wget "https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2019_TotalPopulationBySex.csv"
mv WPP2019_TotalPopulationBySex.csv raw-data/un-population.csv
We retrieved data of percentage of country populations living in poverty from the World Bank, as curated by the team at Our World in Data and extracted estimates of percentages of populations living in poverty from Elvidge et al (2009) to supplement the World Bank measurements.
The spatial resolution in the JHU data set varies between countries, in some instances we need to sum over the different states, such as Australia, and in some instances we only select the region with the most cases, such as with France where various provinces are not included in the primary French death toll.
The path to the data has the 2020-08-19
directory in it because this is a time
stamped version of the data that was downloaded on that date.
library(dplyr)
library(magrittr)
library(purrr)
library(reshape2)
x <- read.csv("../../data/epidemiological/jhu-deaths.csv",
header = TRUE,
stringsAsFactors = FALSE) %>%
select(Province.State,
Country.Region,
starts_with("X")) %>%
filter(Country.Region != "Diamond Princess",
Country.Region != "MS Zaandam") %>%
melt(id.vars = c("Province.State","Country.Region"),
value.name = "cumulative_deaths",
variable.name = "date_string") %>%
mutate(date = as.Date(date_string, format = "X%m.%d.%y"))
countries_needing_cleaning <- x %>%
filter(Province.State != "") %>%
use_series("Country.Region") %>%
unique
subset_aus <- x %>%
filter(Country.Region == "Australia") %>%
group_by(Country.Region, date) %>%
summarise(cumulative_deaths = sum(cumulative_deaths))
subset_can <- x %>%
filter(Country.Region == "Canada",
Province.State != "Grand Princess",
Province.State != "Diamond Princess") %>%
group_by(Country.Region, date) %>%
summarise(cumulative_deaths = sum(cumulative_deaths))
subset_chn <- x %>%
filter(Country.Region == "China") %>%
group_by(Country.Region, date) %>%
summarise(cumulative_deaths = sum(cumulative_deaths))
subset_dnk <- x %>%
filter(Country.Region == "Denmark", Province.State == "") %>%
group_by(Country.Region, date) %>%
summarise(cumulative_deaths = sum(cumulative_deaths))
subset_fra <- x %>%
filter(Country.Region == "France", Province.State == "") %>%
group_by(Country.Region, date) %>%
summarise(cumulative_deaths = sum(cumulative_deaths))
subset_nld <- x %>%
filter(Country.Region == "Netherlands", Province.State == "") %>%
group_by(Country.Region, date) %>%
summarise(cumulative_deaths = sum(cumulative_deaths))
subset_gbr <- x %>%
filter(Country.Region == "United Kingdom", Province.State == "") %>%
group_by(Country.Region, date) %>%
summarise(cumulative_deaths = sum(cumulative_deaths))
result_ <- x %>%
filter(not(is.element(el = Country.Region, set = countries_needing_cleaning))) %>%
select(Country.Region, cumulative_deaths, date)
result <- rbind(result_,
subset_aus,
subset_can,
subset_chn,
subset_dnk,
subset_fra,
subset_nld,
subset_gbr) %>%
rename(location = Country.Region)
Some cruise ship data is scattered throughout the JHU data so this needs to be filtered out.
The JHU data has the cumulative number of deaths in each location on each date, but we want the actual number of deaths on each day so for each location we calculate the daily difference in the cumulative number of deaths and use that. For some days, the cumulative count decreases due to changes in the official numbers, on these days we set the daily value to zero. Since subtracting reduces the length of a vector by one, we remove the first date from all of the countries data.
first_date <- min(result$date)
diffed_deaths <- function(location_str) {
tmp <- filter(result, location == location_str)
death_diff <- diff(tmp$cumulative_deaths)
tmp <- filter(tmp, date > first_date)
tmp$daily_deaths <- pmax(0, death_diff)
return(tmp)
}
location_names <- unique(result$location)
diffed_result <- location_names %>%
map(diffed_deaths) %>%
bind_rows
There is a spike in the number of deaths from China on the 17th of April after a large number of deaths had been added to the official numbers.
> diffed_result %>% filter(location == "China", date > "2020-04-14", date < "2020-04-20") location cumulative_deaths date daily_deaths 1 China 3346 2020-04-15 1 2 China 3346 2020-04-16 0 3 China 4636 2020-04-17 1290 4 China 4636 2020-04-18 0 5 China 4636 2020-04-19 0 > nrow(filter(diffed_result, location == "China", date < "2020-04-17")) [1] 85 > head(filter(diffed_result, location == "China")) location cumulative_deaths date daily_deaths 1 China 18 2020-01-23 1 2 China 26 2020-01-24 8 3 China 42 2020-01-25 16 4 China 56 2020-01-26 14 5 China 82 2020-01-27 26 6 China 131 2020-01-28 49
We want to account for these deaths, so we will uniformly distribute them over
the previous 85 days in the data set. To see why it is 85 days, recall that the
data only goes back to the 23rd of January. We will use a new variable result
to distinguish between the data frames pre- and post-adjustment.
result <- diffed_result
result[result$location == "China" & result$date == "2020-04-17", "daily_deaths"] <- 0
.mask <- result$location == "China" & result$date < "2020-04-17"
death_adjustment <- 1290 / nrow(result[.mask,])
result[.mask, "daily_deaths"] <- result[.mask, "daily_deaths"] + death_adjustment
Then we can do the same check to make sure things looks sensible.
> result %>% filter(location == "China", date > "2020-04-14", date < "2020-04-20") location cumulative_deaths date daily_deaths 1 China 3346 2020-04-15 16.17647 2 China 3346 2020-04-16 15.17647 3 China 4636 2020-04-17 0.00000 4 China 4636 2020-04-18 0.00000 5 China 4636 2020-04-19 0.00000
Since some datasets consider Kosovo as part of Serbia, we will aggregate these locations in the deaths time series.
.kosovo_mask <- result$location == "Kosovo"
.kosovo_cumulative_deaths <- result[.kosovo_mask, "cumulative_deaths"]
.kosovo_daily_deaths <- result[.kosovo_mask, "daily_deaths"]
result_rm_kosovo <- filter(result, location != "Kosovo")
.serbia_mask <- result_rm_kosovo$location == "Serbia"
result_rm_kosovo[.serbia_mask, "cumulative_deaths"] <- result_rm_kosovo[.serbia_mask, "cumulative_deaths"] + .kosovo_cumulative_deaths
result_rm_kosovo[.serbia_mask, "daily_deaths"] <- result_rm_kosovo[.serbia_mask, "daily_deaths"] + .kosovo_daily_deaths
Then we save the results to a file output/clean-jhu-deaths.csv
for use in the
epidemiological model.
write.table(x = result_rm_kosovo,
file = "results/clean-jhu-deaths.csv",
sep = ",",
row.names = FALSE)
The Medium
variant of the data corresponds to a sensible choice of
assumptions, more details can be found at the site below
https://population.un.org/wpp/DefinitionOfProjectionVariants/
A suggested citation for this data set can be obtained at the following page
https://population.un.org/wpp/Download/Standard/CSV/
library(dplyr)
x <- read.csv("../../data/epidemiological/un-population.csv",
header = TRUE,
stringsAsFactors = FALSE) %>%
filter(Variant == "Medium", Time == 2020) %>%
select(Location, PopTotal) %>%
rename(location = Location, population_size = PopTotal)
jhu_deaths <- read.csv("results/clean-jhu-deaths.csv") %>%
mutate(date = as.Date(date),
deaths = daily_deaths) %>%
filter(location != "United Kingdom",
location != "West Bank and Gaza")
mutate_location_name <- function(df, old_name, new_name) {
mask <- df$location == old_name
df[mask,"location"] <- new_name
return(df)
}
x <- mutate_location_name(x, "Russian Federation", "Russia")
x <- mutate_location_name(x, "Bolivia (Plurinational State of)", "Bolivia")
x <- mutate_location_name(x, "Republic of Korea", "Korea, South")
x <- mutate_location_name(x, "United States of America", "US")
x <- mutate_location_name(x, "Iran (Islamic Republic of)", "Iran")
x <- mutate_location_name(x, "Brunei Darussalam", "Brunei")
x <- mutate_location_name(x, "United Republic of Tanzania", "Tanzania")
x <- mutate_location_name(x, "Syrian Arab Republic", "Syria")
x <- mutate_location_name(x, "China, Taiwan Province of China", "Taiwan*")
x <- mutate_location_name(x, "Venezuela (Bolivarian Republic of)", "Venezuela")
x <- mutate_location_name(x, "Republic of Moldova", "Moldova")
x <- mutate_location_name(x, "Viet Nam", "Vietnam")
x <- mutate_location_name(x, "Myanmar", "Burma")
x <- mutate_location_name(x, "Congo", "Congo (Brazzaville)")
x <- mutate_location_name(x, "Democratic Republic of the Congo", "Congo (Kinshasa)")
x <- mutate_location_name(x, "Côte d'Ivoire", "Cote d'Ivoire")
x <- mutate_location_name(x, "Lao People's Democratic Republic", "Laos")
stopifnot(length(setdiff(jhu_deaths$location, x$location)) == 0)
write.table(x = x,
file = "results/clean-un-population.csv",
sep = ",",
row.names = FALSE)
Cleaning the IATA data is a little bit tricky because it requires expanding each record of the original data frame into multiple records in the clean one and inputting zeros where there are no records of travellers. The resulting code is very inefficient but only takes a couple of minutes to run so there is not point in optimising it.
library(magrittr)
library(dplyr)
library(purrr)
library(memoise)
x <- read.csv("../../data/epidemiological/IATA_CountryLevel_Dec_May.csv",
header = TRUE,
stringsAsFactors = FALSE) %>%
filter(month != "", country_origin != "United Kingdom") %>%
select(country_origin, month, total_volume) %>%
rename(location = country_origin, num_travellers = total_volume)
month_total_factory <- function(x) {
function(country_str, month_str) {
maybe_count <- x %>%
filter(location == country_str,
grepl(pattern = month_str, x = month)) %>%
use_series("num_travellers")
switch(length(maybe_count)+1,
0,
maybe_count,
stop("Bad country and month: ", country_str, month_str))
}
}
month_total <- month_total_factory(x)
m_month_total <- memoise(month_total)
location_names <- unique(x$location)
num_locations <- length(location_names)
dates <- seq(from = as.Date("2019-12-01"),
to = as.Date("2020-04-30"),
by = 1)
month_global_total_factory <- function(x) {
month_totals_df <- x %>%
group_by(month) %>%
summarise(total_travellers = sum(num_travellers))
function(month_str) {
maybe_count <- month_totals_df %>%
filter(grepl(pattern = month_str, x = month)) %>%
use_series("total_travellers")
if (length(maybe_count) == 1) {
return(maybe_count)
} else {
stop("Bad month: ", month_str)
}
}
}
month_global_total <- month_global_total_factory(x)
m_month_global_total <- memoise(month_global_total)
record_list <- function(country_str, date_obj) {
month_str <- format(date_obj, format = "%b")
month_passenger_count <- m_month_total(country_str, month_str)
month_total_count <- m_month_global_total(month_str)
daily_average <- month_passenger_count / month_total_count
data.frame(location = country_str,
date = date_obj,
daily_average = daily_average)
}
result <- cross2(location_names, dates) %>%
map(lift_dl(record_list)) %>%
bind_rows
write.table(x = result,
file = "results/clean-iata.csv",
sep = ",",
row.names = FALSE)
We only need a couple of columns for the Home Office data and that is already in a reasonably tidy state so it is a simple select and rename.
library(dplyr)
result <- read.csv("../../data/epidemiological/home-office.csv") %>%
mutate(total_air_travels = as.numeric(gsub(pattern = ",",
replacement = "",
x = Total.air.arrivals))) %>%
select(Date,total_air_travels) %>%
rename(date = Date)
write.table(x = result,
file = "results/clean-home-office.csv",
sep = ",",
row.names = FALSE)
We can re-use a lot of the code from the IATA cleaning script to get a CSV of the number of passengers arriving via methods other than air.
library(reshape2)
library(dplyr)
library(magrittr)
library(dplyr)
library(purrr)
library(memoise)
x <- read.csv("../../data/epidemiological/extra-uk-arrivals.csv",
header = TRUE,
stringsAsFactors = FALSE) %>%
mutate(country = gsub(pattern = " total", replacement = "", x = X)) %>%
select(country, matches("*daily")) %>%
melt(id.vars = "country", variable.name = "month_var", value.name = "daily_count")
daily_number_factory <- function(x) {
function(country_str, month_str) {
maybe_count <- x %>%
filter(country == country_str,
grepl(pattern = month_str, x = month_var)) %>%
use_series("daily_count")
switch(length(maybe_count)+1,
NA,
maybe_count,
stop("Bad country and month: ", country_str, month_str))
}
}
daily_number <- daily_number_factory(x)
m_daily_number <- memoise(daily_number)
record_list <- function(country_str, date_obj) {
month_str <- format(date_obj, format = "%b")
daily_passenger_count <- m_daily_number(country_str, month_str)
data.frame(location = country_str,
date = date_obj,
daily_average = daily_passenger_count)
}
location_names <- unique(x$country)
dates <- seq(from = as.Date("2019-12-01"),
to = as.Date("2020-04-30"),
by = 1)
result <- cross2(location_names, dates) %>% map(lift_dl(record_list)) %>% bind_rows
write.table(x = result,
file = "results/clean-non-air-travel.csv",
sep = ",",
row.names = FALSE)
We extracted Table 1 from Elvidge et al (2009) using tabula-1.2.1
and
adjusted the header for clarity. This data was then stored in the file
data/global_population_data/2020-09-07/elvidge2009global.csv
. The following
script, clean-elvidge-poverty.R
, was then used to further clean this dataset.
We define a convenience function for adjusting the names of locations.
## @@ mutate-location-name-defn @@
mutate_location_name <- function(df, old_name, new_name) {
mask <- df$location == old_name
df[mask,"location"] <- new_name
return(df)
}
Then we read in the data and adjust some of the location names so that they match those used in the JHU database.
poverty_df <- read.csv("../../data/epidemiological/elvidge2009global.csv") %>%
rename(location = country) %>%
select(location,
estimated_percentage_in_poverty) %>%
mutate_location_name("Czech Republic", "Czechia") %>%
mutate_location_name("South Korea", "Korea, South") %>%
mutate_location_name("United States", "US") %>%
mutate_location_name("UAE", "United Arab Emirates")
write.table(x = poverty_df,
file = "results/clean-elvidge2009global.csv",
sep = ",",
row.names = FALSE)
Unfortunately, the Elvidge dataset has some dubious values in it, so we will primarily use a World Bank dataset that has been curated by Our World in Data.
The script for cleaning the second set of poverty data is called
clean-owid-poverty.R
Were possible we adjust location names to match those in the JHU dataset. For locations where there is not a World Bank estimate, we default to the ones from the previous section.
poverty_df <- read.csv("../../data/epidemiological/share-of-the-population-living-in-extreme-poverty.csv") %>%
rename(location = Entity,
year = Year,
poverty_percentage = Share.of.the.population.living.in.extreme.poverty....) %>%
select(location,year,poverty_percentage) %>%
mutate_location_name("Czech Republic", "Czechia") %>%
mutate_location_name("South Korea", "Korea, South") %>%
mutate_location_name("United States", "US")
missing_locs <- setdiff(primary_source_locations, poverty_df$location)
poverty_elvidge_df <- read.csv("results/clean-elvidge2009global.csv") %>%
rename(poverty_percentage = estimated_percentage_in_poverty) %>%
filter(location %in% missing_locs) %>%
mutate(year = 2009)
poverty_df <- bind_rows(poverty_df, poverty_elvidge_df) %>%
group_by(location) %>%
summarise(latest_poverty_percentage = poverty_percentage[which.max(year)])
other_mean <- poverty_df %>%
filter(!(location %in% primary_source_locations)) %>%
use_series("latest_poverty_percentage") %>%
mean
poverty_other_record <- data.frame(location = "other",
latest_poverty_percentage = other_mean)
poverty_df <- poverty_df %>%
filter(location %in% primary_source_locations) %>%
bind_rows(poverty_other_record) %>%
as.data.frame
cat("clean-owid-poverty.R")
setdiff(primary_source_locations, poverty_df$location)
stopifnot(length(setdiff(primary_source_locations, poverty_df$location))==0)
write.table(x = poverty_df,
file = "results/clean-worldbankpoverty.csv",
sep = ",",
row.names = FALSE)
To estimate the number of arrivals into the UK from each location, we combine the data sets and use the formula for each country on each day.
\[ \text{arrivals} = \text{proportion IATA} × \text{Home Office number} + \text{non-air numbers} \]
library(magrittr)
library(dplyr)
iata_df <- read.csv("results/clean-iata.csv") %>%
mutate(date = as.Date(date, format = "%Y-%m-%d"))
home_office_df <- read.csv("results/clean-home-office.csv") %>%
mutate(date = as.Date(date, format = "%d-%b-%y"))
iata_and_ho_df <- left_join(iata_df, home_office_df, by = "date")
non_air_df <- read.csv("results/clean-non-air-travel.csv") %>%
mutate(date = as.Date(date, format = "%Y-%m-%d")) %>%
rename(non_air_num = daily_average)
all_arrivals_df <- left_join(iata_and_ho_df, non_air_df) %>%
mutate(non_air_num = ifelse(test = is.na(non_air_num), yes = 0, no = non_air_num),
estimate = daily_average * total_air_travels + non_air_num) %>%
filter(!is.na(estimate))
write.table(x = all_arrivals_df,
file = "results/estimated-arrivals.csv",
sep = ",",
row.names = FALSE)
Consider a matrix \(A\) where the entry \(Ai,j\) is the number of people on
day \(i\) who have a \(j-1\) day old infection. So the first column of \(A\) is
the number of people infected on each day and the final column is the number of
people that die on each day. The upper right corner of the matrix is set to
zero, assuming there where no deaths prior to those in the data and the bottom
left corner is NA
because of censoring to the data. The following function
takes a vector of the number of deaths on each day, and the number of days from
infection to death (among those who die from the infection) and returns the
corresponding \(A\) matrix.
## @@ age-of-infection-matrix-defn @@
age_of_infection_matrix <- function(inf_to_death_days, deaths_vector) {
num_days_in_matrix <- length(deaths_vector) + inf_to_death_days
result <- matrix(data = NA,
nrow = num_days_in_matrix,
ncol = inf_to_death_days + 1)
padded_deaths <- c(rep(0, inf_to_death_days),
deaths_vector,
rep(NA, inf_to_death_days))
for (i in 1:num_days_in_matrix) {
result[i,] <- rev(padded_deaths[i + (0:inf_to_death_days)])
}
return(result)
}
We assume that once someone experiences symptoms, they will not be able to travel any more. So the people that are capable of seeding a cluster in the UK are those that are either incubating, or asymptomatic and within the first \(d\text{latent} + \d\text{infectious}\) days of their infection. So all of the first \(d\text{incubating} + 1\) columns plus a fraction of the remaining columns up to \(d\text{latent} + \d\text{infectious} + 1\) make up the people who could potentially seed a cluster on each day.
Since the proportion of asymptomatic infections is very uncertain, we keep this parameter variable for sensitivity analysis although since most infections happen earlier it seems that we should opt for a smaller value.
The padded_potential_seeders
returns a vector of the total number of potential
seeders through time given a vector of the deaths on each day. This pads the
time out in the same way that the age matrix function,
age_of_infection_matrix
, result.
## @@ padded-potential-seeders-defn @@
padded_potential_seeders <- function(age_matrix,
days_latent,
days_incubating,
days_infectious,
prop_asymptomatic) {
presymptomatic_cases <- (age_matrix[,0:days_incubating + 1])
asymptomatic_cases <- prop_asymptomatic * age_matrix[,(days_incubating + 1):(days_latent + days_infectious) + 1]
padded_total_seeders <- rowSums(cbind(presymptomatic_cases, asymptomatic_cases))
return(padded_total_seeders)
}
Now to actually estimate the proportion of a country’s population that could potentially seed a cluster in the UK, we need the number of covid-19 deaths on each day in that country the parameters of the infection, and the population of the country.
## @@ seeder-proportion-defn @@
seeder_proportion <- function(deaths_df,
location_population,
days_latent,
days_incubating,
days_infectious,
prop_asymptomatic,
days_infection_to_death,
infection_fatality_ratio) {
if (!setequal(names(deaths_df), c("deaths", "date"))) {
stop("Bad dataframe names: ", names(deaths_df))
}
age_matrix <- age_of_infection_matrix(days_infection_to_death, deaths_df$deaths)
potential_seeders <- padded_potential_seeders(age_matrix,
days_latent,
days_incubating,
days_infectious,
prop_asymptomatic)
start_date <- min(deaths_df$date)
padding_dates <- seq(from = start_date - days_infection_to_death,
to = start_date - 1,
by = 1)
total_dates <- c(padding_dates, deaths_df$date)
data.frame(date = total_dates,
seeder_proportion = infection_fatality_ratio * potential_seeders / location_population)
}
Finally, we need to use these functions along with the deaths data and the population data to estimate the infectious proportion in each country through time and write all of this to a CSV at the end. The first step is to read in the data and fix up some discrepencies in location names and define parameters of covid-19.
The JHU data for West Bank and Gaza is removed because it is unclear how it should be merged into the rest of the data.
## @@ jhu-un-location-unification @@
library(dplyr)
library(purrr)
jhu_deaths <- read.csv("results/clean-jhu-deaths.csv") %>%
mutate(date = as.Date(date),
deaths = daily_deaths) %>%
filter(location != "United Kingdom",
location != "West Bank and Gaza")
un_populations <- read.csv("results/clean-un-population.csv")
stopifnot(length(setdiff(jhu_deaths$location, un_populations$location)) == 0)
When we are estimating the number of people in each state of infection we need the parameters for the average amount of time people spend in each state. The following point estimates where derived from parameter values reported in the literature.
- The number of days between symptom onset to death is estimated to be 18 days https://doi.org/10.1016/S1473-3099(20)30243-7
… we estimated the mean duration from onset of symptoms to death to be 17·8 days (95% credible interval [CrI] 16·9–19·2)…
- The incubation period was assumed to be 5 days https://www.nejm.org/doi/full/10.1056/NEJMOa2001316
The mean incubation period was 5.2 days (95% confidence interval [CI], 4.1 to 7.0)…
- \(23 = 5 + 18\) so the days from infection to death has a mean of 23 assuming the duration of the incubation period is independent of the time from symptom onset to death.
- The latent phase was assumemd to finish to finish 2 days before before the onset of symptoms. In https://www.nature.com/articles/s41591-020-0869-5 it was estimated that although transmission can occur substantially before symptom onset \(<10\%\) of transmission occurs prior to 3 days before symptom onset, but likely has a mean closer to 2 days. Since \(3 = 5 - 2\), the days spent latent is 3.
… start of infectiousness at least 2 days before onset and peak infectiousness at 2 days before to 1 day after onset would be most consistent with this observed proportion (Extended Data Fig. 3).
- From the same publication, “Infectiousness was estimated to decline quickly within 7 days.”, so given the quote below, we set the number of days for which an individual is infectious to 7.
…infectiousness may decline significantly 8 days after symptom onset, as live virus could no longer be cultured (according to W"{o}lfel and colleagues).
We define variables storing these parameter values.
## @@ model-parameters @@
days_latent <- 3
days_incubating <- 5
days_infectious <- 7
prop_asymptomatic <- 0.31
days_infection_to_death <- 23
infection_fatality_ratio <- 100
Then we define a wrapper function that will compute the values for a specific
location based on the un_populations
and jhu_deaths
, we map this over all
the locations and then bind the results and write them to a CSV:
output/estimated-proportion-seeders.csv
for use in subsequent calculations.
Note that when reading the population size there is an additional factor of \(10^3\) because the UN population values are reported in thousands.
location_seeder_props <- function(location_str) {
deaths_df <- filter(jhu_deaths, location == location_str) %>%
select(date,deaths)
if (is.element(location_str, un_populations$location)) {
pop_size <- 1e3 * un_populations[un_populations$location == location_str, "population_size"]
} else {
stop("Cannot find a population size for ", location_str)
}
seeder_props <- seeder_proportion(deaths_df,
pop_size,
days_latent,
days_incubating,
days_infectious,
prop_asymptomatic,
days_infection_to_death,
infection_fatality_ratio)
seeder_props$location <- location_str
return(seeder_props)
}
result <- map(.x = unique(jhu_deaths$location),
.f = location_seeder_props) %>%
bind_rows %>%
filter(date < "2020-07-01")
stopifnot(!any(is.na(result$seeder_proportion)))
write.table(x = result,
file = "results/estimated-proportion-seeders.csv",
sep = ",",
row.names = FALSE)
We might also be interested in the estimated incidence on each day under this
method. This is just a simple scaling and transformation of the data, but we do
it in a similar fashion to the functions above partly as a sanity check on the
code. Note that because we removed the UK from the jhu_deaths
above, this
result does not include the estimates for the UK; they are drawn from other work
in our figures.
location_num_infections <- function(location_str) {
deaths_df <- filter(jhu_deaths, location == location_str) %>%
select(date,deaths)
age_matrix <- age_of_infection_matrix(days_infection_to_death, deaths_df$deaths)
num_deaths_infs <- age_matrix[,1]
start_date <- min(deaths_df$date)
padding_dates <- seq(from = start_date - days_infection_to_death,
to = start_date - 1,
by = 1)
total_dates <- c(padding_dates, deaths_df$date)
data.frame(date = total_dates,
num_infs = infection_fatality_ratio * num_deaths_infs,
location = location_str)
}
result <- map(.x = unique(jhu_deaths$location),
.f = location_num_infections) %>%
bind_rows
write.table(x = result,
file = "results/estimated-daily-infections.csv",
sep = ",",
row.names = FALSE)
The current data set is unwieldy because there are lots of locations that have a very low probability of having seeded a cluster. To reduce this, we filter for those countries that are in the top \(99\%\) of cumulative number of cases at the start of May; we exclude the UK to capture more of the external pandemic.
primary-source-locations-defn
threshold_date <- as.Date("2020-05-01")
threshold_level <- 0.99
jhu_deaths_df <- read.csv("results/clean-jhu-deaths.csv") %>%
mutate(date = as.Date(date)) %>%
filter(location != "United Kingdom")
final_count_df <- jhu_deaths_df %>%
filter(date == threshold_date) %>%
rename(final_count = cumulative_deaths)
sorted_final_counts <- sort(final_count_df$final_count, decreasing = TRUE)
cumulative_proportions <- cumsum(sorted_final_counts) / sum(sorted_final_counts)
mask <- cumulative_proportions <= threshold_level
threshold <- min(sorted_final_counts[mask])
primary_source_locations <- final_count_df %>%
filter(final_count >= threshold) %>%
use_series("location")
The threshold is set to 94. Of the 184 locations in the JHU data set (excluding the UK) 53 contribute 99% of the cumulative cases as of May 1 2020, we considered these primary sources and aggregated the remaining 131 locations into a single “other” category.
> print(threshold) [1] 94 > print(length(primary_source_locations)) [1] 53 > print(length(unique(final_count_df$location))) [1] 184 > print(length(filter(final_count_df, final_count < threshold)$location)) [1] 131
The file output/estimated-arrivals.csv
contains our estimates of the total
number of arrivals into the UK from each country and the file
output/estimated-proportion-seeders.csv
contains our estimates of the
proportion of people in each of those countries that could potentially seed a
cluster after coming to the UK. We assume that arrival in the UK is independent
of COVID19 status if you are not symptomatic, which means that the estimate of
the number of people who entered the UK and are capable of seeding a cluster is
just the product of these estimates.
prop_potential_seeders <- read.csv("results/estimated-proportion-seeders.csv",
stringsAsFactors = FALSE) %>%
mutate(date = as.Date(date)) %>%
dplyr::select(date, location, seeder_proportion)
stopifnot(!any(is.na(prop_potential_seeders$seeder_proportion)))
stopifnot(all(intersect(primary_source_locations, prop_potential_seeders$location) == primary_source_locations))
estimated_arrivals <- read.csv("results/estimated-arrivals.csv") %>%
mutate(date = as.Date(date)) %>%
rename(num_arrivals = estimate) %>%
dplyr::select(date, location, num_arrivals)
There are a lot of locations that are either denoted with different strings between the arrivals data and the potential seeders data, so we need to unify these where possible, and for cases where there are locations with no matching COVID-19 deaths data, those locations need to be removed from the arrivals data.
<<mutate-location-name-defn>>
estimated_arrivals <- mutate_location_name(estimated_arrivals, "Czech Republic", "Czechia")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "United States", "US")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "Dominican Rep", "Dominican Republic")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "Korea (South)", "Korea, South")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "Antigua-Barbuda", "Antigua and Barbuda")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "Bosnia-Herzegovina", "Bosnia and Herzegovina")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "Brunei Darussalam", "Brunei")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "Cape Verde", "Cabo Verde")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "Central African Rep", "Central African Republic")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "Cote D'Ivoire", "Cote d'Ivoire")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "St Kitts-Nevis", "Saint Kitts and Nevis")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "St Lucia", "Saint Lucia")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "St Vincent-Grenad", "Saint Vincent and the Grenadines")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "Taiwan", "Taiwan*")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "Trinidad-Tobago", "Trinidad and Tobago")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "Viet Nam", "Vietnam")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "Timor-leste", "Timor-Leste")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "Sao Tome-Principe", "Sao Tome and Principe")
estimated_arrivals <- mutate_location_name(estimated_arrivals, "Macedonia", "North Macedonia")
estimated_arrivals <- filter(estimated_arrivals, location != "Aruba")
estimated_arrivals <- filter(estimated_arrivals, location != "Bermuda")
estimated_arrivals <- filter(estimated_arrivals, location != "Bonaire, Saint Eustatius and Saba")
estimated_arrivals <- filter(estimated_arrivals, location != "Cayman Islands")
estimated_arrivals <- filter(estimated_arrivals, location != "Cook Islands")
estimated_arrivals <- filter(estimated_arrivals, location != "Curacao")
estimated_arrivals <- filter(estimated_arrivals, location != "Falkland Islands")
estimated_arrivals <- filter(estimated_arrivals, location != "Faroe Islands")
estimated_arrivals <- filter(estimated_arrivals, location != "French Polynesia")
estimated_arrivals <- filter(estimated_arrivals, location != "Gibraltar")
estimated_arrivals <- filter(estimated_arrivals, location != "Greenland")
estimated_arrivals <- filter(estimated_arrivals, location != "Guadeloupe")
estimated_arrivals <- filter(estimated_arrivals, location != "Guam")
estimated_arrivals <- filter(estimated_arrivals, location != "Guernsey")
estimated_arrivals <- filter(estimated_arrivals, location != "Hong Kong (SAR)")
estimated_arrivals <- filter(estimated_arrivals, location != "Isle of Man")
estimated_arrivals <- filter(estimated_arrivals, location != "Jersey")
estimated_arrivals <- filter(estimated_arrivals, location != "Macao (SAR)")
estimated_arrivals <- filter(estimated_arrivals, location != "Martinique")
estimated_arrivals <- filter(estimated_arrivals, location != "Mayotte")
estimated_arrivals <- filter(estimated_arrivals, location != "Myanmar")
estimated_arrivals <- filter(estimated_arrivals, location != "New Caledonia")
estimated_arrivals <- filter(estimated_arrivals, location != "North Mariana Isl")
estimated_arrivals <- filter(estimated_arrivals, location != "Palau")
estimated_arrivals <- filter(estimated_arrivals, location != "Puerto Rico")
estimated_arrivals <- filter(estimated_arrivals, location != "Reunion")
estimated_arrivals <- filter(estimated_arrivals, location != "Samoa")
estimated_arrivals <- filter(estimated_arrivals, location != "Solomon Islands")
estimated_arrivals <- filter(estimated_arrivals, location != "St Barthelemy")
estimated_arrivals <- filter(estimated_arrivals, location != "St Helena")
estimated_arrivals <- filter(estimated_arrivals, location != "St Maarten (Dutch Part)")
estimated_arrivals <- filter(estimated_arrivals, location != "Svalbard")
estimated_arrivals <- filter(estimated_arrivals, location != "Swaziland")
estimated_arrivals <- filter(estimated_arrivals, location != "Tonga")
estimated_arrivals <- filter(estimated_arrivals, location != "Turkmenistan")
estimated_arrivals <- filter(estimated_arrivals, location != "Turks-Caicos")
estimated_arrivals <- filter(estimated_arrivals, location != "Vanuatu")
estimated_arrivals <- filter(estimated_arrivals, location != "Virgin Islands (GB)")
estimated_arrivals <- filter(estimated_arrivals, location != "Virgin Islands (US)")
estimated_arrivals <- filter(estimated_arrivals, location != "French Guiana")
stopifnot(all(intersect(primary_source_locations, estimated_arrivals$location) == primary_source_locations))
Then we can combine these to get the estimated number of arriving people capable of seeding a cluster. Since the time intervals covered by the different data sets do not match up entirely we take their intersection.
min_date_intersection <- max(min(estimated_arrivals$date), min(prop_potential_seeders$date))
max_date_intersection <- min(max(estimated_arrivals$date), max(prop_potential_seeders$date))
seeder_numbers <- left_join(prop_potential_seeders,
estimated_arrivals) %>%
mutate(num_seeders = seeder_proportion * num_arrivals) %>%
filter(date <= max_date_intersection, date >= min_date_intersection)
.mask <- is.na(seeder_numbers$num_seeders)
seeder_numbers[.mask, "num_arrivals"] <- 0
seeder_numbers[.mask, "num_seeders"] <- 0
With the relevant locations identified, the seeders can be filtered by an origin within this set and the the contributions of the remaining locations combined into an “other” class, the result is the epidemic introduction index.
eii_data <- seeder_numbers %>%
mutate(primary_location = ifelse(test = location %in% primary_source_locations,
yes = location, no = "other")) %>%
group_by(date,primary_location) %>%
summarise(num_intros = sum(num_seeders)) %>%
filter(!is.na(num_intros))
stopifnot(is.element(el = "Italy", set = unique(eii_data$primary_location)))
stopifnot(is.element(el = "other", set = unique(eii_data$primary_location)))
write.table(x = eii_data,
file = "results/estimated-introduction-index.csv",
sep = ",",
row.names = FALSE)
For this figure to be displayed the CSV file with the EII estimates needs to be
served at http://localhost:8000/results/estimated-introduction-index.csv
. Running
python3 -m http.server 8000
from the uk-lineages
directory should achieve
this.
We want to study the interval of time between the introduction of an infection into the UK and the time of the TMRCA of the resulting cluster. To allow us to estimate this, we make a simplifying assumptions which seem reasonable in the given setting. We assume the seeding of each cluster is independent of the seeding of other clusters. This allows us to treat the introduction times as a sample from the distribution of infection arrival times without replacement. Since there are far more introductions that clusters this seems like a reasonable approximation to make.
If the introductions are independent, the day \(G\) on which they entered the UK is drawn from the distribution of daily introductions. Let \(f_G(g)\) be the probability of the infection being introduced on day \(g\). Then, if the TMRCA is on day \(k\) there must have been a lag, \(L\) of \(k-g\) days; let \(f_L(j)\) be the probability of a lag of \(j\) days. So the probability, \(v_k\), of a TMRCA on day \(k\) is
\[ \hat{v}_k := ∑g f_G(g)f_L(k-g) \]
and then \(v = \hat{v} / |\hat{v}|\). Since each TMRCA is independent, the total
likelihood is then just the product of the corresponding \(v_k\) on which they
occurred. We can write a closure for the liklihood function
model_1_llhd_factory
which takes a PMF of introductions on each day and the
number of TMRCAs on each day.
model_1_llhd_factory <- function(daily_introduction_prob, daily_tmrca_count) {
stopifnot(length(daily_introduction_prob) == length(daily_tmrca_count))
num_days <- length(daily_introduction_prob)
function(mean_lag) {
if (mean_lag > 0) {
lag_pmf <- dgeom(x = 0:(num_days - 1), prob = 1 / (1 + mean_lag))
tmrca_pmf <- numeric(num_days)
for (ix in 1:num_days) {
tmrca_pmf[ix] <- daily_introduction_prob[1:ix] %*% rev(lag_pmf[1:ix])
}
tmrca_pmf <- tmrca_pmf / sum(tmrca_pmf)
# independence assumption
as.numeric(daily_tmrca_count %*% log(tmrca_pmf))
} else {
-1e10
}
}
}
It is well-known that additional sampling of a phylogeny can push the TMRCA backwards in time closer to the origin of the phylogeny. Hence the TMRCA is correlated with the size of the reconstructed tree. Because there is no appreciable depletion of the susceptible proportion, the order of the clusters should not effect their size. Because we have been watching these clusters until they have gotten small, there should not be any truncation effects to their observed size. __For these reasons we make the simplifying assumption that the eventual size of the cluster is indepedendent of when it was seeded.__
To test that the model we have defined behaves as expected, and that it is
implemented correctly, we simulate a set of num_intros
many TMRCA values
conditional upon the EII.
eii_csv <- "results/estimated-introduction-index.csv"
x <- read.csv(eii_csv, stringsAsFactors = FALSE) %>%
mutate(date = as.Date(date)) %>%
group_by(date) %>%
summarise(total_intros = sum(num_intros))
num_intros <- 1000
intro_dates <- as.integer(x$date - min(x$date))
intro_dist <- x$total_intros / sum(x$total_intros)
rand_intro_dates <- sample(intro_dates,
size = num_intros,
replace = TRUE,
prob = intro_dist)
mean_delay <- 10 # days
rand_delays <- rgeom(n = num_intros, prob = 1 - mean_delay / (1 + mean_delay))
rand_tmrcas <- rand_delays + rand_intro_dates
Then we can take the likelihood function defined above too see who informative
the data is for the mean lag and evaluate it for different values of the mean
lag parameter. The resulting log-likelihood profile is shown in the following
figure. Note that dplyr::full_join
will introduce NA
s, so we will fill
these in with zero, which is a sensible value in both cases, since the
probability of introduction is very small for those values and in the case of
the TMRCAs, the missing data is literally zero.
tmrca_table <- table(rand_tmrcas)
tmrca_df <- data.frame(day_num = as.integer(names(tmrca_table)),
num_tmrca = as.integer(tmrca_table))
intro_probs_df <- data.frame(day_num = intro_dates,
intro_prob = intro_dist)
sim_data <- dplyr::full_join(intro_probs_df, tmrca_df)
na_mask <- is.na(sim_data$num_tmrca)
sim_data[na_mask,"num_tmrca"] <- 0
rm(na_mask)
na_mask <- is.na(sim_data$intro_prob)
sim_data[na_mask,"intro_prob"] <- 0
rm(na_mask)
model_1_llhd <- model_1_llhd_factory(sim_data$intro_prob, sim_data$num_tmrca)
The lag between an arrival of a case into the UK and the observed TMRCA should decrease with the size of the cluster. However, deriving the distribution of this lag is difficult so we will settle for a phenomonological description where the average lag decreases to an asymptote in the size of the cluster. The functional form is \(\text{lag} = α + β/n \) where \(n\) is the size of the cluster.
eii_csv <- "results/estimated-introduction-index.csv"
x <- read.csv(eii_csv, stringsAsFactors = FALSE) %>%
mutate(date = as.Date(date)) %>%
group_by(date) %>%
summarise(total_intros = sum(num_intros))
num_intros <- 1000
intro_dates <- as.integer(x$date - min(x$date))
intro_dist <- x$total_intros / sum(x$total_intros)
rand_intro_dates <- sample(intro_dates,
size = num_intros,
replace = TRUE,
prob = intro_dist)
The start of the simulation is the same as in the previous example, but now we need to simulate the size of the cluster prior to simulating the lag.
mean_size <- 8
rand_sizes <- rgeom(n = num_intros,
prob = 1 - mean_size / (1 + mean_size)) + 2
mean_delay <- 8 + 20 / rand_sizes
rand_delays <- rgeom(n = num_intros,
prob = 1 - mean_delay / (1 + mean_delay))
rand_tmrcas <- rand_delays + rand_intro_dates
tmrca_df <- data.frame(date = rand_tmrcas,
delay = rand_delays,
size = rand_sizes)
intro_prob_df <- data.frame(date = intro_dates,
prob = intro_dist)
sim_data <- dplyr::full_join(tmrca_df,
intro_prob_df,
by = "date")
The following figure shows both what the raw data looks like as it would be observed, and the correlation between the hidden delay and the observed cluster size.
The definition of the simulated TMRCA values above can be abstracted into the
following function which will be helpful later for model criticism. Note that
one of the parameters to this is a function which returns random sizes, this
will be important later when we want to generate realistic cluster sizes, there
is also the mean_from_size
function which requires a specified lag model.
r_tmrcas_function
#' Return a vector of TMRCA values
#'
#' @param n integer number of TMRCA values
#' @param intro_dates integer vector of dates
#' @param intro_dist numeric vector of date weights
#' @param r_size function which takes an integer and returns that many sizes
#' @param mean_from_size function from size to the mean lag
#'
r_tmrcas <- function(n, intro_dates, intro_dist, r_size, mean_from_size) {
r_intro_dates <- sample(intro_dates,
size = n,
replace = TRUE,
prob = intro_dist)
r_sizes <- r_size(n)
mean_delay <- mean_from_size(r_sizes)
r_delay <- rgeom(n = n,
prob = 1 / (1 + mean_delay))
return(r_intro_dates + r_delay)
}
The likelihood for this extended model is very similar, although we need to
disaggregate the data because they now have an additional attribute of cluster
size. This could be made to be a bit more efficient by pre-calculating only the
required values of the log-PMF but this is fast enough for our curren purposes
and is easier to read. We use the log_sum_exp
function to avoid underflow. The
functions here are pure which is important because we will want to reuse them
later.
llhd_factory
log_sum_exp <- function(x) {
m <- max(x)
log(sum(exp(x - m))) + m
}
model_2_llhd_factory <- function(daily_intro_log_probs, tmrca_date, cluster_size) {
stopifnot(length(tmrca_date) == length(cluster_size))
stopifnot(max(tmrca_date) <= length(daily_intro_log_probs))
num_intros <- length(tmrca_date)
max_possible_lag <- max(tmrca_date)
function(params) {
a <- params[1]
b <- params[2]
if (min(params) >= 0) {
llhd <- 0
mean_lags <- a + b / cluster_size
for (ix in 1:num_intros) {
tmrca <- tmrca_date[ix]
mean_lag <- mean_lags[ix]
lag_lpmf <- dgeom(x = 0:max_possible_lag,
prob = 1 / (1 + mean_lag),
log = TRUE)
llhd <- llhd +
log_sum_exp(daily_intro_log_probs[1:tmrca] +
rev(lag_lpmf[1:tmrca]))
}
} else {
llhd <- - .Machine$double.xmax
}
return(llhd)
}
}
Then we just need to reshape the data a little bit to be in a form suitables for
using with the resulting llhd
function. In doing so we fill in the
introduction probability for very late arrivals with machine epsilon to avoid
numerical issues resulting from taking the log of zero.
tmrca_dates <- sim_data %>%
filter(not(is.na(size))) %>%
use_series("date")
cluster_sizes <- sim_data %>%
filter(not(is.na(size))) %>%
use_series("size")
daily_intro_log_probs <- dplyr::left_join(data.frame(date = 0:229,
dummy = NA),
intro_prob_df,
by = "date") %>%
mutate(safe_prob = ifelse(test = is.na(prob),
yes = .Machine$double.eps,
no = prob)) %>%
use_series("safe_prob") %>%
log
model_2_llhd <- model_2_llhd_factory(daily_intro_log_probs,
tmrca_dates,
cluster_sizes)
We numerically estimated the maximum of this likelihood surface using nlm
,
starting the algorithm at several intial conditions randomly drawn from an
exponential distribution to test for robustness of the estimate. Note that we
use a log-transform on the parameters to ensure that the values given to the
likelihood function are always positive.
numeric-mle
estimate_as_df <- function(est) {
est %>%
use_series("estimate") %>%
exp %>%
set_names(c("alpha", "beta")) %>%
as.list %>%
as.data.frame
}
numeric_mle <- 10 %>%
rerun(log(rexp(n = 2,
rate = 1 / c(10, 10)))) %>%
map(~ nlm(f = function(p) { - model_2_llhd(exp(p)) },
p = .x)) %>%
keep(~ .x$code == 1) %>%
map(estimate_as_df) %>%
bind_rows
The function estimate_as_df
is a helper function to translate the results of
the minimisation to a record of a data frame. Filtering by the code
attribute
is used to ensure that we only record the values where the diagnostics of nlm
are confident that the minima was found.
The following figure shows a heatmap of the likelihood surface under the simulated data. A red circle is used to indicate the true values of the parameters used to simulate the data.
For some bizarre reason, the heatmap breaks, possibly because there are too many different values being shown, but this figure is buggy in a few ways and could probably use an overhaul. The geometry is correct, but the implementation is a bit ugly.
The null hypothesis is that there is no effect of size on the lag which corresponds to parameterisations where \(β = 0\). Since these are nested models we can use the likelihood ratio statistic. To test this hypothesis. The null model has a single parameter \(α\) and the altertivative model has two \((α,β)\), hence the likelihood ratio statistic has a chi-squared distribution with a single degree of freedom. We can numerically estiamte the MLE and run the calculation for the p-value of the null.
llhd-ratio-statistic
null_obj <- function(p) { - model_2_llhd(c(exp(p), 0)) }
null_est <- nlm(f = null_obj, p = rnorm(1))
if (null_est$code == 1) {
null_llhd <- - null_est$minimum
} else {
stop("Optimisation non-unit code.")
}
altr_obj <- function(p) { - model_2_llhd(exp(p)) }
altr_est <- nlm(f = altr_obj, p = rnorm(2))
if (altr_est$code == 1) {
altr_llhd <- - altr_est$minimum
} else {
stop("Optimisation non-unit code.")
}
llhd_ratio_stat <- 2 * (altr_llhd - null_llhd)
p_val <- pchisq(q = llhd_ratio_stat,
df = 1, lower.tail = FALSE)
cat("The null LLHD is ", null_llhd, " at ", exp(null_est$estimate), "\n",
"The alternative LLHD is ", altr_llhd, " at ", exp(altr_est$estimate), "\n",
"The likelihood ratio statistic is ", llhd_ratio_stat, "with 1 degree of freedom\n",
"the p-value is ", p_val, "\n")
Which when run prints out the following…
The null LLHD is -4204.266 at 11.65166 The alternative LLHD is -4184.053 at 7.164843 24.13244 The likelihood ratio statistic is 40.42582 with 1 degree of freedom the p-value is 2.042247e-10
This section looks at estimating the lag model from the MCC tree data which has been tresholded at a 0.5 level.
The implementation of the likelihood above assumed that the dates where encoded as integers, so we define a helper function to convert the dates to integers describing the day of the year, although we have to be careful to check that all the relevant dates fall in the year 2020 unless we want to deal with modular arithmetic. Note this function is vectorised so you can apply it to a vector of date objects.
date-as-day-of-year-function
date_as_day_of_year <- function(date_objs) {
as.integer(format.Date(date_objs, "%j"))
}
We read in the EII data and extend it to capture the whole range of TMRCA dates setting setting the introductions numbers to 0 on days far enough in the future that interventions would have made the chance of additional introductions negligable.
eii-setup-1
<<date-as-day-of-year-function>>
eii_csv <- "results/estimated-introduction-index.csv"
eii_df <- read.csv(eii_csv,
stringsAsFactors = FALSE) %>%
mutate(date = as.Date(date)) %>%
group_by(date) %>%
summarise(total_intros = sum(num_intros))
tmrca_date_range <- as.Date(c("2020-01-01", "2020-06-30"))
eii_eps <- data.frame(date = seq(from = tmrca_date_range[1],
to = tmrca_date_range[2],
by = 1))
eii_df <- full_join(eii_df, eii_eps, by = "date")
eii_df[is.na(eii_df$total_intros),]$total_intros <- 0
We further process the EII data to construct the daily introduction probabilities. Then we construct the vectors of data that are expected by the LLHD function. The zeros we added before will now need to be handled slightly differently (i.e., we use machine precision to replace zero,) to ensure that we don’t get infinite values causing numerical issues later.
eii-setup-2
intro_dates <- eii_df$date
intro_date_nums <- date_as_day_of_year(eii_df$date)
intro_dist <- eii_df$total_intros / sum(eii_df$total_intros)
intro_prob_df <- data.frame(date = intro_dates,
date_num = intro_date_nums,
prob = intro_dist)
daily_intro_log_probs <- log(intro_prob_df$prob)
.inf_mask <- is.infinite(daily_intro_log_probs)
daily_intro_log_probs[.inf_mask] <- .Machine$double.min.exp
The data of the cluster TMRCA values is stored in the file
data/phylogenetics_data/2020-09-14/clusters_DTA_MCC_0.5.csv
. This file
contains only a point estimate of cluster sizes. There are a couple of
assertions here to ensure that the EII does encompass all of the TMRCA dates.
cluster_df <- read.csv("../../data/epidemiological/clusters_DTA_MCC_0.5.csv",
stringsAsFactors = FALSE) %>%
mutate(tmrca_date = as.Date(tmrca_calendar),
date_num = date_as_day_of_year(tmrca_date)) %>%
select(cluster, seqs, tmrca_date, date_num)
stopifnot(min(eii_df$date) >= as.Date("2020-01-01"))
stopifnot(min(eii_df$date) <= min(cluster_df$tmrca_date))
stopifnot(max(eii_df$date) >= max(cluster_df$tmrca_date))
The vectors tmrca_dates
and cluster_sizes
are the actual data we need
regarding the clusters to specify the likelihood function.
tmrca_dates <- cluster_df$date_num
cluster_sizes <- cluster_df$seqs
We then define the same likelihood function as from the previous simulation
study — this is done by the NOWEB syntax <<llhd_factory>>
which pulls in the
model_2_llhd_factory
function from above.
<<llhd_factory>>
model_2_llhd <- model_2_llhd_factory(daily_intro_log_probs,
tmrca_dates,
cluster_sizes)
Then we evaluated the log-likelihood function on a mesh of values of \(α\) and \(β\) to generate a heatmap of the likelihood surface. As expected, there is a tradeoff between \(α\) and \(β\) which we had seen in the previous heatmap.
We can then use the same snippets from above to estimate the MLE and to perform the LRT for \(β = 0\).
<<numeric-mle>>
write.table(x = numeric_mle,
file = "results/lag-estimate-mle-estimates.csv",
quote = FALSE,
sep = ",",
row.names = FALSE)
sink(file = "results/llhd-ratio-report-lag-estimate.txt")
<<llhd-ratio-statistic>>
sink()
The sink
command pipes the output of the likelihood ratio test to a file that
can then be included in the output. Based on this we see that there is a
substantial difference in the likelihood between the two models, but with the
parameters so close to the edge of the parameter space we might need to be a
little bit careful not to overinterpret the p-value of the results.
Another method would be to use the AIC as a model selection method. Since the log-likelihood differs be about \(67\) and there is only a single parameter difference, this still gives very good support for the model with \(β ≠ 0\).
To check that the optimisation has landed at a sensible place, we can compute the log-likelihood profiles about the optima.
mle_estimate <- exp(altr_est$estimate)
mesh_length <- 50
alpha_mesh <- seq(from = 0, to = mle_estimate[1] * 2, length = mesh_length)
alpha_profile <- alpha_mesh %>% map(~ model_2_llhd(c(.x, mle_estimate[2]))) %>% as_vector
beta_mesh <- seq(from = 0, to = mle_estimate[2] * 2, length = mesh_length)
beta_profile <- beta_mesh %>% map(~ model_2_llhd(c(mle_estimate[1], .x))) %>% as_vector
result <- data.frame(param = rep(c("alpha", "beta"), each = mesh_length),
value = c(alpha_mesh, beta_mesh),
llhd = c(alpha_profile, beta_profile))
write.table(result,
file = "results/lag-estimate-llhd-profiles.csv",
quote = FALSE,
sep = ",",
row.names = FALSE)
To check what the model fit actually looks like, we have the following code which estimates the 95% CI for when the introduction may have occurred.
model_criticism_df <- data.frame(cluster_size = cluster_sizes,
tmrca_date = tmrca_dates,
mean_lag_estimate = median(numeric_mle$alpha) + median(numeric_mle$beta) / cluster_sizes)
sorted_ixs <- order(model_criticism_df$tmrca_date, model_criticism_df$cluster_size)
model_criticism_df <- model_criticism_df[sorted_ixs,]
model_criticism_df$ix = 1:nrow(model_criticism_df)
model_criticism_df$lag_lower_bound <- qgeom(p = 0.025, prob = 1 / (1 + model_criticism_df$mean_lag_estimate))
model_criticism_df$lag_upper_bound <- qgeom(p = 0.975, prob = 1 / (1 + model_criticism_df$mean_lag_estimate))
write.table(x = model_criticism_df,
file = "results/lag-estimate-model-fig.csv",
sep = ",",
row.names = FALSE)
The second figure shows the TMRCA of each cluster ordered by date with size and colour mapping to the size of the cluster, the line segments next to each point indicate the 95% CI of the estimate for when that cluster was introduced into the UK.
A more nuanced way to test the legitemacy of the model is to look at simulated
TMRCA values under the model fit to see if they obviously different from the
actual data set. Using the r_tmrcas
function from before we can generate 19
replicates of the data under the fitted model. Note that to generate the cluster
sizes we are just resampling from the distribution.
<<r_tmrcas_function>>
r_size <- function(n) {
sample(x = cluster_sizes, size = n, replace = TRUE)
}
mean_from_size_factory <- function(alpha, beta) {
function(c_size) {
alpha + beta / c_size
}
}
mean_from_size <- mean_from_size_factory(median(numeric_mle$alpha),
median(numeric_mle$beta))
replicated_tmrca_sampes <- rerun(.n = 19,
r_tmrcas(length(cluster_sizes),
intro_date_nums,
intro_dist,
r_size,
mean_from_size))
The next step is to generate the figure of the TMRCAs to see how they compare to the actual values, any discrepencies are an indication of a deficiency of the model. There is a slight difference, but it is not a huge difference and this could be an artifact of the binning used.
foo <- c(list(tmrca_dates), replicated_tmrca_sampes)
bar <- as.list(c("truth", rep("simulation", 19)))
baz <- as.list(1:20)
plot_df <- map3(foo,
bar,
baz,
function(x,y,z) data.frame(tmrca = x,
model = y,
id = z)) %>%
bind_rows
write.table(x = plot_df,
file = "results/lag-estimation-tmrca-replicates.csv",
sep = ",",
row.names = FALSE)
Now that we have an estimate of the parameters we can performa a simulation
re-estimation study to obtain a CI and get a better understanding of the lag
model. The script that will contain this analysis is simulation-study-3.R
.
Most of the necessary code can be re-used from previous parts of this document.
library(dplyr)
library(purrr)
library(magrittr)
library(ggplot2)
<<llhd_factory>>
<<r_tmrcas_function>>
<<eii-setup-1>>
<<eii-setup-2>>
However, we do need a way to select random cluster sizes. Rather than
approximating the distribution, we will just re-sample the existing cluster
sizes. The r_cluster_sizes
function takes no arguments and returns a vector of
cluster sizes.
r_cluster_sizes_factory <- function(cluster_sizes) {
num_clusters <- length(cluster_sizes)
function() {
sample(cluster_sizes,
size = num_clusters,
replace = TRUE)
}
}
data_file <- "../../data/epidemiological/clusters_DTA_MCC_0.5.csv"
cluster_df <- read.csv(data_file,
stringsAsFactors = FALSE) %>%
mutate(tmrca_date = as.Date(tmrca_calendar),
date_num = date_as_day_of_year(tmrca_date))
r_cluster_sizes <- cluster_df %>%
use_series("seqs") %>%
r_cluster_sizes_factory
The mean_from_size_factory
is there as a convenient way for us to generate the
expected lag for each of the simulated data sets. We can generate the actual
mean_from_size
function by plugging in our previous estimate.
mean_from_size_factory <- function(alpha, beta) {
function(c_size) {
alpha + beta / c_size
}
}
mean_from_size <- mean_from_size_factory(0.7189865, 28.91369)
This provides us with all the components to generate multiple random data sets,
construct their likelihood functions so we just need a function to numerically
approximate the MLE and we will have all the components for the simulation
re-estimation study. The random_reestimate
function generates a random data
set and then attempts to compute the MLE.
estimate_as_df <- function(est) {
data.frame(alpha = exp(est$estimate[1]),
beta = exp(est$estimate[2]),
exit_code = est$code)
}
random_reestimate <- function() {
foo_n <- nrow(cluster_df)
valid_sim <- FALSE
loop_count <- 0
while (not(valid_sim) && loop_count < 20) {
foo_sizes <- r_cluster_sizes()
foo_tmrca_dates <- r_tmrcas(foo_n,
intro_dates,
intro_dist,
function(dummy) {foo_sizes},
mean_from_size) %>%
date_as_day_of_year
if (max(foo_tmrca_dates) <= length(daily_intro_log_probs)) {
valid_sim <- TRUE
} else {
loop_count <- loop_count + 1
warning("failed simulation...")
}
}
if (valid_sim) {
foo_llhd <- model_2_llhd_factory(daily_intro_log_probs,
foo_tmrca_dates,
foo_sizes)
foo_obj <- function(p) { - foo_llhd(exp(p)) }
foo_est <- nlm(f = foo_obj, p = rnorm(2))
estimate_as_df(foo_est)
} else {
data.frame(alpha = NA, beta = NA, exit_code = 5)
}
}
Then we just need to re-evaluate this expression and write the results to file. The exit code of the optimisation can be mapped to the colour of points to make sure no bad results slipped in.
num_replicates <- 100
plot_df <- rerun(.n = num_replicates, random_reestimate()) %>%
bind_rows
write.table(x = plot_df,
file = "results/sim-study-3-reestimates.csv",
sep = ",",
row.names = FALSE)
There is uncertainty in the estimated TMRCA dates and cluster sizes. The file
data/phylogenetics_data/2020-09-14/clusters_DTA_MCC.csv
contains estimates
obtained by altering a threshold parameter in the MCC tree. To understand how
the uncertainty propagates we want to estimate \(α\) and \(β\) for each
of these estimates. The script that will contain this analysis is
lag-estimation-2.R
, most of required code can be reused from the previous
analysis.
eii-and-llhd-setup
<<eii-setup-1>>
<<eii-setup-2>>
<<llhd_factory>>
The schema of the data is the same in clusters_DTA_MCC_0.5.csv
but now we also
select for the cutoff
variable to indicate which threshold was used for those
cluster sizes and TMRCAs.
read-in-all-mcc-cluster-data
all_clusters_df <- read.csv("../../data/epidemiological/clusters_DTA_MCC.csv",
stringsAsFactors = FALSE) %>%
mutate(tmrca_date = as.Date(tmrca_calendar),
date_num = date_as_day_of_year(tmrca_date),
cutoff = as.factor(cutoff)) %>%
select(cluster, seqs, tmrca_date, date_num, cutoff)
To get the LLHD function for each cutoff value, we need to extract the relevant
cluster size data and give it to model_2_llhd_factory
along with the daily
introduction probabilities. The following function does this and wraps up the
result with the cutoff value of the data used. The assertions have been
duplicated from previous source blocks.
llhd-factory-applier-defn
llhd_factory_applier <- function(cutoff_factor, all_clusters_df,
daily_intro_log_probs, eii_df) {
cluster_df <- filter(.data = all_clusters_df,
cutoff == cutoff_factor)
stopifnot(nrow(cluster_df) > 0)
stopifnot(min(eii_df$date) >= as.Date("2020-01-01"))
stopifnot(min(eii_df$date) <= min(cluster_df$tmrca_date))
stopifnot(max(eii_df$date) >= max(cluster_df$tmrca_date))
list(llhd_func = model_2_llhd_factory(daily_intro_log_probs,
cluster_df$date_num,
cluster_df$seqs),
cutoff = cutoff_factor)
}
To actually estimate the MLE for these LLHD functions we re-use some of the code
from <<numeric-mle>>
, throwing a warning if the numeric optimisation failed
(which may have occurred if the exit code exceeds 2). We record the resuls along
with the exit code for post processing. Note that this uses a transform on the
parameter space to make things easier for the optimisation algorithm and it
starts with a random initial condition.
estimate-mle-defn
estimated_mle <- function(llhd_obj) {
stopifnot(setequal(names(llhd_obj),
c("llhd_func", "cutoff")))
p0 <- rnorm(n = 2)
.f <- function(p) {
- llhd_obj$llhd_func(exp(p))
}
est <- nlm(f = .f, p = p0)
result <- est$estimate %>%
exp %>%
set_names(c("alpha", "beta")) %>%
as.list
result$cutoff <- llhd_obj$cutoff
result$llhd <- - est$minimum
if (est$code > 2) {
warning("Inference failed for cutoff: ",
result$cutoff)
}
result$exit_code <- est$code
return(result)
}
Finally, we map these functions over the cluster data for all of the cutoff
thresholds and save the resulting MLE estimates in
cutoff-varying-lag-estimates.csv
.
<<read-in-all-mcc-cluster-data>>
pipeline <- function(cutoff_factor) {
llhd_factory_applier(cutoff_factor, all_clusters_df,
daily_intro_log_probs, eii_df) %>%
estimated_mle %>%
as.data.frame
}
mle_df <- map(unique(all_clusters_df$cutoff), pipeline) %>%
bind_rows %>%
mutate(cutoff = as.numeric(as.character(cutoff)))
write.table(x = mle_df,
file = "results/cutoff-varying-lag-estimates.csv",
sep = ",",
row.names = FALSE)
Another way to consider the uncertainty in the TMRCA estimates is to consider
multiple tree samples from the posterior. The file
data/phylogenetics_data/2020-09-14/clusters_DTA.csv
describes the TMRCA data
under a set of posterior trees. The analysis is the same as above, but now
instead of using cutoff
as the variable we use tree
which describes which
posterior tree the data comes from. The script that will contain this analysis
is lag-estimation-3.R
<<eii-and-llhd-setup>>
<<llhd-factory-applier-defn>>
<<estimate-mle-defn>>
To use the functions from before we can just rename tree
to cutoff
and
change it back at the end before saving the results. This allows us to reuse the
code from above without needing to change the internals. The results are written
to the file tree-varying-lag-estimates.csv
.
all_clusters_df <- read.csv("../../data/epidemiological/clusters_DTA.csv",
stringsAsFactors = FALSE) %>%
mutate(tmrca_date = as.Date(tmrca_calendar),
date_num = date_as_day_of_year(tmrca_date),
cutoff = as.factor(tree)) %>%
select(cluster, seqs, tmrca_date, date_num, cutoff)
pipeline <- function(cutoff_factor) {
llhd_factory_applier(cutoff_factor, all_clusters_df,
daily_intro_log_probs, eii_df) %>%
estimated_mle %>%
as.data.frame
}
<<run-parallel-pipeline>>
write.table(x = mle_df,
file = "results/tree-varying-lag-estimates.csv",
sep = ",",
row.names = FALSE)
Since there are substantially larger number of trees than there were cutoffs we will run the pipeline in parallel.
run-parallel-pipeline
library(future)
plan(multiprocess)
mle_df <- furrr::future_map(unique(all_clusters_df$cutoff), pipeline) %>%
bind_rows %>%
rename(tree = cutoff) %>%
mutate(tree = as.numeric(as.character(tree)))
This choropleth shows the World Bank data that informs the majority of the poverty estimates
We are accounting for poverty as a proxy for access to healthcare and testing to correct for the bias that countries with greater access to healthcare are likely to ascertain more of the COVID-19 deaths than countries with lower access. To do this, we replicate most of the calculations previously used to estimate the proportion of seeders, but now we adjust the population size to just consist of those not living in poverty.
<<age-of-infection-matrix-defn>>
<<padded-potential-seeders-defn>>
<<seeder-proportion-defn>>
library(magrittr)
<<jhu-un-location-unification>>
<<model-parameters>>
Once we get to the population size component, we need to adjust it by the proportion living in extreme poverty.
get_primary_sources <- function() {
<<primary-source-locations-defn>>
return(primary_source_locations)
}
primary_source_locations <- get_primary_sources()
poverty_df <- read.csv("results/clean-worldbankpoverty.csv")
poverty_map <- as.list(poverty_df$latest_poverty_percentage)
names(poverty_map) <- poverty_df$location
location_seeder_props_adjusted <- function(location_str) {
deaths_df <- filter(jhu_deaths, location == location_str) %>%
select(date,deaths)
if (is.element(location_str, un_populations$location)) {
non_poverty_prop <- if (is.element(location_str, names(poverty_map))) {
1 - 0.01 * poverty_map[[location_str]]
} else {
1 - 0.01 * poverty_map[["other"]]
}
pop_size <- 1e3 *
un_populations[un_populations$location == location_str, "population_size"] *
non_poverty_prop
} else {
stop("Cannot find a population size for ", location_str)
}
if (length(pop_size) == 0) {
stop("Bad population size encountered for ", location_str)
}
seeder_props <- seeder_proportion(deaths_df,
pop_size,
days_latent,
days_incubating,
days_infectious,
prop_asymptomatic,
days_infection_to_death,
infection_fatality_ratio)
seeder_props$location <- location_str
return(seeder_props)
}
result <- map(.x = unique(jhu_deaths$location),
.f = location_seeder_props_adjusted) %>%
bind_rows %>%
filter(date < "2020-07-01")
stopifnot(!any(is.na(result$seeder_proportion)))
write.table(x = result,
file = "results/estimated-proportion-seeders-poverty-adjusted.csv",
sep = ",",
row.names = FALSE)
Then we to re-calculate the EII with the adjusted seeding proportions.
To estimate the EII under the adjusted population sizes we just copy the file
estimate-introduction-index.R
and replace the input file
estimated-proportion-seeders.csv
to
estimated-proportion-seeders-poverty-adjusted.csv
and the output file
estimated-introduction-index.csv
to
estimated-introduction-index-poverty-adjusted.csv
the new script for this is
called estimate-introduction-index-poverty-adjusted.R
The changes to the EII are shown in the following figure
There are very different estimates of the asymptomatic ratio in the literature. Since there is substantial uncertainty surrounding this parameter we performed a sensitivity analysis to determine how it influences our results. Here are some of the estimates, clicking on the point estimate will open the corresponding article page.
We opted for \(31\%\) in the main results, but as a sensitivity analysis we
re-ran the calculations with both \(18\%\) and \(78\%\) to see how it would
influence the results. The simplest way to do this was just to change the
parameter value and then re-run all of the code, copying the file
estimated-introduction-index.csv
each time as
estimated-introduction-index-<A>.csv
for <A>
as 18
, 31
and 78
.
The following figure shows the changes in the total EII as the asymptomatic ratio is varied from \(18\%\) to \(78\%\) with the solid line indicating the values at \(31\%\).
There are a few steps involved in using this document and running the analysis which we document here for clarity.
The source code for the analysis is entirely contained within this document
(i.e., it is a literate program). To extract that code, run the following
command (assuming you have emacs
installed).
emacs --batch -l org --eval '(org-babel-tangle-file "README.org")'
To export the HTML version of this document (as epi-mobility-lag.org
) run the
following command. Note that the warnings generated are a known issue but do not
affect the result.
emacs --batch -l org epi-mobility-lag.org --eval '(org-html-export-to-html)'
Some of it is already included in the raw-data
directory, but there is other
data that we can download a fresh copy into raw-data
.
wget "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"
mv time_series_covid19_deaths_global.csv raw-data/jhu-deaths.csv
wget "https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2019_TotalPopulationBySex.csv"
mv WPP2019_TotalPopulationBySex.csv raw-data/un-population.csv
To get the data into a format that is easy to work with, run the cleaning scripts. Note that these need to be run in order, and they assume that certain files are available from within this repository.
# run-cleaning-scripts.sh
Rscript clean-jhu-deaths.R
Rscript clean-un-population.R
Rscript clean-iata.R
Rscript clean-home-office.R
Rscript clean-non-air-travel.R
Rscript clean-elvidge-poverty.R
Rscript clean-owid-poverty.R
To estimate the epidemic introduction index the following scripts should be run.
# run-estimation-scripts.sh
Rscript estimate-arrivals.R
Rscript estimate-potential-seeders.R
Rscript estimate-introduction-index.R
Rscript estimate-potential-seeders-poverty-adjusted.R
Rscript estimate-introduction-index-poverty-adjusted.R
Rscript poverty-adjustment-size.R
Rscript asymptomatic-adjustment-size.R
Note that running both the poverty adjustment sensitivity analysis and the asymptomatic ratio sensitivity analysis requires some additional steps, but these are documented in the relevant sections.
This produces some output which then will be displayed in the HTML export of this document.
Rscript simulation-study-1.R
Rscript simulation-study-2.R
Rscript simulation-study-3.R
To run the actual inference on the real data run the following R script on
freshly tangled scripts. The main results are computed by lag-estimation.R
but
there are sensitivity analyses in lag-estimation-2.R
and lag-estimation-3.R
which are also of interest.
Rscript lag-estimation.R
Rscript lag-estimation-2.R
Rscript lag-estimation-3.R
To serve the HTML exported version of this file run the following command after weaving it.
python3 -m http.server 8000