Skip to content

Commit

Permalink
Merge pull request #42 from bolliger32/handle-jhu-errors
Browse files Browse the repository at this point in the history
Final pass before tagging initial version
  • Loading branch information
bolliger32 authored Mar 27, 2020
2 parents 7b5689f + 778772b commit 4f126f1
Show file tree
Hide file tree
Showing 61 changed files with 188,342 additions and 187,619 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,6 @@ ssc install coefplot, replace
ssc install filelist, replace
```

## Data Documentation
A detailed description of the epidemiological and policy data obtained and processed for this analysis can be found [here](https://www.dropbox.com/scl/fi/8djnxhj0wqqbyzg2qhiie/SI.gdoc?dl=0&rlkey=jnjy82ov2km7vc0q1k6190esp). This is a live document that may be updated as additional data becomes available. For a version that is fixed at the time this manuscript was submitted, please see the link to our paper at the top of this README.

## Code Structure
```text
codes
Expand Down Expand Up @@ -110,6 +107,9 @@ codes
└── utils.py
```

## Data Documentation
A detailed description of the epidemiological and policy data obtained and processed for this analysis can be found [here](https://www.dropbox.com/scl/fi/8djnxhj0wqqbyzg2qhiie/SI.gdoc?dl=0&rlkey=jnjy82ov2km7vc0q1k6190esp). This is a live document that may be updated as additional data becomes available. For a version that is fixed at the time this manuscript was submitted, please see the link to our paper at the top of this README.

## Replication Steps

There are four stages to our analysis:
Expand All @@ -121,7 +121,7 @@ There are four stages to our analysis:
### Data collection and processing
The steps to obtain all data in <data/raw>, and then process this data into datasets that can be ingested into a regression, are described below. Note that some of the data collection was performed through manual downloading and/or processing of datasets and is described in as much detail as possible. The sections should be run in the order listed, as some files from later sections will depend on those from earlier sections (e.g. the geographical and population data).

For detailed information on the manual collection of policy, epidemiological, and population information, see the [up-to-date](https://www.dropbox.com/scl/fi/8djnxhj0wqqbyzg2qhiie/SI.gdoc?dl=0&rlkey=jnjy82ov2km7vc0q1k6190esp) version of our paper’s Appendix. A version that was frozen at the time of submission is available with the article cited at the top of this README. Our epidemiological and policy data sources for all countries are listed [here](references/data_sources_static_20200321.xlsx), with a more frequently updated version [here](https://www.dropbox.com/scl/fi/v3o62qfrpam45ylaofekn/data_sources.gsheet?dl=0&rlkey=p3miruxmvq4cxqz7r3q7dc62t).
For detailed information on the manual collection of policy, epidemiological, and population information, see the [up-to-date](https://www.dropbox.com/scl/fi/8djnxhj0wqqbyzg2qhiie/SI.gdoc?dl=0&rlkey=jnjy82ov2km7vc0q1k6190esp) version of our paper’s Appendix. A version that was frozen at the time of submission is available with the article cited at the top of this README. Our epidemiological and policy data sources for all countries are listed [here](references/data_sources.xlsx), with a more frequently updated version [here](https://www.dropbox.com/scl/fi/v3o62qfrpam45ylaofekn/data_sources.gsheet?dl=0&rlkey=p3miruxmvq4cxqz7r3q7dc62t).

#### Geographical and population data
1. `python codes/data/multi_country/get_adm_info.py`: Generates shapefiles and csvs with administrative unit names, geographies, and populations (most countries). **Note:** To run this script, you will need a U.S. Census API key. See [Setup](##Setup)
Expand Down Expand Up @@ -165,7 +165,7 @@ For the United States, pieces of this policy/testing regime data collection pipe
#### Epidemiological data

##### Multi-country
1. `Rscript codes/data/multi_country/download_6_countries_JHU.R`: Downloads 6 countries' data from the Johns Hopkins University Data underlying the dashboard [here](https://coronavirus.jhu.edu/map.html).
1. `Rscript codes/data/multi_country/download_6_countries_JHU.R`: Downloads 6 countries' data from the Johns Hopkins University Data underlying [their dashboard](https://coronavirus.jhu.edu/map.html). **Note:** The JHU dataset format has been changing frequently, so it is possible that this script will need to be modified.

##### China
1. For data from January 24, 2020 onwards, we relied on [an open source GitHub project](https://github.com/BlankerL/DXY-COVID-19-Data). Download the data and save it to [data/raw/china/DXYArea.csv](data/raw/china/DXYArea.csv).
Expand Down
15 changes: 10 additions & 5 deletions codes/data/multi_country/download_6_countries_JHU.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
suppressPackageStartupMessages(library(tidyverse))
for (i in c("china", "france", "iran", "italy", "korea", "usa"))
{
dir.create(paste("data/interim/",i, sep=""), recursive=TRUE, showWarnings=FALSE)
source(paste("codes/data/", i, "/download_and_clean_JHU_", i, ".R", sep=""))
}
tryCatch({
for (i in c("china", "france", "iran", "italy", "korea", "usa"))
{
dir.create(paste("data/interim/",i, sep=""), recursive=TRUE, showWarnings=FALSE)
source(paste("codes/data/", i, "/download_and_clean_JHU_", i, ".R", sep=""))
}
},
error=function(cond) {
message("SKIP ERROR: JHU download/processing not working. Data format/URL has likely changed and script will need to be updated")
})
30 changes: 26 additions & 4 deletions codes/data/multi_country/get_adm_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

idx = pd.IndexSlice

if cutil.API_KEYS["census"] == "API_KEY_STRING":
if cutil.API_KEYS["census"] == "YOUR_API_KEY":
raise ValueError(
"""To run this script, you will need a U.S. Census API key, which can be obtained"""
"""here: https://api.census.gov/data/key_signup.html. You will need to save this """
Expand Down Expand Up @@ -405,10 +405,31 @@ def main():
)
us_gdf = us_gdf[us_gdf.HASC_2.notnull()]

us_pops = us_gdf.join(us_county_df, on="HASC_2", how="left")
us_pops = us_gdf.join(us_county_df, on="HASC_2", how="outer")
us_pops = us_pops[["NAME_1", "NAME_2", "fips", "population", "area_km2", "capital"]]
us_pops = us_pops.rename(columns={"NAME_1": "adm1_name", "NAME_2": "adm2_name"})
us_pops["adm0_name"] = "USA"

# Manual addition of names that are in the statoids dataset but not the gadm shapes
manual_names = {
"24005": ("Maryland", "Baltimore County"),
"02130": ("Alaska", "Ketchikan Gateway Borough"),
"29510": ("Missouri", "City of St. Louis"),
"51019": ("Virginia", "Bedford County"),
"51059": ("Virginia", "Fairfax County"),
"51161": ("Virginia", "Roanoke County"),
"51620": ("Virginia", "Franklin City"),
"02105": ("Alaska", "Hoonah-Angoon Census Area"),
"02195": ("Alaska", "Petersburg Borough"),
"02198": ("Alaska", "Prince of Wales-Hyder Census Area"),
"51159": ("Virginia", "Richmond County"),
"02230": ("Alaska", "Skagway Municipality"),
"02275": ("Alaska", "Wrangell City and Borough"),
"02282": ("Alaska", "Yakutat City and Borough")
}

for k,v in manual_names.items():
us_pops.loc[us_pops.fips==k,['adm1_name','adm2_name']] = v
us_pops = us_pops.set_index(["adm0_name", "adm1_name", "adm2_name"])

# save fips xwalk
Expand All @@ -418,7 +439,9 @@ def main():


# ##### Merge back into global adm datasets
adm2_gdf = adm2_gdf.fillna(us_pops)
adm2_gdf = adm2_gdf.join(us_pops.population, rsuffix='_r', how="outer")
adm2_gdf['population'] = adm2_gdf.population.fillna(adm2_gdf.population_r)
adm2_gdf = adm2_gdf.drop(columns='population_r')

st_pops = (
adm2_gdf.loc[:, "population"]
Expand Down Expand Up @@ -470,7 +493,6 @@ def main():
pop2.name = "population"

provinces_as_regions = pop2.loc[idx[:, ["Bolzano", "Trento"]]]
pop2 = pop2.drop(index=["Bolzano", "Trento"], level="adm2_name")
provinces_as_regions.index = provinces_as_regions.index.set_names(
"adm1_name", level="adm2_name"
)
Expand Down
9 changes: 0 additions & 9 deletions codes/data/usa/merge_policy_and_cases.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,15 +143,6 @@ def download_and_process_policy_csv():

df_rows_merged = policy_data_adm1_only.groupby(['date','adm0_name','adm1_name'], as_index=False).agg(aggregation_styles)

# fix the travel ban countries list
#df_rows_merged['travel_ban_intl_out_country_list'].map(lambda x: [i for i in x if not np.isnan(i)])

formated_policy_data = df_rows_merged.sort_values(['date_to_sort','adm0_name','adm1_name']).drop(['date_to_sort'],axis=1)

# save intermediate version
print('writing intermediate policy file to ', os.path.join(int_data_dir,"US_COVID-19_policies_reformatted.csv"))
formated_policy_data.to_csv(os.path.join(int_data_dir,"US_COVID-19_policies_reformatted.csv"),index=False)

return df_rows_merged , policy_keys


Expand Down
12 changes: 10 additions & 2 deletions codes/plotting/fig1.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,15 @@ if (!dir.exists(output_dir)){ #make dir if it doesn't exist
dir.create(output_dir, recursive=TRUE)
}

notify <- function(country) {
message("Plotting map and timeseries for ",country)
}

#################

# ITALY
country <- "ITA"

notify(country)
#################

#### (1) Epidemiological timeseries ####
Expand Down Expand Up @@ -158,7 +162,7 @@ dev.off()
#######################################################################

country <- "IRN"

notify(country)
#####

#### (1) Epidemiological timeseries ####
Expand Down Expand Up @@ -291,6 +295,7 @@ dev.off()
######################################################################

country <- "CHN"
notify(country)

#####

Expand Down Expand Up @@ -419,6 +424,7 @@ dev.off()
##########################################################

country <- "USA"
notify(country)

policylist <- c("no_gathering_popwt",
"travel_ban_local_popwt",
Expand Down Expand Up @@ -538,6 +544,7 @@ dev.off()
##########################################################

country <- "FRA"
notify(country)

policylist <- c("no_gathering_national_100",
"home_isolation",
Expand Down Expand Up @@ -697,6 +704,7 @@ dev.off()
###############################################################

country <- "KOR"
notify(country)

policylist <- c("emergency_declaration",
"no_demonstration",
Expand Down
148 changes: 79 additions & 69 deletions codes/plotting/figA2.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
import pandas as pd
import matplotlib.pyplot as plt
import warnings
from urllib.error import HTTPError

import matplotlib
import datetime
Expand All @@ -8,95 +10,103 @@
matplotlib.rcParams['pdf.fonttype'] = 42
matplotlib.rcParams['axes.linewidth'] = 2

out_dir = cutil.HOME / 'results' / 'figures' / 'appendix'
out_dir.mkdir(parents=True, exist_ok=True)
def main():
out_dir = cutil.HOME / 'results' / 'figures' / 'appendix'
out_dir.mkdir(parents=True, exist_ok=True)

df = pd.read_csv(cutil.DATA_PROCESSED / 'adm2' / 'CHN_processed.csv')
df.loc[:, 'date'] = pd.to_datetime(df['date'])
df = pd.read_csv(cutil.DATA_PROCESSED / 'adm2' / 'CHN_processed.csv')
df.loc[:, 'date'] = pd.to_datetime(df['date'])

# Validate with JHU provincial data
# Validate with JHU provincial data

# validate with JHU
url = (
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/'
'csse_covid_19_time_series/time_series_19-covid-Confirmed.csv')
jhu = pd.read_csv(url)
jhu = jhu.loc[jhu['Country/Region'] == 'China', :]
jhu = jhu.melt(
id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'],
var_name='date',
value_name='cum_confirmed_jhu')
jhu.loc[:, 'date'] = pd.to_datetime(jhu['date'])
jhu.set_index('Province/State', inplace=True)
# validate with JHU
url = (
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/'
'csse_covid_19_time_series/time_series_19-covid-Confirmed.csv')
try:
jhu = pd.read_csv(url)
except HTTPError:
warnings.warn("JHU data no longer available at URL. Unable to scrape data for Fig A2.")
return None
jhu = jhu.loc[jhu['Country/Region'] == 'China', :]
jhu = jhu.melt(
id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'],
var_name='date',
value_name='cum_confirmed_jhu')
jhu.loc[:, 'date'] = pd.to_datetime(jhu['date'])
jhu.set_index('Province/State', inplace=True)

# agg for visualization
df_viz = df.groupby(['adm0_name', 'adm1_name', 'date']).sum().reset_index().set_index(['adm1_name'])
# agg for visualization
df_viz = df.groupby(['adm0_name', 'adm1_name', 'date']).sum().reset_index().set_index(['adm1_name'])

# plot visualization
fig, ax = plt.subplots(
nrows=2, ncols=2, figsize=(8, 6), sharex=True)
for i, province_viz in enumerate(['Hubei', 'Zhejiang', 'Guangdong', 'Henan']):
ax_i = ax[i // 2, i % 2]
jhu.loc[province_viz, :].plot(
x='date', y='cum_confirmed_jhu',
# plot visualization
fig, ax = plt.subplots(
nrows=2, ncols=2, figsize=(8, 6), sharex=True)
for i, province_viz in enumerate(['Hubei', 'Zhejiang', 'Guangdong', 'Henan']):
ax_i = ax[i // 2, i % 2]
jhu.loc[province_viz, :].plot(
x='date', y='cum_confirmed_jhu',
alpha=0.3, ax=ax_i, linewidth=3, color='dimgray',
legend=False)
df_viz.loc[province_viz, :].plot(
x='date', y='cum_confirmed_cases_imputed', style='.',
alpha=0.7, ax=ax_i, color='black',
legend=False)
# df_viz.loc[province_viz, :].plot(x='date', y='cum_recoveries', ax=ax)
# df_viz.loc[province_viz, :].plot(x='date', y='cum_deaths', ax=ax)
ax_i.set_title(province_viz)
ax_i.xaxis.set_major_formatter(mdates.DateFormatter(''))
ax_i.set_xlabel('')
ax_i.spines["top"].set_visible(False)
ax_i.spines["right"].set_visible(False)
ax_i.spines['bottom'].set_color('dimgray')
ax_i.spines['left'].set_color('dimgray')
ax_i.tick_params(direction='out', length=6, width=2, colors='dimgray')
x_ticks = [20200110, 20200201, 20200301, 20200318]
x_ticklabels = ['Jan 10', 'Feb 1', 'Mar 1', 'Mar 18']
x_ticks = [datetime.datetime.strptime(str(x), '%Y%m%d') for x in x_ticks]
ax_i.set_xticks(x_ticks)
ax_i.set_xticklabels(x_ticklabels)
ax_i.minorticks_off()
fig.tight_layout()
fig.savefig(out_dir / 'figA2-1.pdf')


df_kor = pd.read_csv(cutil.DATA_INTERIM / 'korea' / 'KOR_JHU_data_comparison.csv')

df_kor = df_kor.iloc[0:56, :].copy()

df_kor.loc[:, 'date'] = pd.to_datetime(df_kor['date'])


# plot visualization
fig, ax_i = plt.subplots(figsize=(4, 3))
df_kor.plot(
x='date', y='cum_confirmed_cases_JHU',
alpha=0.3, ax=ax_i, linewidth=3, color='dimgray',
legend=False)
df_viz.loc[province_viz, :].plot(
x='date', y='cum_confirmed_cases_imputed', style='.',
df_kor.plot(
x='date', y='cum_confirmed_cases_data', style='.',
alpha=0.7, ax=ax_i, color='black',
legend=False)
# df_viz.loc[province_viz, :].plot(x='date', y='cum_recoveries', ax=ax)
# df_viz.loc[province_viz, :].plot(x='date', y='cum_deaths', ax=ax)
ax_i.set_title(province_viz)
ax_i.set_title('Korea')
ax_i.xaxis.set_major_formatter(mdates.DateFormatter(''))
ax_i.set_xlabel('')
ax_i.spines["top"].set_visible(False)
ax_i.spines["right"].set_visible(False)
ax_i.spines['bottom'].set_color('dimgray')
ax_i.spines['left'].set_color('dimgray')
ax_i.tick_params(direction='out', length=6, width=2, colors='dimgray')
x_ticks = [20200110, 20200201, 20200301, 20200318]
x_ticklabels = ['Jan 10', 'Feb 1', 'Mar 1', 'Mar 18']
x_ticks = [20200122, 20200201, 20200301, 20200318]
x_ticklabels = ['Jan 22', 'Feb 1', 'Mar 1', 'Mar 18']
x_ticks = [datetime.datetime.strptime(str(x), '%Y%m%d') for x in x_ticks]
ax_i.set_xticks(x_ticks)
ax_i.set_xticklabels(x_ticklabels)
ax_i.minorticks_off()
fig.tight_layout()
fig.savefig(out_dir / 'figA2-1.pdf')


df_kor = pd.read_csv(cutil.DATA_INTERIM / 'korea' / 'KOR_JHU_data_comparison.csv')

df_kor = df_kor.iloc[0:56, :].copy()

df_kor.loc[:, 'date'] = pd.to_datetime(df_kor['date'])

fig.tight_layout()
fig.savefig(out_dir / 'figA2-2.pdf')

# plot visualization
fig, ax_i = plt.subplots(figsize=(4, 3))
df_kor.plot(
x='date', y='cum_confirmed_cases_JHU',
alpha=0.3, ax=ax_i, linewidth=3, color='dimgray',
legend=False)
df_kor.plot(
x='date', y='cum_confirmed_cases_data', style='.',
alpha=0.7, ax=ax_i, color='black',
legend=False)
# df_viz.loc[province_viz, :].plot(x='date', y='cum_recoveries', ax=ax)
# df_viz.loc[province_viz, :].plot(x='date', y='cum_deaths', ax=ax)
ax_i.set_title('Korea')
ax_i.xaxis.set_major_formatter(mdates.DateFormatter(''))
ax_i.set_xlabel('')
ax_i.spines["top"].set_visible(False)
ax_i.spines["right"].set_visible(False)
ax_i.spines['bottom'].set_color('dimgray')
ax_i.spines['left'].set_color('dimgray')
ax_i.tick_params(direction='out', length=6, width=2, colors='dimgray')
x_ticks = [20200122, 20200201, 20200301, 20200318]
x_ticklabels = ['Jan 22', 'Feb 1', 'Mar 1', 'Mar 18']
x_ticks = [datetime.datetime.strptime(str(x), '%Y%m%d') for x in x_ticks]
ax_i.set_xticks(x_ticks)
ax_i.set_xticklabels(x_ticklabels)
ax_i.minorticks_off()
fig.tight_layout()
fig.savefig(out_dir / 'figA2-2.pdf')
if __name__ == "__main__":
main()
Loading

0 comments on commit 4f126f1

Please sign in to comment.