Merge pull request #42 from bolliger32/handle-jhu-errors

Final pass before tagging initial version
Global-Policy-Lab · Mar 27, 2020 · 4f126f1 · 4f126f1
2 parents 7b5689f + 778772b
commit 4f126f1
Show file tree

Hide file tree

Showing 61 changed files with 188,342 additions and 187,619 deletions.
diff --git a/README.md b/README.md
@@ -34,9 +34,6 @@ ssc install coefplot, replace
 ssc install filelist, replace
 ```
 
-## Data Documentation
-A detailed description of the epidemiological and policy data obtained and processed for this analysis can be found [here](https://www.dropbox.com/scl/fi/8djnxhj0wqqbyzg2qhiie/SI.gdoc?dl=0&rlkey=jnjy82ov2km7vc0q1k6190esp). This is a live document that may be updated as additional data becomes available. For a version that is fixed at the time this manuscript was submitted, please see the link to our paper at the top of this README.
-
 ## Code Structure
 ```text
 codes
@@ -110,6 +107,9 @@ codes
 └── utils.py
 ```
 
+## Data Documentation
+A detailed description of the epidemiological and policy data obtained and processed for this analysis can be found [here](https://www.dropbox.com/scl/fi/8djnxhj0wqqbyzg2qhiie/SI.gdoc?dl=0&rlkey=jnjy82ov2km7vc0q1k6190esp). This is a live document that may be updated as additional data becomes available. For a version that is fixed at the time this manuscript was submitted, please see the link to our paper at the top of this README.
+
 ## Replication Steps
 
 There are four stages to our analysis:
@@ -121,7 +121,7 @@ There are four stages to our analysis:
 ### Data collection and processing
 The steps to obtain all data in <data/raw>, and then process this data into datasets that can be ingested into a regression, are described below. Note that some of the data collection was performed through manual downloading and/or processing of datasets and is described in as much detail as possible. The sections should be run in the order listed, as some files from later sections will depend on those from earlier sections (e.g. the geographical and population data).
 
-For detailed information on the manual collection of policy, epidemiological, and population information, see the [up-to-date](https://www.dropbox.com/scl/fi/8djnxhj0wqqbyzg2qhiie/SI.gdoc?dl=0&rlkey=jnjy82ov2km7vc0q1k6190esp) version of our paper’s Appendix. A version that was frozen at the time of submission is available with the article cited at the top of this README. Our epidemiological and policy data sources for all countries are listed [here](references/data_sources_static_20200321.xlsx), with a more frequently updated version [here](https://www.dropbox.com/scl/fi/v3o62qfrpam45ylaofekn/data_sources.gsheet?dl=0&rlkey=p3miruxmvq4cxqz7r3q7dc62t).
+For detailed information on the manual collection of policy, epidemiological, and population information, see the [up-to-date](https://www.dropbox.com/scl/fi/8djnxhj0wqqbyzg2qhiie/SI.gdoc?dl=0&rlkey=jnjy82ov2km7vc0q1k6190esp) version of our paper’s Appendix. A version that was frozen at the time of submission is available with the article cited at the top of this README. Our epidemiological and policy data sources for all countries are listed [here](references/data_sources.xlsx), with a more frequently updated version [here](https://www.dropbox.com/scl/fi/v3o62qfrpam45ylaofekn/data_sources.gsheet?dl=0&rlkey=p3miruxmvq4cxqz7r3q7dc62t).
 
 #### Geographical and population data
 1. `python codes/data/multi_country/get_adm_info.py`: Generates shapefiles and csvs with administrative unit names, geographies, and populations (most countries). **Note:** To run this script, you will need a U.S. Census API key. See [Setup](##Setup)
@@ -165,7 +165,7 @@ For the United States, pieces of this policy/testing regime data collection pipe
 #### Epidemiological data
 
 ##### Multi-country
-1. `Rscript codes/data/multi_country/download_6_countries_JHU.R`: Downloads 6 countries' data from the Johns Hopkins University Data underlying the dashboard [here](https://coronavirus.jhu.edu/map.html).
+1. `Rscript codes/data/multi_country/download_6_countries_JHU.R`: Downloads 6 countries' data from the Johns Hopkins University Data underlying [their dashboard](https://coronavirus.jhu.edu/map.html). **Note:** The JHU dataset format has been changing frequently, so it is possible that this script will need to be modified.
 
 ##### China
 1. For data from January 24, 2020 onwards, we relied on [an open source GitHub project](https://github.com/BlankerL/DXY-COVID-19-Data). Download the data and save it to [data/raw/china/DXYArea.csv](data/raw/china/DXYArea.csv).

diff --git a/codes/data/multi_country/download_6_countries_JHU.R b/codes/data/multi_country/download_6_countries_JHU.R
@@ -1,6 +1,11 @@
 suppressPackageStartupMessages(library(tidyverse))
-for (i in c("china", "france", "iran", "italy", "korea", "usa"))
-{
-    dir.create(paste("data/interim/",i, sep=""), recursive=TRUE, showWarnings=FALSE)
-    source(paste("codes/data/", i, "/download_and_clean_JHU_", i, ".R", sep=""))
-}
+tryCatch({
+    for (i in c("china", "france", "iran", "italy", "korea", "usa"))
+    {
+        dir.create(paste("data/interim/",i, sep=""), recursive=TRUE, showWarnings=FALSE)
+        source(paste("codes/data/", i, "/download_and_clean_JHU_", i, ".R", sep=""))
+    }
+},
+error=function(cond) {
+    message("SKIP ERROR: JHU download/processing not working. Data format/URL has likely changed and script will need to be updated")
+})
diff --git a/codes/data/multi_country/get_adm_info.py b/codes/data/multi_country/get_adm_info.py
@@ -14,7 +14,7 @@
 
 idx = pd.IndexSlice
 
-if cutil.API_KEYS["census"] == "API_KEY_STRING":
+if cutil.API_KEYS["census"] == "YOUR_API_KEY":
     raise ValueError(
         """To run this script, you will need a U.S. Census API key, which can be obtained"""
         """here: https://api.census.gov/data/key_signup.html. You will need to save this """
@@ -405,10 +405,31 @@ def main():
     )
     us_gdf = us_gdf[us_gdf.HASC_2.notnull()]
 
-    us_pops = us_gdf.join(us_county_df, on="HASC_2", how="left")
+    us_pops = us_gdf.join(us_county_df, on="HASC_2", how="outer")
     us_pops = us_pops[["NAME_1", "NAME_2", "fips", "population", "area_km2", "capital"]]
     us_pops = us_pops.rename(columns={"NAME_1": "adm1_name", "NAME_2": "adm2_name"})
     us_pops["adm0_name"] = "USA"
+
+    # Manual addition of names that are in the statoids dataset but not the gadm shapes
+    manual_names = {
+        "24005": ("Maryland", "Baltimore County"),
+        "02130": ("Alaska", "Ketchikan Gateway Borough"),
+        "29510": ("Missouri", "City of St. Louis"),
+        "51019": ("Virginia", "Bedford County"),
+        "51059": ("Virginia", "Fairfax County"),
+        "51161": ("Virginia", "Roanoke County"),
+        "51620": ("Virginia", "Franklin City"),
+        "02105": ("Alaska", "Hoonah-Angoon Census Area"),
+        "02195": ("Alaska", "Petersburg Borough"),
+        "02198": ("Alaska", "Prince of Wales-Hyder Census Area"),
+        "51159": ("Virginia", "Richmond County"),
+        "02230": ("Alaska", "Skagway Municipality"),
+        "02275": ("Alaska", "Wrangell City and Borough"),
+        "02282": ("Alaska", "Yakutat City and Borough")
+    }
+
+    for k,v in manual_names.items():
+        us_pops.loc[us_pops.fips==k,['adm1_name','adm2_name']] = v
     us_pops = us_pops.set_index(["adm0_name", "adm1_name", "adm2_name"])
 
     # save fips xwalk
@@ -418,7 +439,9 @@ def main():
 
 
     # ##### Merge back into global adm datasets
-    adm2_gdf = adm2_gdf.fillna(us_pops)
+    adm2_gdf = adm2_gdf.join(us_pops.population, rsuffix='_r', how="outer")
+    adm2_gdf['population'] = adm2_gdf.population.fillna(adm2_gdf.population_r)
+    adm2_gdf = adm2_gdf.drop(columns='population_r')
 
     st_pops = (
         adm2_gdf.loc[:, "population"]
@@ -470,7 +493,6 @@ def main():
     pop2.name = "population"
 
     provinces_as_regions = pop2.loc[idx[:, ["Bolzano", "Trento"]]]
-    pop2 = pop2.drop(index=["Bolzano", "Trento"], level="adm2_name")
     provinces_as_regions.index = provinces_as_regions.index.set_names(
         "adm1_name", level="adm2_name"
     )

diff --git a/codes/data/usa/merge_policy_and_cases.py b/codes/data/usa/merge_policy_and_cases.py
@@ -143,15 +143,6 @@ def download_and_process_policy_csv():
 
 	df_rows_merged = policy_data_adm1_only.groupby(['date','adm0_name','adm1_name'], as_index=False).agg(aggregation_styles)
 
-	# fix the travel ban countries list
-	#df_rows_merged['travel_ban_intl_out_country_list'].map(lambda x: [i for i in x if not np.isnan(i)])
-
-	formated_policy_data = df_rows_merged.sort_values(['date_to_sort','adm0_name','adm1_name']).drop(['date_to_sort'],axis=1)
-
-	# save intermediate version
-	print('writing intermediate policy file to ', os.path.join(int_data_dir,"US_COVID-19_policies_reformatted.csv"))
-	formated_policy_data.to_csv(os.path.join(int_data_dir,"US_COVID-19_policies_reformatted.csv"),index=False)
-
 	return df_rows_merged , policy_keys
 
 

diff --git a/codes/plotting/fig1.R b/codes/plotting/fig1.R
@@ -17,11 +17,15 @@ if (!dir.exists(output_dir)){ #make dir if it doesn't exist
   dir.create(output_dir, recursive=TRUE)
 }
 
+notify <- function(country) {
+  message("Plotting map and timeseries for ",country)
+}
+
 #################
 
 # ITALY
 country <- "ITA"
-
+notify(country)
 #################
 
 #### (1) Epidemiological timeseries ####
@@ -158,7 +162,7 @@ dev.off()
 #######################################################################
 
 country <- "IRN"
-
+notify(country)
 #####
 
 #### (1) Epidemiological timeseries ####
@@ -291,6 +295,7 @@ dev.off()
 ######################################################################
 
 country <- "CHN"
+notify(country)
 
 #####
 
@@ -419,6 +424,7 @@ dev.off()
 ##########################################################
 
 country <- "USA"
+notify(country)
 
   policylist <- c("no_gathering_popwt", 
                   "travel_ban_local_popwt", 
@@ -538,6 +544,7 @@ dev.off()
 ##########################################################
 
 country <- "FRA"
+notify(country)
 
 policylist <- c("no_gathering_national_100", 
                 "home_isolation", 
@@ -697,6 +704,7 @@ dev.off()
 ###############################################################
 
 country <- "KOR"
+notify(country)
 
   policylist <- c("emergency_declaration", 
                   "no_demonstration",

diff --git a/codes/plotting/figA2.py b/codes/plotting/figA2.py
@@ -1,5 +1,7 @@
 import pandas as pd
 import matplotlib.pyplot as plt
+import warnings
+from urllib.error import HTTPError
 
 import matplotlib
 import datetime
@@ -8,95 +10,103 @@
 matplotlib.rcParams['pdf.fonttype'] = 42
 matplotlib.rcParams['axes.linewidth'] = 2
 
-out_dir = cutil.HOME / 'results' / 'figures' / 'appendix'
-out_dir.mkdir(parents=True, exist_ok=True)
+def main():
+    out_dir = cutil.HOME / 'results' / 'figures' / 'appendix'
+    out_dir.mkdir(parents=True, exist_ok=True)
 
-df = pd.read_csv(cutil.DATA_PROCESSED / 'adm2' / 'CHN_processed.csv')
-df.loc[:, 'date'] = pd.to_datetime(df['date'])
+    df = pd.read_csv(cutil.DATA_PROCESSED / 'adm2' / 'CHN_processed.csv')
+    df.loc[:, 'date'] = pd.to_datetime(df['date'])
 
-# Validate with JHU provincial data
+    # Validate with JHU provincial data
 
-# validate with JHU
-url = (
-    'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/'
-    'csse_covid_19_time_series/time_series_19-covid-Confirmed.csv')
-jhu = pd.read_csv(url)
-jhu = jhu.loc[jhu['Country/Region'] == 'China', :]
-jhu = jhu.melt(
-    id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
-    var_name='date',
-    value_name='cum_confirmed_jhu')
-jhu.loc[:, 'date'] = pd.to_datetime(jhu['date'])
-jhu.set_index('Province/State', inplace=True)
+    # validate with JHU
+    url = (
+        'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/'
+        'csse_covid_19_time_series/time_series_19-covid-Confirmed.csv')
+    try:
+        jhu = pd.read_csv(url)
+    except HTTPError:
+        warnings.warn("JHU data no longer available at URL. Unable to scrape data for Fig A2.")
+        return None
+    jhu = jhu.loc[jhu['Country/Region'] == 'China', :]
+    jhu = jhu.melt(
+        id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
+        var_name='date',
+        value_name='cum_confirmed_jhu')
+    jhu.loc[:, 'date'] = pd.to_datetime(jhu['date'])
+    jhu.set_index('Province/State', inplace=True)
 
-# agg for visualization
-df_viz = df.groupby(['adm0_name', 'adm1_name', 'date']).sum().reset_index().set_index(['adm1_name'])
+    # agg for visualization
+    df_viz = df.groupby(['adm0_name', 'adm1_name', 'date']).sum().reset_index().set_index(['adm1_name'])
 
-# plot visualization
-fig, ax = plt.subplots(
-    nrows=2, ncols=2, figsize=(8, 6), sharex=True)
-for i, province_viz in enumerate(['Hubei', 'Zhejiang', 'Guangdong', 'Henan']):
-    ax_i = ax[i // 2, i % 2]
-    jhu.loc[province_viz, :].plot(
-        x='date', y='cum_confirmed_jhu',
+    # plot visualization
+    fig, ax = plt.subplots(
+        nrows=2, ncols=2, figsize=(8, 6), sharex=True)
+    for i, province_viz in enumerate(['Hubei', 'Zhejiang', 'Guangdong', 'Henan']):
+        ax_i = ax[i // 2, i % 2]
+        jhu.loc[province_viz, :].plot(
+            x='date', y='cum_confirmed_jhu',
+            alpha=0.3, ax=ax_i, linewidth=3, color='dimgray',
+            legend=False)
+        df_viz.loc[province_viz, :].plot(
+            x='date', y='cum_confirmed_cases_imputed', style='.',
+            alpha=0.7, ax=ax_i, color='black',
+            legend=False)
+        # df_viz.loc[province_viz, :].plot(x='date', y='cum_recoveries', ax=ax)
+        # df_viz.loc[province_viz, :].plot(x='date', y='cum_deaths', ax=ax)
+        ax_i.set_title(province_viz)
+        ax_i.xaxis.set_major_formatter(mdates.DateFormatter(''))
+        ax_i.set_xlabel('')
+        ax_i.spines["top"].set_visible(False)
+        ax_i.spines["right"].set_visible(False)
+        ax_i.spines['bottom'].set_color('dimgray')
+        ax_i.spines['left'].set_color('dimgray')
+        ax_i.tick_params(direction='out', length=6, width=2, colors='dimgray')
+        x_ticks = [20200110, 20200201, 20200301, 20200318]
+        x_ticklabels = ['Jan 10', 'Feb 1', 'Mar 1', 'Mar 18']
+        x_ticks = [datetime.datetime.strptime(str(x), '%Y%m%d') for x in x_ticks]
+        ax_i.set_xticks(x_ticks)
+        ax_i.set_xticklabels(x_ticklabels)
+        ax_i.minorticks_off()
+    fig.tight_layout()
+    fig.savefig(out_dir / 'figA2-1.pdf')
+
+
+    df_kor = pd.read_csv(cutil.DATA_INTERIM / 'korea' / 'KOR_JHU_data_comparison.csv')
+
+    df_kor = df_kor.iloc[0:56, :].copy()
+
+    df_kor.loc[:, 'date'] = pd.to_datetime(df_kor['date'])
+
+
+    # plot visualization
+    fig, ax_i = plt.subplots(figsize=(4, 3))
+    df_kor.plot(
+        x='date', y='cum_confirmed_cases_JHU',
         alpha=0.3, ax=ax_i, linewidth=3, color='dimgray',
         legend=False)
-    df_viz.loc[province_viz, :].plot(
-        x='date', y='cum_confirmed_cases_imputed', style='.',
+    df_kor.plot(
+        x='date', y='cum_confirmed_cases_data', style='.',
         alpha=0.7, ax=ax_i, color='black',
         legend=False)
     # df_viz.loc[province_viz, :].plot(x='date', y='cum_recoveries', ax=ax)
     # df_viz.loc[province_viz, :].plot(x='date', y='cum_deaths', ax=ax)
-    ax_i.set_title(province_viz)
+    ax_i.set_title('Korea')
     ax_i.xaxis.set_major_formatter(mdates.DateFormatter(''))
     ax_i.set_xlabel('')
     ax_i.spines["top"].set_visible(False)
     ax_i.spines["right"].set_visible(False)
     ax_i.spines['bottom'].set_color('dimgray')
     ax_i.spines['left'].set_color('dimgray')
     ax_i.tick_params(direction='out', length=6, width=2, colors='dimgray')
-    x_ticks = [20200110, 20200201, 20200301, 20200318]
-    x_ticklabels = ['Jan 10', 'Feb 1', 'Mar 1', 'Mar 18']
+    x_ticks = [20200122, 20200201, 20200301, 20200318]
+    x_ticklabels = ['Jan 22', 'Feb 1', 'Mar 1', 'Mar 18']
     x_ticks = [datetime.datetime.strptime(str(x), '%Y%m%d') for x in x_ticks]
     ax_i.set_xticks(x_ticks)
     ax_i.set_xticklabels(x_ticklabels)
     ax_i.minorticks_off()
-fig.tight_layout()
-fig.savefig(out_dir / 'figA2-1.pdf')
-
-
-df_kor = pd.read_csv(cutil.DATA_INTERIM / 'korea' / 'KOR_JHU_data_comparison.csv')
-
-df_kor = df_kor.iloc[0:56, :].copy()
-
-df_kor.loc[:, 'date'] = pd.to_datetime(df_kor['date'])
-
+    fig.tight_layout()
+    fig.savefig(out_dir / 'figA2-2.pdf')
 
-# plot visualization
-fig, ax_i = plt.subplots(figsize=(4, 3))
-df_kor.plot(
-    x='date', y='cum_confirmed_cases_JHU',
-    alpha=0.3, ax=ax_i, linewidth=3, color='dimgray',
-    legend=False)
-df_kor.plot(
-    x='date', y='cum_confirmed_cases_data', style='.',
-    alpha=0.7, ax=ax_i, color='black',
-    legend=False)
-# df_viz.loc[province_viz, :].plot(x='date', y='cum_recoveries', ax=ax)
-# df_viz.loc[province_viz, :].plot(x='date', y='cum_deaths', ax=ax)
-ax_i.set_title('Korea')
-ax_i.xaxis.set_major_formatter(mdates.DateFormatter(''))
-ax_i.set_xlabel('')
-ax_i.spines["top"].set_visible(False)
-ax_i.spines["right"].set_visible(False)
-ax_i.spines['bottom'].set_color('dimgray')
-ax_i.spines['left'].set_color('dimgray')
-ax_i.tick_params(direction='out', length=6, width=2, colors='dimgray')
-x_ticks = [20200122, 20200201, 20200301, 20200318]
-x_ticklabels = ['Jan 22', 'Feb 1', 'Mar 1', 'Mar 18']
-x_ticks = [datetime.datetime.strptime(str(x), '%Y%m%d') for x in x_ticks]
-ax_i.set_xticks(x_ticks)
-ax_i.set_xticklabels(x_ticklabels)
-ax_i.minorticks_off()
-fig.tight_layout()
-fig.savefig(out_dir / 'figA2-2.pdf')
+if __name__ == "__main__":
+    main()