-
Notifications
You must be signed in to change notification settings - Fork 16
CHC Signal #320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CHC Signal #320
Changes from all commits
352abe6
e7b0b28
6798c34
cbfd2ac
3a208b0
8ad5128
a2fc87b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
# Change Healthcare Indicator | ||
|
||
COVID-19 indicator using outpatient visits from Change Healthcare claims data. | ||
Reads claims data into pandas dataframe. | ||
Makes appropriate date shifts, adjusts for backfilling, and smooths estimates. | ||
Writes results to csvs. | ||
|
||
|
||
## Running the Indicator | ||
|
||
The indicator is run by directly executing the Python module contained in this | ||
directory. The safest way to do this is to create a virtual environment, | ||
installed the common DELPHI tools, and then install the module and its | ||
dependencies. To do this, run the following code from this directory: | ||
|
||
``` | ||
python -m venv env | ||
source env/bin/activate | ||
pip install ../_delphi_utils_python/. | ||
pip install . | ||
``` | ||
|
||
*Note*: you may need to install blas, in Ubuntu do | ||
``` | ||
sudo apt-get install libatlas-base-dev gfortran | ||
``` | ||
|
||
All of the user-changable parameters are stored in `params.json`. To execute | ||
the module and produce the output datasets (by default, in `receiving`), run | ||
the following: | ||
|
||
``` | ||
env/bin/python -m delphi_changehc | ||
``` | ||
|
||
Once you are finished with the code, you can deactivate the virtual environment | ||
and (optionally) remove the environment itself. | ||
|
||
``` | ||
deactivate | ||
rm -r env | ||
``` | ||
|
||
## Testing the code | ||
|
||
To do a static test of the code style, it is recommended to run **pylint** on | ||
the module. To do this, run the following from the main module directory: | ||
|
||
``` | ||
env/bin/pylint delphi_changehc | ||
``` | ||
|
||
The most aggressive checks are turned off; only relatively important issues | ||
should be raised and they should be manually checked (or better, fixed). | ||
|
||
Unit tests are also included in the module. To execute these, run the following | ||
command from this directory: | ||
|
||
``` | ||
(cd tests && ../env/bin/pytest --cov=delphi_changehc --cov-report=term-missing) | ||
``` | ||
|
||
The output will show the number of unit tests that passed and failed, along | ||
with the percentage of code covered by the tests. None of the tests should | ||
fail and the code lines that are not covered by unit tests should be small and | ||
should not include critical sub-routines. | ||
|
||
## Code tour | ||
chinandrew marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- update_sensor.py: CHCSensorUpdator: reads the data, makes transformations, writes results to file | ||
- sensor.py: CHCSensor: methods for transforming data, including backfill and smoothing | ||
- smooth.py: implements local linear left Gaussian filter | ||
- load_data.py: methods for loading denominator and covid data | ||
- config.py: Config: constants for reading data and transformations, Constants: constants for sanity checks | ||
- constants.py: constants for signal names | ||
- weekday.py: Weekday: Adjusts for weekday effect | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
## Code Review (Python) | ||
|
||
A code review of this module should include a careful look at the code and the | ||
output. To assist in the process, but certainly not in replace of it, please | ||
check the following items. | ||
|
||
**Documentation** | ||
|
||
- [ ] the README.md file template is filled out and currently accurate; it is | ||
possible to load and test the code using only the instructions given | ||
- [ ] minimal docstrings (one line describing what the function does) are | ||
included for all functions; full docstrings describing the inputs and expected | ||
outputs should be given for non-trivial functions | ||
|
||
**Structure** | ||
|
||
- [ ] code should use 4 spaces for indentation; other style decisions are | ||
flexible, but be consistent within a module | ||
- [ ] any required metadata files are checked into the repository and placed | ||
within the directory `static` | ||
- [ ] any intermediate files that are created and stored by the module should | ||
be placed in the directory `cache` | ||
- [ ] final expected output files to be uploaded to the API are placed in the | ||
`receiving` directory; output files should not be committed to the respository | ||
- [ ] all options and API keys are passed through the file `params.json` | ||
- [ ] template parameter file (`params.json.template`) is checked into the | ||
code; no personal (i.e., usernames) or private (i.e., API keys) information is | ||
included in this template file | ||
|
||
**Testing** | ||
|
||
- [ ] module can be installed in a new virtual environment | ||
- [ ] pylint with the default `.pylint` settings run over the module produces | ||
minimal warnings; warnings that do exist have been confirmed as false positives | ||
- [ ] reasonably high level of unit test coverage covering all of the main logic | ||
of the code (e.g., missing coverage for raised errors that do not currently seem | ||
possible to reach are okay; missing coverage for options that will be needed are | ||
not) | ||
- [ ] all unit tests run without errors |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# -*- coding: utf-8 -*- | ||
"""Module to pull and clean indicators from the CHC source. | ||
|
||
This file defines the functions that are made public by the module. As the | ||
module is intended to be executed though the main method, these are primarily | ||
for testing. | ||
""" | ||
|
||
from __future__ import absolute_import | ||
|
||
from . import config | ||
from . import load_data | ||
from . import run | ||
from . import sensor | ||
from . import smooth | ||
from . import update_sensor | ||
from . import weekday | ||
|
||
__version__ = "0.0.0" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# -*- coding: utf-8 -*- | ||
"""Call the function run_module when executed. | ||
|
||
This file indicates that calling the module (`python -m MODULE_NAME`) will | ||
call the function `run_module` found within the run.py file. There should be | ||
no need to change this template. | ||
""" | ||
|
||
from .run import run_module # pragma: no cover | ||
|
||
run_module() # pragma: no cover |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
""" | ||
This file contains configuration variables used to generate the CHC signal. | ||
|
||
Author: Aaron Rumack | ||
Created: 2020-10-14 | ||
""" | ||
|
||
from datetime import datetime, timedelta | ||
import numpy as np | ||
|
||
|
||
class Config: | ||
"""Static configuration variables. | ||
""" | ||
|
||
## dates | ||
FIRST_DATA_DATE = datetime(2020, 1, 1) | ||
|
||
# number of days training needs to produce estimate | ||
# (one day needed for smoother to produce values) | ||
BURN_IN_PERIOD = timedelta(days=1) | ||
|
||
# shift dates forward for labeling purposes | ||
DAY_SHIFT = timedelta(days=1) | ||
|
||
## data columns | ||
COVID_COL = "COVID" | ||
DENOM_COL = "Denominator" | ||
COUNT_COLS = ["COVID"] + ["Denominator"] | ||
DATE_COL = "date" | ||
GEO_COL = "fips" | ||
ID_COLS = [DATE_COL] + [GEO_COL] | ||
FILT_COLS = ID_COLS + COUNT_COLS | ||
DENOM_COLS = [GEO_COL, DATE_COL, DENOM_COL] | ||
COVID_COLS = [GEO_COL, DATE_COL, COVID_COL] | ||
DENOM_DTYPES = {"date": str, "Denominator": str, "fips": str} | ||
COVID_DTYPES = {"date": str, "COVID": str, "fips": str} | ||
|
||
SMOOTHER_BANDWIDTH = 100 # bandwidth for the linear left Gaussian filter | ||
MIN_DEN = 100 # number of total visits needed to produce a sensor | ||
MAX_BACKFILL_WINDOW = ( | ||
7 # maximum number of days used to average a backfill correction | ||
) | ||
MIN_CUM_VISITS = 500 # need to observe at least 500 counts before averaging | ||
|
||
|
||
class Constants: | ||
""" | ||
Contains the maximum number of geo units for each geo type | ||
Used for sanity checks | ||
""" | ||
# number of counties in usa, including megacounties | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we add a docstring here on what these constants are for, especially since there's also a constants.py file? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this + linter and should be good to go. EDIT: there's also a few more instances of single vs double quoting, should be a relatively quick search and replace |
||
NUM_COUNTIES = 3141 + 52 | ||
NUM_HRRS = 308 | ||
NUM_MSAS = 392 + 52 # MSA + States | ||
NUM_STATES = 52 # including DC and PR | ||
|
||
MAX_GEO = {"county": NUM_COUNTIES, | ||
"hrr": NUM_HRRS, | ||
"msa": NUM_MSAS, | ||
"state": NUM_STATES} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
"""Registry for signal names and geo types""" | ||
SMOOTHED = "smoothed_chc" | ||
SMOOTHED_ADJ = "smoothed_adj_chc" | ||
SIGNALS = [SMOOTHED, SMOOTHED_ADJ] | ||
NA = "NA" | ||
HRR = "hrr" | ||
FIPS = "fips" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
""" | ||
Load CHC data. | ||
|
||
Author: Aaron Rumack | ||
Created: 2020-10-14 | ||
""" | ||
|
||
# third party | ||
import pandas as pd | ||
|
||
# first party | ||
from .config import Config | ||
|
||
|
||
def load_denom_data(denom_filepath, dropdate, base_geo): | ||
"""Load in and set up denominator data. | ||
|
||
Args: | ||
denom_filepath: path to the aggregated denominator data | ||
dropdate: data drop date (datetime object) | ||
base_geo: base geographic unit before aggregation ('fips') | ||
|
||
Returns: | ||
cleaned denominator dataframe | ||
""" | ||
assert base_geo == "fips", "base unit must be 'fips'" | ||
chinandrew marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
denom_suffix = denom_filepath.split("/")[-1].split(".")[0][9:] | ||
assert denom_suffix == "All_Outpatients_By_County" | ||
denom_filetype = denom_filepath.split("/")[-1].split(".")[1] | ||
assert denom_filetype == "dat" | ||
|
||
denom_data = pd.read_csv( | ||
denom_filepath, | ||
sep="|", | ||
header=None, | ||
names=Config.DENOM_COLS, | ||
dtype=Config.DENOM_DTYPES, | ||
) | ||
|
||
denom_data[Config.DATE_COL] = \ | ||
pd.to_datetime(denom_data[Config.DATE_COL],errors="coerce") | ||
|
||
# restrict to start and end date | ||
denom_data = denom_data[ | ||
(denom_data[Config.DATE_COL] >= Config.FIRST_DATA_DATE) & | ||
(denom_data[Config.DATE_COL] < dropdate) | ||
] | ||
|
||
# counts between 1 and 3 are coded as "3 or less", we convert to 1 | ||
denom_data[Config.DENOM_COL][ | ||
denom_data[Config.DENOM_COL] == "3 or less" | ||
] = "1" | ||
denom_data[Config.DENOM_COL] = denom_data[Config.DENOM_COL].astype(int) | ||
|
||
assert ( | ||
(denom_data[Config.DENOM_COL] >= 0).all().all() | ||
), "Denominator counts must be nonnegative" | ||
|
||
# aggregate age groups (so data is unique by date and base geography) | ||
denom_data = denom_data.groupby([base_geo, Config.DATE_COL]).sum() | ||
denom_data.dropna(inplace=True) # drop rows with any missing entries | ||
|
||
return denom_data | ||
|
||
def load_covid_data(covid_filepath, dropdate, base_geo): | ||
"""Load in and set up denominator data. | ||
|
||
Args: | ||
covid_filepath: path to the aggregated covid data | ||
dropdate: data drop date (datetime object) | ||
base_geo: base geographic unit before aggregation ('fips') | ||
|
||
Returns: | ||
cleaned denominator dataframe | ||
""" | ||
assert base_geo == "fips", "base unit must be 'fips'" | ||
|
||
covid_suffix = covid_filepath.split("/")[-1].split(".")[0][9:] | ||
assert covid_suffix == "Covid_Outpatients_By_County" | ||
covid_filetype = covid_filepath.split("/")[-1].split(".")[1] | ||
assert covid_filetype == "dat" | ||
|
||
covid_data = pd.read_csv( | ||
covid_filepath, | ||
sep="|", | ||
header=None, | ||
names=Config.COVID_COLS, | ||
dtype=Config.COVID_DTYPES, | ||
parse_dates=[Config.DATE_COL] | ||
) | ||
|
||
covid_data[Config.DATE_COL] = \ | ||
pd.to_datetime(covid_data[Config.DATE_COL],errors="coerce") | ||
|
||
# restrict to start and end date | ||
covid_data = covid_data[ | ||
(covid_data[Config.DATE_COL] >= Config.FIRST_DATA_DATE) & | ||
(covid_data[Config.DATE_COL] < dropdate) | ||
] | ||
|
||
# counts between 1 and 3 are coded as "3 or less", we convert to 1 | ||
covid_data[Config.COVID_COL][ | ||
covid_data[Config.COVID_COL] == "3 or less" | ||
] = "1" | ||
covid_data[Config.COVID_COL] = covid_data[Config.COVID_COL].astype(int) | ||
|
||
assert ( | ||
(covid_data[Config.COVID_COL] >= 0).all().all() | ||
), "COVID counts must be nonnegative" | ||
|
||
# aggregate age groups (so data is unique by date and base geography) | ||
covid_data = covid_data.groupby([base_geo, Config.DATE_COL]).sum() | ||
covid_data.dropna(inplace=True) # drop rows with any missing entries | ||
|
||
return covid_data | ||
|
||
|
||
def load_combined_data(denom_filepath, covid_filepath, dropdate, base_geo): | ||
"""Load in denominator and covid data, and combine them. | ||
|
||
Args: | ||
denom_filepath: path to the aggregated denominator data | ||
covid_filepath: path to the aggregated covid data | ||
dropdate: data drop date (datetime object) | ||
base_geo: base geographic unit before aggregation ('fips') | ||
|
||
Returns: | ||
combined multiindexed dataframe, index 0 is geo_base, index 1 is date | ||
""" | ||
assert base_geo == "fips", "base unit must be 'fips'" | ||
|
||
# load each data stream | ||
denom_data = load_denom_data(denom_filepath, dropdate, base_geo) | ||
covid_data = load_covid_data(covid_filepath, dropdate, base_geo) | ||
|
||
# merge data | ||
data = denom_data.merge(covid_data, how="outer", left_index=True, right_index=True) | ||
assert data.isna().all(axis=1).sum() == 0, "entire row is NA after merge" | ||
|
||
# calculate combined numerator and denominator | ||
data.fillna(0, inplace=True) | ||
data["num"] = data[Config.COVID_COL] | ||
data["den"] = data[Config.DENOM_COL] | ||
data = data[["num", "den"]] | ||
|
||
return data |
Uh oh!
There was an error while loading. Please reload this page.