-
Notifications
You must be signed in to change notification settings - Fork 8
Data Extraction, Validation, and Cleaning
The CODI PPRL process depends on information pulled from a database or translated from a .csv
file structured to match the CODI Data Model. extract.py
either connects to a database and extracts information, or reads from a provided .csv
file, cleaning and validating it to prepare it for the PPRL process. The script will output a temp-data/pii-TIMESTAMP.csv
file that contains the PII ready for garbling.
To extract from a database, extract.py
requires a database connection string to connect. Consult the SQLAlchemy documentation to determine the exact string for the database in use.
(To translate from a .csv
file instead, extract.py
requires a .json
configuration file, the path to which must be specified with the --csv_config
flag. The requirements of this configuration file are described below )
By default, extract.py
expects a schema named cdm
containing the CODI Data Model PRIVATE_DEMOGRAPHIC and PRIVATE_ADDRESS_HISTORY tables. The --schema_name
option can be used to provide the name of schema containing these tables in the source system if other than cdm
.
Because CODI PPRL currently operates with only a single address per individual and the PRIVATE_ADDRESS_HISTORY table allows for multiple addresses, the extraction process selects a single one. By default, the logic is to select the most recent (based on ADDRESS_PERIOD_START) address with ADDRESS_PREFERRED='Y', however this logic can be particularly slow on large datasets. The --address_selection
flag can be used to specify alternate approaches:
-
--address_selection full
: The default approach, select the most recent address with ADDRESS_PREFERRED='Y' -
--address_selection preferred
: This approach will bypass the date logic and select all rows where ADDRESS_PREFERRED='Y'. This is intended for data owners who may have multiple addresses but it is guaranteed that only one per PATID will have ADDRESS_PREFERRED='Y'. -
--address_selection single
: This approach will bypass all filters and select all addresses. This is intended for data owners who are guaranteed to only have a single address per PATID.
IMPORTANT: If using the preferred
or single
approach as described above, it is critical to ensure the extracted pii.csv contains exactly one row per PATID. Using the data characterization script described below, review the total number of rows to ensure it matches your expectations (ie, if it is low, then certain PATIDs may not have a PREFERRED address, or some other selection error may have occurred) and ensure the number of duplicates is low, preferably 0 (if it is high, then there may be multiple addresses per individual).
A note on performance: even with the preferred
or single
approaches here, selecting all records may be a slow process. If possible, adding the following database indices is highly recommended:
- Unique index on PRIVATE_DEMOGRAPHIC (PATID)
- Index on PRIVATE_ADDRESS_HISTORY (PATID) -- (potentially a unique index if your specific setup guarantees this)
- Index on PRIVATE_ADDRESS_HISTORY (PATID, ADDRESS_PREFERRED) -- (again, potentially a unique index if your setup guarantees this expectation)
Finally if you specify the --verbose
flag, the script will print a report to the terminal, documenting various issues it found when extracting the data. An example execution of the script is included below:
$ python extract.py postgresql://codi:codi@localhost/codi -v
Total records exported: 5476
record_id
--------------------
given_name
--------------------
Contains Non-ASCII Characters: 226
Contains Non-printable Characters: 3
family_name
--------------------
Contains Non-ASCII Characters: 2
DOB
--------------------
sex
--------------------
phone_number
--------------------
household_street_address
--------------------
Contains Non-ASCII Characters: 1
Contains Non-printable Characters: 1
household_zip
--------------------
NULL Value: 9
The configuration file used to extract and translate data for PPRL from a .csv
file must be a .json
file, the path to which is specified with the --csv_config
flag.
python extract.py my_data.csv --csv_config my_config.json
The .json
file must contain the following fields:
- A "date_format" field which specifies the string date representation of dates within the
.csv
for DOB extraction. The string must conform to the 1989 C Date format standard. See Python's datetime documentation for the most relevant information. - A "translation_map" field that contains an object mapping standardized column names used in the PPRL process to the column names within the
.csv
file. The following columns require a mapping or a default value:- "given_name"
- "family_name"
- "DOB"
- "sex"
- "phone"
- "address"
- "zip"
Default values are provided as an object within the "tranlsation_map" under the field "default_values". It both provides a single default value to assign to an entire column if not present within the .csv
or if a given record has no value for that field
The configuration file may also optionally contain the following fields:
- A "value_mapping_rules" field which contains an dictionary which maps values found within a given column to values expected for the PPRL process (e.g. mapping "Male" and "Female" within the "sex" column to "M" and "F", respectively)
- An "initial_id" field which gives an integer to be used as the ID number for each of the records if a suitable column is not present in the
.csv
file or if a value is not present for that field in a given record. IDs are generated sequentially counting up from the provided "initial_id" if used.
See testing-and-tuning/sample_conf.json
for an example of the configuration file.
A data characterization script is provided to assist in identifying data anomalies or quality issues. This script can be run against the pii-TIMESTAMP.csv
generated by extract.py
or directly against the database used by extract.py
.
It is recommended that data_analyis.py
at least be run against the generated pii-TIMESTAMP.csv
file to help ensure that extraction succeeded successfully.
python data_analysis.py --csv temp-data/pii-TIMESTAMP.csv
python data_analysis.py --db postgresql://username:password@host:port/database
data_analysis.py
produces two json text files containing summary statistics including: total number of records, number of individuals of each gender, zip code and phone number formats, most common zip codes, and address length at various percentiles among others.
Any aberrant results should be investigated rectified within the data set before proceeding.
Notes:
- DB-related command-line arguments from
extract.py
listed above, such as specifying a different schema, are also available here. Runpython data_analysis.py -h
to see the full list of available options. - The
data_analysis.py
script does not currently support CSV translation the same way theextract.py
script does. If you are using the CSV translation option for extract then you can only rundata_analysis.py
against the extracted pii.csv file.