-
Notifications
You must be signed in to change notification settings - Fork 8
Data Extraction, Validation, and Cleaning
The CODI PPRL process depends on information pulled from a database or translated from a .csv
file structured to match the CODI Data Model. extract.py
either connects to a database and extracts information, or reads from a provided .csv
file, cleaning and validating it to prepare it for the PPRL process. The script will output a temp-data/pii-TIMESTAMP.csv
file that contains the PII ready for garbling.
To extract from a database, extract.py
requires a database connection string to connect. Consult the SQLAlchemy documentation to determine the exact string for the database in use.
By default, extract.py
expects a schema named cdm
containing the CODI Data Model DEMOGRAPHIC and PRIVATE_DEMOGRAPHIC tables. The --schema_name
option can be used to provide the name of schema containing these tables in the source system if other than cdm
.
To translate from a .csv
file, extract.py
requires a .json
configuration file, the path to which must be specified with the --csv_config
flag. The requirements of this configuration file are described below
When finished, if you specify the --verbose
flag, the script will print a report to the terminal, documenting various issues it found when extracting the data. An example execution of the script is included below:
$ python extract.py postgresql://codi:codi@localhost/codi -v
Total records exported: 5476
record_id
--------------------
given_name
--------------------
Contains Non-ASCII Characters: 226
Contains Non-printable Characters: 3
family_name
--------------------
Contains Non-ASCII Characters: 2
DOB
--------------------
sex
--------------------
phone_number
--------------------
household_street_address
--------------------
Contains Non-ASCII Characters: 1
Contains Non-printable Characters: 1
household_zip
--------------------
NULL Value: 9
The configuration file used to extract and translate data for PPRL from a .csv
file must be a .json
file, the path to which is specified with the --csv_config
flag.
python extract.py my_data.csv --csv_config my_config.json
The .json
file must contain the following fields:
- A "date_format" field which specifies the string date representation of dates within the
.csv
for DOB extraction. The string must conform to the 1989 C Date format standard. See Python's datetime documentation for the most relevant information. - A "translation_map" field that contains an object mapping standardized column names used in the PPRL process to the column names within the
.csv
file. The following columns require a mapping or a default value:- "given_name"
- "family_name"
- "DOB"
- "sex"
- "phone"
- "address"
- "zip"
Default values are provided as an object within the "tranlsation_map" under the field "default_values". It both provides a single default value to assign to an entire column if not present within the
.csv
or if a given record has no value for that field
The configuration file may also optionally contain the following fields:
- A "value_mapping_rules" field which contains an dictionary which maps values found within a given column to values expected for the PPRL process (e.g. mapping "Male" and "Female" within the "sex" column to "M" and "F", respectively)
- An "initial_id" field which gives an integer to be used as the ID number for each of the records if a suitable column is not present in the
.csv
file or if a value is not present for that field in a given record. IDs are generated sequentially counting up from the provided "initial_id" if used.
See testing-and-tuning/sample_conf.json
for an example of the configuration file.
A data characterization script is provided to assist in identifying data anomalies or quality issues. This script can be run against the pii-TIMESTAMP.csv
generated by extract.py
or directly against the database used by extract.py
.
It is recommended that data_analyis.py
at least be run against the generated pii-TIMESTAMP.csv
file to help ensure that extraction succeeded successfully.
python data_analysis.py --csv temp-data/pii-TIMESTAMP.csv
python data_analysis.py --db postgresql://username:password@host:port/database
data_analysis.py
produces two json text files containing summary statistics including: total number of records, number of individuals of each gender, zip code and phone number formats, most common zip codes, and address length at various percentiles among others.
Any aberrant results should be investigated rectified within the data set before proceeding.