Skip to content

Data Extraction, Validation, and Cleaning

Dylan Hall edited this page Dec 21, 2022 · 6 revisions

Extract PII

The CODI PPRL process depends on information pulled from a database or translated from a .csv file structured to match the CODI Data Model. extract.py either connects to a database and extracts information, or reads from a provided .csv file, cleaning and validating it to prepare it for the PPRL process. The script will output a temp-data/pii-TIMESTAMP.csv file that contains the PII ready for garbling.

To extract from a database, extract.py requires a database connection string to connect. Consult the SQLAlchemy documentation to determine the exact string for the database in use.

By default, extract.py expects a schema named cdm containing the CODI Data Model DEMOGRAPHIC and PRIVATE_DEMOGRAPHIC tables. The --schema_name option can be used to provide the name of schema containing these tables in the source system if other than cdm.

To translate from a .csv file, extract.py requires a .json configuration file, the path to which must be specified with the --csv_config flag. The requirements of this configuration file are described below

When finished, if you specify the --verbose flag, the script will print a report to the terminal, documenting various issues it found when extracting the data. An example execution of the script is included below:

$ python extract.py postgresql://codi:codi@localhost/codi -v
Total records exported: 5476

record_id
--------------------

given_name
--------------------
Contains Non-ASCII Characters: 226
Contains Non-printable Characters: 3

family_name
--------------------
Contains Non-ASCII Characters: 2

DOB
--------------------

sex
--------------------

phone_number
--------------------

household_street_address
--------------------
Contains Non-ASCII Characters: 1
Contains Non-printable Characters: 1

household_zip
--------------------
NULL Value: 9

CSV Translation Configuration File

The configuration file used to extract and translate data for PPRL from a .csv file must be a .json file, the path to which is specified with the --csv_config flag.

python extract.py my_data.csv --csv_config my_config.json

The .json file must contain the following fields:

  • A "date_format" field which specifies the string date representation of dates within the .csv for DOB extraction. The string must conform to the 1989 C Date format standard. See Python's datetime documentation for the most relevant information.
  • A "translation_map" field that contains an object mapping standardized column names used in the PPRL process to the column names within the .csv file. The following columns require a mapping or a default value:
    • "given_name"
    • "family_name"
    • "DOB"
    • "sex"
    • "phone"
    • "address"
    • "zip" Default values are provided as an object within the "tranlsation_map" under the field "default_values". It both provides a single default value to assign to an entire column if not present within the .csv or if a given record has no value for that field

The configuration file may also optionally contain the following fields:

  • A "value_mapping_rules" field which contains an dictionary which maps values found within a given column to values expected for the PPRL process (e.g. mapping "Male" and "Female" within the "sex" column to "M" and "F", respectively)
  • An "initial_id" field which gives an integer to be used as the ID number for each of the records if a suitable column is not present in the .csv file or if a value is not present for that field in a given record. IDs are generated sequentially counting up from the provided "initial_id" if used.

See testing-and-tuning/sample_conf.json for an example of the configuration file.

Data Quality and Characterization

A data characterization script is provided to assist in identifying data anomalies or quality issues. This script can be run against the pii-TIMESTAMP.csv generated by extract.py or directly against the database used by extract.py. It is recommended that data_analyis.py at least be run against the generated pii-TIMESTAMP.csv file to help ensure that extraction succeeded successfully.

python data_analysis.py --csv temp-data/pii-TIMESTAMP.csv

python data_analysis.py --db postgresql://username:password@host:port/database

data_analysis.py produces two json text files containing summary statistics including: total number of records, number of individuals of each gender, zip code and phone number formats, most common zip codes, and address length at various percentiles among others. Any aberrant results should be investigated rectified within the data set before proceeding.

Clone this wiki locally