Skip to content

Garbling PII

Dylan Hall edited this page Apr 19, 2023 · 4 revisions

anonlink will garble personally identifiable information (PII) in a way that it can be used for linkage later on. The CODI PPRL process garbles information a number of different ways. The garble.py script will manage executing anonlink multiple times and package the information for transmission to the linkage agent.

garble.py accepts the following positional inputs:

  1. (optional) The location of a CSV file containing the PII to garble. If not provided, the script will look for the newest pii-TIMESTAMP.csv file in the temp-data directory.
  2. (required) The location of a directory of anonlink linkage schema files
  3. (required) The location of a secret file to use in the garbling process - this should be a text file containing a single hexadecimal string of at least 128 bits (32 characters)

The anonlink schema files specify the fields that will be used in the hashing process as well as assigning weights to those fields. The example-schema directory contains a set of example schema that can be used to test the tools.

garble.py, and all other scripts in the repository, will provide usage information with the -h flag:

$ python garble.py -h
usage: garble.py [-h] [-z OUTPUTZIP] [-o OUTPUTDIR] [sourcefile] schemadir secretfile

Tool for garbling PII in for PPRL purposes in the CODI project

positional arguments:
  sourcefile            Source pii-TIMESTAMP.csv file
  schemadir             Directory of linkage schema
  secretfile            Location of de-identification secret file

optional arguments:
  -h, --help            show this help message and exit
  -z OUTPUTZIP, --outputzip OUTPUTZIP
                        Specify an name for the .zip file. Default is garbled.zip
  -o OUTPUTDIR, --outputdir OUTPUTDIR
                        Specify an output directory. Default is output/

garble.py will package up the garbled PII files into a zip file called garbled.zip and place it in the output/ folder by default, you can change this with an --output flag if desired.

Two example executions of garble.py is shown below–first with the PII CSV specified via positional argument:

$ python garble.py temp-data/pii-TIMESTAMP.csv example-schema ../deidentification_secret.txt
CLK data written to output/name-sex-dob-phone.json
CLK data written to output/name-sex-dob-zip.json
CLK data written to output/name-sex-dob-addr.json
Zip file created at: output/garbled.zip

And second without the PII CSV specified as a positional argument:

$ python garble.py example-schema ../deidentification_secret.txt
PII Source: temp-data/pii-TIMESTAMP.csv
CLK data written to output/name-sex-dob-phone.json
CLK data written to output/name-sex-dob-zip.json
CLK data written to output/name-sex-dob-addr.json
Zip file created at: output/garbled.zip

Household Extract and Garble

You may now run households.py. The arguments are similar to garble.py script, with the biggest difference being that specifying a specific schema file must be done using the keyword --schemafile instead of positionally. If the --schemafile is not used to specify a specefic schema, it will default to the example-schema/household-schema/fn-phone-addr-zip.json (use -h flag for more information). NOTE: If you want to generate the testing and tuning files for development on a synthetic dataset, you need to specify the -t or --testrun flags.

It is also recommended to use the --debug flag when running this process on large datasets as the process can be very slow and memory-intensive. The --debug flag will print out detailed logs at each step so you can be sure the process is running, and if it runs out of memory the log will provide some notes that can assist in developing a fix.

Finally, the households script has a --split_factor # option which allows for breaking up the pii file into chunks to reduce memory requirements. (In a perfect world the script would calculate the optimal split factor based on size of data and amount of available system memory, but for now the feature requires some manual intervention.) The default if not specified is 4. This flag should only be used if the default value crashes with a memory error. (The log may say simply "Killed" on some operating systems.) Note that the split process does create some overhead in terms of time, so the goal should be to pick a number as small as possible that allows the process to complete.

The households script will do the following:

  1. Attempt to group individuals into households and store those records in a csv file in temp-data
  2. Create a mapping file to be sent to the linkage agent, along with a zip file of household specific garbled information.

This information must be provided to the linkage agent if you would like to get a household linkages table as well.

Example run with PII CSV specified:

$ python households.py temp-data/pii-TIMESTAMP.csv ../deidentification_secret.txt
CLK data written to output/households/fn-phone-addr-zip.json
Zip file created at: output/garbled_households.zip

and without PII CSV specified:

$ python households.py ../deidentification_secret.txt
PII Source: temp-data/pii-TIMESTAMP.csv
CLK data written to output/households/fn-phone-addr-zip.json
Zip file created at: output/garbled_households.zip
Clone this wiki locally