Tools for the Childhood Obesity Data Initative (CODI) Linkage Agent to use to accept garbled input from data owners / partners, perform matching and generate network IDs. This can also be thought of as Semi-Trusted Third Party (STTP) tools.
These tool facilitate a Privacy Preserving Record Linkage (PPRL) process. They build on the open source anonlink software package.
The primary dependency of these tools is on the anonlink-entity-service. This software package provides a web service for accessing anonlink's matching capabilites. This software must be installed for the Linkage Agent Tools to work. Install instructions can be found on the anonlink-entity-service Deployment page
After following the install instructions in the anonlink-entity-service
documentation, you can confirm it is working if the API call to /api/v1/status
responds as described in the example at
https://anonlink-entity-service.readthedocs.io/en/stable/local-deployment.html
to clarify when entity service is up and running correctly.
Linkage Agent Tools uses MongoDB to store results obtained from the anonlink-entity-service. Install MongoDB by downloading the community version.
Linkage Agent Tools is a set of scripts designed to interact with the previously mentioned anonlink-entity-service. They were created and tested on Python 3.7.4. The tools rely on two libraries: Requests and pymongo.
Requests is a library that makes HTTP requests. This is used for the tools to communicate with the web service offered by the anonlink-entity-service.
pymongo is a Python client library for MongoDB.
Linkage Agent Tools contains a test suite, which was created using pytest.
Clone the project locally as a Git repository
git clone https://github.com/mitre/linkage-agent-tools.git
Or download as a zip file:
- Click this link to download the project as a zip or use the "Clone or download" button on GitHub.
- Unzip the file.
It can be helpful to set up a virtual environment to isolate project dependencies from system dependencies.
There are a few libraries that can do this, but this documentation will stick with venv
since that is included
in the Python Standard Library.
# Navigate to the project folder
cd linkage-agent-tools/
# Create a virtual environment in a `venv/` folder
python -m venv venv/
# Activate the virtual environment
source venv/bin/activate
pip install -r requirements.txt
pip install -e .
The second invocation of pip
is required in order for setup.py
to be able to communicate with your local installation of python so that the included modules can be found easily.
- Install Anaconda by following the
install instructions.
- Depending on user account permissions, Anaconda may not install the latest
version or may not be available to all users. If that is the case, try
running
conda update -n base -c defaults conda
- Depending on user account permissions, Anaconda may not install the latest
version or may not be available to all users. If that is the case, try
running
- Download the tools as a zip file using the "Clone or download" button on GitHub.
- Unzip the file.
- Open an Anaconda Powershell Prompt
- Go to the unzipped directory
- Run the following commands:
conda create --name codi
conda activate codi
conda install pip
pip install -r requirements.txt
pip install -e .
Linkage Agent Tools is driven by a central configuration file, which is a JSON
document saved as config.json
. An example is shown below:
{
"systems": ["site_a", "site_b", "site_c", "site_d", "site_e", "site_f"],
"projects": ["name-sex-dob-phone", "name-sex-dob-zip",
"name-sex-dob-parents", "name-sex-dob-addr"],
"schema_folder": "/CODI/data-owner-tools/example-schema",
"inbox_folder": "/CODI/inbox",
"matching_results_folder": "/CODI/results",
"project_results_folder": "/CODI/project_results",
"output_folder": "/CODI/output",
"entity_service_url": "http://localhost:8851/api/v1",
"matching_threshold": 0.75,
"mongo_uri": "localhost:27017",
"blocked": false,
"blocking_schema": "/CODI/data-owner-tools/example-schema/blocking-schema/lambda.json",
"household_match": true,
"household_schema": "/CODI/data-owner-tools/example-schema/household-schema/fn-phone-addr-zip.json"
}
A description of the properties in the file:
- systems - The set of data owners in this matching effort. These are short names for the participants. When data owners send zip files, it is expected that they will have the format of "data owner name".zip.
- projects - The anonlink linkage projects that are going to be used in this matching effort. It assumes that the project names will have a corresponding anonlink schema file in the schema folder.
- schema_folder - A folder containing anonlink schema files. The schema files should be named "project name".json.
- inbox_folder - The folder where zip files recieved from data owners should be placed.
- matching_results_folder - Folder where the CSV containing the complete
mapping of LINK_IDs to all data owners
project_results_folder - Folder where results from projects run with
anonlink-entity-service
are stored. - output_folder - Folder where CSV files are generated, one per data owner. These files contain LINK_IDs mapped to a single data owner.
- entity_service_url - The RESTful service endpoint for the anonlink-entity-service.
- matching_threshold - The threshold for considering a potential set of records a match when comparing in anonlink. This can either be a single number between 0 and 1 or a list of numbers between 0 and 1
- mongo_uri - The URI to use when connecting to MongoDB to store or access results. For details on the URI structure, consult the Connection String URI Format documentation
- blocked - A boolean value indicating whether the CLKs from the data owner in the inbox folder were generated via blocking
- blocking_schema - The optional path to the file used by data owner tools for blocking
- household_match - A boolean true / false value for running the household pprl and matching options. The matching process can only be run in individual or household mode; if this value is true, matching will only be performed on household data provided by the data owners
- household_schema - The path to the file used during household PPRL
Once you specify the paths outlined in the configuration section above, you need
to put the zip files from each data owner into the inbox_folder
specified,
with file from either individuals or households from systems
aka data owners
[site_a, site_b, site_c]
. Below is an example for individuals, corresponding
to a configuration setting of false
for household_match
:
inbox/
site_a.zip
site_b.zip
site_c.zip
...
Note that these file names exactly match the systems
list in the
configuration, with .zip
at the end. This is required.
And an example for households, with household_match
set to true
:
inbox/
site_a_households.zip
site_b_households.zip
site_c_households.zip
...
Note that the household file names in this example also start with system names
from the systems
configuration value, and end with _households.zip
; this is
also required.
After running the scripts in the order specified in the repository structure
section below, the project will produce the following files in the
output_folder
specified in the config. The first example would be the output
for individuals:
output/
site_a.zip
site_b.zip
site_c.zip
...
And the second example, for households:
output/
site_a_households.zip
site_b_households.zip
site_c_households.zip
...
This project is a set of python scripts driven by a central configuration file,
config.json
. It is expected to operate in the following order:
- Data owners transmit their garbled zip files to the Linkage Agent. These zip files should be placed into the configured inbox folder.
- Update
config.json
to enable or disablehousehold_match
, depending on the type of files received from data owners. - Run
validate.py
which will ensure all of the necessary files are present. - Run
drop.py
if you have done a previous matching run to clear old data in the database; this will drop all data for individuals and households, whetherhousehold_match
istrue
orfalse
- When all data is present, run
projects.py
to run the projects with theanonlink-entity-service
in preparation for matching. Results will be stored in theproject_results_folder
. - Run
match.py
, which will perform pairwise matching of the garbled information sent by the data owners for either individuals or households, depending on the value ofhousehold_match
. The matching information will be stored in MongoDB. - After matching, run
link_ids.py
, which will take all of the resulting matching information and use it to generate LINK_IDs, which are written to a CSV file in the configured results folder. - Once all LINK_IDs have been created, run
data_owner_ids.py
which will create one ZIP file per data owner. That file will contain a metadata file and a CSV file with only information on their LINK_IDs.
projects.py
, match.py
and link_ids.py
will also generate JSON metadata
files that contain information about the corresponding process.
The schema_folder
in the example below is using the example config paths from
above, with household_match
set to true
. The schemas used by the data-owner
during garbling of the data needs to be the same schemas pointed to in the
linkage-agent config.json
.
/CODI/
linkage-agent-tools/
...
inbox/
site_a_households.zip
site_a_block.zip
site_b_households.zip
site_b_block.zip
output/
site_a_households.csv
site_b_households.csv
data-owner-tools/
...
example-schema/
name-dob-ex.json
name-phone-ex.json
blocking-schema/
lambda.json
household-schema/
fn-phone-addr-zip.json
Linkage Agent Tools contains a unit test suite. Tests can be run with the following command:
python -m pytest
This repository uses black
, flake8
, and isort
to maintain consistent formatting and style. These tools can be run with the following command:
black .
isort .
flake8 .
The Linkage and Blocking Tuning Tool
Jupyter notebook is a work in progress
meant for testing and tuning different configurations against the synthetic data
set with Data Owner Tools and Linkage Agent Tools projects running on the same
machine. It will currently run all necessary scripts to do end to end testing of
the entire PPRL process but is still being improved and will include more
documentation when finalized.
Copyright 2020-2022 The MITRE Corporation.
Approved for Public Release; Distribution Unlimited. Case Number 19-2008