Skip to content

PPRL Overview

Dylan Hall edited this page Dec 21, 2022 · 1 revision

Background

If you are new to CODI, you may find the following resources helpful in understanding the Data Owner Tools, what they do, and how they fit into the CODI data model.

CODI PPRL Implementation Guide (pdf) – Explains how to implement the CODI PPRL process, including roles and responsibilities for the Linkage Agent, Data Coordinating Center, Key Escrow, and Data Partner—in addition to the Data Owner information that is also covered in this Readme. Also provides a comprehensive introduction to PPRL and explains in more detail how the Data Owner Tools work

CODI Data Model Implementation Guide (pdf) – Explains the tables and data elements used by the CODI Data Model, including the Record Linkage Data Model tables that are necessary to perform PPRL.

CODI@NC Front Door Webpage – Provides basic background information for CODI and triages collaboration requests.

PPRL Introduction

The process of matching records across organizations, in the absence of a shared, unique identifier, often requires those organizations to exchange information with each other or a third party to participate in a matching process. Matching occurs by comparing shared PII to see if there are similarities in demographic attributes such as name, sex, date of birth, or address.

Although this approach to matching works, it has its drawbacks. First, there is always increased risk of privacy breaches when PII is shared outside organizations’ firewalls. Second, this approach does not scale well: while a small number of partners may agree to share information with each other, it is unlikely that large numbers of organizations would be willing to exchange PII nationally, outside of a national mandate. It is similarly unlikely that consolidating PII using a nationwide third-party matcher would be appealing. In order to conduct matching at scale, there must be an approach that does not involve exchanging PII beyond organizational boundaries.

PPRL is an alternative set of techniques to solve the issue of identity matching without exchanging PII directly. The basis for this class of solutions is that the PII is obfuscated, or garbled, prior to transmission beyond an organizational boundary for matching. The garbling of information takes place through a series of prescribed steps that makes it nearly impossible for an outside party to recover the PII, but still allows for the establishment of links across organizations.

PPRL solutions allow for “blind” matching. In this case, the third party is provided access to garbled data, but is unable to view PII. The third party then compares the garbled information to establish linkages. The image below illustrates this process.

pprl_example

The third party conducting the matching assigns an identifier when a linkage is found and communicates the identifier back to the participating organizations for use in establishing longitudinal records.

With PPRL, the third-party matching organization is not a large warehouse of PII, but instead is working with garbled, de-identified data. CODI uses PPRL to establish linkages across organizations without sharing PII.

CODI Roles

CODI uses the following terminology to denote roles within the PPRL process:

A data owner is an organization that has data to contribute for queries. This could be a clinical care provider, a community organization, or a government benefits provider. A data partner is an organization that participates in the distributed network by hosting data and/or performing the PPRL process on behalf of a data owner.

**This Readme explains how to set up and run the software tools that a data owner or data partner uses to perform PPRL

A linkage agent is an organization that performs linkage on behalf of data owners. The linkage agent receives de-identified PII and produces globally unique identifiers used to construct longitudinal records. Ultimately, longitudinal records will be assembled by an organization in the Data Coordinating Center (DCC) role. The DCC distributes queries to data owners, receives their responses, and conducts any analyses needed to meet researchers’ requests.

A key escrow is an organization responsible for generating an encryption secret, called a “salt,” that is used in the de-identification process. The key escrow will provide the salt value to data owners securely to ensure the security of the process.

PPRL Process Flow

PPRL is the process of matching individuals and households based on de-identified information. Matched records are assigned a globally unique identifier, which can be used to link those records across organizations.

The matching process typically involves the following steps:

  1. A linkage agent shares configuration information with the data owners. The key escrow provides a secret “salt” value to the data owners. The salt value will be the same for all data owners.

  2. Each data owner creates a de-identified data set of individuals by:

    • Extracting PII from its operational database.

    • Passing the PII and salt value through a hashing process that will garble the information.

    • Sharing the garbled data with the linkage agent.

  3. The linkage agent develops individual LINKIDs by:

    • Determining which de-identified values correspond to the same individual.

    • Establishing a unique LINKID for each individual.

  4. The linkage agent shares the LINKIDs with each data owner.

  5. Steps 2-4 are repeated for households, generating HOUSEHOLDIDs

Each data owner stores the LINKIDs and HOUSEHOLDIDs, for future queries. A key aspect of PPRL is the method used to garble the PII, which impacts the capabilities of the linkage agent to perform matching. The below image lists the CODI PPRL process steps, while the following section describes the matching approach that will be used.

codi_process_overview

The below illustrates the PPRL data and process flow using the same step numbering as in the figure above. While the figure specifically highlights the process for individual linkage (i.e., generating LINKIDs) the process for household linkage is analogous, but uses different scripts uniquely designed for the household linkage process (see Section 4.6 for more details). 

codi_pprl_process

Data Owner-Specific Process Flow Steps

The guidance in this section describes the entire PPRL process for data owners who will be hosting a database, performing hashing, transmitting information to the linkage agent, and responding to queries.

In order to mitigate privacy concerns associated with the potential linking of individuals to households, data owners shall not transmit both individual and household information to the linkage agent at the same time. Instead, the process for individual linkage and the process for household linkage will be run separately, one at a time. The basic sequence is as follows: data owners shall transmit de-identified individual information to the linkage agent, receive the LINKIDs and confirmation that the linkage agent has deleted the individual information, and then transmit the de-identified household information. As before, data owners will receive HOUSEHOLDIDs from the linkage agent.

The process of extracting individual information and preparing it for transmission to the linkage agent is illustrated in Figure 41. Data Owners execute steps 1, 2, 3, 4, 5, 9, and 10 in the process.

data_owner_process