-
Notifications
You must be signed in to change notification settings - Fork 5
Scripts for PPRL QA
This page describes the QA scripts that are included in the linkage-agent-tools repository.
- run result refers to a single “match” as the result of running a single anonlink project. These contain one row number from each system that matched, and they get aggregated together by our linkage-agent-tools. There are 3 run results below.
- match group refers to the entire thing, containing a set of run results and the aggregate of the row numbers from those match groups, grouped by site. One match group conceptually represents one individual, and turns into one LINKID. (Note that not all LINKIDs will have associated match groups – if a row from one system doesn’t match any others in the process, it will get its own LINKID but not be referenced in any match groups)
{
"site_c": [518],
"site_d": [317],
"site_e": [499, 732],
"run_results": [
{
"site_c": 518,
"site_d": 317,
"project": "name-sex-dob-phone"
},
{
"site_c": 518,
"site_d": 317,
"site_e": 499,
"project": "name-sex-dob-zip"
},
{
"site_d": 317,
"site_e": 732,
"project": "name-sex-dob-addr"
}
]
}
Note that this example is completely synthetic, and even if it were real data, the numbers shown here are not unique identifiers. The numbers refer to the position of a hash in the list of hashes provided by an organization. (i.e., "site_c": 518
refers to the hash on line 518 in the file provided by site_c)
db.getCollection('match_groups').aggregate([{
$unwind: "$run_results"
},
{
$group: {
_id:'$run_results.project',
total: { $sum: 1 }
}
}])
{ "_id" : "name-sex-dob-parents", "total" : 2434 }
{ "_id" : "name-sex-dob-addr", "total" : 1989 }
{ "_id" : "name-sex-dob-zip", "total" : 2519 }
{ "_id" : "name-sex-dob-phone", "total" : 2016 }
project_matching_freq tells us the total number of run results by project, across all match groups.
db.getCollection('match_groups').aggregate([{
$group: {
_id: { $size: '$run_results' },
total: { $sum: 1 },
}
}])
{ "_id" : 1, "total" : 3927 }
{ "_id" : 2, "total" : 1519 }
{ "_id" : 3, "total" : 1381 }
{ "_id" : 4, "total" : 18086 }
{ "_id" : 5, "total" : 824 }
{ "_id" : 6, "total" : 412 }
{ "_id" : 7, "total" : 213 }
{ "_id" : 8, "total" : 150 }
{ "_id" : 9, "total" : 86 }
{ "_id" : 10, "total" : 52 }
[...]
freqs_of_matching_projs returns the count of how many match groups have a certain number of run results. This one gets a little more interesting. I’ve rearranged the output into a CSV for greater readability – see attached. If we imagine a scenario where the data and matching process were perfect, every match group would have exactly 4 run results. Each of the 4 projects would link to the same person across whichever systems they were present in. If the person is only in one system, there’s no run results and no match group.
This script compares the hashes across all sites for a single project to count the number of exact matches between each pair of sites and across all sites.
COUNTING EXACT MATCHES FOR name-sex-dob-zip
Size of site_a: 819
Size of site_b: 819
Size of site_c: 819
Size of site_d: 819
Size of site_e: 819
Size of site_f: 819
Total exact matches between site_a and site_b: 186
Total exact matches between site_a and site_c: 109
Total exact matches between site_a and site_d: 88
Total exact matches between site_a and site_e: 104
Total exact matches between site_a and site_f: 181
Total exact matches between site_b and site_c: 189
Total exact matches between site_b and site_d: 124
Total exact matches between site_b and site_e: 71
Total exact matches between site_b and site_f: 102
Total exact matches between site_c and site_d: 188
Total exact matches between site_c and site_e: 111
Total exact matches between site_c and site_f: 74
Total exact matches between site_d and site_e: 173
Total exact matches between site_d and site_f: 132
Total exact matches between site_e and site_f: 176
Total exact matches for name-sex-dob-zip: 1303
Size of {site}:
== the number of UNIQUE hashes provided by the data owner. If the number of unique hashes is a fraction of the total number of hashes/lines in the file, then it's possible the data may not have been appropriately deduplicated. (Or in other words, they may have included multiple rows per individual, which the current tool suite does not support)
Total exact matches between {site_a} and {site_b}
== the number of unique hashes that appear in both sites, ie, the intersection between the two sites.
Total exact matches for {project}
== the number of exact matching hashes across all pairs of sites for the given project. This is the union of the exact matches between all sets of pairs.
This script attempts to determine the impact of dropping one anonlink project from the set. (For example, CODI 1.0 originally used 4 projects, but this script was used to assist in determining the impact of dropping the name-sex-dob-parents
project)
Note that because of the complexities and randomness of the deconfliction logic, this tool cannot perfectly predict the impact of dropping a project in all cases. Also note that the results cannot be combined to determine the impact of dropping multiple projects at once.
Total number of matches: 1454
Total number of matched pairs (run_results): 4491
NAME-SEX-DOB-PHONE
1445 results have a match on name-sex-dob-phone
4 results have a match ONLY on name-sex-dob-phone
Total # of changes by removing this project: 25
LINKID 94e94e28-b9b3-11ec-81b7-acde48001122 was created using only name-sex-dob-phone -- sites: ['site_e', 'site_a']
LINKID 94eb169a-b9b3-11ec-81b7-acde48001122 was created using only name-sex-dob-phone -- sites: ['site_c', 'site_d']
LINKID 94eb37f6-b9b3-11ec-81b7-acde48001122 was created using only name-sex-dob-phone -- sites: ['site_f', 'site_e']
LINKID 94eb9606-b9b3-11ec-81b7-acde48001122 was created using only name-sex-dob-phone -- sites: ['site_f', 'site_e']
LINKID 94e8a16c-b9b3-11ec-81b7-acde48001122 links to site_f only by name-sex-dob-phone -- remaining links are ['site_a', 'site_b']
LINKID 94e8b544-b9b3-11ec-81b7-acde48001122 links to site_a only by name-sex-dob-phone -- remaining links are ['site_d', 'site_e', 'site_f']
LINKID 94e8cbe2-b9b3-11ec-81b7-acde48001122 links to site_f only by name-sex-dob-phone -- remaining links are ['site_a', 'site_b', 'site_c']
LINKID 94e93dca-b9b3-11ec-81b7-acde48001122 links to site_f only by name-sex-dob-phone -- remaining links are ['site_a', 'site_e']
LINKID 94e9a0a8-b9b3-11ec-81b7-acde48001122 links to site_e only by name-sex-dob-phone -- remaining links are ['site_b', 'site_f', 'site_a']
This script works in different ways depending on which version of Mongo is present.
For version 4.4 and newer, this script will use a $function-based approach, producing results like the following:
{ _id:
{ 'name-sex-dob-addr': 1,
'name-sex-dob-parents': 1,
'name-sex-dob-phone': 1,
'name-sex-dob-zip': 1 },
total: 1281 }
{ _id:
{ 'name-sex-dob-addr': 2,
'name-sex-dob-parents': 1,
'name-sex-dob-phone': 1,
'name-sex-dob-zip': 1 },
total: 52 }
{ _id:
{ 'name-sex-dob-addr': 1,
'name-sex-dob-parents': 2,
'name-sex-dob-phone': 1,
'name-sex-dob-zip': 1 },
total: 15 }
For earlier versions, results will look like this:
{ _id: ',addr,parents,phone,zip', count: 1281 }
{ _id: ',addr,addr,parents,phone,zip', count: 52 }
{ _id: ',addr,parents,parents,phone,zip', count: 15 }
{ _id: ',addr,parents,parents,phone,phone,zip', count: 14 }
{ _id: ',addr,addr,parents,parents,phone,phone,zip', count: 14 }
{ _id: ',addr,parents,phone,phone,zip', count: 10 }
This script counts the number of instances of match groups that have a certain "shape" of run results, where "shape" means the number and type of results of each project. In a perfect world with perfect data and perfect matching algorithms, every match group would have a "shape" of 1-each of all projects.
If a "single-project match" is near the top of the list, for example: { "_id": ",zip", "count": 19837}
this indicates that project may be contributing more significantly to matches than the other projects. This also means that this project is finding matches between individuals that other projects are not finding, so depending on the key field, this may suggest the matching threshold for a project is too low.