Skip to content

Duplicate Detection Mechanism

Bartha Barna edited this page Jan 9, 2023 · 4 revisions

Introduction

This guide explains the detection of potentially duplicate persons, cases, contacts, and events in SORMAS - how the process works in general and which variables are taken into account.

General Information

Duplicate detection is always based on the jurisdiction of the user that creates or imports persons, cases, contacts, or event participants into the system. Data that the user has no access to is not taken into account, which theoretically makes the creation of duplicates possible even if duplicate detection is thoroughly used by every user. To resolve these, users with the respective rights have access to dedicated "Merge Duplicate" views for cases and contacts, accessible via the directories, where they can clean up the system by merging duplicate data.

Please note that there is currently no duplicate detection in place when pushing data to SORMAS via the ReST interface!

Persons

Whenever a user tries to create data that involves persons - cases, contacts, or event participants -, the system is checked for similar persons before a subsequent duplicate detection of the actual case, contact, or event participant is done. A person is classified as a potential duplicate if it meets the following requirements. Any variable that is not specified for the created person is ignored in this calculation.

  • The person must be associated with at least one case, contact, or event participant.
    • If the server property duplicatechecks.excludepersonsonlylinkedtoarchivedentries is enabled, the associated case or contact, or the event that the event participant is part of, additionally needs to be active, i.e. not deleted and not archived. By default, this property is disabled.
  • The person must have a similar name.
    • To detect similar names, we're using PostgreSQL's pg_trgm module that utilizes trigrams to calculate the similarity between two strings, in this case the names of two persons. The default similarity threshold is 0.65. This threshold can be adjusted by changing the value of the server property namesimilaritythreshold. The higher the value, the more similar the names need to be in order to be detected.
  • Both persons must have the same or no sex (if specified for the created person).
    • The sex "Unknown" matches with every other sex, so the system would detect a person with sex "Unknown" as a potential duplicate of a person with sex "Male", "Female", or "Other" as long as all other requirements are met.
  • Both persons must not have a differing year, month, or day of birth.
    • Persons are also detected as potential duplicates if their year, month, or day of birth is empty. They are only excluded if there's an actually different value.
  • Both persons must not have a differing national health ID.

Important: All of the above requirements are ignored (except for the association requirement) if both persons have the same passport number.

  • The person similarity logic is also used when merging persons from the person directory to show a warning in case the persons are not similar.

Cases

When a user tries to create a case and a potentially duplicate person has been identified and selected, the system is additionally checked for similar cases of the selected person. A case is classified as a potential dupliate if it meets the following requirements. Any variable that is not specified for the created case is ignored in this calculation. Only cases that have not been marked as deleted are considered, including archived cases.

  • Both cases must have the same disease.
  • Both cases must have the same place of stay region or, if the place of stay region is empty, same responsible region.
  • The report dates of both cases must be within 30 days of each other.

Contacts

When a user tries to create a contact and a potentially duplicate person has been identified and selected, the system is additionally checked for similar contacts of the selected person. A contact is classified as a potential dupliate if it meets the following requirements. Any variable that is not specified for the created contact is ignored in this calculation. Only contacts that have not been marked as deleted are considered, including contacts of archived cases.

  • If no source case has been selected, both contacts must have the same disease.
  • If a source case has been selected, both contacts must have the same source case.
  • The report dates of both contacts must be within 30 days of each other.
  • The last contact dates of both contacts must be within 30 days of each other (if specified for the created contact).

Event Participants

When a user tries to create an event participant and a potentially duplicate person has been identified and selected, the system additionally checks whether there already is an event participant for the selected person in the associated event. In that case, an error message is displayed to the user, informing them about this circumstance and blocking them from creating a duplicate event participant.