Documentation > GitHub Wiki on SORMAS duplicate detection variables and logic #23

Candice-Louw · 2021-01-20T12:06:37Z

Situation Description

Duplicate detection in SORMAS is a feature that affects many users. There is currently no documentation that describes which variables are taken into account when this process is executed. This makes it difficult for SORMAS end users to decide which fields to make mandatory to capture, during the contact tracing process for example, to ensure more accurate results.

Feature Description

Create a GitHub Wiki entry dedicated to duplicate detection logic, includicating which variables are taken into consideration during duplicate check for Person entities i.e.
- Case Persons
- Contact Persons
- Event Participants (Persons)
Please complete this per server configuration i.e. DE, CH and international, as different variables are visible on these different systems and it is not clear which are/are not relevant for which configuration.
Please also provide link(s) to the file(s) in the sourcecode where this is programmed so developers may directly access this.

@bernardsilenou @kwa20 - please include additional entities if needed.

Possible Alternatives

This doesn't necessarily have to be a Wiki entry - it could be documented elsewhere (public) too, please. The request is simply to be able to share this information (URL) when this sort of request comes our way so that it is self-explanatory enough for anyone (technical and non-technical) to be able to understand and access the most up to date version available.

Candice-Louw · 2021-01-20T12:12:19Z

Related to (directly):
SORMAS-Foundation/SORMAS-Project#3787
SORMAS-Foundation/SORMAS-Project#3704
SORMAS-Foundation/SORMAS-Project#2595
(indirectly)
SORMAS-Foundation/SORMAS-Project#4003
https://github.com/hzi-braunschweig/SORMAS-Switzerland/issues/138

JonasCir · 2021-01-20T12:56:05Z

@Candice-Louw what do you think about using GitHub pages for that like we do e.g., in SORMAS-glossary? The good think is by this we would also have a solution to collect all the documentation files in the repo root.

Candice-Louw · 2021-01-20T13:37:01Z

@JonasCir - great idea, yes! Where would this process start?

JonasCir · 2021-01-20T13:45:22Z

I provide the same PR as for SORMAS-glossary to this repo and you go ahead and start to write the documentation in Markdown? 😃 I would still request a signoff for this approach from someone of the team, though :)

JonasCir · 2021-01-20T13:50:41Z

Everything taken from SORMAS-glossary where it is already running in production:
The approach uses mkdocs
a GitHub action workflow
All docs go to a docs folder
and will be rendered in Github pages

If filed this PR already 3 times to SORMAS-Stats, data-generator and glossary. An advantage would be that all the docs currently cluttering the repo root could got to a dedicated folder under docs/

bernardsilenou · 2021-01-21T02:33:09Z

I also support we create section of the glossary.

JonasCir · 2021-01-21T08:52:45Z

This is the question, do we put such things into glossary or main repo? The tech stack with github pages is the same.

bernardsilenou · 2021-01-21T21:27:54Z

OK I got it now, I think we should put in glossary please.

JonasCir · 2021-01-21T21:45:18Z

@bernardsilenou @Candice-Louw let's continue our discussion here in the glossary:)

Candice-Louw · 2021-05-28T10:48:38Z

@Jan-Boehme - would it be possible to please upload/share the document/info that you compiled on the current duplicate check algorithm, please?

SORMAS-JanBoehme · 2021-05-28T12:50:55Z

@Candice-Louw

Sure, the github wiki page does not exist yet, right? Or I am too dumb to find it :-)

For everyone involved here is the info from a discussion currently going on with the german health departments about the duplicate detection for persons:

The departments aren't really happy with the current way the duplicate detection works because it requires some field to be exactly the same or else it will not work at all (more on that later).
Typos when entering data happen often or sometimes the information they get from the persons themselves are unclear. (i.e. is the person called Detlef or Detlev or Mohammed or Muhammed)

This I why I went ahead and did some digging in the the source code for the current implementation and created a concept for a more sophisticated person duplicate detection which makes use of weighted values attached to fields which could indicate a duplicate. The sum of all these weighted values is then observed to decide if a person is presented to the user as a possible duplicate.

Current implementation:

The SELECT statement for reading possible duplicates from the database is build regarding this criteria (PersonService.buildSimilarityCriteriaFilter):

(FirstName is equal OR LastName is equal)
AND
(sex is equal OR sex is null OR sex is unknown)
AND
(birthdateDD is equal AND birthdateMM is equal AND birthdateYYYY is equal) //Only if a value is provided
AND
(NationalHealthId is equal OR passportNumber is equal) //Only if a value is provided

The way of building the statement raises the following problems:

Either first name or last name of the person needs to be exactly the same. If it isn't the person will never be detected as a duplicate (i.e. Jens Müller and Hens Nüller will never be considered possible duplicates even if every other known data is exactly the same because they are never pulled from the database for further inspection)
If sex is male/female they are never considered a duplicate even if everything else is exactly the same. There are unisex names in existence from which it is not 100% clear which sex the person has, possibly resulting in someone interpeting the name as male and someone else as female.
Day, month and year are not evaluated seperately from another but instead are checked if they are equal connected by a logical and condition. Which means that only if the birtdate is exactly the same it will be considered a duplicate. (i.e. when the user makes a typo and enters the birthdate as March 3rd, 1991 oder March 2nd, 1919 instead of the correct date of March 2nd, 1991 it will not be considered a duplicate)

After pulling all matches from the database, firstName and LastName are joined into one string and the trigram distance between this string and the value in question is calculated.
If is greater than the server config value "namesimilaritythreshold" it is considered a possible duplicate that is presented to the user for selection.

I will provide the concept for the weighted person duplicate check when it has reached a high enough maturity level as it could have severe implications on database load which needs to be tested before even considering going ahead with a new implementation.
i.e the trigram calculation or maybe even using a phonetic algorithm for fuzzy search would need to be done on the database when executing the query. Which, in the worst case, means cross referencing every single entry in the table. Which may be fine for a few hundred entrys but not for over 1 Million like we have in Nigeria.

bernardsilenou · 2021-05-28T14:11:45Z

@Jan-Boehme

I think there are many duplicate detection methods out there and we can implement multiple of needed.
A challenge with all weighted methods is how weights are defined This differ from person to person, and a wrong assignment of weights may instead lead to false suggestions/ detection. This is the only point I think we need to clearly define.
The current implementation should not require any variable to be exact for it to work. If that is the case, there there is surly a bug or they need to change the value for "namesimilaritythreshold".

Few comments to your last comment:

Either first name or last name of the person needs to be exactly the same. If it isn't the person will never be detected as a duplicate (i.e. Jens Müller and Hens Nüller will never be considered possible duplicates even if every other known data is exactly the same because they are never pulled from the database for further inspection)

If this is what they experienced, then its clearly that they are using a "namesimilaritythreshold" corresponding to 1, that implies exact match. This would lead to over-conservative results. Names must not be exactly the same, even when you swap first and last name, it should not matter.
First and last names are concatenated in a sting, white space deleted, stings are compared using qgram algorithm and similarity compared with "namesimilaritythreshold"

If sex is male/female they are never considered a duplicate even if everything else is exactly the same. There are unisex names in existence from which it is not 100% clear which sex the person has, possibly resulting in someone interpeting the name as male and someone else as female.

If users are not 100% sure of the sex, then they should use "unknown " or NA as option. If name and all other person identifiers are the same but sex for one is male and the other is female or other, then they would not be suggested as duplicate.

Day, month and year are not evaluated seperately from another but instead are checked if they are equal connected by a logical and condition. Which means that only if the birtdate is exactly the same it will be considered a duplicate. (i.e. when the user makes a typo and enters the birthdate as March 3rd, 1991 oder March 2nd, 1919 instead of the correct date of March 2nd, 1991 it will not be considered a duplicate)
That is right, duplicate does not correct for wrong data entry. Adjusting for wrong data entry in the duplicate detection may lead to over suggestion of possible duplicates which is also as bad as under suggestion.

kwa20 · 2021-05-28T14:33:00Z

@bernardsilenou @Jan-Boehme

Either first name or last name of the person needs to be exactly the same. If it isn't the person will never be detected as a duplicate (i.e. Jens Müller and Hens Nüller will never be considered possible duplicates even if every other known data is exactly the same because they are never pulled from the database for further inspection)

If this is what they experienced, then its clearly that they are using a "namesimilaritythreshold" corresponding to 1, that implies exact match. This would lead to over-conservative results. Names must not be exactly the same, even when you swap first and last name, it should not matter. First and last names are concatenated in a sting, white space deleted, stings are compared using qgram algorithm and similarity compared with "namesimilaritythreshold"

This is however what happens, even if the namesimilaritythreshold is the default of 0.65. Here some examples:

name 1	name 2	detected
Jens Müller	Hens Nüller	no
Jens Müller	Hens Müller	yes
Jens Müller	Jens Nüller	yes
Jens Müller	Thomas Müller	no
Hens Müller	Hens Nüller	yes
Jens Nüller	Thomas Nüller	no
Thomas Muller	Dhomas Müller	no

SORMAS-JanBoehme · 2021-05-28T16:23:35Z

@bernardsilenou
The current implementation should not require any variable to be exact for it to work. If that is the case, there there is surly a bug or they need to change the value for "namesimilaritythreshold".

I checked the source code though and it is implemented exactly this way. No bug or wrong configuration by the GSA admins.
Which is confirmed by the tests @kwa20 ran which are the same results I am getting at least for a german test instance.

I get that we should avoid over suggesting of possible duplicates but at the moment the current implementation hides possibly relevant information from the users on purpose out of fear of overwhelming them.
Human errors happen and that's okay. SORMAS should compensate for them and help the user find and fix them instead of basically saying to the user "Well, you should have entered the birthdate correctly, tough luck.".

I can only speak for myself but when I first used SORMAS and I realized that I have to "create" a case and then just trust the system to provide me with the correct person (of which I knew for certain existed in the database) without telling me how it decides if I would be "allowed" to link the person I was kind of taken aback.

This is what I would like to achieve with a weighted comparison system. Allowing for users to make errors and fix them easily while at the same time getting better and more relevant results from the comparison. Also making every single parameter of the duplicate detection configurable by the local administrators on-the-fly. Enabling them to make informed decisions and tailor the software exactly to their needs while being 100% transparent about what happens.

maxiheyner · 2021-07-21T14:30:21Z

@MateStrysewske The local health departments often ask for more information on how the duplicate detection works exactly but we do not have an official documentation for it, yet.
Jan once checked the duplicate detection for persons and documented his findings here:
#23 (comment)
But something seems to have changed in the meantime as it does no longer behave the same way as at that time (It seems no longer neccessary for either first or last name to be exactly the same)

What is the exact way the duplicate detection for different entities is implemented at the moment? We need to document that for e.g. the admin manual. Can you please help here?

MateStrysewske · 2021-07-26T10:26:35Z

Could you please add an issue to the main GitHub repository to create such a guide and prioritise it accordingly, i.e. for the next sprint if it's urgently needed?

JonasCir transferred this issue from SORMAS-Foundation/SORMAS-Project Jan 21, 2021

kwa20 mentioned this issue May 31, 2021

Introduce duplicate detection configurations SORMAS-Foundation/SORMAS-Project#5583

Open

SORMAS-JanBoehme mentioned this issue Jun 10, 2021

Change duplicate person detection SORMAS-Foundation/SORMAS-Project#5758

Open

MateStrysewske added documentation Improvements or additions to documentation and removed documentation Improvements or additions to documentation labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation > GitHub Wiki on SORMAS duplicate detection variables and logic #23

Documentation > GitHub Wiki on SORMAS duplicate detection variables and logic #23

Candice-Louw commented Jan 20, 2021

Candice-Louw commented Jan 20, 2021

JonasCir commented Jan 20, 2021

Candice-Louw commented Jan 20, 2021

JonasCir commented Jan 20, 2021

JonasCir commented Jan 20, 2021

bernardsilenou commented Jan 21, 2021

JonasCir commented Jan 21, 2021

bernardsilenou commented Jan 21, 2021

JonasCir commented Jan 21, 2021

Candice-Louw commented May 28, 2021

SORMAS-JanBoehme commented May 28, 2021 •

edited

Loading

bernardsilenou commented May 28, 2021

kwa20 commented May 28, 2021

SORMAS-JanBoehme commented May 28, 2021

maxiheyner commented Jul 21, 2021

MateStrysewske commented Jul 26, 2021

Documentation > GitHub Wiki on SORMAS duplicate detection variables and logic #23

Documentation > GitHub Wiki on SORMAS duplicate detection variables and logic #23

Comments

Candice-Louw commented Jan 20, 2021

Situation Description

Feature Description

Possible Alternatives

Candice-Louw commented Jan 20, 2021

JonasCir commented Jan 20, 2021

Candice-Louw commented Jan 20, 2021

JonasCir commented Jan 20, 2021

JonasCir commented Jan 20, 2021

bernardsilenou commented Jan 21, 2021

JonasCir commented Jan 21, 2021

bernardsilenou commented Jan 21, 2021

JonasCir commented Jan 21, 2021

Candice-Louw commented May 28, 2021

SORMAS-JanBoehme commented May 28, 2021 • edited Loading

bernardsilenou commented May 28, 2021

kwa20 commented May 28, 2021

SORMAS-JanBoehme commented May 28, 2021

maxiheyner commented Jul 21, 2021

MateStrysewske commented Jul 26, 2021

SORMAS-JanBoehme commented May 28, 2021 •

edited

Loading