-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation > GitHub Wiki on SORMAS duplicate detection variables and logic #23
Comments
@Candice-Louw what do you think about using GitHub pages for that like we do e.g., in SORMAS-glossary? The good think is by this we would also have a solution to collect all the documentation files in the repo root. |
@JonasCir - great idea, yes! Where would this process start? |
I provide the same PR as for SORMAS-glossary to this repo and you go ahead and start to write the documentation in Markdown? 😃 I would still request a signoff for this approach from someone of the team, though :) |
Everything taken from SORMAS-glossary where it is already running in production: If filed this PR already 3 times to SORMAS-Stats, data-generator and glossary. An advantage would be that all the docs currently cluttering the repo root could got to a dedicated folder under docs/ |
I also support we create section of the glossary. |
This is the question, do we put such things into glossary or main repo? The tech stack with github pages is the same. |
OK I got it now, I think we should put in glossary please. |
@bernardsilenou @Candice-Louw let's continue our discussion here in the glossary:) |
@Jan-Boehme - would it be possible to please upload/share the document/info that you compiled on the current duplicate check algorithm, please? |
Sure, the github wiki page does not exist yet, right? Or I am too dumb to find it :-) For everyone involved here is the info from a discussion currently going on with the german health departments about the duplicate detection for persons: The departments aren't really happy with the current way the duplicate detection works because it requires some field to be exactly the same or else it will not work at all (more on that later). This I why I went ahead and did some digging in the the source code for the current implementation and created a concept for a more sophisticated person duplicate detection which makes use of weighted values attached to fields which could indicate a duplicate. The sum of all these weighted values is then observed to decide if a person is presented to the user as a possible duplicate. Current implementation: The SELECT statement for reading possible duplicates from the database is build regarding this criteria (PersonService.buildSimilarityCriteriaFilter): (FirstName is equal OR LastName is equal) The way of building the statement raises the following problems:
After pulling all matches from the database, firstName and LastName are joined into one string and the trigram distance between this string and the value in question is calculated. I will provide the concept for the weighted person duplicate check when it has reached a high enough maturity level as it could have severe implications on database load which needs to be tested before even considering going ahead with a new implementation. |
@Jan-Boehme
Few comments to your last comment:
If this is what they experienced, then its clearly that they are using a "namesimilaritythreshold" corresponding to 1, that implies exact match. This would lead to over-conservative results. Names must not be exactly the same, even when you swap first and last name, it should not matter.
If users are not 100% sure of the sex, then they should use "unknown " or NA as option. If name and all other person identifiers are the same but sex for one is male and the other is female or other, then they would not be suggested as duplicate.
|
@bernardsilenou @Jan-Boehme
This is however what happens, even if the namesimilaritythreshold is the default of 0.65. Here some examples:
|
@bernardsilenou I checked the source code though and it is implemented exactly this way. No bug or wrong configuration by the GSA admins. I get that we should avoid over suggesting of possible duplicates but at the moment the current implementation hides possibly relevant information from the users on purpose out of fear of overwhelming them. I can only speak for myself but when I first used SORMAS and I realized that I have to "create" a case and then just trust the system to provide me with the correct person (of which I knew for certain existed in the database) without telling me how it decides if I would be "allowed" to link the person I was kind of taken aback. This is what I would like to achieve with a weighted comparison system. Allowing for users to make errors and fix them easily while at the same time getting better and more relevant results from the comparison. Also making every single parameter of the duplicate detection configurable by the local administrators on-the-fly. Enabling them to make informed decisions and tailor the software exactly to their needs while being 100% transparent about what happens. |
@MateStrysewske The local health departments often ask for more information on how the duplicate detection works exactly but we do not have an official documentation for it, yet. What is the exact way the duplicate detection for different entities is implemented at the moment? We need to document that for e.g. the admin manual. Can you please help here? |
Could you please add an issue to the main GitHub repository to create such a guide and prioritise it accordingly, i.e. for the next sprint if it's urgently needed? |
Situation Description
Duplicate detection in SORMAS is a feature that affects many users. There is currently no documentation that describes which variables are taken into account when this process is executed. This makes it difficult for SORMAS end users to decide which fields to make mandatory to capture, during the contact tracing process for example, to ensure more accurate results.
Feature Description
Create a GitHub Wiki entry dedicated to duplicate detection logic, includicating which variables are taken into consideration during duplicate check for Person entities i.e.
Please complete this per server configuration i.e. DE, CH and international, as different variables are visible on these different systems and it is not clear which are/are not relevant for which configuration.
Please also provide link(s) to the file(s) in the sourcecode where this is programmed so developers may directly access this.
@bernardsilenou @kwa20 - please include additional entities if needed.
Possible Alternatives
This doesn't necessarily have to be a Wiki entry - it could be documented elsewhere (public) too, please. The request is simply to be able to share this information (URL) when this sort of request comes our way so that it is self-explanatory enough for anyone (technical and non-technical) to be able to understand and access the most up to date version available.
The text was updated successfully, but these errors were encountered: