-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New duplicate algorithm to check for similar entries #52
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: George Araújo <[email protected]>
Signed-off-by: George Araújo <[email protected]>
Signed-off-by: George Araújo <[email protected]>
Signed-off-by: George Araújo <[email protected]>
Signed-off-by: George Araújo <[email protected]>
Signed-off-by: George Araújo <[email protected]>
Signed-off-by: George Araújo <[email protected]>
Signed-off-by: George Araújo <[email protected]>
Thanks for this contribution, it look really nice! It makes a lot of sense to me to want to deduplicate using some fuzzy matching. The code looks clean! I was testing out the features and it seems to be working well. Just a few comments, which I'll put in the comments below. |
When I was testing with |
But thanks again, it looks very nice! We'll also need to have a chat about how this relates to asreview2.0 @J535D165 ! |
What do you think should be done when verbose is true for the current algorithm? I mean, what would be the expected output? Because the pretty print probably will print everything dimmed in most cases. |
Signed-off-by: George Araújo <[email protected]>
Great contribution!! I have recently used fuzzy matching for our FORAS project, where we obtained over 10K records to screen. I checked for duplicates within the dataset and between the titles obtained via the database search and the most likely title match in OpenAlex. I saved the matching score and went through the data, checking these scores from low to high, and I found many fuzzy duplicates of the following type: • Titles containing extra or different punctuation. All such cases are precise duplicates and can be corrected without losing any information. But I also found cases with different versions of the same work:
You might want to keep both records in these cases, but the labeling decisions will be the same. So, my question is whether it is possible to store the matching score so that a user can manually check the records with lower matching scores? and, hopefully, my comment helps with starting the discussion on what to do with fuzzy-duplicates in ASReview v2.0 :-) |
Currently you can choose to print the duplicates on the terminal instead of already removing them by avoiding the This should be enough to match most of the cases you pointed here just by playing with the
The added subtitle might be a pitfall, since some papers build on top of others, including the title, being completely different papers with similar titles. And sometimes they like to follow a trend of titles, like the case for x is all you need. It would be actually a great test for this code the project you mentioned. |
True, I'm not sure what I would want to see in verbose mode for the other case. I would leave it as it is for now. We might to think in the future about wether we want to maintain this feature and how it would look then, also for the other deduplication. I do think it's very nice to have such verbose output when deduplicating, so that you can clearly see which ones get marked as duplicate and why. |
Signed-off-by: George Araújo <[email protected]>
I agree with you. I had a quick look at the ASReviewData.drop_duplicates code, and it resets index by default, meaning that comparing dataframes before and after the deduplication would fail. I changed the code a little bit to allow verbose when not using similar, also moved part of the code to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the changes! I just have a few comments, but other than that I'm all for merging :)
Signed-off-by: George Araújo <[email protected]>
One question, is it really necessary to enforce the maximum line length to be 88? I mean, this line length is too small even for someone working on a laptop screen. For instance, line 124 has 88 columns, and it doesn't even reach half of my laptop screen (it has 1920x1080 resolution). I feel like some print statements in this code would be more readable if this line length increased by a little, and it wouldn't hurt the coding experience. |
I kinda like it, I can put two windows next to each other on my laptop. |
I usually use line wrap for this. Also, I believe using half monitor code is more of an exception rather than the regular use case. And at least for my case I have access to external bigger monitors, so it is kind of a waste of space. I am not saying that it should not have a limit, rather it could be a little bigger (like idk 119?). |
|
Even in your setup the lines trespass the 88 columns threshold for half monitor and you need to scroll right to see the rest of it, so a line wrap is more fitting I think. But that is just personal opinion. |
@PeterLombaers do you have any dataset with DOIs that I can use to test the new solution? The change is rather annoying if I am to consider pids for the similar case, like this loop has to change: for j, t in s.iloc[i+1:][abs(s.str.len() - len(text)) < 5].items(): |
Signed-off-by: George Araújo <[email protected]>
I believe now it is finished. |
Sorry, I feel like there was some miscommunication! In your last commit you implement deduplicating based on PID, but How about the following plan to keep things simple:
This means that the behaviour with If you do the first five points on that list, I can help by writing tests, adding the help texts and linting the code. Do you agree with this plan? |
BTW, When checking out your pull request, I made the following csv file to test the different settings:
Might be useful for running sanity checks that everything is working as expected. |
Sorry, I disagree. I agree with what you previously said (or at least that's what I understood), that it is best to have a streamlined functionality that looks like one single tool, not like two separate tools were glued together. Also one should not need to run once without I have another proposition: make all output look like the version with What I am thinking now for the case without
What do you think? Also, help for the tests and linting would be great. |
The reason I suggest keeping it split, is that in the near future (a few months), ASReview version 2 will be released. The If we keep the two methods (with and without |
That makes sense. So what do you think I should do? Wait for V2? |
I leave the choice up to you. We can add it to v1 with the plan I wrote above, or we leave this pull request open for now and then integrate your deduplication algorithm in the v2 deduplication. Then we'll be careful that you get the correct recognition as a contributer when we implement the deduplication in v2. |
I can keep them separate for now and get the PR accepted so some users (including me) might benefit from it. Then, when V2 launches I can get back to it and port the solution. What do you think? |
I added the option to check for duplicate entries based on the similarity of the title and abstract. Sometimes we can have a duplicate entry that is a fixed version of another entry, with a corrected typo or added comma, for example.
I decided to go with difflib.SequenceMatcher for this similarity, since it is a built-in solution. Added the options to use only the title for this check, set the similarity score, discard stopwords for a more strict check considering only the useful words, and also added a pretty diff print support thanks to the rich library: