New duplicate algorithm to check for similar entries #52

george-gca · 2025-01-09T02:41:04Z

I added the option to check for duplicate entries based on the similarity of the title and abstract. Sometimes we can have a duplicate entry that is a fixed version of another entry, with a corrected typo or added comma, for example.

I decided to go with difflib.SequenceMatcher for this similarity, since it is a built-in solution. Added the options to use only the title for this check, set the similarity score, discard stopwords for a more strict check considering only the useful words, and also added a pretty diff print support thanks to the rich library:

Signed-off-by: George Araújo <[email protected]>

PeterLombaers · 2025-01-09T16:15:53Z

Thanks for this contribution, it look really nice! It makes a lot of sense to me to want to deduplicate using some fuzzy matching. The code looks clean! I was testing out the features and it seems to be working well. Just a few comments, which I'll put in the comments below.

PeterLombaers · 2025-01-09T16:19:24Z

When I was testing with --verbose, I got confused when I did not see anything pretty printing. I think this happened because my duplicated were already dropped by the original dedup method. When I added rows that were dropped only by your method, I indeed so the nice diff. If I'm using --verbose I expect to get feedback no matter which method does the deduplicating.

asreviewcontrib/datatools/entrypoint.py

PeterLombaers · 2025-01-09T16:25:55Z

But thanks again, it looks very nice! We'll also need to have a chat about how this relates to asreview2.0 @J535D165 !

george-gca · 2025-01-09T16:42:03Z

What do you think should be done when verbose is true for the current algorithm? I mean, what would be the expected output? Because the pretty print probably will print everything dimmed in most cases.

Signed-off-by: George Araújo <[email protected]>

Rensvandeschoot · 2025-01-10T06:43:59Z

Great contribution!! I have recently used fuzzy matching for our FORAS project, where we obtained over 10K records to screen. I checked for duplicates within the dataset and between the titles obtained via the database search and the most likely title match in OpenAlex. I saved the matching score and went through the data, checking these scores from low to high, and I found many fuzzy duplicates of the following type:

• Titles containing extra or different punctuation.
• One title has a spelling mistake corrected in the other.
• Presence of HTML code in one title (e.g., PTSD), but not in the other.
• An abstract number at the beginning of one title (e.g., “T262.”), missing from the other.
• A subtitle in one record versus a single-title format in the other.

All such cases are precise duplicates and can be corrected without losing any information.

But I also found cases with different versions of the same work:

re-prints of the same paper in a different journal,
pre-print + journal version,
conference abstract + journal version,
book chapter + journal version,
dissertation + journal version,
version 1 + version 2 of the same paper

You might want to keep both records in these cases, but the labeling decisions will be the same.

So, my question is whether it is possible to store the matching score so that a user can manually check the records with lower matching scores?

and, hopefully, my comment helps with starting the discussion on what to do with fuzzy-duplicates in ASReview v2.0 :-)

george-gca · 2025-01-10T14:15:50Z

Currently you can choose to print the duplicates on the terminal instead of already removing them by avoiding the -o option. It will print the line of the entries and a pretty diff between them, but it will print this in order of finding, not based on the score. This could be added as an option, for instance, or added as a column to the dataset like you said. What option would be best? Also, adding this column to the dataset wouldn't somehow affect the usage of the dataset with ASReview, or does it simply ignore the extra columns?

This should be enough to match most of the cases you pointed here just by playing with the threshold param, since I do some cleaning before doing the actual match. I've also added an option to remove stopwords before checking, to create even more similar titles. The cases that could fall short would be:

presence/abscence of HTML or latex code, if it is too long
added subtitle to the title

The added subtitle might be a pitfall, since some papers build on top of others, including the title, being completely different papers with similar titles. And sometimes they like to follow a trend of titles, like the case for x is all you need.

It would be actually a great test for this code the project you mentioned.

PeterLombaers · 2025-01-10T15:53:29Z

What do you think should be done when verbose is true for the current algorithm? I mean, what would be the expected output? Because the pretty print probably will print everything dimmed in most cases.

True, I'm not sure what I would want to see in verbose mode for the other case. I would leave it as it is for now. We might to think in the future about wether we want to maintain this feature and how it would look then, also for the other deduplication. I do think it's very nice to have such verbose output when deduplicating, so that you can clearly see which ones get marked as duplicate and why.

Signed-off-by: George Araújo <[email protected]>

george-gca · 2025-01-10T16:35:10Z

I agree with you. I had a quick look at the ASReviewData.drop_duplicates code, and it resets index by default, meaning that comparing dataframes before and after the deduplication would fail.

I changed the code a little bit to allow verbose when not using similar, also moved part of the code to dedup.py since it was starting to be too much code in the entrypoint for that.

PeterLombaers

I like the changes! I just have a few comments, but other than that I'm all for merging :)

asreviewcontrib/datatools/dedup.py

Signed-off-by: George Araújo <[email protected]>

george-gca · 2025-01-16T14:26:48Z

One question, is it really necessary to enforce the maximum line length to be 88? I mean, this line length is too small even for someone working on a laptop screen.

For instance, line 124 has 88 columns, and it doesn't even reach half of my laptop screen (it has 1920x1080 resolution). I feel like some print statements in this code would be more readable if this line length increased by a little, and it wouldn't hurt the coding experience.

PeterLombaers · 2025-01-16T14:37:52Z

I kinda like it, I can put two windows next to each other on my laptop.

george-gca · 2025-01-16T15:18:09Z

I usually use line wrap for this. Also, I believe using half monitor code is more of an exception rather than the regular use case. And at least for my case I have access to external bigger monitors, so it is kind of a waste of space. I am not saying that it should not have a limit, rather it could be a little bigger (like idk 119?).

PeterLombaers · 2025-01-16T15:26:43Z

This is my usual setup, which actually fits on my laptop, so that's why I like it. But feel free to file an issue on the asreview repo and see if you can convince @J535D165 😄.

george-gca · 2025-01-16T15:58:52Z

Even in your setup the lines trespass the 88 columns threshold for half monitor and you need to scroll right to see the rest of it, so a line wrap is more fitting I think. But that is just personal opinion.

george-gca · 2025-01-16T16:49:11Z

@PeterLombaers do you have any dataset with DOIs that I can use to test the new solution? The change is rather annoying if I am to consider pids for the similar case, like this loop has to change:

for j, t in s.iloc[i+1:][abs(s.str.len() - len(text)) < 5].items():

Signed-off-by: George Araújo <[email protected]>

george-gca · 2025-01-19T04:15:52Z

I believe now it is finished.

PeterLombaers · 2025-01-23T11:01:24Z

Sorry, I feel like there was some miscommunication! In your last commit you implement deduplicating based on PID, but ASReviewData.drop_duplicates already does this. So having a second implementation will only confuse matters.

How about the following plan to keep things simple:

Revert the last commit
If --similar, we use your method
Else we ASReviewData.drop_duplicates, so simply how it is in the current version of datatools
If you want to use both, you have to call asreview data dedup twice, once with --similar and once without
Any extra arguments like --verbose and --title_only, etc. will only apply in case of the --similar flag
in the help text of the arguments it says whether the argument only applies for --similar
We add a test to check that the different options give the correct deduplication

This means that the behaviour with --similar and without is not quite the same. For example, if you don't use --similar, you always use both title and abstract for deduplicating, but if you use --similar, you can select --title_only. But I'm fine with that difference for now, lets get you PR into shape to merge, then we can always look later at streamlining the datatools API as a whole.

If you do the first five points on that list, I can help by writing tests, adding the help texts and linting the code.

Do you agree with this plan?

PeterLombaers · 2025-01-23T11:01:35Z

BTW, When checking out your pull request, I made the following csv file to test the different settings:

abstract, title, doi, publication_year
abstract, <h2>this is a title of a paper<h2>, doi1, 2005
abstract, this is a title of a paper, doi1, 2001
abstract, this is a title of a paper&, doi3,
abstract, this is the title of a paper, doi4, 2003
abstract, this is a title%^ of a paper, doi5, 2004
abstract, this is a title of a pãper, doi6, 2000

Might be useful for running sanity checks that everything is working as expected.

george-gca · 2025-01-23T16:43:17Z

If you do the first five points on that list, I can help by writing tests, adding the help texts and linting the code.

Do you agree with this plan?

Sorry, I disagree. I agree with what you previously said (or at least that's what I understood), that it is best to have a streamlined functionality that looks like one single tool, not like two separate tools were glued together. Also one should not need to run once without --similar and once with --similar, since an exact copy is also a similar considering the string difference algorithm.

I have another proposition: make all output look like the version with --similar flag. The difference is that without the similar flag we will look for exact matches kinda like it is currently done, only making a few changes so I can find which one is the original and which is the duplicate, and keep it how it is for the --similar case. I won't suggest always using the same algorithm because without the --similar is way faster.

What I am thinking now for the case without --similar is:

remove duplicates, compare sizes before and after removal
size changed (meaning some duplicates were removed), check again for duplicates in the original data (now with keep=False so we flag all data with copies, not only the duplicates) and create a df with only the copies
use the same algorithm used in the --similar case, but only for this subset of the data that have/are copies. This way we can find from which exactly row that duplicate is a duplicate of, and running in a potentially smaller subset will be faster
use the same print as for the similar case

What do you think? Also, help for the tests and linting would be great.

PeterLombaers · 2025-01-27T08:32:29Z

The reason I suggest keeping it split, is that in the near future (a few months), ASReview version 2 will be released. The ASReviewData object will change and how we deal with duplicates inside ASReview might also change. The current deduplicate without --similar makes use of ASReviewData.drop_duplicates, so if we change that now we will have to do the same work again for ASReview v2.

If we keep the two methods (with and without --similar) separate for now, we can unify them once v2 lands. The --similar method can be completely inside asreview-datatools, and I don't mind making a new release for the updated dedup method for datatools. But I don't think we want any more releases for ASReview v1 that are not bugfixes.

george-gca · 2025-01-27T20:08:26Z

That makes sense. So what do you think I should do? Wait for V2?

PeterLombaers · 2025-01-30T09:35:48Z

I leave the choice up to you. We can add it to v1 with the plan I wrote above, or we leave this pull request open for now and then integrate your deduplication algorithm in the v2 deduplication. Then we'll be careful that you get the correct recognition as a contributer when we implement the deduplication in v2.

george-gca · 2025-01-30T14:32:21Z

I can keep them separate for now and get the PR accepted so some users (including me) might benefit from it. Then, when V2 launches I can get back to it and port the solution. What do you think?

PeterLombaers · 2025-01-31T08:29:03Z

Yes, let's do it. I think if you revert the last commit, it should be almost in good shape. I would like a small test to check that the old case still works, and small test to see that your algorithm works. And the linting is done by running ruff. But if you can't get around to the tests and linting, I'm happy to help.

george-gca · 2025-01-31T14:31:52Z

I won't revert it exactly cause I don't want to lose the code (it will be useful for the port), but I will make a new commit undoing stuff.

Signed-off-by: George Araújo <[email protected]>

george-gca · 2025-01-31T16:49:35Z

Just created a few tests for dedup. Give it a try @PeterLombaers.

Also, kind of not related, but is there any update regarding 1507? I mean, will this be changed for v2?

PeterLombaers · 2025-02-06T12:39:14Z

Thanks for the changes, the old and new flow are clearly separated. And thanks for adding the tests. When I tried running them they failed though. The problem is that the deduplicate_data function takes the complete args objects as input. That makes it quite difficult to directly call the function, since you have to recreate all the default values that argparse normally creates for you.

Instead of letting you fix the tests, I'm going to go ahead and merge the pull request. I'll fix the tests and do the linting in a separate pull request after. That way I can speed up things a bit, because as you probably noticed I don't have a lot of time to review your changes during the week.

PeterLombaers · 2025-02-06T13:07:28Z

Thanks again for the pull request. It's nice to see people from all over contributing to ASReview!

george-gca added 7 commits January 8, 2025 18:22

Added more dependencies to project

5ddaf4a

Signed-off-by: George Araújo <[email protected]>

Added new duplicate finding algorithm

95b6569

Signed-off-by: George Araújo <[email protected]>

Added more params to deduplicate similar function

8b921e7

Signed-off-by: George Araújo <[email protected]>

Added missing params for deduplicate similar entries

b0a5259

Signed-off-by: George Araújo <[email protected]>

Updated README

7a6d916

Signed-off-by: George Araújo <[email protected]>

Added similar dedup info to Tutorials

8995c45

Signed-off-by: George Araújo <[email protected]>

Added missing example image

1762e15

Signed-off-by: George Araújo <[email protected]>

J535D165 requested review from PeterLombaers and J535D165 January 9, 2025 08:53

J535D165 added the enhancement New feature or request label Jan 9, 2025

Fixed ruff warnings

413f23d

Signed-off-by: George Araújo <[email protected]>

PeterLombaers reviewed Jan 9, 2025

View reviewed changes

asreviewcontrib/datatools/entrypoint.py Show resolved Hide resolved

PeterLombaers reviewed Jan 9, 2025

View reviewed changes

asreviewcontrib/datatools/entrypoint.py Show resolved Hide resolved

PeterLombaers reviewed Jan 9, 2025

View reviewed changes

asreviewcontrib/datatools/entrypoint.py Show resolved Hide resolved

Renamed similarity as threshold

a27bad9

Signed-off-by: George Araújo <[email protected]>

Printing when not using similar

871d159

Signed-off-by: George Araújo <[email protected]>

PeterLombaers reviewed Jan 16, 2025

View reviewed changes

asreviewcontrib/datatools/dedup.py Outdated Show resolved Hide resolved

asreviewcontrib/datatools/dedup.py Outdated Show resolved Hide resolved

asreviewcontrib/datatools/dedup.py Outdated Show resolved Hide resolved

Printing PID of duplicate entries

4657d9b

Signed-off-by: George Araújo <[email protected]>

george-gca mentioned this pull request Jan 18, 2025

Removing duplicate records from the dataset during screening asreview/asreview#1759

Open

Also supporting PID when searching for similar

2a833a6

Signed-off-by: George Araújo <[email protected]>

george-gca added 2 commits January 31, 2025 11:44

Changed dedup without --similar to use old solution

4db85ae

Signed-off-by: George Araújo <[email protected]>

Created tests for dedup

61c3c40

Signed-off-by: George Araújo <[email protected]>

PeterLombaers merged commit 3d9b906 into asreview:main Feb 6, 2025

PeterLombaers mentioned this pull request Feb 6, 2025

Fix tests and clean up new deduplication algorithm code #54

Merged

george-gca deleted the new_duplicate_algorithm branch February 6, 2025 15:27

Uh oh!

New duplicate algorithm to check for similar entries #52

New duplicate algorithm to check for similar entries #52

Uh oh!

Conversation

george-gca commented Jan 9, 2025

Uh oh!

PeterLombaers commented Jan 9, 2025

Uh oh!

PeterLombaers commented Jan 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PeterLombaers commented Jan 9, 2025

Uh oh!

george-gca commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rensvandeschoot commented Jan 10, 2025

Uh oh!

george-gca commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PeterLombaers commented Jan 10, 2025

Uh oh!

george-gca commented Jan 10, 2025

Uh oh!

PeterLombaers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

george-gca commented Jan 16, 2025

Uh oh!

PeterLombaers commented Jan 16, 2025

Uh oh!

george-gca commented Jan 16, 2025

Uh oh!

PeterLombaers commented Jan 16, 2025

Uh oh!

george-gca commented Jan 16, 2025

Uh oh!

george-gca commented Jan 16, 2025

Uh oh!

george-gca commented Jan 19, 2025

Uh oh!

PeterLombaers commented Jan 23, 2025

Uh oh!

PeterLombaers commented Jan 23, 2025

Uh oh!

george-gca commented Jan 23, 2025

Uh oh!

PeterLombaers commented Jan 27, 2025

Uh oh!

george-gca commented Jan 27, 2025

Uh oh!

PeterLombaers commented Jan 30, 2025

Uh oh!

george-gca commented Jan 30, 2025

Uh oh!

PeterLombaers commented Jan 31, 2025

Uh oh!

george-gca commented Jan 31, 2025

Uh oh!

george-gca commented Jan 31, 2025

Uh oh!

PeterLombaers commented Feb 6, 2025

Uh oh!

PeterLombaers commented Feb 6, 2025

Uh oh!

Uh oh!

george-gca commented Jan 9, 2025 •

edited

Loading

george-gca commented Jan 10, 2025 •

edited

Loading