-
-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Link duplication : should have the possibility to ignore some website with the dup check #298
Comments
The problem with YouTube links is, that they have the video ID as a query parameter, which are currently ignored when links are checked for duplicates. I have to check a few URLs and think about a proper solution. |
An editable whitelist of sites/domains that rely on query params would make sense (pre-populated with Youtube and others). |
Quite a lot of websites use query parameters for page differentiation, including some of my own but also popular third party ones like Hacker News and, uh, a niche little place called youtube indeed :). I haven't run into this myself yet but am surprised to see it's entirely ignored. Might it make more sense to instead do the duplicate check on the whole URL, except perhaps a pre-populated but configurable list of dummy parameters such as |
I am using this in my fork, I strip out all utm parameters and make a full comparison:
|
Thanks for that @sergiorgiraldo, I was going to incorporate that into my local custom version as well but then noticed that this might not be the optimal solution. You triggered me to look into it further :) I started wondering: why is this function parsing and rebuilding the URL at all? We want the query part to be considered, like the
Personally, I would already strip off any utm_whatevers manually. That's not the only tracking parameter that exists, or the only parameter that can be removed. In fact, I would also remove the However, we do want to check for small variations that are likely duplicates. Typically, a trailing slash or question mark makes no difference. The path Essentially, we need to store This is the patch that I came up with:
While testing if this works, I noticed a bug where you can insert data with leading or trailing whitespace into the database, which is just plain ugly and not a correct URL. There is no unique key in the database (imo that would be a good addition), but the
This function gets called when Summing up, these patches:
I should probably add tests for this... Another nice-to-have would be detecting when you add the |
As you can both see, it's a bit more complicated to get this right. 😅 The current solution ignores query params, which is obviously not correct. But I can't think of any better solution at the moment. |
What about a permitlist of query params? e.g. |
Late reply but I don't understand: I'm proposing to just check for things which are almost always identical like a (I'm also happy with any other solution like a permitlist or blocklist, although for my use-case the simple checks are easily sufficient) |
When I add multiple video from youtube for example, I get duplicate warning however those are different video.
Should be possible to ignore some websites with an ignore list
The text was updated successfully, but these errors were encountered: