Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A lot of pages have some links "screwed". We need a filter to hot-fix these somehow. #15

Open
Lisias opened this issue Nov 24, 2024 · 0 comments
Assignees

Comments

@Lisias
Copy link
Contributor

Lisias commented Nov 24, 2024

I found these two URLS on my "ALL" report this month (not meaning they weren't there before, I just noticed them today):

https://forum.kerbalspaceprogram.com/%7B___base_url___%7D/index.php?/profile/128696-killashley/
https://forum.kerbalspaceprogram.com/%7B___base_url___%7D/index.php?/profile/42312-alexsheff/

Note the %7B___base_url___%7D substring, that unencoded gives us {___base_url___}. Almost surely is a missing $ after the opening curly braces.

Curious about the issue, and knowing that this kind of issue reproduce like rabbits :P I coded a quick report for all the occurrences on the current (and WIP) WARCs, and boy, I found a lot (note: file in CSV format, ignore anything starting with #): Uploading url_weirdities.csv…

The earliest thread with the problem is 278, and the biggest id is 209425.

'cat url_weirdities.csv | grep -Eo 'https://forum.kerbalspaceprogram.com/index\.php\?/topic/([0-9]+)-' | sed -E 's/^https://forum.kerbalspaceprogram.com/index.php?/topic/(.+?)-$/\1/g' | sort -n | uniq`

Fixing the problem in the WARC file is out of the question (the thing need to be exactly as I fetched them), so we need to find a way to work around these problems.

A filter on the playback machine to detect and fix these will do but, so, we will need a cache to keep the thing responsible - python is not exactly the fastest cookie in the jar.

@Lisias Lisias self-assigned this Nov 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant