-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Periodic CI to find links that have gone bad #1431
Comments
I tried https://www.drlinkcheck.com and got these results. Its a $10 monthly subscription for up to 10,000 links. OverviewFour bad links found amoung the 1500 checked |
Sandbox does this for us periodically. Though I think it would not be a bad idea to also institute some regular checking within our repo. I have actions for the bssw-tutorial website that do this on files that change (though this has had some problems lately) and on a scheduled for the whole repository. See https://github.com/bssw-tutorial/bssw-tutorial.github.io/blob/main/.github/workflows/check-pr-urls.yml and https://github.com/bssw-tutorial/bssw-tutorial.github.io/blob/main/.github/workflows/check-all-urls.yml. |
Thanks for mentioning. Took a very quick look at this run and it looks like it spews all URLs it checked (some timed out...what does that mean?) and then lists those it think failed. I clicked on some of the failed links and they worked. Something like this is likely sensitive to intermittent issues (in servers, networking, etc.). We could expand to maintain a list of failed URLs over several successive checks and flag a URL as bad only if its gone into a consistent failure state. That would take a bit more work because it would require maintaining a list of the failures across CI runs. But, I think its possible. |
Yes, it lists all URLs it checks and then the ones that failed. There are configuration options for the timeout on checks and the number of retries to attempt. There is also an exclusion list of links not to check. I had to tweak those some when I first set it up. Its not perfect -- I occasionally have experiences like you had. DOI links in particular like to fail, even though they are in fact good and because they're DOIs, there is a commitment behind them that they not disappear. My guess is that may have to do with the redirects (from doi.org to the actual provider). Occasionally other things fail the test too. My strategy for bssw-tutorial is to check the links that fail and if they work when I check, I just let it go. If I see something failing several times in a row, I'll consider adding it to the exclusion list. This works fine for the tutorial. For something on a larger scale, like bssw.io, I can imagine that this might be more of a problem. There are other URL-checker actions out there. I'm using one by Vanessa Sochat which was simple to adopt in part because she provided several good examples. I don't recall anything that described saving a list of failing URLs and comparing from run to run. But that doesn't mean such a thing doesn't exist -- or we could write one, of course. I will note that in the tutorial, and in general, there are two different use cases:
They don't necessarily have to use the same tools. It would be nice if they could share a common exclusion list (if necessary). |
Or, maybe its time to update or remove that link? |
Sorry. 90%+ of the failures are false positives. If it is evident that it is a real failure, of course I'll find an alternative or remove it. The frequent false positives I will add to the exclusion list. |
I have fixed broken links that Sandbox sent us. We have around 70 broken links while the https://www.drlinkcheck.com/ showed us 5. I am not sure what tool the former folks are using.. |
Just a note on this: There is an available GitHub action for this. The US-RSE uses it for checking spelling and links - https://github.com/USRSE/usrse.github.io/blob/main/.github/workflows/linting.yaml |
@bartlettroscoe : is this action worth implementing for us? |
Every maintained website should be running regular link checks and fix issues as they come up. |
I do this for the bssw-tutorial website. Every PR gets checked, and we periodically check all of the links on the site. https://github.com/bssw-tutorial/bssw-tutorial.github.io/tree/main/.github/workflows has two actions, which we should be able to use with little or no modification on BSSw: |
A lot of what we publish has links. I think links are really important. But, they also go stale over time as other content hosters change how they content we are linking to gets hosted.
We should have something that runs periodically, maybe once a week, and generates a report of bad links.
The text was updated successfully, but these errors were encountered: