Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python linkchecker component #1

Open
pvgenuchten opened this issue Apr 18, 2024 · 2 comments
Open

python linkchecker component #1

pvgenuchten opened this issue Apr 18, 2024 · 2 comments
Assignees

Comments

@pvgenuchten
Copy link
Contributor

The python linkchecker is a basic checker which runs through a series of webpages identifies links and tries to resolve them, it reports in a number of potential output formats (xml, sitemap, sql)

The checker can run as a ci-cd script in a container

docker run --rm -it -u $(id -u):$(id -g) ghcr.io/linkchecker/linkchecker:latest --verbose https://www.example.com

results can be written to a location to be picked up by a process, for example a set of sql statements to be run against a DB

@pvgenuchten pvgenuchten changed the title introduce linkchecker component python linkchecker component Apr 18, 2024
@pvgenuchten
Copy link
Contributor Author

we already identified some challenges with this tool (and other tools):

  • linkchecker can not a accept:html header, pycsw currently requires this header to return html, add ?f=html as a workaround
  • pycsw maxpagesize (currently 20) is too low, better to use 100
  • use --verbose and --check-extern to add full checking and logging

example

linkchecker https://soilwise-he.containers.wurnet.nl/cat/collections/metadata:main/items?f=html --verbose --check-extern

@pvgenuchten
Copy link
Contributor Author

pvgenuchten commented Apr 18, 2024

since the soilwise catalogue is not online yet, suggestion is to use a catalogue of ejpsoil

https://catalogue.ejpsoil.eu/collections/metadata:main/items?f=html

not sure if linkchecker properly fetches each result of the paginated search result, else maybe an option to check links per page

etc...

Goal is to:

  • set up a ci-cd (in gitlab) which runs at weekly interval.
  • Results of the linkchecker can be saved as sql-inserts or xml
  • A next step in ci-cd picks up the content and pushes it to a database

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants