python linkchecker component #1

pvgenuchten · 2024-04-18T11:32:07Z

The python linkchecker is a basic checker which runs through a series of webpages identifies links and tries to resolve them, it reports in a number of potential output formats (xml, sitemap, sql)

The checker can run as a ci-cd script in a container

docker run --rm -it -u $(id -u):$(id -g) ghcr.io/linkchecker/linkchecker:latest --verbose https://www.example.com

results can be written to a location to be picked up by a process, for example a set of sql statements to be run against a DB

The text was updated successfully, but these errors were encountered:

pvgenuchten · 2024-04-18T12:14:49Z

we already identified some challenges with this tool (and other tools):

linkchecker can not a accept:html header, pycsw currently requires this header to return html, add ?f=html as a workaround
pycsw maxpagesize (currently 20) is too low, better to use 100
use --verbose and --check-extern to add full checking and logging

example

linkchecker https://soilwise-he.containers.wurnet.nl/cat/collections/metadata:main/items?f=html --verbose --check-extern

pvgenuchten · 2024-04-18T13:42:28Z

since the soilwise catalogue is not online yet, suggestion is to use a catalogue of ejpsoil

https://catalogue.ejpsoil.eu/collections/metadata:main/items?f=html

not sure if linkchecker properly fetches each result of the paginated search result, else maybe an option to check links per page

etc...

Goal is to:

set up a ci-cd (in gitlab) which runs at weekly interval.
Results of the linkchecker can be saved as sql-inserts or xml
A next step in ci-cd picks up the content and pushes it to a database

pvgenuchten changed the title ~~introduce linkchecker component~~ python linkchecker component Apr 18, 2024

pvgenuchten assigned pvgenuchten and vgole001 and unassigned pvgenuchten Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python linkchecker component #1

python linkchecker component #1

pvgenuchten commented Apr 18, 2024

pvgenuchten commented Apr 18, 2024

pvgenuchten commented Apr 18, 2024 •

edited by BerkvensNick

Loading

python linkchecker component #1

python linkchecker component #1

Comments

pvgenuchten commented Apr 18, 2024

pvgenuchten commented Apr 18, 2024

pvgenuchten commented Apr 18, 2024 • edited by BerkvensNick Loading

pvgenuchten commented Apr 18, 2024 •

edited by BerkvensNick

Loading