Skip to content

Commit

Permalink
Merge branch 'vgole001/issue11'
Browse files Browse the repository at this point in the history
# Conflicts:
#	README.md
#	src/linkchecker.py
  • Loading branch information
pvgenuchten committed Aug 21, 2024
2 parents 6fabd51 + 4bfc5f6 commit 1c90f4d
Show file tree
Hide file tree
Showing 8 changed files with 184 additions and 86 deletions.
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ ENV POSTGRES_HOST=host.docker.internal
ENV POSTGRES_PORT=5432
ENV POSTGRES_DB=postgres
ENV POSTGRES_USER=postgres
ENV POSTGRES_PASSWORD=w4qu+0sj
ENV POSTGRES_PASSWORD=*****

WORKDIR /home/link-liveliness-assessment

Expand All @@ -39,4 +39,4 @@ EXPOSE 8000

USER linky

# ENTRYPOINT [ "python3", "-m", "uvicorn", "api:app", "--reload", "--host", "0.0.0.0", "--port", "8000" ]
# ENTRYPOINT [ "python3", "-m", "uvicorn", "api:app", "--reload", "--host", "0.0.0.0", "--port", "8000" ]
129 changes: 85 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
# OGC API - Records; link liveliness assessment
# OGC API - Records; link liveliness assessment tool

### Overview
The linkchecker component is designed to evaluate the validity and accuracy of links within metadata records in the OGC API - Records System.

A component which evaluates for a set of metadata records (describing either data or knowledge sources), if:

- the links to external sources are valid
- the links within the repository are valid
- link metadata represents accurately the resource

The component either returns a http status: 200 (ok), 403 (non autorized), 404 (not found), 500 (error), ...
Status 302 is forwarded to new location and the test is repeated.
The component either returns a http status: 200 (ok), 401 (non autorized), 404 (not found), 500 (server error)

The component runs an evaluation for a single resource at request, or runs tests at intervals providing a history of availability
The component runs an evaluation for a single resource at request, or runs tests at intervals providing a history of availability

A link either points to:

Expand All @@ -21,44 +23,77 @@ If endpoint is API, some sanity checks can be performed on the API:

- Identify if the API adopted any API-standard
- IF an API standard is adopted, does the API support basic operations of that API

The benefit of latter is that it provides more information then a simple ping to the index page of the API, typical examples of standardised API's are SOAP,
GraphQL, SPARQL, OpenAPI, WMS, WFS

The benefit of latter is that it provides more information then a simple ping to the index page of the API, typical examples of standardised API's are SOAP, GraphQL, SPARQL, OpenAPI, WMS, WFS

## OGC API - records

OGC is in the process of adopting the [OGC API - Records](https://github.com/opengeospatial/ogcapi-records) specification. A standardised API to interact with Catalogues. The specification includes a datamodel for metadata. This tool assesses the linkage section of any record in an OGC API - Records.

***Sample response***
```
{
"id": 25,
"urlname": "https://demo.pycsw.org/gisdata/collections/metadata:main/queryables",
"parent_urls": [
"https://demo.pycsw.org/gisdata/collections?f=html"
],
"status": "200 OK",
"result": "",
"info": "True",
"warning": "",
"deprecated": null
}
```
# OGC API - records
OGC is in the process of adopting the [OGC API - Records](https://github.com/opengeospatial/ogcapi-records) specification.
A standardised API to interact with Catalogues. The specification includes a datamodel for metadata.
This tool assesses the linkage section of any record in an OGC API - Records.

OGC services (WMS, WFS, WCS, CSW) often return an HTTP 500 error or a 400 Bad Request when called without the necessary parameters.
This is because these services expect specific parameters to understand which operations to perform.
A handling for this URL formats has been done in order to detect and include the necessary parameters before being checked

Set the endpoint to be analysed as 2 environment variables

```
export OGCAPI_URL=https://soilwise-he.containers.wur.nl/cat/
export OGCAPI_COLLECTION=metadata:main
```

## Source Code Brief Desrciption

Running the linkchecker.py will utilize the requests library from python to get the relevant EJP Soil Catalogue source.
Run the command below
* python linkchecker.py
The urls selected from the requests will passed to linkchecker using the proper options.
The output generated will be written to a PostgreSQL database.
A .env is required to define the database connection parameters.
More specifically the following parameters must be specified

```
POSTGRES_HOST=
POSTGRES_PORT=
POSTGRES_DB=
POSTGRES_USER=
POSTGRES_PASSWORD=
```

## API
The api.py file creates a FastAPI in order to retrieve links statuses.
Run the command below:
## Api Key Features
1. **Link validation**:
Returns HTTP status codes for each link, along with other important information such as the parent URL, any warnings, and the date and time of the test.
![Fast API link_status](./images/link_status.png)
2. **Broken link categorization**:
Identifies and categorizes broken links based on status codes, including Redirection Errors, Client Errors, and Server Errors.
![Link categorization enpoint](./images/categorization.png)
3. **Deprecated links identification**:
Flags links as deprecated if they have failed for X consecutive tests, in our case X equals to 10.
Deprecated links are excluded from future tests to optimize performance.
![Fast API deprecated endpoint](./images/deprecated.png)
4. **Timeout management**:
Allows the identification of URLs that exceed a timeout threshold which can be set manually as a parameter in linkchecker's properties.
![Fast API timeout enpoint](./images/timeouts.png)
5. **Availability monitoring**:
When run periodically, the tool builds a history of availability for each URL, enabling users to view the status of links over time.
![Link validation enpoint](./images/val_history.png)

## Container Deployment

Set environment variables in Dockerfile to enable database connection.

Run the following command:

The app can be deployed as a container.
A docker-compose file has been implemented.

Run ***docker-compose up*** to run the container

***Helpful commands***
To run the FastAPI locally run
```
python -m uvicorn api:app --reload --host 0.0.0.0 --port 8000
```
The FastAPI service runs on: [http://127.0.0.1:8000/docs] only if ROOTPATH is not set

To view the service of the FastAPI on [http://127.0.0.1:8000/docs]

# Get current URL Status History
Expand Down Expand Up @@ -91,25 +126,31 @@ You can set `ROOTPATH` env var to run the api at a path (default is at root)

```
export ROOTPATH=/linky
```

## Docker

A Docker instance must be running for the linkchecker command to work.
## CI/CD
A workflow is provided in order to run it as a cronological job once per week every Sunday Midnight
(However currently it is commemended to save running minutes since it takes about 80 minutes to complete)
It is necessary to use the **secrets context in gitlab in order to be connected to database
A CI/CD configuration file is provided in order to create an automated chronological pipeline.
It is necessary to define the secrets context using GitLab secrets in order to connect to the database.
## Roadmap

### GeoHealthCheck integration
[GeoHealthCheck](https://GeoHealthCheck.org) is a component to monitor livelyhood of typical OGC services (WMS, WFS, WCS, CSW). It is based on the [owslib](https://owslib.readthedocs.io/en/latest/) library, which provides a python implementation of various OGC services clients.
[GeoHealthCheck](https://GeoHealthCheck.org) is a component to monitor livelyhood of typical OGC services (WMS, WFS, WCS, CSW).
It is based on the [owslib](https://owslib.readthedocs.io/en/latest/) library, which provides a python implementation of various OGC services clients.
## Soilwise-he project
## Technological Stack
1. **Core Language**:
- Python: Used for the linkchecker, API, and database interactions.
This work has been initiated as part of the [Soilwise-he project](https://soilwise-he.eu/).
The project receives funding from the European Union’s HORIZON Innovation Actions 2022 under grant agreement No. 101112838.
2. **Database**:
- PostgreSQL: Utilized for storing and managing information.
3. **Backend Framework**:
- FastAPI: Employed to create and expose REST API endpoints, utilizing its efficiency and auto-generated components like Swagger.
4. **Containerization**:
- Docker: Used to containerize the linkchecker application, ensuring deployment and execution across different environments.
## Soilwise-he project
This work has been initiated as part of the Soilwise-he project.
The project receives funding from the European Union’s HORIZON Innovation Actions 2022 under grant agreement No. 101112838.
Binary file added images/categorization.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/deprecated.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/link_status.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/timeouts.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/val_history.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
137 changes: 97 additions & 40 deletions src/linkchecker.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from urllib.parse import urlparse, parse_qs, urlencode
import subprocess
import psycopg2
import psycopg2.extras
Expand Down Expand Up @@ -119,24 +120,40 @@ def extract_links(url):
print(f"Error extracting links from {url}: {e}")
return []

def run_linkchecker(urls):
for url in urls:
# Run LinkChecker Docker command with specified user and group IDs for each URL
process = subprocess.Popen([
"linkchecker",
"--verbose",
"--check-extern",
"--recursion-level=1",
"--timeout=5",
"--output=csv",
url + "?f=html"
], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
def check_single_url(url):
process = subprocess.Popen([
"linkchecker",
"--verbose",
"--check-extern",
"--recursion-level=0",
"--timeout=5",
"--output=csv",
url + "?f=html"
], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# Process.communicate is good for shorter-running processes
stdout, _ = process.communicate()

return stdout.decode('utf-8').strip().split('\n')

# Process the output line by line and yield each line
for line in process.stdout:
yield line.decode('utf-8').strip() # Decode bytes to string and strip newline characters
# Wait for the process to finish
process.wait()
def run_linkchecker(url):
# Run LinkChecker Docker command with specified user and group IDs for each URL
process = subprocess.Popen([
"linkchecker",
"--verbose",
"--check-extern",
"--recursion-level=1",
"--timeout=5",
"--output=csv",
url + "?f=html"
], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# Process the output line by line and yield each line
# Memory efficient for large outputs
for line in process.stdout:
yield line.decode('utf-8').strip() # Decode bytes to string and strip newline characters
# Wait for the process to finish
process.wait()

def insert_or_update_link(conn, urlname, status, result, info, warning, is_valid):

Expand Down Expand Up @@ -231,6 +248,35 @@ def get_active_urls(conn):
else:
cur.execute("SELECT url FROM validation_history WHERE NOT deprecated")
return [row[0] for row in cur.fetchall()]

def determine_service_type(url):
ogc_patterns = ['/wms', '/wfs', '/csw', '/wcs', 'service=']

if any(pattern in url.lower() for pattern in ogc_patterns):
parsed_url = urlparse(url)
query_params = parse_qs(parsed_url.query)

query_params.pop('service', None)
query_params.pop('request', None)

query_params['request'] = ['GetCapabilities']

if 'service' not in query_params:
if '/wms' in parsed_url.path.lower():
query_params['service'] = ['WMS']
elif '/wfs' in parsed_url.path.lower():
query_params['service'] = ['WFS']
elif '/csw' in parsed_url.path.lower():
query_params['service'] = ['CSW']
elif '/wcs' in parsed_url.path.lower():
query_params['service'] = ['WCS']

new_query = urlencode(query_params, doseq=True)
new_url = parsed_url._replace(query=new_query).geturl()

return new_url

return url

def main():
start_time = time.time() # Start timing
Expand All @@ -255,34 +301,45 @@ def main():
extracted_links = extract_links(url)
all_links.update(extracted_links) # Add new links to the set of all links

# Define the formats to be removed
formats_to_remove = [
'collections/' + collection + '/items?offset',
'?f=json'
]

# Specify the fields to include in the CSV file
fields_to_include = ['urlname', 'parentname', 'baseref', 'valid', 'result', 'warning', 'info']

print("Checking Links...")

# Run LinkChecker and process the output
for line in run_linkchecker(all_links):
if re.match(r'^http', line):
# Remove trailing semicolon and split by semicolon
values = line.rstrip(';').split(';')

# Filter and pad values based on fields_to_include
filtered_values = [str(values[i]) if i < len(values) else "" for i in range(len(fields_to_include))]

# Destructure filtered_values
urlname, parentname, baseref, valid, result, warning, info = filtered_values

is_valid = is_valid_status(valid)

link_id = insert_or_update_link(conn, urlname, valid, result, info, warning, is_valid)

# Insert parent information
insert_parent(conn, parentname, baseref, link_id)
urls_to_recheck = set()
print("Initial Link Checking...")
for url in all_links:
for line in run_linkchecker(url):
if re.match(r'^http', line):
values = line.rstrip(';').split(';')
urlname = values[0]

# Parse initial check results
filtered_values = [str(values[i]) if i < len(values) else "" for i in range(len(fields_to_include))]
urlname, parentname, baseref, valid, result, warning, info = filtered_values

# Determine if URL needs to be rechecked
processed_url = determine_service_type(urlname)
if processed_url != urlname:
urls_to_recheck.add(processed_url)
else:
# If URL doesn't need reprocessing, insert results directly
is_valid = is_valid_status(valid)
link_id = insert_or_update_link(conn, urlname, valid, result, info, warning, is_valid)
insert_parent(conn, parentname, baseref, link_id)

print("Rechecking OGC processed URLs...")
for url in urls_to_recheck:
results = check_single_url(url)
for line in results:
if re.match(r'^http', line):
values = line.rstrip(';').split(';')
filtered_values = [str(values[i]) if i < len(values) else "" for i in range(len(fields_to_include))]
urlname, parentname, baseref, valid, result, warning, info = filtered_values
is_valid = is_valid_status(valid)
link_id = insert_or_update_link(conn, urlname, valid, result, info, warning, is_valid)
insert_parent(conn, parentname, baseref, link_id)

# conn.commit()
print("LinkChecker output written to PostgreSQL database")
Expand Down

0 comments on commit 1c90f4d

Please sign in to comment.