Skip to content

Commit

Permalink
Merge branch 'main-github.vgkole'
Browse files Browse the repository at this point in the history
  • Loading branch information
pvgenuchten committed May 27, 2024
2 parents 693d4e3 + f2806d7 commit 63ec572
Show file tree
Hide file tree
Showing 7 changed files with 216 additions and 128 deletions.
53 changes: 30 additions & 23 deletions .github/workflows/linkhealth_job.yml
Original file line number Diff line number Diff line change
@@ -1,40 +1,47 @@
# name: Periodic Link Checker
# name: Periodic Link Checker

# on:
# schedule:
# - cron: '0 0 * * 0' # Run every Sunday midnight (UTC time)
# schedule:
# - cron: '0 0 * * 0' # Run every Sunday at midnight (UTC time)

# jobs:
# link_check:
# runs-on: ubuntu-latest
# defaults:
# run:
# working-directory: ./linkchecker/

# strategy:
# matrix:
# python-version: [3.11]

# steps:
# - uses: actions/checkout@v4
# - uses: actions/checkout@v4

# - name: Set up Python
# uses: actions/setup-python@v5
# with:
# python-version: ${{ matrix.python-version }}
# - name: Set up Python
# uses: actions/setup-python@v5
# with:
# python-version: ${{ matrix.python-version }}

# - name: Install dependencies
# run: |
# python -m pip install --upgrade pip
# pip install -r ./requirements.txt
# - name: Install dependencies
# run: |
# python -m pip install --upgrade pip
# pip install -r requirements.txt

# # Add this step to set up Docker Buildx
# - name: Set up Docker Buildx
# uses: docker/setup-buildx-action@v1

# - name: Run app code
# run: python ./linkchecker.py
# - name: Run link checker script
# env:
# POSTGRES_USER: ${{ secrets.POSTGRES_USER }}
# POSTGRES_PASSWORD: ${{ secrets.POSTGRES_PASSWORD }}
# POSTGRES_HOST: ${{ secrets.POSTGRES_HOST }}
# POSTGRES_PORT: ${{ secrets.POSTGRES_PORT }}
# POSTGRES_DB: ${{ secrets.POSTGRES_DB }}
# run: python linkchecker.py

# - name: Run fast api
# run: |
# python -m uvicorn api:app --reload --host 0.0.0.0 --port 8000

# - name: Run FastAPI server
# env:
# POSTGRES_USER: ${{ secrets.POSTGRES_USER }}
# POSTGRES_PASSWORD: ${{ secrets.POSTGRES_PASSWORD }}
# POSTGRES_HOST: ${{ secrets.POSTGRES_HOST }}
# POSTGRES_PORT: ${{ secrets.POSTGRES_PORT }}
# POSTGRES_DB: ${{ secrets.POSTGRES_DB }}
# run: |
# nohup python -m uvicorn api:app --host 0.0.0.0 --port 8000 &
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
__pycache__/
.env
52 changes: 28 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,38 +28,41 @@ The benefit of latter is that it provides more information then a simple ping to

OGC is in the process of adopting the [OGC API - Records](https://github.com/opengeospatial/ogcapi-records) specification. A standardised API to interact with Catalogues. The specification includes a datamodel for metadata. This tool assesses the linkage section of any record in an OGC API - Records.


## Source Code Brief Desrciption

The source code leverages the [linkchecker](https://linkchecker.github.io/linkchecker/index.html) tool in order to check weather a link
froom the [EJP Soil Catalogue](https://catalogue.ejpsoil.eu/collections/metadata:main/items?offset=0)
The [JSON](https://catalogue.ejpsoil.eu/collections/metadata:main/items?f=json) file is used in order to retrieve details about the pagination.
A string is created for each page.For every url python [requests](https://pypi.org/project/requests/) library is used in order to retrieve all urls for each page.
Linkchecker command:
* subprocess.Popen(["docker", "run", "--rm", "-i", "-u", "1000:1000", "ghcr.io/linkchecker/linkchecker:latest",
"--verbose", "--check-extern", "--recursion-level=1", "--output=csv", url + "?f=html"])
runs a container with the LinkChecker tool and instructs it to check the links in verbose mode, follow external links up to one level deep, and output the results in a CSV file format.

A FastAPI is created to provide endpoints based on the statuses of links, including those with status codes 300, 400, and 500, as well as those containing warnings.
Command to run the FastAPI
Running the linkchecker.py will utilize the requests library from python to get the relevant EJP Soil Catalogue source.
Run the command below
* python linkchecker.py
The urls selected from the requests will passed to linkchecker using the proper options.
The output generated will be written to a PostgreSQL database.
A .env is required to define the database connection parameters.
More specifically the following parameters must be specified

```
POSTGRES_HOST=
POSTGRES_PORT=
POSTGRES_DB=
POSTGRES_USER=
POSTGRES_PASSWORD=
```

## API
The api.py file creates a FastAPI in order to retrieve links statuses.
Run the command below
* python -m uvicorn api:app --reload --host 0.0.0.0 --port 8000

To view the running FastAPI navigate on: [http://127.0.0.1:8000/docs]

## CI/CD
This workflow is designed to run as a cron job at midnight every Sunday.
The execution time takes about 80 minutes to complete and more than 12.000 urls are checked.
Currently the workflow is commented.
To view the service of the FastAPI on [http://127.0.0.1:8000/docs]

## Known issues
Attempting to write LinkChecker's output directly to a PostgreSQL database causes crashes due to encountering invalid characters and missing values within the data.
# Docker
A Docker instance must be running for the linkchecker command to work.

## CI/CD
A workflow is provided in order to run it as a cronological job once per week every Sunday Midnight
(However currently it is commemended to save running minutes since it takes about 80 minutes to complete)
It is necessary to use the **secrets context in gitlab in order to be connected to database

## Roadmap

### Report about results

Stats are currently saved as CSV. Stats should be ingested into a format which can be used to create reports in a platform like Apache Superset

### GeoHealthCheck integration

[GeoHealthCheck](https://GeoHealthCheck.org) is a component to monitor livelyhood of typical OGC services (WMS, WFS, WCS, CSW). It is based on the [owslib](https://owslib.readthedocs.io/en/latest/) library, which provides a python implementation of various OGC services clients.
Expand All @@ -68,3 +71,4 @@ Stats are currently saved as CSV. Stats should be ingested into a format which c

This work has been initiated as part of the [Soilwise-he project](https://soilwise-he.eu/).
The project receives funding from the European Union’s HORIZON Innovation Actions 2022 under grant agreement No. 101112838.

4 changes: 3 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,6 @@ beautifulsoup4
fastapi
pydantic
uvicorn
pandas
pandas
asyncpg
databases
Binary file modified src/__pycache__/api.cpython-311.pyc
Binary file not shown.
137 changes: 75 additions & 62 deletions src/api.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,43 @@
from fastapi import FastAPI, APIRouter, Query
import pandas as pd
from fastapi import FastAPI, HTTPException
from dotenv import load_dotenv
from databases import Database
from typing import List
from pydantic import BaseModel
import asyncpg
import os

app = FastAPI(
title="SOILWISE SERVICE PROJECT",
description="API that Retrieves EJPSOIL Catalogue URLs status",
)
# Load environment variables from .env file
load_dotenv()

# Define a route group
urls_router = APIRouter(tags=["Retrieve URLs Info"])
# Database connection setup
# Load environment variables securely (replace with your actual variable names)
DATABASE_URL = "postgresql://" + os.environ.get("POSTGRES_USER") + ":" +\
os.environ.get("POSTGRES_PASSWORD") + "@" + os.environ.get("POSTGRES_HOST") + ":" +\
os.environ.get("POSTGRES_PORT") + "/" + os.environ.get("POSTGRES_DB")

redirection_statuses = [
database = Database(DATABASE_URL)

# FastAPI app instance
app = FastAPI()

# Define response model
class StatusResponse(BaseModel):
id: int # Example column, adjust based on your actual table schema
urlname: str
parentname: str
valid: str
warning: str

# Define status lists
REDIRECTION_STATUSES = [
"301 Moved Permanently",
"302 Found (Moved Temporarily)",
"304 Not Modified",
"307 Temporary Redirect",
"308 Permanent Redirect"
]

client_error_statuses = [
CLIENT_ERROR_STATUSES = [
"400 Bad Request",
"401 Unauthorized",
"403 Forbidden",
Expand All @@ -26,67 +46,60 @@
"409 Conflict"
]

server_error_statuses = [
SERVER_ERROR_STATUSES = [
"500 Internal Server Error",
"501 Not Implemented",
"503 Service Unavailable",
"504 Gateway Timeout"
]

data = pd.read_csv("soil_catalogue_link.csv")
data = data.fillna('')

def paginate_data(data_frame: pd.DataFrame, skip: int = 0, limit: int = 10):
"""
Paginates the result from DataFrame
Args:
data_frame: The DataFrame to paginate.
skip: The number of records to skip (default: 0).
limit: The maximum number of records to return per page (default: 10).
"""
return data_frame.iloc[skip: skip + limit]
# Helper function to execute SQL query and fetch results
async def fetch_data(query: str, values: dict = {}):
try:
return await database.fetch_all(query=query, values=values)
except asyncpg.exceptions.UndefinedTableError:
raise HTTPException(status_code=500, detail="The specified table does not exist")
except Exception as e:
raise HTTPException(status_code=500, detail="Database query failed") from e

def get_urls_by_category(category_statuses, column_to_check="valid"):
"""
Filters URL from the DataFrame based on the provided status code list
The column containing the values to check ("default valid")
"""
filtered_data = data[data[column_to_check].isin(category_statuses)]
filtered_rows = filtered_data.to_dict(orient='records')
return filtered_rows
@app.get('/Redirection_URLs/3xx', response_model=List[StatusResponse])
async def get_redirection_statuses():
query = "SELECT DISTINCT * FROM linkchecker_output WHERE valid = ANY(:statuses)"
data = await fetch_data(query=query, values={'statuses': REDIRECTION_STATUSES})
return data

@urls_router.get("/Redirection_URLs/3xx", name="Get Redirection URLs 3xx")
async def get_redirection_urls():
"""
Retrieve URLs from the CSV classified as status code 3xx'
"""
urls = get_urls_by_category(redirection_statuses)
return {"category": "3xx Redirection", "urls": urls}
# Endpoint to retrieve data with client error statuses
@app.get('/Client_Error_URLs/4xx', response_model=List[StatusResponse])
async def get_client_error_statuses():
query = "SELECT DISTINCT * FROM linkchecker_output WHERE valid = ANY(:statuses)"
data = await fetch_data(query=query, values={'statuses': CLIENT_ERROR_STATUSES})
return data

@urls_router.get("/Client_Error_URLs/4xx", name="Get Client Error URLs 4xx")
async def get_client_error_urls():
"""
Retrieves URLs from the CSV classified as status code 4xx
"""
urls = get_urls_by_category(client_error_statuses)
return {"category": "4xx Client Error", "urls": urls}
# Endpoint to retrieve data with server error statuses
@app.get('/Server_Errors_URLs/5xx', response_model=List[StatusResponse])
async def get_server_error_statuses():
query = "SELECT DISTINCT * FROM linkchecker_output WHERE valid = ANY(:statuses)"
data = await fetch_data(query=query, values={'statuses': SERVER_ERROR_STATUSES})
return data

@urls_router.get("/Server_Error_URLs/5xx", name="Get Server Error URLs 5xx")
async def get_server_error_urls():
"""
Retrieves URLs from the CSV classified as status code 5xx
"""
urls = get_urls_by_category(server_error_statuses)
return {"category": "5xx Server Error", "urls": urls}
# Endpoint to retrieve data where the warning column is not empty
@app.get('/URLs_Which_Have_Warnings', response_model=List[StatusResponse])
async def get_non_empty_warnings():
query = "SELECT DISTINCT * FROM linkchecker_output WHERE warning != ''"
data = await fetch_data(query=query)
return data

@urls_router.get("/URLs_Which_Have_Warnings", name="Get URLs that contain warnings")
async def get_warning_urls(skip: int = Query(0, ge=0), limit: int = Query(10, ge=1)):
"""
Retrieves URLs from the CSV that contain warnings
"""
filtered_data = data[data['warning'] != '']
paginated_data = paginate_data(filtered_data, skip=skip, limit=limit)
return {"category": "Has Warnings", "urls": paginated_data.to_dict(orient='records')}
# Start the application
@app.on_event('startup')
async def startup():
try:
await database.connect()
except Exception as e:
raise HTTPException(status_code=500, detail="Database connection failed") from e

# Include the router in the main app
app.include_router(urls_router)
@app.on_event('shutdown')
async def shutdown():
try:
await database.disconnect()
except Exception as e:
raise HTTPException(status_code=500, detail="Database disconnection failed") from e
Loading

0 comments on commit 63ec572

Please sign in to comment.