Merge branch 'main-github.vgkole'

soilwise-he · May 27, 2024 · 63ec572 · 63ec572
2 parents 693d4e3 + f2806d7
commit 63ec572
Show file tree

Hide file tree

Showing 7 changed files with 216 additions and 128 deletions.
diff --git a/.github/workflows/linkhealth_job.yml b/.github/workflows/linkhealth_job.yml
@@ -1,40 +1,47 @@
-# name: Periodic Link Checker 
+# name: Periodic Link Checker
 
 # on:
-#     schedule:
-#       - cron: '0 0 * * 0'  # Run every Sunday midnight (UTC time)
+#   schedule:
+#     - cron: '0 0 * * 0' # Run every Sunday at midnight (UTC time)
+
 # jobs:
 #   link_check:
 #     runs-on: ubuntu-latest
 #     defaults:
 #       run:
 #         working-directory: ./linkchecker/
-
 #     strategy:
 #       matrix:
 #         python-version: [3.11]
 
 #     steps:
-#       - uses: actions/checkout@v4
+#     - uses: actions/checkout@v4
 
-#       - name: Set up Python
-#         uses: actions/setup-python@v5
-#         with:
-#           python-version: ${{ matrix.python-version }}
+#     - name: Set up Python
+#       uses: actions/setup-python@v5
+#       with:
+#         python-version: ${{ matrix.python-version }}
 
-#       - name: Install dependencies
-#         run: |
-#           python -m pip install --upgrade pip
-#           pip install -r ./requirements.txt
+#     - name: Install dependencies
+#       run: |
+#         python -m pip install --upgrade pip
+#         pip install -r requirements.txt
 
-#         # Add this step to set up Docker Buildx
-#       - name: Set up Docker Buildx
-#         uses: docker/setup-buildx-action@v1
-
-#       - name: Run app code
-#         run: python ./linkchecker.py
+#     - name: Run link checker script
+#       env:
+#         POSTGRES_USER: ${{ secrets.POSTGRES_USER }}
+#         POSTGRES_PASSWORD: ${{ secrets.POSTGRES_PASSWORD }}
+#         POSTGRES_HOST: ${{ secrets.POSTGRES_HOST }}
+#         POSTGRES_PORT: ${{ secrets.POSTGRES_PORT }}
+#         POSTGRES_DB: ${{ secrets.POSTGRES_DB }}
+#       run: python linkchecker.py
 
-#       - name: Run fast api
-#         run: |
-#           python -m uvicorn api:app --reload --host 0.0.0.0 --port 8000
-
+#     - name: Run FastAPI server
+#       env:
+#         POSTGRES_USER: ${{ secrets.POSTGRES_USER }}
+#         POSTGRES_PASSWORD: ${{ secrets.POSTGRES_PASSWORD }}
+#         POSTGRES_HOST: ${{ secrets.POSTGRES_HOST }}
+#         POSTGRES_PORT: ${{ secrets.POSTGRES_PORT }}
+#         POSTGRES_DB: ${{ secrets.POSTGRES_DB }}
+#       run: |
+#         nohup python -m uvicorn api:app --host 0.0.0.0 --port 8000 &
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+__pycache__/
+.env
diff --git a/README.md b/README.md
@@ -28,38 +28,41 @@ The benefit of latter is that it provides more information then a simple ping to
 
 OGC is in the process of adopting the [OGC API - Records](https://github.com/opengeospatial/ogcapi-records) specification. A standardised API to interact with Catalogues. The specification includes a datamodel for metadata. This tool assesses the linkage section of any record in an OGC API - Records.
 
+
 ## Source Code Brief Desrciption
 
-The source code leverages the [linkchecker](https://linkchecker.github.io/linkchecker/index.html) tool in order to check weather a link 
-froom the [EJP Soil Catalogue](https://catalogue.ejpsoil.eu/collections/metadata:main/items?offset=0)
-The [JSON](https://catalogue.ejpsoil.eu/collections/metadata:main/items?f=json) file is used in order to retrieve details about the pagination.
-A string is created for each page.For every url python [requests](https://pypi.org/project/requests/) library is used in order to retrieve all urls for each page.
-Linkchecker command:  
- * subprocess.Popen(["docker", "run", "--rm", "-i", "-u", "1000:1000", "ghcr.io/linkchecker/linkchecker:latest", 
-    "--verbose", "--check-extern", "--recursion-level=1",  "--output=csv", url + "?f=html"])
-runs a container with the LinkChecker tool and instructs it to check the links in verbose mode, follow external links up to one level deep, and output the results in a CSV file format.
-
-A FastAPI is created to provide endpoints based on the statuses of links, including those with status codes 300, 400, and 500, as well as those containing warnings.
-Command to run the FastAPI
+Running the linkchecker.py will utilize the requests library from python to get the relevant EJP Soil Catalogue source.
+Run the command below
+* python linkchecker.py
+The urls selected from the requests will passed to linkchecker using the proper options.
+The output generated will be written to a PostgreSQL database.
+A .env is required to define the database connection parameters.
+More specifically the following parameters must be specified
+
+```
+    POSTGRES_HOST=
+    POSTGRES_PORT=
+    POSTGRES_DB=
+    POSTGRES_USER=
+    POSTGRES_PASSWORD=
+```
+
+## API
+The api.py file creates a FastAPI in order to retrieve links statuses. 
+Run the command below
 * python -m uvicorn api:app --reload --host 0.0.0.0 --port 8000 
-
-To view the running FastAPI navigate on: [http://127.0.0.1:8000/docs]
-
-## CI/CD
-This workflow is designed to run as a cron job at midnight every Sunday.
-The execution time takes about 80 minutes to complete and more than 12.000 urls are checked.
-Currently the workflow is commented.
+To view the service of the FastAPI on [http://127.0.0.1:8000/docs]
 
-## Known issues
-Attempting to write LinkChecker's output directly to a PostgreSQL database causes crashes due to encountering invalid characters and missing values within the data.
+# Docker
+A Docker instance must be running for the linkchecker command to work.
 
+## CI/CD
+A workflow is provided in order to run it as a cronological job once per week every Sunday Midnight
+(However currently it is commemended to save running minutes since it takes about 80 minutes to complete)
+It is necessary to use the **secrets context in gitlab in order to be connected to database
 
 ## Roadmap
 
-### Report about results
-
-Stats are currently saved as CSV. Stats should be ingested into a format which can be used to create reports in a platform like Apache Superset
-
 ### GeoHealthCheck integration
 
 [GeoHealthCheck](https://GeoHealthCheck.org) is a component to monitor livelyhood of typical OGC services (WMS, WFS, WCS, CSW). It is based on the [owslib](https://owslib.readthedocs.io/en/latest/) library, which provides a python implementation of various OGC services clients.
@@ -68,3 +71,4 @@ Stats are currently saved as CSV. Stats should be ingested into a format which c
 
 This work has been initiated as part of the [Soilwise-he project](https://soilwise-he.eu/).
 The project receives funding from the European Union’s HORIZON Innovation Actions 2022 under grant agreement No. 101112838.
+
diff --git a/requirements.txt b/requirements.txt
@@ -3,4 +3,6 @@ beautifulsoup4
 fastapi
 pydantic
 uvicorn
-pandas
+pandas
+asyncpg
+databases
diff --git a/src/__pycache__/api.cpython-311.pyc b/src/__pycache__/api.cpython-311.pyc
diff --git a/src/api.py b/src/api.py
@@ -1,23 +1,43 @@
-from fastapi import FastAPI, APIRouter, Query
-import pandas as pd
+from fastapi import FastAPI, HTTPException
+from dotenv import load_dotenv
+from databases import Database
+from typing import List
+from pydantic import BaseModel
+import asyncpg
+import os
 
-app = FastAPI(
-    title="SOILWISE SERVICE PROJECT",
-    description="API that Retrieves EJPSOIL Catalogue URLs status",
-)
+# Load environment variables from .env file
+load_dotenv()
 
-# Define a route group
-urls_router = APIRouter(tags=["Retrieve URLs Info"])
+# Database connection setup
+# Load environment variables securely (replace with your actual variable names)
+DATABASE_URL = "postgresql://" + os.environ.get("POSTGRES_USER") + ":" +\
+    os.environ.get("POSTGRES_PASSWORD") + "@" + os.environ.get("POSTGRES_HOST") + ":" +\
+    os.environ.get("POSTGRES_PORT") + "/" + os.environ.get("POSTGRES_DB")
 
-redirection_statuses = [
+database = Database(DATABASE_URL)
+
+# FastAPI app instance
+app = FastAPI()
+
+# Define response model
+class StatusResponse(BaseModel):
+    id: int  # Example column, adjust based on your actual table schema
+    urlname: str
+    parentname: str
+    valid: str
+    warning: str
+
+# Define status lists
+REDIRECTION_STATUSES = [
     "301 Moved Permanently",
     "302 Found (Moved Temporarily)",
     "304 Not Modified",
     "307 Temporary Redirect",
     "308 Permanent Redirect"
 ]
 
-client_error_statuses = [
+CLIENT_ERROR_STATUSES = [
     "400 Bad Request",
     "401 Unauthorized",
     "403 Forbidden",
@@ -26,67 +46,60 @@
     "409 Conflict"
 ]
 
-server_error_statuses = [
+SERVER_ERROR_STATUSES = [
     "500 Internal Server Error",
     "501 Not Implemented",
     "503 Service Unavailable",
     "504 Gateway Timeout"
 ]
 
-data = pd.read_csv("soil_catalogue_link.csv")
-data = data.fillna('')
-
-def paginate_data(data_frame: pd.DataFrame, skip: int = 0, limit: int = 10):
-    """
-    Paginates the result from DataFrame
-    Args:
-        data_frame: The DataFrame to paginate.
-        skip: The number of records to skip (default: 0).
-        limit: The maximum number of records to return per page (default: 10). 
-    """
-    return data_frame.iloc[skip: skip + limit]
+# Helper function to execute SQL query and fetch results
+async def fetch_data(query: str, values: dict = {}):
+    try:
+        return await database.fetch_all(query=query, values=values)
+    except asyncpg.exceptions.UndefinedTableError:
+        raise HTTPException(status_code=500, detail="The specified table does not exist")
+    except Exception as e:
+        raise HTTPException(status_code=500, detail="Database query failed") from e
 
-def get_urls_by_category(category_statuses, column_to_check="valid"):
-    """
-    Filters URL from the DataFrame based on the provided status code list
-    The column containing the values to check ("default valid")
-    """
-    filtered_data = data[data[column_to_check].isin(category_statuses)]
-    filtered_rows = filtered_data.to_dict(orient='records')
-    return filtered_rows
+@app.get('/Redirection_URLs/3xx', response_model=List[StatusResponse])
+async def get_redirection_statuses():
+    query = "SELECT DISTINCT * FROM linkchecker_output WHERE valid = ANY(:statuses)"
+    data = await fetch_data(query=query, values={'statuses': REDIRECTION_STATUSES})
+    return data
 
-@urls_router.get("/Redirection_URLs/3xx", name="Get Redirection URLs 3xx")
-async def get_redirection_urls():
-    """
-    Retrieve URLs from the CSV classified as status code 3xx'
-    """
-    urls = get_urls_by_category(redirection_statuses)
-    return {"category": "3xx Redirection", "urls": urls}
+# Endpoint to retrieve data with client error statuses
+@app.get('/Client_Error_URLs/4xx', response_model=List[StatusResponse])
+async def get_client_error_statuses():
+    query = "SELECT DISTINCT * FROM linkchecker_output WHERE valid = ANY(:statuses)"
+    data = await fetch_data(query=query, values={'statuses': CLIENT_ERROR_STATUSES})
+    return data
 
-@urls_router.get("/Client_Error_URLs/4xx", name="Get Client Error URLs 4xx")
-async def get_client_error_urls():
-    """
-    Retrieves URLs from the CSV classified as status code 4xx
-    """
-    urls = get_urls_by_category(client_error_statuses)
-    return {"category": "4xx Client Error", "urls": urls}
+# Endpoint to retrieve data with server error statuses
+@app.get('/Server_Errors_URLs/5xx', response_model=List[StatusResponse])
+async def get_server_error_statuses():
+    query = "SELECT DISTINCT * FROM linkchecker_output WHERE valid = ANY(:statuses)"
+    data = await fetch_data(query=query, values={'statuses': SERVER_ERROR_STATUSES})
+    return data
 
-@urls_router.get("/Server_Error_URLs/5xx", name="Get Server Error URLs 5xx")
-async def get_server_error_urls():
-    """
-    Retrieves URLs from the CSV classified as status code 5xx
-    """
-    urls = get_urls_by_category(server_error_statuses)
-    return {"category": "5xx Server Error", "urls": urls}
+# Endpoint to retrieve data where the warning column is not empty
+@app.get('/URLs_Which_Have_Warnings', response_model=List[StatusResponse])
+async def get_non_empty_warnings():
+    query = "SELECT DISTINCT * FROM linkchecker_output WHERE warning != ''"
+    data = await fetch_data(query=query)
+    return data
 
-@urls_router.get("/URLs_Which_Have_Warnings", name="Get URLs that contain warnings")
-async def get_warning_urls(skip: int = Query(0, ge=0), limit: int = Query(10, ge=1)):
-    """
-    Retrieves URLs from the CSV that contain warnings
-    """
-    filtered_data = data[data['warning'] != '']
-    paginated_data = paginate_data(filtered_data, skip=skip, limit=limit)
-    return {"category": "Has Warnings", "urls": paginated_data.to_dict(orient='records')}
+# Start the application
+@app.on_event('startup')
+async def startup():
+    try:
+        await database.connect()
+    except Exception as e:
+        raise HTTPException(status_code=500, detail="Database connection failed") from e
 
-# Include the router in the main app
-app.include_router(urls_router)
+@app.on_event('shutdown')
+async def shutdown():
+    try:
+        await database.disconnect()
+    except Exception as e:
+        raise HTTPException(status_code=500, detail="Database disconnection failed") from e