Merge branch 'vgole001/issue11'

# Conflicts: # README.md # src/linkchecker.py
soilwise-he · Aug 21, 2024 · 1c90f4d · 1c90f4d
2 parents 6fabd51 + 4bfc5f6
commit 1c90f4d
Show file tree

Hide file tree

Showing 8 changed files with 184 additions and 86 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -15,7 +15,7 @@ ENV POSTGRES_HOST=host.docker.internal
 ENV POSTGRES_PORT=5432
 ENV POSTGRES_DB=postgres
 ENV POSTGRES_USER=postgres
-ENV POSTGRES_PASSWORD=w4qu+0sj
+ENV POSTGRES_PASSWORD=*****
 
 WORKDIR /home/link-liveliness-assessment
 
@@ -39,4 +39,4 @@ EXPOSE 8000
 
 USER linky
 
-# ENTRYPOINT [ "python3", "-m", "uvicorn", "api:app", "--reload", "--host", "0.0.0.0", "--port", "8000" ]
+# ENTRYPOINT [ "python3", "-m", "uvicorn", "api:app", "--reload", "--host", "0.0.0.0", "--port", "8000" ]
diff --git a/README.md b/README.md
@@ -1,15 +1,17 @@
-# OGC API - Records; link liveliness assessment
+# OGC API - Records; link liveliness assessment tool
+
+### Overview
+The linkchecker component is designed to evaluate the validity and accuracy of links within metadata records in the OGC API - Records System. 
 
 A component which evaluates for a set of metadata records (describing either data or knowledge sources), if:
 
 - the links to external sources are valid
 - the links within the repository are valid
 - link metadata represents accurately the resource
 
-The component either returns a http status: 200 (ok), 403 (non autorized), 404 (not found), 500 (error), ...
-Status 302 is forwarded to new location and the test is repeated.
+The component either returns a http status: 200 (ok), 401 (non autorized), 404 (not found), 500 (server error)
 
-The component runs an evaluation for a single resource at request, or runs tests at intervals providing a history of availability 
+The component runs an evaluation for a single resource at request, or runs tests at intervals providing a history of availability
 
 A link either points to:
 
@@ -21,44 +23,77 @@ If endpoint is API, some sanity checks can be performed on the API:
 
 - Identify if the API adopted any API-standard
 - IF an API standard is adopted, does the API support basic operations of that API
+
+The benefit of latter is that it provides more information then a simple ping to the index page of the API, typical examples of standardised API's are SOAP, 
+GraphQL, SPARQL, OpenAPI, WMS, WFS
 
-The benefit of latter is that it provides more information then a simple ping to the index page of the API, typical examples of standardised API's are SOAP, GraphQL, SPARQL, OpenAPI, WMS, WFS
-
-## OGC API - records
-
-OGC is in the process of adopting the [OGC API - Records](https://github.com/opengeospatial/ogcapi-records) specification. A standardised API to interact with Catalogues. The specification includes a datamodel for metadata. This tool assesses the linkage section of any record in an OGC API - Records.
-
+***Sample response*** 
+```
+    {
+        "id": 25,
+        "urlname": "https://demo.pycsw.org/gisdata/collections/metadata:main/queryables",
+        "parent_urls": [
+        "https://demo.pycsw.org/gisdata/collections?f=html"
+        ],
+        "status": "200 OK",
+        "result": "",
+        "info": "True",
+        "warning": "",
+        "deprecated": null
+    }
+```
+# OGC API - records
+OGC is in the process of adopting the [OGC API - Records](https://github.com/opengeospatial/ogcapi-records) specification. 
+A standardised API to interact with Catalogues. The specification includes a datamodel for metadata. 
+This tool assesses the linkage section of any record in an OGC API - Records.
+
+OGC services (WMS, WFS, WCS, CSW) often return an HTTP 500 error or a 400 Bad Request when called without the necessary parameters.
+This is because these services expect specific parameters to understand which operations to perform.
+A handling for this URL formats has been done in order to detect and include the necessary parameters before being checked 
+
 Set the endpoint to be analysed as 2 environment variables
 
 ```
 export OGCAPI_URL=https://soilwise-he.containers.wur.nl/cat/
 export OGCAPI_COLLECTION=metadata:main
 ```
 
-## Source Code Brief Desrciption
-
-Running the linkchecker.py will utilize the requests library from python to get the relevant EJP Soil Catalogue source.
-Run the command below
-* python linkchecker.py
-The urls selected from the requests will passed to linkchecker using the proper options.
-The output generated will be written to a PostgreSQL database.
-A .env is required to define the database connection parameters.
-More specifically the following parameters must be specified
-
-```
-    POSTGRES_HOST=
-    POSTGRES_PORT=
-    POSTGRES_DB=
-    POSTGRES_USER=
-    POSTGRES_PASSWORD=
-```
-
-## API
-The api.py file creates a FastAPI in order to retrieve links statuses. 
-Run the command below:
+## Api Key Features
+1. **Link validation**: 
+Returns HTTP status codes for each link, along with other important information such as the parent URL, any warnings, and the date and time of the test.
+![Fast API link_status](./images/link_status.png)
+2. **Broken link categorization**:
+Identifies and categorizes broken links based on status codes, including Redirection Errors, Client Errors, and Server Errors.
+![Link categorization enpoint](./images/categorization.png)
+3. **Deprecated links identification**: 
+Flags links as deprecated if they have failed for X consecutive tests, in our case X equals to 10. 
+Deprecated links are excluded from future tests to optimize performance.
+![Fast API deprecated endpoint](./images/deprecated.png)
+4. **Timeout management**: 
+Allows the identification of URLs that exceed a timeout threshold which can be set manually as a parameter in linkchecker's properties.
+![Fast API timeout enpoint](./images/timeouts.png)
+5. **Availability monitoring**:
+When run periodically, the tool builds a history of availability for each URL, enabling users to view the status of links over time.
+![Link validation enpoint](./images/val_history.png)
+
+## Container Deployment
+
+Set environment variables in Dockerfile to enable database connection.
+
+Run the following command:
+
+The app can be deployed as a container. 
+A docker-compose file has been implemented.
+
+Run ***docker-compose up*** to run the container
+
+***Helpful commands***
+To run the FastAPI locally run 
 ```
 python -m uvicorn api:app --reload --host 0.0.0.0 --port 8000 
 ```
+The FastAPI service runs on: [http://127.0.0.1:8000/docs] only if ROOTPATH is not set
+
 To view the service of the FastAPI on [http://127.0.0.1:8000/docs]
 
 # Get current URL Status History 
@@ -91,25 +126,31 @@ You can set `ROOTPATH` env var to run the api at a path (default is at root)
 
 ```
 export ROOTPATH=/linky
-```
-
-## Docker
-
-A Docker instance must be running for the linkchecker command to work.
 
 ## CI/CD
-A workflow is provided in order to run it as a cronological job once per week every Sunday Midnight
-(However currently it is commemended to save running minutes since it takes about 80 minutes to complete)
-It is necessary to use the **secrets context in gitlab in order to be connected to database
+A CI/CD configuration file is provided in order to create an automated chronological pipeline.
+It is necessary to define the secrets context using GitLab secrets in order to connect to the database.
 
 ## Roadmap
-
 ### GeoHealthCheck integration
 
-[GeoHealthCheck](https://GeoHealthCheck.org) is a component to monitor livelyhood of typical OGC services (WMS, WFS, WCS, CSW). It is based on the [owslib](https://owslib.readthedocs.io/en/latest/) library, which provides a python implementation of various OGC services clients.
+[GeoHealthCheck](https://GeoHealthCheck.org) is a component to monitor livelyhood of typical OGC services (WMS, WFS, WCS, CSW). 
+It is based on the [owslib](https://owslib.readthedocs.io/en/latest/) library, which provides a python implementation of various OGC services clients.
 
-## Soilwise-he project
+## Technological Stack
+
+1. **Core Language**:
+   - Python: Used for the linkchecker, API, and database interactions.
 
-This work has been initiated as part of the [Soilwise-he project](https://soilwise-he.eu/).
-The project receives funding from the European Union’s HORIZON Innovation Actions 2022 under grant agreement No. 101112838.
+2. **Database**:
+   - PostgreSQL: Utilized for storing and managing information.
 
+3. **Backend Framework**:
+   - FastAPI: Employed to create and expose REST API endpoints, utilizing its efficiency and auto-generated components like Swagger.
+
+4. **Containerization**:
+   - Docker: Used to containerize the linkchecker application, ensuring deployment and execution across different environments.
+  
+## Soilwise-he project
+This work has been initiated as part of the Soilwise-he project. 
+The project receives funding from the European Union’s HORIZON Innovation Actions 2022 under grant agreement No. 101112838.
diff --git a/images/categorization.png b/images/categorization.png
diff --git a/images/deprecated.png b/images/deprecated.png
diff --git a/images/link_status.png b/images/link_status.png
diff --git a/images/timeouts.png b/images/timeouts.png
diff --git a/images/val_history.png b/images/val_history.png
diff --git a/src/linkchecker.py b/src/linkchecker.py
@@ -1,5 +1,6 @@
 from bs4 import BeautifulSoup
 from dotenv import load_dotenv
+from urllib.parse import urlparse, parse_qs, urlencode
 import subprocess
 import psycopg2
 import psycopg2.extras
@@ -119,24 +120,40 @@ def extract_links(url):
         print(f"Error extracting links from {url}: {e}")
         return []
 
-def run_linkchecker(urls):
-    for url in urls:
-        # Run LinkChecker Docker command with specified user and group IDs for each URL
-        process = subprocess.Popen([
-           "linkchecker",
-            "--verbose",
-            "--check-extern",
-            "--recursion-level=1",
-            "--timeout=5",
-            "--output=csv",
-            url + "?f=html"
-        ], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+def check_single_url(url):
+    process = subprocess.Popen([
+        "linkchecker",
+        "--verbose",
+        "--check-extern",
+        "--recursion-level=0",
+        "--timeout=5",
+        "--output=csv",
+        url + "?f=html"
+    ], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+
+    # Process.communicate is good for shorter-running processes 
+    stdout, _ = process.communicate()
+
+    return stdout.decode('utf-8').strip().split('\n')
 
-        # Process the output line by line and yield each line
-        for line in process.stdout:
-            yield line.decode('utf-8').strip()  # Decode bytes to string and strip newline characters
-        # Wait for the process to finish
-        process.wait()
+def run_linkchecker(url):
+    # Run LinkChecker Docker command with specified user and group IDs for each URL
+    process = subprocess.Popen([
+        "linkchecker",
+        "--verbose",
+        "--check-extern",
+        "--recursion-level=1",
+        "--timeout=5",
+        "--output=csv",
+        url + "?f=html"
+    ], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+
+    # Process the output line by line and yield each line
+    # Memory efficient for large outputs
+    for line in process.stdout:
+        yield line.decode('utf-8').strip()  # Decode bytes to string and strip newline characters
+    # Wait for the process to finish
+    process.wait()
 
 def insert_or_update_link(conn, urlname, status, result, info, warning, is_valid):
 
@@ -231,6 +248,35 @@ def get_active_urls(conn):
         else:
             cur.execute("SELECT url FROM validation_history WHERE NOT deprecated")
             return [row[0] for row in cur.fetchall()]
+
+def determine_service_type(url):
+    ogc_patterns = ['/wms', '/wfs', '/csw', '/wcs', 'service=']
+
+    if any(pattern in url.lower() for pattern in ogc_patterns):
+        parsed_url = urlparse(url)
+        query_params = parse_qs(parsed_url.query)
+
+        query_params.pop('service', None)
+        query_params.pop('request', None)
+
+        query_params['request'] = ['GetCapabilities']
+
+        if 'service' not in query_params:
+            if '/wms' in parsed_url.path.lower():
+                query_params['service'] = ['WMS']
+            elif '/wfs' in parsed_url.path.lower():
+                query_params['service'] = ['WFS']
+            elif '/csw' in parsed_url.path.lower():
+                query_params['service'] = ['CSW']
+            elif '/wcs' in parsed_url.path.lower():
+                query_params['service'] = ['WCS']
+
+        new_query = urlencode(query_params, doseq=True)
+        new_url = parsed_url._replace(query=new_query).geturl()
+
+        return new_url
+
+    return url
 
 def main():
     start_time = time.time()  # Start timing
@@ -255,34 +301,45 @@ def main():
         extracted_links = extract_links(url)
         all_links.update(extracted_links)  # Add new links to the set of all links
 
-    # Define the formats to be removed
-    formats_to_remove = [
-        'collections/' + collection + '/items?offset',
-        '?f=json'
-    ]
-
     # Specify the fields to include in the CSV file
     fields_to_include = ['urlname', 'parentname', 'baseref', 'valid', 'result', 'warning', 'info']
 
     print("Checking Links...")
+
     # Run LinkChecker and process the output
-    for line in run_linkchecker(all_links):
-        if re.match(r'^http', line):
-            # Remove trailing semicolon and split by semicolon
-            values = line.rstrip(';').split(';')
-
-            # Filter and pad values based on fields_to_include
-            filtered_values = [str(values[i]) if i < len(values) else "" for i in range(len(fields_to_include))]
-
-            # Destructure filtered_values
-            urlname, parentname, baseref, valid, result, warning, info = filtered_values
-
-            is_valid = is_valid_status(valid)
-
-            link_id = insert_or_update_link(conn, urlname, valid, result, info, warning, is_valid)
-
-            # Insert parent information
-            insert_parent(conn, parentname, baseref, link_id)
+    urls_to_recheck = set()
+    print("Initial Link Checking...")
+    for url in all_links:
+        for line in run_linkchecker(url):
+            if re.match(r'^http', line):
+                values = line.rstrip(';').split(';')
+                urlname = values[0]               
+
+                # Parse initial check results
+                filtered_values = [str(values[i]) if i < len(values) else "" for i in range(len(fields_to_include))]
+                urlname, parentname, baseref, valid, result, warning, info = filtered_values
+
+                # Determine if URL needs to be rechecked
+                processed_url = determine_service_type(urlname)
+                if processed_url != urlname:
+                    urls_to_recheck.add(processed_url)
+                else:
+                    # If URL doesn't need reprocessing, insert results directly
+                    is_valid = is_valid_status(valid)
+                    link_id = insert_or_update_link(conn, urlname, valid, result, info, warning, is_valid)
+                    insert_parent(conn, parentname, baseref, link_id)
+
+    print("Rechecking OGC processed URLs...")
+    for url in urls_to_recheck:
+        results = check_single_url(url)
+        for line in results:
+            if re.match(r'^http', line):
+                values = line.rstrip(';').split(';')
+                filtered_values = [str(values[i]) if i < len(values) else "" for i in range(len(fields_to_include))]
+                urlname, parentname, baseref, valid, result, warning, info = filtered_values
+                is_valid = is_valid_status(valid)
+                link_id = insert_or_update_link(conn, urlname, valid, result, info, warning, is_valid)
+                insert_parent(conn, parentname, baseref, link_id)
 
     # conn.commit()
     print("LinkChecker output written to PostgreSQL database")