Merge pull request #121 from openeduhub/upgrade_to_scrapy_212

Upgrade to Python 3.13 and Scrapy v2.12 / feat: robots.txt parsing for "ccm:ai_allow_usage"
openeduhub · Dec 6, 2024 · 4e60db7 · 4e60db7
2 parents ef85217 + e733041
commit 4e60db7
Show file tree

Hide file tree

Showing 14 changed files with 838 additions and 320 deletions.
diff --git a/.github/workflows/python.yaml b/.github/workflows/python.yaml
@@ -15,7 +15,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.12"]
+        python-version: ["3.13"]
 
     steps:
     - uses: actions/checkout@v4

diff --git a/Dockerfile b/Dockerfile
@@ -1,4 +1,4 @@
-FROM python:3.12.7-slim-bookworm
+FROM python:3.13-slim-bookworm
 
 # ENV CRAWLER wirlernenonline_spider
 

diff --git a/Readme.md b/Readme.md
@@ -1,9 +1,9 @@
 # Open Edu Hub Search ETL
 
-## Step 1: Project Setup - Python 3.12 (manual approach)
+## Step 1: Project Setup — Python 3.13 (manual approach)
 
 - make sure you have python3 installed (<https://docs.python-guide.org/starting/installation/>)
-  - (Python 3.12 or newer is required)
+  - (Python 3.13 is required)
 - go to project root
 - Run the following commands:
 
@@ -22,7 +22,7 @@ python3 -m venv .venv
 
 ## Step 1 (alternative): Project Setup - Python (automated, via `poetry`)
 
-- Step 1: Make sure that you have [Poetry](https://python-poetry.org) v1.5.0+ installed
+- Step 1: Make sure that you have [Poetry](https://python-poetry.org) [v1.8.4](https://github.com/python-poetry/poetry/releases/tag/1.8.4)+ installed
   - for detailed instructions, please consult the [Poetry Installation Guide](https://python-poetry.org/docs/#installation)
 - Step 2: Open your terminal **in the project root directory**:
   - Step 2.1: If you want to have your `.venv` to be created inside the project root directory: 
@@ -31,6 +31,7 @@ python3 -m venv .venv
 - Step 3: **Install dependencies** (according to `pyproject.toml`) by running: `poetry install`
 
 ## Step 2: Project Setup - required Docker Containers
+
 If you have Docker installed, use `docker-compose up` to start up the multi-container for `Splash` and `Playwright`-integration.
 
 As a last step, set up your config variables by copying the `.env.example`-file and modifying it if necessary: 
@@ -40,7 +41,7 @@ As a last step, set up your config variables by copying the `.env.example`-file
 # Running crawlers
 
 - A crawler can be run with `scrapy crawl <spider-name>`. 
-  - (It assumes that you have an edu-sharing v6.0+ instance in your `.env` settings configured which can accept the data.)
+  - (It assumes that you have an edu-sharing v8.1+ instance in your `.env` settings configured which can accept the data.)
 - If a crawler has [Scrapy Spider Contracts](https://docs.scrapy.org/en/latest/topics/contracts.html#spiders-contracts) implemented, you can test those by running `scrapy check <spider-name>`
 
 
@@ -60,8 +61,10 @@ docker compose up
 
 - We use Scrapy as a framework. Please check out the guides for Scrapy spider (https://docs.scrapy.org/en/latest/intro/tutorial.html)
 - To create a new spider, create a file inside `converter/spiders/<myname>_spider.py`
-- We recommend inheriting the `LomBase` class in order to get out-of-the-box support for our metadata model
-- You may also Inherit a Base Class for crawling data, if your site provides LRMI metadata, the `LrmiBase` is a good start. If your system provides an OAI interface, you may use the `OAIBase`
+- We recommend inheriting the `LomBase` class to get out-of-the-box support for our metadata model
+- You may also inherit a base class (see: `converter/spiders/base_classes/`) for crawling data. 
+  - If your site provides LRMI metadata, the `LrmiBase` is a good start. 
+  - If your system provides an OAI interface, you may use the `OAIBase`
 - As a sample/template, please take a look at the `sample_spider.py` or `sample_spider_alternative.py`
 - To learn more about the LOM standard we're using, you'll find useful information at https://en.wikipedia.org/wiki/Learning_object_metadata
 

diff --git a/converter/es_connector.py b/converter/es_connector.py
@@ -583,6 +583,21 @@ def transform_item(self, uuid, spider, item):
                     spaces["ccm:educationaltypicalagerange_from"] = tar["fromRange"]
                 if "toRange" in tar:
                     spaces["ccm:educationaltypicalagerange_to"] = tar["toRange"]
+            if "typicalLearningTime" in item["lom"]["educational"]:
+                tlt: int | str | None = item["lom"]["educational"]["typicalLearningTime"]
+                if (
+                        tlt and isinstance(tlt,str) and tlt.isnumeric()
+                        or tlt and isinstance(tlt, int)
+                ):
+                    tlt_in_ms: int = int(tlt) * 1000
+                    spaces["cclom:typicallearningtime"] = tlt_in_ms
+
+        if "ai_allow_usage" in item:
+            # this property is automatically filled by the RobotsTxtPipeline
+            if isinstance(item["ai_allow_usage"], bool):
+                _ai_allow_usage: bool = item["ai_allow_usage"]
+                # the edu-sharing API client expects the value to be of type string
+                spaces["ccm:ai_allow_usage"] = str(_ai_allow_usage)
 
         if "course" in item:
             if "course_availability_from" in item["course"]:

diff --git a/converter/items.py b/converter/items.py
@@ -337,6 +337,7 @@ class CourseItem(Item):
     """
     BIRD-specific metadata properties intended only for courses.
     """
+
     course_availability_from = Field()
     """Corresponding edu-sharing property: ``ccm:oeh_event_begin`` (expects ISO datetime string)"""
     course_availability_until = Field()
@@ -380,6 +381,9 @@ class BaseItem(Item):
     - ``ValuespaceItem``
     """
 
+    ai_allow_usage = Field()
+    """Stores a ``bool``-value to keep track of items that are allowed to be used in AI training. 
+    Corresponding edu-sharing property: ``ccm:ai_allow_usage``"""
     binary = Field()
     """Binary data which should be uploaded to edu-sharing (= raw data, e.g. ".pdf"-files)."""
     collection = Field(output_processor=JoinMultivalues())

diff --git a/converter/pipelines.py b/converter/pipelines.py
@@ -42,6 +42,7 @@
 from converter.items import BaseItem
 from converter.util.edu_sharing_source_template_helper import EduSharingSourceTemplateHelper
 from converter.util.language_mapper import LanguageMapper
+from converter.util.robots_txt import is_ai_usage_allowed
 from converter.web_tools import WebTools, WebEngine
 from valuespace_converter.app.valuespaces import Valuespaces
 
@@ -346,7 +347,6 @@ def process_item(self, raw_item, spider):
             tll_duration_in_seconds = determine_duration_and_convert_to_seconds(
                 time_raw=tll_raw, item_field_name="LomEducationalItem.typicalLearningTime"
             )
-            # ToDo: update es_connector and connect this property with the backend
             item["lom"]["educational"]["typicalLearningTime"] = tll_duration_in_seconds
 
         if "technical" in item["lom"]:
@@ -1107,6 +1107,43 @@ def process_item(self, raw_item, spider):
                     # raise DropItem()
         return raw_item
 
+class RobotsTxtPipeline(BasicPipeline):
+    """
+    Analyze the ``robots.txt``-file of an item
+    to look for indicators if said item is allowed to be used for AI training.
+    """
+    def process_item(self, item: scrapy.Item, spider: scrapy.Spider) -> Optional[scrapy.Item]:
+        item_adapter = ItemAdapter(item)
+        if "ai_allow_usage" in item_adapter:
+            # if the scrapy Field is already filled before hitting this pipeline,
+            # we can assume that a crawler-specific implementation already filled this field and do a type-validation
+            _ai_allowed: bool = item_adapter["ai_allow_usage"]
+            if isinstance(_ai_allowed, bool):
+                return item
+            else:
+                log.warning(f"Wrong type for BaseItem.ai_allow_usage detected: "
+                            f"Expected a 'bool'-value, but received type {type(_ai_allowed)} .")
+        else:
+            # default behavior: the pipeline should fill up the "ai_allow_usage"-field for every item
+            _item_url: str | None = None
+            try:
+                _response_url: str | None = item_adapter["response"]["url"]
+                _lom_technical_location: list[str] | None = item_adapter["lom"]["technical"]["location"]
+                if _response_url and isinstance(_response_url, str):
+                    _item_url = _response_url
+                elif _lom_technical_location and isinstance(_lom_technical_location, list):
+                    # LOM Technical location might contain several URLs, we'll try to grab the first one
+                    if len(_lom_technical_location) >= 1:
+                        _item_url = _lom_technical_location[0]
+            except KeyError:
+                # Not all items have URLs in the scrapy fields we're looking into.
+                # Binary files might have neither ``BaseItem.response.url`` nor ``BaseItem.lom.technical.location``
+                pass
+            if _item_url:
+                # only try to fetch a robots.txt file if we successfully grabbed a URL from the item
+                _ai_allowed: bool = is_ai_usage_allowed(url=_item_url)
+                item_adapter["ai_allow_usage"] = _ai_allowed
+        return item
 
 class EduSharingTypeValidationPipeline(BasicPipeline):
     """

diff --git a/converter/settings.py b/converter/settings.py
@@ -32,9 +32,6 @@
 })
 
 TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
-REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
-# fixes Scrapy DeprecationWarning on startup (Scrapy v2.10+)
-# (see: https://docs.scrapy.org/en/latest/topics/request-response.html#request-fingerprinter-implementation):
 
 # Default behaviour for regular crawlers of non-license-controlled content
 # When set True, every item will have GROUP_EVERYONE attached in edu-sharing
@@ -131,6 +128,7 @@
     "converter.pipelines.NormLanguagePipeline": 150,
     "converter.pipelines.ConvertTimePipeline": 200,
     "converter.pipelines.ProcessValuespacePipeline": 250,
+    "converter.pipelines.RobotsTxtPipeline": 255,
     "converter.pipelines.CourseItemPipeline": 275,
     "converter.pipelines.ProcessThumbnailPipeline": 300,
     "converter.pipelines.EduSharingTypeValidationPipeline": 325,

diff --git a/converter/spiders/sample_spider_alternative.py b/converter/spiders/sample_spider_alternative.py
@@ -70,6 +70,8 @@ async def parse(self, response: scrapy.http.Response, **kwargs) -> BaseItemLoade
         #                                   human readable string of text) store its within the 'fulltext' field.)
         #                                   If no 'fulltext' value was provided, the pipelines will try to fetch
         #                                   'full text' content from "ResponseItem.text" and save it here.
+        #  - ai_allow_usage     optional    (filled automatically by the ``RobotsTxtPipeline`` and expects a boolean)
+        #                                   indicates if an item is allowed to be used in AI training.
         base.add_value('sourceId', response.url)
         # if the source doesn't have a "datePublished" or "lastModified"-value in its header or JSON_LD,
         # you might have to help yourself with a unique string consisting of the datetime of the crawl + self.version

diff --git a/converter/util/robots_txt.py b/converter/util/robots_txt.py
@@ -0,0 +1,151 @@
+import re
+from functools import lru_cache
+
+import requests
+import tldextract
+from loguru import logger
+from protego import Protego
+from tldextract.tldextract import ExtractResult
+
+AI_USER_AGENTS: list[str] = [
+    "anthropic-ai",
+    "Claude-Web",
+    "Applebot-Extended",
+    "Bytespider",
+    "CCBot",
+    "ChatGPT-User",
+    "cohere-ai",
+    "Diffbot",
+    "FacebookBot",
+    "GoogleOther",
+    "Google-Extended",
+    "GPTBot",
+    "ImagesiftBot",
+    "PerplexityBot",
+    "OmigiliBot",
+    "Omigili",
+]
+# this non-exhaustive list of known (AI) web crawlers is used to check if the robots.txt file explicitly allows or forbids AI usage
+# for reference: https://www.foundationwebdev.com/2023/11/which-web-crawlers-are-associated-with-ai-crawlers/
+# ToDo: the list of known AI user agents could be refactored into a SkoHub Vocab
+
+
+@lru_cache(maxsize=512)
+def fetch_robots_txt(url: str) -> str | None:
+    """
+    Fetch the robots.txt file from the given URL.
+
+    :param url: URL string pointing towards a ``robots.txt``-file.
+    :return: The file content of the ``robots.txt``-file as a ``str``, otherwise returns ``None`` if the HTTP ``GET``-request failed.
+    """
+    response: requests.Response = requests.get(url=url)
+    if response.status_code != 200:
+        logger.warning(
+            f"Could not fetch robots.txt from {url} . "
+            f"Response code: {response.status_code} "
+            f"Reason: {response.reason}"
+        )
+        return None
+    else:
+        # happy-case: the content of the robots.txt file should be available in response.text
+        return response.text
+
+
+def _remove_wildcard_user_agent_from_robots_txt(robots_txt: str) -> str:
+    """
+    Remove the wildcard user agent part of a string from the given ``robots.txt``-string.
+
+    :param robots_txt: text content of a ``robots.txt``-file
+    :return: ``robots.txt``-file content without the wildcard user agent. If no wildcard user agent was found, return the original string without alterations.
+    """
+    # the user agent directive can appear in different forms and spellings
+    # (e.g. "user agent:", "useragent:", "user-agent:", "User-agent:" etc.)
+    # and is followed by a newline with "disallow: /"
+    _wildcard_pattern: re.Pattern = re.compile(
+        r"(?P<user_agent_directive>[u|U]ser[\s|-]?[a|A]gent:\s*[*]\s*)"
+        r"(?P<disallow_directive>[d|D]isallow:\s*/\s+)"
+    )
+    _wildcard_agent_match: re.Match | None = _wildcard_pattern.search(robots_txt)
+    if _wildcard_agent_match:
+        # remove the wildcard user agent from the parsed robots.txt string
+        robots_txt_without_wildcard_agent: str = robots_txt.replace(_wildcard_agent_match.group(), "")
+        return robots_txt_without_wildcard_agent
+    else:
+        # if no wildcard user agent was detected, do nothing.
+        return robots_txt
+
+
+def _parse_robots_txt_with_protego(robots_txt: str) -> Protego | None:
+    """
+    Parse a ``robots.txt``-string with ``Protego``.
+
+    :param robots_txt: text content of a ``robots.txt``-file
+    :return: returns a ``Protego``-object if the string could be parsed successfully, otherwise returns ``None``
+    """
+    if robots_txt and isinstance(robots_txt, str):
+        robots_txt = _remove_wildcard_user_agent_from_robots_txt(robots_txt)
+        protego_object: Protego = Protego.parse(robots_txt)
+        return protego_object
+    else:
+        return None
+
+
+def _check_protego_object_against_list_of_known_ai_user_agents(protego_object: Protego, url: str) -> bool:
+    """
+    Check if the given ``url`` is allowed to be scraped by AI user agents.
+
+    :param protego_object: ``Protego``-object holding ``robots.txt``-information
+    :param url: URL to be checked against a list of known AI user agents
+    :return: Returns ``True`` if the given ``url`` is allowed to be scraped by AI user agents. If the ``robots.txt``-file forbids AI scrapers, returns ``False``.
+    """
+    if url is None:
+        raise ValueError(f"url cannot be None. (Please provide a valid URL string!)")
+    if protego_object is None:
+        raise ValueError(f"This method requires a valid protego object.")
+    else:
+        ai_usage_allowed: bool = True  # assumption: if not explicitly disallowed by the robots.txt, AI usage is allowed
+        for user_agent in AI_USER_AGENTS:
+            _allowed_for_current_user_agent: bool = protego_object.can_fetch(
+                url=url,
+                user_agent=user_agent,
+            )
+            if _allowed_for_current_user_agent is False:
+                ai_usage_allowed = False
+                # as soon as one AI user agent is disallowed, we assume that AI usage is generally disallowed,
+                # therefore, we can skip the rest of the iterations.
+                break
+        return ai_usage_allowed
+
+
+def is_ai_usage_allowed(url: str, robots_txt: str = None) -> bool:
+    """
+    Check if the given ``url`` is allowed to be scraped by AI user agents.
+
+    :param url: URL to be checked against a list of known AI user agents
+    :param robots_txt: string value of a ``robots.txt`` file. If no ``robots.txt``-string is provided, fallback to HTTP Request: ``https://<fully_qualified_domain_name>/robots.txt``
+    :return: Returns ``True`` if the given ``url`` is allowed to be scraped by AI user agents. If the URL target forbids any of the known AI scrapers, returns ``False``.
+    """
+    if robots_txt is None:
+        # Fallback:
+        # if no robots_txt string was provided, fetch the file from "<fully qualified domain name>/robots.txt"
+        _extracted: ExtractResult = tldextract.extract(url=url)
+        # using tldextract instead of python's built-in ``urllib.parse.urlparse()``-method was a conscious decision!
+        # tldextract is more forgiving/reliable when it comes to incomplete urls (and provides a neat "fqdn"-attribute)
+        if _extracted.fqdn:
+            # use the fully qualified domain name to build the robots.txt url
+            _most_probable_robots_url_path: str = f"https://{_extracted.fqdn}/robots.txt"
+            robots_txt: str = fetch_robots_txt(url=_most_probable_robots_url_path)
+            if robots_txt is None:
+                # if the website provides no robots.txt, assume that everything is allowed
+                return True
+        else:
+            # this covers edge-cases for completely malformed URLs like "https://fakeurl"
+            raise ValueError(f"The URL {url} does not exist (and therefore contains no robots.txt).")
+    if robots_txt:
+        # happy case: check the current url against the provided robots.txt ruleset
+        po = _parse_robots_txt_with_protego(robots_txt=robots_txt)
+        _ai_usage_allowed: bool = _check_protego_object_against_list_of_known_ai_user_agents(
+            protego_object=po,
+            url=url,
+        )
+        return _ai_usage_allowed