Skip to content

Commit

Permalink
Merge pull request #121 from openeduhub/upgrade_to_scrapy_212
Browse files Browse the repository at this point in the history
Upgrade to Python 3.13 and Scrapy v2.12 / feat: robots.txt parsing for "ccm:ai_allow_usage"
  • Loading branch information
Criamos authored Dec 6, 2024
2 parents ef85217 + e733041 commit 4e60db7
Show file tree
Hide file tree
Showing 14 changed files with 838 additions and 320 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/python.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.12"]
python-version: ["3.13"]

steps:
- uses: actions/checkout@v4
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM python:3.12.7-slim-bookworm
FROM python:3.13-slim-bookworm

# ENV CRAWLER wirlernenonline_spider

Expand Down
15 changes: 9 additions & 6 deletions Readme.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Open Edu Hub Search ETL

## Step 1: Project Setup - Python 3.12 (manual approach)
## Step 1: Project Setup Python 3.13 (manual approach)

- make sure you have python3 installed (<https://docs.python-guide.org/starting/installation/>)
- (Python 3.12 or newer is required)
- (Python 3.13 is required)
- go to project root
- Run the following commands:

Expand All @@ -22,7 +22,7 @@ python3 -m venv .venv

## Step 1 (alternative): Project Setup - Python (automated, via `poetry`)

- Step 1: Make sure that you have [Poetry](https://python-poetry.org) v1.5.0+ installed
- Step 1: Make sure that you have [Poetry](https://python-poetry.org) [v1.8.4](https://github.com/python-poetry/poetry/releases/tag/1.8.4)+ installed
- for detailed instructions, please consult the [Poetry Installation Guide](https://python-poetry.org/docs/#installation)
- Step 2: Open your terminal **in the project root directory**:
- Step 2.1: If you want to have your `.venv` to be created inside the project root directory:
Expand All @@ -31,6 +31,7 @@ python3 -m venv .venv
- Step 3: **Install dependencies** (according to `pyproject.toml`) by running: `poetry install`

## Step 2: Project Setup - required Docker Containers

If you have Docker installed, use `docker-compose up` to start up the multi-container for `Splash` and `Playwright`-integration.

As a last step, set up your config variables by copying the `.env.example`-file and modifying it if necessary:
Expand All @@ -40,7 +41,7 @@ As a last step, set up your config variables by copying the `.env.example`-file
# Running crawlers

- A crawler can be run with `scrapy crawl <spider-name>`.
- (It assumes that you have an edu-sharing v6.0+ instance in your `.env` settings configured which can accept the data.)
- (It assumes that you have an edu-sharing v8.1+ instance in your `.env` settings configured which can accept the data.)
- If a crawler has [Scrapy Spider Contracts](https://docs.scrapy.org/en/latest/topics/contracts.html#spiders-contracts) implemented, you can test those by running `scrapy check <spider-name>`


Expand All @@ -60,8 +61,10 @@ docker compose up

- We use Scrapy as a framework. Please check out the guides for Scrapy spider (https://docs.scrapy.org/en/latest/intro/tutorial.html)
- To create a new spider, create a file inside `converter/spiders/<myname>_spider.py`
- We recommend inheriting the `LomBase` class in order to get out-of-the-box support for our metadata model
- You may also Inherit a Base Class for crawling data, if your site provides LRMI metadata, the `LrmiBase` is a good start. If your system provides an OAI interface, you may use the `OAIBase`
- We recommend inheriting the `LomBase` class to get out-of-the-box support for our metadata model
- You may also inherit a base class (see: `converter/spiders/base_classes/`) for crawling data.
- If your site provides LRMI metadata, the `LrmiBase` is a good start.
- If your system provides an OAI interface, you may use the `OAIBase`
- As a sample/template, please take a look at the `sample_spider.py` or `sample_spider_alternative.py`
- To learn more about the LOM standard we're using, you'll find useful information at https://en.wikipedia.org/wiki/Learning_object_metadata

Expand Down
15 changes: 15 additions & 0 deletions converter/es_connector.py
Original file line number Diff line number Diff line change
Expand Up @@ -583,6 +583,21 @@ def transform_item(self, uuid, spider, item):
spaces["ccm:educationaltypicalagerange_from"] = tar["fromRange"]
if "toRange" in tar:
spaces["ccm:educationaltypicalagerange_to"] = tar["toRange"]
if "typicalLearningTime" in item["lom"]["educational"]:
tlt: int | str | None = item["lom"]["educational"]["typicalLearningTime"]
if (
tlt and isinstance(tlt,str) and tlt.isnumeric()
or tlt and isinstance(tlt, int)
):
tlt_in_ms: int = int(tlt) * 1000
spaces["cclom:typicallearningtime"] = tlt_in_ms

if "ai_allow_usage" in item:
# this property is automatically filled by the RobotsTxtPipeline
if isinstance(item["ai_allow_usage"], bool):
_ai_allow_usage: bool = item["ai_allow_usage"]
# the edu-sharing API client expects the value to be of type string
spaces["ccm:ai_allow_usage"] = str(_ai_allow_usage)

if "course" in item:
if "course_availability_from" in item["course"]:
Expand Down
4 changes: 4 additions & 0 deletions converter/items.py
Original file line number Diff line number Diff line change
Expand Up @@ -337,6 +337,7 @@ class CourseItem(Item):
"""
BIRD-specific metadata properties intended only for courses.
"""

course_availability_from = Field()
"""Corresponding edu-sharing property: ``ccm:oeh_event_begin`` (expects ISO datetime string)"""
course_availability_until = Field()
Expand Down Expand Up @@ -380,6 +381,9 @@ class BaseItem(Item):
- ``ValuespaceItem``
"""

ai_allow_usage = Field()
"""Stores a ``bool``-value to keep track of items that are allowed to be used in AI training.
Corresponding edu-sharing property: ``ccm:ai_allow_usage``"""
binary = Field()
"""Binary data which should be uploaded to edu-sharing (= raw data, e.g. ".pdf"-files)."""
collection = Field(output_processor=JoinMultivalues())
Expand Down
39 changes: 38 additions & 1 deletion converter/pipelines.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
from converter.items import BaseItem
from converter.util.edu_sharing_source_template_helper import EduSharingSourceTemplateHelper
from converter.util.language_mapper import LanguageMapper
from converter.util.robots_txt import is_ai_usage_allowed
from converter.web_tools import WebTools, WebEngine
from valuespace_converter.app.valuespaces import Valuespaces

Expand Down Expand Up @@ -346,7 +347,6 @@ def process_item(self, raw_item, spider):
tll_duration_in_seconds = determine_duration_and_convert_to_seconds(
time_raw=tll_raw, item_field_name="LomEducationalItem.typicalLearningTime"
)
# ToDo: update es_connector and connect this property with the backend
item["lom"]["educational"]["typicalLearningTime"] = tll_duration_in_seconds

if "technical" in item["lom"]:
Expand Down Expand Up @@ -1107,6 +1107,43 @@ def process_item(self, raw_item, spider):
# raise DropItem()
return raw_item

class RobotsTxtPipeline(BasicPipeline):
"""
Analyze the ``robots.txt``-file of an item
to look for indicators if said item is allowed to be used for AI training.
"""
def process_item(self, item: scrapy.Item, spider: scrapy.Spider) -> Optional[scrapy.Item]:
item_adapter = ItemAdapter(item)
if "ai_allow_usage" in item_adapter:
# if the scrapy Field is already filled before hitting this pipeline,
# we can assume that a crawler-specific implementation already filled this field and do a type-validation
_ai_allowed: bool = item_adapter["ai_allow_usage"]
if isinstance(_ai_allowed, bool):
return item
else:
log.warning(f"Wrong type for BaseItem.ai_allow_usage detected: "
f"Expected a 'bool'-value, but received type {type(_ai_allowed)} .")
else:
# default behavior: the pipeline should fill up the "ai_allow_usage"-field for every item
_item_url: str | None = None
try:
_response_url: str | None = item_adapter["response"]["url"]
_lom_technical_location: list[str] | None = item_adapter["lom"]["technical"]["location"]
if _response_url and isinstance(_response_url, str):
_item_url = _response_url
elif _lom_technical_location and isinstance(_lom_technical_location, list):
# LOM Technical location might contain several URLs, we'll try to grab the first one
if len(_lom_technical_location) >= 1:
_item_url = _lom_technical_location[0]
except KeyError:
# Not all items have URLs in the scrapy fields we're looking into.
# Binary files might have neither ``BaseItem.response.url`` nor ``BaseItem.lom.technical.location``
pass
if _item_url:
# only try to fetch a robots.txt file if we successfully grabbed a URL from the item
_ai_allowed: bool = is_ai_usage_allowed(url=_item_url)
item_adapter["ai_allow_usage"] = _ai_allowed
return item

class EduSharingTypeValidationPipeline(BasicPipeline):
"""
Expand Down
4 changes: 1 addition & 3 deletions converter/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,6 @@
})

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
# fixes Scrapy DeprecationWarning on startup (Scrapy v2.10+)
# (see: https://docs.scrapy.org/en/latest/topics/request-response.html#request-fingerprinter-implementation):

# Default behaviour for regular crawlers of non-license-controlled content
# When set True, every item will have GROUP_EVERYONE attached in edu-sharing
Expand Down Expand Up @@ -131,6 +128,7 @@
"converter.pipelines.NormLanguagePipeline": 150,
"converter.pipelines.ConvertTimePipeline": 200,
"converter.pipelines.ProcessValuespacePipeline": 250,
"converter.pipelines.RobotsTxtPipeline": 255,
"converter.pipelines.CourseItemPipeline": 275,
"converter.pipelines.ProcessThumbnailPipeline": 300,
"converter.pipelines.EduSharingTypeValidationPipeline": 325,
Expand Down
2 changes: 2 additions & 0 deletions converter/spiders/sample_spider_alternative.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ async def parse(self, response: scrapy.http.Response, **kwargs) -> BaseItemLoade
# human readable string of text) store its within the 'fulltext' field.)
# If no 'fulltext' value was provided, the pipelines will try to fetch
# 'full text' content from "ResponseItem.text" and save it here.
# - ai_allow_usage optional (filled automatically by the ``RobotsTxtPipeline`` and expects a boolean)
# indicates if an item is allowed to be used in AI training.
base.add_value('sourceId', response.url)
# if the source doesn't have a "datePublished" or "lastModified"-value in its header or JSON_LD,
# you might have to help yourself with a unique string consisting of the datetime of the crawl + self.version
Expand Down
151 changes: 151 additions & 0 deletions converter/util/robots_txt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
import re
from functools import lru_cache

import requests
import tldextract
from loguru import logger
from protego import Protego
from tldextract.tldextract import ExtractResult

AI_USER_AGENTS: list[str] = [
"anthropic-ai",
"Claude-Web",
"Applebot-Extended",
"Bytespider",
"CCBot",
"ChatGPT-User",
"cohere-ai",
"Diffbot",
"FacebookBot",
"GoogleOther",
"Google-Extended",
"GPTBot",
"ImagesiftBot",
"PerplexityBot",
"OmigiliBot",
"Omigili",
]
# this non-exhaustive list of known (AI) web crawlers is used to check if the robots.txt file explicitly allows or forbids AI usage
# for reference: https://www.foundationwebdev.com/2023/11/which-web-crawlers-are-associated-with-ai-crawlers/
# ToDo: the list of known AI user agents could be refactored into a SkoHub Vocab


@lru_cache(maxsize=512)
def fetch_robots_txt(url: str) -> str | None:
"""
Fetch the robots.txt file from the given URL.
:param url: URL string pointing towards a ``robots.txt``-file.
:return: The file content of the ``robots.txt``-file as a ``str``, otherwise returns ``None`` if the HTTP ``GET``-request failed.
"""
response: requests.Response = requests.get(url=url)
if response.status_code != 200:
logger.warning(
f"Could not fetch robots.txt from {url} . "
f"Response code: {response.status_code} "
f"Reason: {response.reason}"
)
return None
else:
# happy-case: the content of the robots.txt file should be available in response.text
return response.text


def _remove_wildcard_user_agent_from_robots_txt(robots_txt: str) -> str:
"""
Remove the wildcard user agent part of a string from the given ``robots.txt``-string.
:param robots_txt: text content of a ``robots.txt``-file
:return: ``robots.txt``-file content without the wildcard user agent. If no wildcard user agent was found, return the original string without alterations.
"""
# the user agent directive can appear in different forms and spellings
# (e.g. "user agent:", "useragent:", "user-agent:", "User-agent:" etc.)
# and is followed by a newline with "disallow: /"
_wildcard_pattern: re.Pattern = re.compile(
r"(?P<user_agent_directive>[u|U]ser[\s|-]?[a|A]gent:\s*[*]\s*)"
r"(?P<disallow_directive>[d|D]isallow:\s*/\s+)"
)
_wildcard_agent_match: re.Match | None = _wildcard_pattern.search(robots_txt)
if _wildcard_agent_match:
# remove the wildcard user agent from the parsed robots.txt string
robots_txt_without_wildcard_agent: str = robots_txt.replace(_wildcard_agent_match.group(), "")
return robots_txt_without_wildcard_agent
else:
# if no wildcard user agent was detected, do nothing.
return robots_txt


def _parse_robots_txt_with_protego(robots_txt: str) -> Protego | None:
"""
Parse a ``robots.txt``-string with ``Protego``.
:param robots_txt: text content of a ``robots.txt``-file
:return: returns a ``Protego``-object if the string could be parsed successfully, otherwise returns ``None``
"""
if robots_txt and isinstance(robots_txt, str):
robots_txt = _remove_wildcard_user_agent_from_robots_txt(robots_txt)
protego_object: Protego = Protego.parse(robots_txt)
return protego_object
else:
return None


def _check_protego_object_against_list_of_known_ai_user_agents(protego_object: Protego, url: str) -> bool:
"""
Check if the given ``url`` is allowed to be scraped by AI user agents.
:param protego_object: ``Protego``-object holding ``robots.txt``-information
:param url: URL to be checked against a list of known AI user agents
:return: Returns ``True`` if the given ``url`` is allowed to be scraped by AI user agents. If the ``robots.txt``-file forbids AI scrapers, returns ``False``.
"""
if url is None:
raise ValueError(f"url cannot be None. (Please provide a valid URL string!)")
if protego_object is None:
raise ValueError(f"This method requires a valid protego object.")
else:
ai_usage_allowed: bool = True # assumption: if not explicitly disallowed by the robots.txt, AI usage is allowed
for user_agent in AI_USER_AGENTS:
_allowed_for_current_user_agent: bool = protego_object.can_fetch(
url=url,
user_agent=user_agent,
)
if _allowed_for_current_user_agent is False:
ai_usage_allowed = False
# as soon as one AI user agent is disallowed, we assume that AI usage is generally disallowed,
# therefore, we can skip the rest of the iterations.
break
return ai_usage_allowed


def is_ai_usage_allowed(url: str, robots_txt: str = None) -> bool:
"""
Check if the given ``url`` is allowed to be scraped by AI user agents.
:param url: URL to be checked against a list of known AI user agents
:param robots_txt: string value of a ``robots.txt`` file. If no ``robots.txt``-string is provided, fallback to HTTP Request: ``https://<fully_qualified_domain_name>/robots.txt``
:return: Returns ``True`` if the given ``url`` is allowed to be scraped by AI user agents. If the URL target forbids any of the known AI scrapers, returns ``False``.
"""
if robots_txt is None:
# Fallback:
# if no robots_txt string was provided, fetch the file from "<fully qualified domain name>/robots.txt"
_extracted: ExtractResult = tldextract.extract(url=url)
# using tldextract instead of python's built-in ``urllib.parse.urlparse()``-method was a conscious decision!
# tldextract is more forgiving/reliable when it comes to incomplete urls (and provides a neat "fqdn"-attribute)
if _extracted.fqdn:
# use the fully qualified domain name to build the robots.txt url
_most_probable_robots_url_path: str = f"https://{_extracted.fqdn}/robots.txt"
robots_txt: str = fetch_robots_txt(url=_most_probable_robots_url_path)
if robots_txt is None:
# if the website provides no robots.txt, assume that everything is allowed
return True
else:
# this covers edge-cases for completely malformed URLs like "https://fakeurl"
raise ValueError(f"The URL {url} does not exist (and therefore contains no robots.txt).")
if robots_txt:
# happy case: check the current url against the provided robots.txt ruleset
po = _parse_robots_txt_with_protego(robots_txt=robots_txt)
_ai_usage_allowed: bool = _check_protego_object_against_list_of_known_ai_user_agents(
protego_object=po,
url=url,
)
return _ai_usage_allowed
Loading

0 comments on commit 4e60db7

Please sign in to comment.