feat: parse robots.txt for AI usage indicators ("ccm:ai_allow_usage") #120

Criamos · 2024-12-06T11:29:47Z

This PR includes the following changes:

feat: implemented a new Scrapy Field in BaseItem: ai_allow_usage
- this metadata field is meant to store a bool-value that indicates if a crawled item is allowed (or disallowed) for usage in AI training
feat: implemented a RobotsTxtPipeline to automatically fill up the BaseItem.ai_allow_usage-Field during a crawl
feat: connected LomEducationalItem.typicalLearningTime with the edu-sharing backend (cclom:typicallearningtime)
deps: added loguru, protego and tldextract to the list of dependencies
- loguru: to make debugging (and logging handlers) a bit more pleasant to work with
- protego: for parsing robots.txt-files more efficiently than the Python built-in RobotFileParser
  - this package is used by the Scrapy package (and maintained by the Scrapy devs)
- tldextract: helps us with disassembling URLs into their structural parts
  - (this package is part of the Scrapy dependencies)

(This PR closes https://edu-sharing.atlassian.net/browse/KDATAPORT-23)

Implementation details

The basic idea of the ai_allow_usage-Field and its bool-value:
While there is currently no strict standard that handles the question of "is this <dataset / item> allowed to be used by AI?", the robots.txt-file of a website might at least offer some indicators that could become relevant in the future.

If a webmaster explicitly forbids some AI scrapers from interacting with his website (by declaring a robots.txt-directive for known AI user agents), we assume that all AI usage is generally disallowed for said items. Here's an excerpt of disallowed AI user agents (this example is from the faculty "Department of Earth, Ocean and Atmospheric Sciences" from the University of British Columbia):

User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: GPTBot
Disallow: /

When an item goes through our RobotsTxtPipeline, the pipeline compares the item's URL with the rules/directives of the parsed robots.txt-file and checks against a list of known AI user agents (for further reading: "Which web crawlers are associated with AI crawlers?"). As soon as one AI scraper is forbidden, the Pipeline will flag the item with ai_usage_allowed: False

If there are no (known) AI user agents mentioned in the robots.txt-file, we need to assume that a webmaster has not made a conscious decision yet and therefore assume that AI usage was allowed at the time of crawl. This use-case includes restrictive robots.txt-files like a general disallow-directive for robots (see: wildcard user agent *):

User-agent: *
Disallow: /

In this case, the pipeline would disregard the directive altogether and flag the item as ai_usage_allowed: True, as long as no other AI user agents are mentioned in the robots.txt.

`RobotsTxtPipeline`: Cache / Performance

Crawling a big website or dataset (e.g., OERSI / SODIX) often consists of thousands of URLs (and therefore items to be crawled). To minimize the performance impact of having to fire additional HTTP GET-Requests towards each individual item's https://<fully_qualified_domain_name>/robots.txt URL, some precautions were taken with regard to caching:
To guarantee that only a single HTTP Request per fully qualified domain name is made, the robots_txt.py-utility uses an LRU cache of size 512. Once a robots.txt-file has been parsed, the cached str-value will be readily available for subsequent requests and the performance hit should be miniscule.

- see: https://github.com/Delgan/loguru - see: https://github.com/scrapy/protego

- see: https://github.com/john-kurkowski/tldextract

- to make debugging easier, I collected a few URLs to robots.txt files, which either generally allow or disallow AI scraping - since these tests are firing HTTP requests, they are skipped by default and should not be enabled in CI/CD pipelines - if you want to test the function calls, comment out the "pytest.mark.skip"-decorator of the test and run it within pyCharm / your IDE of choice

…e used in AI training

…g backend (-> "ccm:ai_allow_usage")

…rawlers like OERSI or SODIX)

….txt with protego - for our use-case we assume that wildcard user agents that disallow all robots from parsing a site need to be disregarded before trying to determine if AI usage is allowed (or not)

- change: removed "id"-parameter from pytest.mark.parametrize decorator because it made the pytest log less readable than initially expected

Criamos added 19 commits December 3, 2024 12:22

feat: add Scrapy Field for edu-sharing property "ccm:ai_allow_usage"

161256c

refactor: move utility tests to project "tests/"-directory

856d8e0

deps: add loguru / protego and update pytest

1e7bfd1

- see: https://github.com/Delgan/loguru - see: https://github.com/scrapy/protego

deps: add tldextract

b9dc65f

- see: https://github.com/john-kurkowski/tldextract

feat: robots.txt utility (first draft, squashed)

fdf1ef5

test: implement basic tests for robots.txt utility

0d59ce5

docs: add "ai_allow_usage"-field explanation

31ce707

feat: add "RobotsTxtPipeline" to determine if an item is allowed to b…

22ba9a3

…e used in AI training

feat: connect the "BaseItem.ai_allow_usage"-field with the edu-sharin…

051d970

…g backend (-> "ccm:ai_allow_usage")

feat: enable typicalLearningTime in es_connector

aa43f0a

style: code formatting

b60e2de

style: code cleanup

716504e

change: increase size of LRU cache (might become necessary for huge c…

d5fa52a

…rawlers like OERSI or SODIX)

feat: remove wildcard user agent directives before parsing the robots…

b73f9f5

….txt with protego - for our use-case we assume that wildcard user agents that disallow all robots from parsing a site need to be disregarded before trying to determine if AI usage is allowed (or not)

test: test behavior with overly restrictive robots.txt files

253d050

- change: removed "id"-parameter from pytest.mark.parametrize decorator because it made the pytest log less readable than initially expected

fix: string replacement

c27e6b1

test: add test-case for restrictive, but mixed robots.txt directives

8a24869

fix: robots.txt syntax (there is no "Allow:" in the spec)

57617c5

Criamos self-assigned this Dec 6, 2024

Criamos added enhancement New feature or request dependencies Pull requests that update a dependency file labels Dec 6, 2024

Criamos merged commit e733041 into upgrade_to_scrapy_212 Dec 6, 2024
2 checks passed

Criamos deleted the feat_allow_ai_usage branch December 6, 2024 12:14

This was referenced Dec 6, 2024

Upgrade to Python 3.13 and Scrapy v2.12 / feat: robots.txt parsing for "ccm:ai_allow_usage" #121

Merged

Merge develop into master #122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: parse robots.txt for AI usage indicators ("ccm:ai_allow_usage") #120

feat: parse robots.txt for AI usage indicators ("ccm:ai_allow_usage") #120

Criamos commented Dec 6, 2024

feat: parse robots.txt for AI usage indicators ("ccm:ai_allow_usage") #120

feat: parse robots.txt for AI usage indicators ("ccm:ai_allow_usage") #120

Conversation

Criamos commented Dec 6, 2024

Implementation details

RobotsTxtPipeline: Cache / Performance

`RobotsTxtPipeline`: Cache / Performance