Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: parse robots.txt for AI usage indicators ("ccm:ai_allow_usage") #120

Merged
merged 19 commits into from
Dec 6, 2024

Conversation

Criamos
Copy link
Contributor

@Criamos Criamos commented Dec 6, 2024

This PR includes the following changes:

  • feat: implemented a new Scrapy Field in BaseItem: ai_allow_usage
    • this metadata field is meant to store a bool-value that indicates if a crawled item is allowed (or disallowed) for usage in AI training
  • feat: implemented a RobotsTxtPipeline to automatically fill up the BaseItem.ai_allow_usage-Field during a crawl
  • feat: connected LomEducationalItem.typicalLearningTime with the edu-sharing backend (cclom:typicallearningtime)
  • deps: added loguru, protego and tldextract to the list of dependencies
    • loguru: to make debugging (and logging handlers) a bit more pleasant to work with
    • protego: for parsing robots.txt-files more efficiently than the Python built-in RobotFileParser
      • this package is used by the Scrapy package (and maintained by the Scrapy devs)
    • tldextract: helps us with disassembling URLs into their structural parts
      • (this package is part of the Scrapy dependencies)

(This PR closes https://edu-sharing.atlassian.net/browse/KDATAPORT-23)


Implementation details

The basic idea of the ai_allow_usage-Field and its bool-value:
While there is currently no strict standard that handles the question of "is this <dataset / item> allowed to be used by AI?", the robots.txt-file of a website might at least offer some indicators that could become relevant in the future.

If a webmaster explicitly forbids some AI scrapers from interacting with his website (by declaring a robots.txt-directive for known AI user agents), we assume that all AI usage is generally disallowed for said items. Here's an excerpt of disallowed AI user agents (this example is from the faculty "Department of Earth, Ocean and Atmospheric Sciences" from the University of British Columbia):

User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: GPTBot
Disallow: /

When an item goes through our RobotsTxtPipeline, the pipeline compares the item's URL with the rules/directives of the parsed robots.txt-file and checks against a list of known AI user agents (for further reading: "Which web crawlers are associated with AI crawlers?"). As soon as one AI scraper is forbidden, the Pipeline will flag the item with ai_usage_allowed: False

If there are no (known) AI user agents mentioned in the robots.txt-file, we need to assume that a webmaster has not made a conscious decision yet and therefore assume that AI usage was allowed at the time of crawl. This use-case includes restrictive robots.txt-files like a general disallow-directive for robots (see: wildcard user agent *):

User-agent: *
Disallow: /

In this case, the pipeline would disregard the directive altogether and flag the item as ai_usage_allowed: True, as long as no other AI user agents are mentioned in the robots.txt.

RobotsTxtPipeline: Cache / Performance

Crawling a big website or dataset (e.g., OERSI / SODIX) often consists of thousands of URLs (and therefore items to be crawled). To minimize the performance impact of having to fire additional HTTP GET-Requests towards each individual item's https://<fully_qualified_domain_name>/robots.txt URL, some precautions were taken with regard to caching:
To guarantee that only a single HTTP Request per fully qualified domain name is made, the robots_txt.py-utility uses an LRU cache of size 512. Once a robots.txt-file has been parsed, the cached str-value will be readily available for subsequent requests and the performance hit should be miniscule.

- to make debugging easier, I collected a few URLs to robots.txt files, which either generally allow or disallow AI scraping
- since these tests are firing HTTP requests, they are skipped by default and should not be enabled in CI/CD pipelines
  - if you want to test the function calls, comment out the "pytest.mark.skip"-decorator of the test and run it within pyCharm / your IDE of choice
….txt with protego

- for our use-case we assume that wildcard user agents that disallow all robots from parsing a site need to be disregarded before trying to determine if AI usage is allowed (or not)
- change: removed "id"-parameter from pytest.mark.parametrize decorator because it made the pytest log less readable than initially expected
@Criamos Criamos self-assigned this Dec 6, 2024
@Criamos Criamos added enhancement New feature or request dependencies Pull requests that update a dependency file labels Dec 6, 2024
@Criamos Criamos merged commit e733041 into upgrade_to_scrapy_212 Dec 6, 2024
2 checks passed
@Criamos Criamos deleted the feat_allow_ai_usage branch December 6, 2024 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant