-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: parse robots.txt for AI usage indicators ("ccm:ai_allow_usage") #120
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- to make debugging easier, I collected a few URLs to robots.txt files, which either generally allow or disallow AI scraping - since these tests are firing HTTP requests, they are skipped by default and should not be enabled in CI/CD pipelines - if you want to test the function calls, comment out the "pytest.mark.skip"-decorator of the test and run it within pyCharm / your IDE of choice
…e used in AI training
…g backend (-> "ccm:ai_allow_usage")
…rawlers like OERSI or SODIX)
….txt with protego - for our use-case we assume that wildcard user agents that disallow all robots from parsing a site need to be disregarded before trying to determine if AI usage is allowed (or not)
- change: removed "id"-parameter from pytest.mark.parametrize decorator because it made the pytest log less readable than initially expected
Criamos
added
enhancement
New feature or request
dependencies
Pull requests that update a dependency file
labels
Dec 6, 2024
This was referenced Dec 6, 2024
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR includes the following changes:
BaseItem
:ai_allow_usage
bool
-value that indicates if a crawled item is allowed (or disallowed) for usage in AI trainingRobotsTxtPipeline
to automatically fill up theBaseItem.ai_allow_usage
-Field during a crawlLomEducationalItem.typicalLearningTime
with the edu-sharing backend (cclom:typicallearningtime
)loguru
,protego
andtldextract
to the list of dependenciesrobots.txt
-files more efficiently than the Python built-in RobotFileParser(This PR closes https://edu-sharing.atlassian.net/browse/KDATAPORT-23)
Implementation details
The basic idea of the
ai_allow_usage
-Field and itsbool
-value:While there is currently no strict standard that handles the question of "is this <dataset / item> allowed to be used by AI?", the robots.txt-file of a website might at least offer some indicators that could become relevant in the future.
If a webmaster explicitly forbids some AI scrapers from interacting with his website (by declaring a
robots.txt
-directive for known AI user agents), we assume that all AI usage is generally disallowed for said items. Here's an excerpt of disallowed AI user agents (this example is from the faculty "Department of Earth, Ocean and Atmospheric Sciences" from the University of British Columbia):When an item goes through our
RobotsTxtPipeline
, the pipeline compares the item's URL with the rules/directives of the parsedrobots.txt
-file and checks against a list of known AI user agents (for further reading: "Which web crawlers are associated with AI crawlers?"). As soon as one AI scraper is forbidden, the Pipeline will flag the item withai_usage_allowed: False
If there are no (known) AI user agents mentioned in the
robots.txt
-file, we need to assume that a webmaster has not made a conscious decision yet and therefore assume that AI usage was allowed at the time of crawl. This use-case includes restrictiverobots.txt
-files like a generaldisallow
-directive for robots (see: wildcard user agent*
):In this case, the pipeline would disregard the directive altogether and flag the item as
ai_usage_allowed: True
, as long as no other AI user agents are mentioned in therobots.txt
.RobotsTxtPipeline
: Cache / PerformanceCrawling a big website or dataset (e.g., OERSI / SODIX) often consists of thousands of URLs (and therefore items to be crawled). To minimize the performance impact of having to fire additional HTTP
GET
-Requests towards each individual item'shttps://<fully_qualified_domain_name>/robots.txt
URL, some precautions were taken with regard to caching:To guarantee that only a single HTTP Request per fully qualified domain name is made, the
robots_txt.py
-utility uses an LRU cache of size 512. Once arobots.txt
-file has been parsed, the cachedstr
-value will be readily available for subsequent requests and the performance hit should be miniscule.