Merge pull request #17 from corbindavenport/main

More bot information and improved README
ai-robots-txt · Jun 22, 2024 · fd4ade5 · fd4ade5
2 parents 4163ca9 + fe98e41
commit fd4ade5
Show file tree

Hide file tree

Showing 2 changed files with 22 additions and 18 deletions.
diff --git a/README.md b/README.md
@@ -2,18 +2,22 @@
 
 <img src="/assets/images/noai-logo.png" width="100" />
 
-**[Subscribe to updates via RSS/Atom by clicking on this link.](https://github.com/ai-robots-txt/ai.robots.txt/releases.atom)**
-
-_(Or paste the link into your preferred feed reader.)_
-
----
-
 This is an open list of web crawlers associated with AI companies and the training of LLMs to block. We encourage you to contribute to and implement this list on your own site.
 
 A number of these crawlers have been sourced from [Dark Visitors](https://darkvisitors.com) and we appreciate the ongoing effort they put in to track these crawlers. 
 
 If you'd like to add information about a crawler to the list, please make a pull request with the bot name added to `robots.txt`, `ai.txt`, and any relevant details in `table-of-bot-metrics.md` to help people understand what's crawling.
 
+## Subscribe to updates
+
+You can subscribe to list updates via RSS/Atom with the releases feed:
+
+```
+https://github.com/ai-robots-txt/ai.robots.txt/releases.atom
+```
+
+You can subscribe with [Feedly](https://feedly.com/i/subscription/feed/https://github.com/ai-robots-txt/ai.robots.txt/releases.atom), [Inoreader](https://www.inoreader.com/?add_feed=https://github.com/ai-robots-txt/ai.robots.txt/releases.atom), [The Old Reader](https://theoldreader.com/feeds/subscribe?url=https://github.com/ai-robots-txt/ai.robots.txt/releases.atom), [Feedbin](https://feedbin.me/?subscribe=https://github.com/ai-robots-txt/ai.robots.txt/releases.atom), or any other reader app.
+
 ## Additional resources
 
 - [Blocking Bots with Nginx](https://rknight.me/blog/blocking-bots-with-nginx/) by Robb Knight

diff --git a/table-of-bot-metrics.md b/table-of-bot-metrics.md
@@ -3,22 +3,22 @@
 | AdsBot-Google   | Google  | Yes (Exceptions for Dynamic Search Ads) | Analyzes website content for ad relevancy, improves ad serving for Google Ads. Data anonymized according to [Google's Privacy Policy](https://policies.google.com/privacy). Unclear on data retention or use by other products. | Varies depending on campaign activity and website updates. Crawls optimized to minimize impact, specific frequency not public. | Web crawler by Google Ads to analyze websites for ad effectiveness and ensure ad relevancy to webpage content. |
 |Amazonbot      | Amazon | Yes | Service improvement and enabling answers for Alexa users. | No information provided. | Includes references to crawled website when surfacing answers via Alexa; does not clearly outline other uses. |
 |anthropic-ai  | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
-|Applebot-Extended   |    [Apple](https://support.apple.com/en-us/119829#datausage)    |          Yes           |          |                  |      Apple has a secondary user agent, Applebot-Extended ... [that is] used to train Apple's foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.       |
+|Applebot-Extended   | [Apple](https://support.apple.com/en-us/119829#datausage) | Yes           | Powers features in Siri, Spotlight, Safari, Apple Intelligence, and others. | Unclear at this time. | Apple has a secondary user agent, Applebot-Extended ... [that is] used to train Apple's foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.       |
 |Bytespider    | ByteDance | No | LLM training. | Unclear at this time. | Downloads data to train LLMS, including ChatGPT competitors. |
 |CCBot         | [Common Crawl](https://commoncrawl.org) | [Yes](https://commoncrawl.org/ccbot) | Provides crawl data for an open source repository that has been used to train LLMs. | Unclear at this time. | Sources data that is made openly available and is used to train AI models. |
 |ChatGPT-User   | [OpenAI](https://openai.com) | Yes | Takes action based on user prompts. | Only when prompted by a user. | Used by plugins in ChatGPT to answer queries based on user input. |
 |ClaudeBot      | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
 |Claude-Web | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
 |cohere-ai | [Cohere](https://cohere.com) | Unclear at this time. | Retrieves data to provide responses to user-initiated prompts. | Takes action based on user prompts. | Retrieves data based on user prompts. |
 |Diffbot | [Diffbot](https://www.diffbot.com/) | At the discretion of Diffbot users. | Aggregates structured web data for monitoring and AI model training. | Unclear at this time. | Diffbot is an application used to parse web pages into structured data; this data is used for monitoring or AI model training. |
-|FacebookBot    |         |                       |          |                  |             |
-|Google-Extended|         |                       |          |                  |             |
-|GoogleOther    |         |                       |          |                  |             |
-|GPTBot        | [OpenAI](https://openai.com) | Yes | Scrapes data to train OpenAI's products. | No information provided. | Data is used to train current and future models, removed paywalled data, PII and data that violates the company's policies. |
-| img2dataset |         |                       |          |                  |             |
-|omgili        |         |                       |          |                  |             |
-|omgilibot     |         |                       |          |                  |             |
-|peer39_crawler|         |                       |          |                  |             |
-|peer39_crawler/1.0|         |                       |          |                  |             |
-|PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [Yes](https://docs.perplexity.ai/docs/perplexitybot) | Used to answer queries at the request of users. | Takes action based on user prompts.  | Operated by Perplexity to obtain results in response to user queries. |
-|YouBot        |         |                       |          |                  |             |
+|FacebookBot    | Meta/Facebook | [Yes](https://developers.facebook.com/docs/sharing/bot/) | Training language models | Up to 1 page per second | Officially used for training Meta "speech recognition technology," unknown if used to train Meta AI specifically. |
+|Google-Extended| Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | LLM training. | No information | Used to train Gemini and Vertex AI generative APIs. Does not impact a site's inclusion or ranking in Google Search. |
+|GoogleOther    | Google | [Yes](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) | Scrapes data. | No information | "Used by various product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development." |
+|GPTBot        | [OpenAI](https://openai.com) | Yes | Scrapes data to train OpenAI's products. | No information | Data is used to train current and future models, removed paywalled data, PII and data that violates the company's policies. |
+| img2dataset | [img2dataset](https://github.com/rom1504/img2dataset) | At the discretion of img2dataset users. | Scrapes images for use in LLMs. | At the discretion of img2dataset users. | Downloads large sets of images into datasets for LLM training or other purposes. |
+|omgili        | [Webz.io](https://webz.io/) | [Yes](https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/) | Data is sold. | No information | Crawls sites for APIs used by Hootsuite, Sprinklr, NetBase, and other companies. Data also sold for research purposes or LLM training. |
+|omgilibot     | [Webz.io](https://webz.io/) | [Yes](https://web.archive.org/web/20170704003301/http://omgili.com/Crawler.html) | Data is sold. | No information | Legacy user agent initially used for Omgili search engine. Unknown if still used, `omgili` agent still used by Webz.io. |
+|peer39_crawler| [Peer39](https://www.peer39.com/) | [Yes](https://www.peer39.com/crawler-notice) | Targeted advertising. | No information | Web crawler used to "enhance the visibility of your site to advertisers who value and seek out such quality content." |
+|peer39_crawler| [Peer39](https://www.peer39.com/) | [Yes](https://www.peer39.com/crawler-notice) | Targeted advertising. | No information | Web crawler used to "enhance the visibility of your site to advertisers who value and seek out such quality content." |
+|PerplexityBot | [Perplexity](https://www.perplexity.ai/) | [No](https://www.macstories.net/stories/wired-confirms-perplexity-is-bypassing-efforts-by-websites-to-block-its-web-crawler/) | Used to answer queries at the request of users. | Takes action based on user prompts.  | Operated by Perplexity to obtain results in response to user queries. |
+|YouBot        | [You](https://about.you.com/youchat/) | [Yes](https://about.you.com/youbot/) | Scrapes data for search engine and LLMs. | No information | Retrieves data used for You.com web search engine and LLMs. |