-
Notifications
You must be signed in to change notification settings - Fork 2
/
robots.txt
197 lines (171 loc) · 6.19 KB
/
robots.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
# robots.txt
# https://owly.fans/robots.txt
# Page is under FOPL-ZERO
# https://owly.fans/license/fopl-zero
# Want to know when this page is updated? Follow the RSS!
# https://owly.fans/rss/robots.xml
# This page is also on git:
# Please feel free to suggest a change to this file
# https://github.com/DynTylluan/owly.fans/blob/main/robots.txt (main)
# https://notabug.org/DynTylluan/owly.fans/src/main/robots.txt (mirror)
# https://tildegit.org/cass/owly.fans/src/branch/main/robots.txt (mirror)
# AdIdxBot (Bing/Microsoft)
# Used by Bing Ads for «quality control» for, you guessed it, ads!
# https://bing.com/webmasters/help/which-crawlers-does-bing-use-8c184ec0
User-Agent: AdIdxBot
Disallow: /
# Amazon
# Used for Alexa - does not cite sources.
# https://developer.amazon.com/support/amazonbot
User-Agent: Amazonbot
Disallow: /
# Anthropic
# «AI research and products that put safety at the
# frontier» - no thanks!
# https://anthropic.com
User-agent: anthropic-ai
Disallow: /
# Apple's bot
# Used for Siri and Spotlight Suggestions. Does not cite its sources.
# https://support.apple.com/en-us/HT204683
User-Agent: Applebot
User-Agent: AppleNewsBot
Disallow: /
# Claude
# Used to help AI.
# https://claude.ai
User-agent: Claude-Web
Disallow: /
# Cohere
# Stupid bot used to help artificial intelligence.
# https://cohere.com
User-agent: cohere-ai
Disallow: /
# Common Crawl bot
# From what I've leant from reading online, «[t]he majority of ChatGPT's
# training data comes from the Common Crawl crawler bot».
# Quote: https://datadome.co/threat-research/how-chatgpt-openai-might-use-your-content-now-in-the-future
# https://commoncrawl.org/faq
User-agent: CCBot
Disallow: /
# EmailSiphon
# Email harvesting bot, get outta here!
User-agent: EmailSiphon
Disallow: /
# FacebookBot
# Used to «improve language models for our speech recognition technology»,
# so more AI rubbish that I don't want from a company that I don't like.
User-Agent: FacebookBot
User-Agent: meta-externalagent
Disallow: /
# Google's AdSense/StoreBot bots
# Go away - I don't want you here.
# https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
User-Agent: AdsBot-Google-Mobile
User-Agent: Storebot-Google
User-Agent: AdsBot-Google
User-agent: Mediapartners-Google
Disallow: /
# Google's AI training bot
# I don't want to have my content used without giving me credit.
# https://blog.google/technology/ai/an-update-on-web-publisher-controls
User-agent: Google-Extended
Disallow: /
# GPTBot is OpenAI's web crawler
# I do not want it to use my content as it does not cite its sources.
# https://platform.openai.com/docs/gptbot
User-agent: GPTBot
User-agent: ChatGPT-User
Disallow: /
# ia_archiver
# In the past, the Internet Archive (https://archive.org) used to have a
# bot called ia_archiver that you could allow (or disallow) to index your
# website, sometimes around 2017, this crawler stopped obeying robots.txt.
# Putting ia_archiver and setting it to «Disallow: /» will not work. If
# you really want your website to not be seen on this archive, you will
# need to E-Mail [email protected].
# https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives
# https://web.archive.org/web/20150322111536/http://archive.org/about/exclude.php
# https://blog.reputationx.com/block-wayback-machine
# Magpie Crawler
# Used to download pages and get «indexed and analysed by our system».
# I don't want to use your service!
User-agent: magpie-crawler
Disallow: /
# MindSpider
# According to cdc.gov, this bot «appears to be ill-behaved», so it is
# being blocked here.
# https://www.cdc.gov/robots.txt
User-agent: MindSpider
Disallow: /
# Pinterest crawler
# I don't want my website to be used to «collect rich metadata like the
# price, description and availability of your products» or however they
# want to word that rubbish.
# https://help.pinterest.com/en/business/article/pinterest-crawler
User-agent: Pinterestbot
Disallow: /
# This is where I copy Bytemoth
# I am taking it in good faith that Bytemoth knows what he's doing and that
# the following bots are for marketers - so I am blocking them.
# As taken from the following URL: http://bytemoth.nfshost.com/robots.txt
User-agent: adsbot
User-agent: AhrefsBot
User-agent: BLEXBot
User-agent: dotbot
User-agent: Pandalytics
User-agent: SemrushBot
User-agent: SemrushBot-BA
User-agent: SemrushBot-BM
User-agent: SemrushBot-CT
User-agent: SemrushBot-SA
User-agent: SemrushBot-SI
User-agent: SemrushBot-SWA
User-agent: serpstatbot
User agent: MTRobot
User-agent: PageThing
Disallow: /
# Bots that you might like to block
# There are a few bots that I am okay/don't really care all that much if
# they index my website, but not everyone might be okay with them, so
# if you copy this TXT on your website and want to have the bot not index
# it, simply remove the «#» before the «User-agent» and «Disallow» part.
# DuckDuckGo
# The search engine website uses the following bots to index sites.
# https://duckduckgo.com/duckduckbot
# https://duckduckgo.com/duckduckgo-help-pages/results/duckduckbot
# https://duckduckgo.com/duckduckgo-help-pages/results/sources
# User-agent: DuckDuckBot
# Disallow: /
# Other Google bots
# Don't like Google? Unmark these.
# https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
# User-agent: Googlebot
# User-agent: Googlebot-News
# User-agent: Googlebot-Image
# User-agent: Googlebot-Video
# User-agent: Storebot-Google
# User-agent: Google-InspectionTool
# User-agent: GoogleOther
# User-agent: Google-Extended
# User-agent: APIs-Google
# User-agent: AdsBot-Google-Mobile
# User-agent: AdsBot-Google
# User-agent: Mediapartners-Google
# User-agent: FeedFetcher-Google
# User-agent: GoogleProducer
# User-agent: GoogleProducer; (+http://goo.gl/7y4SX)
# User-agent: google-speakr
# Disallow: /
# Don't allow any bots
# If you don't want *any* good bots browsing your website, then unmark
# this. Please note that there are some bots still that don't obay
# robots.txt at all, like ia_archiver.
# User-agent: *
# Disallow: /
# Allow all bots
# If you do, however, want to allow all bots, you should uncheck the
# following. You can also just *not* have a robots.txt and it'll
# work all the same.
# User-agent: *
# Allow: /