-
Notifications
You must be signed in to change notification settings - Fork 260
User Agent Configuration
The configuration of the user agent in StormCrawler has 2 purposes:
- Identification of the crawler for webmasters
- Selection of rules from robots.txt
The politeness of a web crawler is not limited to how frequently it fetches pages from a site, but also in how it identifies itself to sites it crawls. This is done by setting the HTTP header User-Agent
, just like your web browser does.
The full user agent string is built from the concatenation of the configuration elements:
-
http.agent.name
: name of your crawler -
http.agent.version
: version of your crawler -
http.agent.description
: description of what it does -
http.agent.url
: URL webmasters can go to to learn about it -
http.agent.email
: an email so that they can get in touch with you
Whereas StormCrawler used to provide a default value for these, this is not the case since version 2.11 and you will now be asked to provide a value.
You can specify the user agent verbatim with the config http.agent
but you will still need to provide a http.agent.name
for parsing robots.txt files.
This is also known as the robots.txt protocol, it is formalised in RFC 9309. Part of what the robots directives does is to define rules to specify which parts of a website (if any) are allowed to be crawler. The rules are organised by User-Agent
, with a *
to match any agent not otherwise specified explicitly e.g.
User-Agent: *
Disallow: *.gif$
Disallow: /example/
Allow: /publications/
In the example above the rule allows access to the URLs with the /publications/ path prefix, and it restricts access to the URLs with the /example/ path prefix and to all URLs with a .gif suffix. The "*" character designates any character, including the otherwise-required forward slash;
The value of http.agent.name
is what StormCrawler looks for in the robots.txt. It MUST contain only uppercase and lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens ("-").
Unless you are running a well know web crawler, it is unlikely that its agent name will be listed explicitly in the robots.txt (if it is, well, congratulations!). While you want the agent name value to reflect who your crawler is, you might want to follow rules set for better known crawlers. For instance, if you were a responsible AI company crawling the web to build a dataset to train LLMs, you would want to follow the rules set for Google-Extended
(see list of Google crawlers) if any were found.
This is what the configuration http.robots.agents
allows you to do. It is a comma separated string but can also take a list of values. By setting it alongside http.agent.name
(which should also be the first value it contains), you are able to broaden the match rules based on the identity as well as the purpose of your crawler.
- Start
- Components
- Filters
- Bolts
- Protocol
- Metadata
- Resources