Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Document how to deal with bots on live sites #2286

Open
joshdentremont opened this issue Feb 20, 2024 · 7 comments
Open

[DOCS] Document how to deal with bots on live sites #2286

joshdentremont opened this issue Feb 20, 2024 · 7 comments
Labels
Type: documentation provides documentation or asks for documentation.

Comments

@joshdentremont
Copy link
Contributor

We should write up some docs about how to block bots from crawling a live site that is set up with Docker. This has come up a few times in Slack and it would be good to have something to explain how to deal with it.

One option some of us have been using is to edit drupal.defaults.conf to return a 403 based on user agent. I have done this by adding the following to my Dockerfile, but you could also mount the conf file and edit it manually:

# block bots in nginx
RUN echo -e '\n\
if ($http_user_agent ~ (Bytespider|Sogou|SemrushBot|AcademicBotRTU|PetalBot|GPTBot|DataForSeoBot|test-bot) ) { \n\
    return 403; \n\
}'\
>> /etc/nginx/shared/drupal.defaults.conf

It would also be nice to document how to block by IP address using Docker.

Related, but possibly a separate issue, is that bots are getting stuck looping over facets. I'm seeing this on my site with legit bots as well, like bingbot. If there is a way to prevent this we should document that as well.

@mjordan
Copy link
Contributor

mjordan commented Feb 20, 2024

bots are getting stuck looping over facets

We've experienced this as well and it's brought out site to its knees.

@ajstanley
Copy link
Contributor

Same. Tiktok ignores robots.txt. We have one sight that was getting several hits per second before we stuck a user agent filter in.

@Natkeeran
Copy link
Contributor

@joshdentremont
Copy link
Contributor Author

Suggestions from tech call below:

Blocking bots by user agent:

  • add user agents as an env variable
  • mount drupal.defaults.conf as a volume so changes persist
    • can you add a secondary file to drupal.defaults.conf and mount that instead?

Stopping legit bots from crawling facets:

  • ignore query params in robots.txt
  • block collections and search pages in robots.txt, and instead use a sitemap (simple sitemap was a suggested drupal module)

Remaining questions:

  • How do we block by IP in Docker
  • How do we update robots.txt? Should we supply a default or just document how to change it?

@ajstanley
Copy link
Contributor

Nginx allows for multiple conf files. We could add an include in nginx.conf to point to a file in /var/www/drupal which would eliminate the need for a separate mount.

@ysuarez ysuarez added the Type: documentation provides documentation or asks for documentation. label Mar 13, 2024
@kayakr
Copy link
Contributor

kayakr commented Apr 10, 2024

fwiw, I've found the patch for facets at https://www.drupal.org/node/2937191 useful; it converts the facets into actual checkboxes instead of the default that renders them as links (followable by bots) that get converted to checkboxes by js.

@joshdentremont
Copy link
Contributor Author

@kayakr that would be awesome if we could get that patched into the facets module. I really like the checkboxes for facets but am having the same issue with bots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: documentation provides documentation or asks for documentation.
Projects
None yet
Development

No branches or pull requests

6 participants