Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSS - scapers and bots issue #163

Open
lunaraurora opened this issue Oct 15, 2021 · 3 comments
Open

RSS - scapers and bots issue #163

lunaraurora opened this issue Oct 15, 2021 · 3 comments
Assignees

Comments

@lunaraurora
Copy link

I get in my logs a lot of these requests:
HTTP/1.1" 200 26701 "http://www.google.co.uk/url?sa=t&source=web&cd=1" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0"

GET /rss.xml HTTP/1.1" 200 375883 "http://www.google.co.uk/url?sa=t&source=web&cd=1" "PHP/7.4"

crawling each post of my site all days, and producing fake stats, consuming resources and ruining seo.
What i believe is that these requests are produced from some sort of scraping program. My xml rss feed is the most abused part of the site as scrapers and bots try to crawl contents each day. I found that a program called 'Full-Text RSS' from fivefilters[.]org can do exactly this:

as i said, xml rss are the most abused from scapers, bots and fake bots.
I have ultimate bad bot blocker, csf firewall and fail2ban configured, each day i try to harden the configuration but i always get such type of bloodsuckers in my logs .. please help me to stop them

@mitchellkrogza
Copy link
Owner

For a start the referrer shows as Google can you search for your full link to the RSS file on Google to see if it got indexed. If so request removal of it and block indexing of it then also deny access to it via robots.txt or disable the RSS feed completely as they are very abused

@lunaraurora
Copy link
Author

i believe i will disable rss completely as many bots do not respect the robots.txt directives,
thanks for your response

@mitchellkrogza
Copy link
Owner

i believe i will disable rss completely as many bots do not respect the robots.txt directives, thanks for your response

I did too as it's highly abused by meaningless content scraping sites & bots that fill pages with the content scraped and serve ads on those pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants