Is there a way to stop spider check duplicate with redis ? #242

milkeasd · 2022-04-02T20:30:02Z

My spider was extremely slow when run with scrapy-redis. Because there is a big delay between slave and master. I want to reduce the commuication to just only getting the start_urls periodically or when all start_urls is done, Is there any ways to do so ?

Moreover, I want to stop the duplication check to reduce the number of connection.

But, I cant change the DUPEFILTER_CLASS to scrapy default one, it raise error.

Is there any other ways to stop the duplicate check ?

Or any ideas can help speed up the process ?

Thanks

LuckyPigeon · 2022-04-03T02:56:59Z

@Germey Any ideas?

LuckyPigeon · 2022-04-03T03:01:28Z

@milkeasd
Could you provide related code files?

LuckyPigeon · 2022-04-03T05:21:58Z

The way I see, let developer customize their communication rules and add a disable option for DUPEFILTER_CLASS can be two great features.

LuckyPigeon · 2022-04-08T16:52:04Z

@milkeasd
For disable DUPEFILTER_CLASS, try this https://stackoverflow.com/questions/23131283/how-to-force-scrapy-to-crawl-duplicate-url

Germey · 2022-04-09T06:11:51Z

@milkeasd could you please provide your code or make some sample code?

sify21 · 2024-06-07T10:10:01Z

@LuckyPigeon it doesn't work. setting DUPEFILTER_CLASS = "scrapy.dupefilters.BaseDupeFilter" will report this error:

builtins.AttributeError: type object 'BaseDupeFilter' has no attribute 'from_spider'

Maybe there should be a custom BaseDupeFilter in scrapy-redis like RFPDupeFilter:

scrapy-redis/src/scrapy_redis/dupefilter.py

Line 128 in 48a7a89

def from_spider(cls, spider):

From scrapy's doc: https://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class

You can disable filtering of duplicate requests by setting DUPEFILTER_CLASS to 'scrapy.dupefilters.BaseDupeFilter'. Be very careful about this however, because you can get into crawling loops. It’s usually a better idea to set the dont_filter parameter to True on the specific Request that should not be filtered.

HairlessVillager · 2024-07-07T02:05:45Z

Hi, everyone! I've made a little change in scrapy_redis.scheduler.Scheduler, which maybe helpful for this issue. Feel free to use and comment.🥰

LuckyPigeon added the question label Apr 3, 2022

LuckyPigeon added feature improvement labels Apr 3, 2022

HairlessVillager mentioned this issue Jun 22, 2024

scrapy_redis.scheduler.Scheduler not compatible with scrapy.dupefilters.BaseDupeFilter #293

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to stop spider check duplicate with redis ? #242

Is there a way to stop spider check duplicate with redis ? #242

milkeasd commented Apr 2, 2022

LuckyPigeon commented Apr 3, 2022

LuckyPigeon commented Apr 3, 2022

LuckyPigeon commented Apr 3, 2022 •

edited

Loading

LuckyPigeon commented Apr 8, 2022 •

edited

Loading

Germey commented Apr 9, 2022

sify21 commented Jun 7, 2024 •

edited

Loading

HairlessVillager commented Jul 7, 2024

Is there a way to stop spider check duplicate with redis ? #242

Is there a way to stop spider check duplicate with redis ? #242

Comments

milkeasd commented Apr 2, 2022

LuckyPigeon commented Apr 3, 2022

LuckyPigeon commented Apr 3, 2022

LuckyPigeon commented Apr 3, 2022 • edited Loading

LuckyPigeon commented Apr 8, 2022 • edited Loading

Germey commented Apr 9, 2022

sify21 commented Jun 7, 2024 • edited Loading

HairlessVillager commented Jul 7, 2024

LuckyPigeon commented Apr 3, 2022 •

edited

Loading

LuckyPigeon commented Apr 8, 2022 •

edited

Loading

sify21 commented Jun 7, 2024 •

edited

Loading