Skip to content

Distributed Web Scraping Tool Powered by Spidey and Redis

License

Notifications You must be signed in to change notification settings

asad-haider/spidey-redis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

npm package

NPM download Package Quality

Redis Spidey - Distributed Web Scraping Solution Powered by Redis

RedisSpidey is a powerful tool that combines the capabilities of Spidey and Redis to enable efficient distributed crawling and web scraping. Leveraging the advanced features of Redis, RedisSpidey features a distributed architecture that supports parallel operation of multiple instances, all listening to the same queue. Additionally, RedisSpidey pushes scraped data back to Redis queues for easy distributed post-processing, enhancing the overall efficiency of the scraping process.

Features

  • Distributed Crawling: RedisSpidey enables seamless operation of multiple instances of crawlers, all listening to the same queue, for efficient distributed crawling.
  • RedisPipeline: RedisSpidey provides support to push crawled data back to Redis queues for distributed post-processing

Installation

npm install spidey-redis

Options

RedisSpidey supports all Spidey options in addition to the following specific options.

Configuration Type Description Default Required
redisUrl string Redis url such as redis://localhost:6379 null Yes
urlsKey string Redis input queue name such as urls:queue null Yes
dataKey string Redis output queue name such as data:queue null Yes if using RedisPipeline
sleepDelay number Wait for new items in queue if empty 5000ms No

Usage

import { RedisSpidey, RedisPipeline } from 'spidey-redis';

class AmazonSpidey extends RedisSpidey {
  constructor() {
    super({
      // spidey options ...
      redisUrl: 'redis://localhost:6379',

      // Input queue
      urlsKey: 'amazon:urls',

      // Output queue
      dataKey: 'amazon:data',

      // Redis pipeline to push crawled data to data queue 
      pipelines: [RedisPipeline],
    });
  }
}

Conclusion

RedisSpidey is the ultimate solution for distributed web scraping and crawling, offering unparalleled performance, scalability, and flexibility. With RedisSpidey, you can easily handle large-scale web scraping tasks with ease, while taking advantage of advanced Redis and Spidey technology for efficient distributed crawling and post-processing of data.

License

Spidey is licensed under the MIT License.