Skip to content

theyorubayesian/otelemuye

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ọ̀tẹlẹ̀múyẹ́

Ọ̀tẹlẹ̀múyẹ́ means detective in Yoruba.

This project, Ọ̀tẹlẹ̀múyẹ́ provides an extensible framework for scraping websites. It relies on Scrapy and provides a Selenium middleware to handle dynamic content.

🎬 Installation

  • Create a conda environment
conda create -n otelemuye python=3.9
conda activate otelemuye
  • Run the following command to install this project
pip install .
  • If you would like a development installation instead, use the following command
pip install -e ".[dev]"

Setup 🛠️

  • You can find a list of existing spiders here.

  • See example.ipynb to see notebook examples of how you can create your own Spider and start crawling.

  • To use this tool via command line, you will require a development installation. See Installation

  • You can create a new spider using the following command:

otelemuye create-spider --template template/sitemap --spider-name <YourSpiderName> --language <Language>

Contribution

  • You will require a development installation in order to contribute a Spider to this repository. See Installation

  • To contribute new crawlers, extend otelemuye.SitemapSpider or otelemuye.Spider and provide concrete implementations of the abstract methods.

  • You will also need to provide a template config file in config/. Your filename should be name of the spider class you created e.g. legitng.yaml is the config file for LegitNGSpider.

  • See LegitNGSpider for guidance if your crawler requires Selenium to load dynamic content.

  • You can run start crawling by running a command similar to:

otelemuye run-till-complete --spider-class LegitNGSpider --check-interval 300

Note that --check-interval is only used when the Selenium middleware is in use.

  • To see other commands, configurations and functionalities
otelemuye --help

About

An extensible framework for webscraping

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published