Ọ̀tẹlẹ̀múyẹ́
means detective in Yoruba.
This project, Ọ̀tẹlẹ̀múyẹ́
provides an extensible framework for scraping websites. It relies on Scrapy and provides a Selenium middleware to handle dynamic content.
- Create a conda environment
conda create -n otelemuye python=3.9
conda activate otelemuye
- Run the following command to install this project
pip install .
- If you would like a development installation instead, use the following command
pip install -e ".[dev]"
-
You can find a list of existing
spiders
here. -
See example.ipynb to see notebook examples of how you can create your own Spider and start crawling.
-
To use this tool via command line, you will require a development installation. See Installation
-
You can create a new spider using the following command:
otelemuye create-spider --template template/sitemap --spider-name <YourSpiderName> --language <Language>
-
You will require a development installation in order to contribute a Spider to this repository. See Installation
-
To contribute new crawlers, extend
otelemuye.SitemapSpider
orotelemuye.Spider
and provide concrete implementations of the abstract methods. -
You will also need to provide a template config file in config/. Your filename should be name of the spider class you created e.g.
legitng.yaml
is the config file forLegitNGSpider
. -
See LegitNGSpider for guidance if your crawler requires Selenium to load dynamic content.
-
You can run start crawling by running a command similar to:
otelemuye run-till-complete --spider-class LegitNGSpider --check-interval 300
Note that --check-interval
is only used when the Selenium middleware is in use.
- To see other commands, configurations and functionalities
otelemuye --help