Multisite-Python-Crawler

Usage

scrapy crawl mySpider -a url=<enter_complete_url> -a domain=<enter_allowed_domains_seperated_by_&>

Example

scrapy crawl mySpider -a url=https://en.wikipedia.org/wiki/Jimmy_Wales -a domain=en.wikipedia.org

Prerequisite

Scrapy 2.3 and above

pip install scrapy

Description

An almost generic web crawler built using Scrapy and Python 3.7 to recursively crawl entire websites. Developing a single generic crawler is difficult as different websites require different XPath expressions to retreive content. This multisite crawler gets the paragraph tag text and outputs a JSON file of the following format :-

{
	"pages" : [
		{
			"page" : "....",
			"content" : "...."
		},
		{
			"page" : "...",
			"content" : "..."
		}
	]
}

If required, it can be modified to read text from other tags by including explicit xpath statements too. The settings.py file can be modified with relevant key value pairs supported by Scrapy.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
scraper		scraper
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multisite-Python-Crawler

Usage

Example

Prerequisite

Description

About

Uh oh!

Releases

Packages

Languages

chandrasekharan98/Multisite-Python-Crawler

Folders and files

Latest commit

History

Repository files navigation

Multisite-Python-Crawler

Usage

Example

Prerequisite

Description

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages