Fandom Scraper

This is a tool to scrape and clean the page content from any Fandom.com (e.g. https://harrypotter.fandom.com).

The tool is a modified version of ScrapeFandom which itself relies heavily upon WikiExtractor. Those tools are much more full-feature, robust, and set-up to be used as a Python package. I encourage to check them out.

My goal was to produce a smaller, simpler version that could be controlled entirely from the command line.

Installation

I aimed for ease of installation and a quick time to obtaining files.

Clone the repo to a location of your choosing.
Source the run-me.sh file with appropriate arguments

Example Usage

./run-me.sh all ~/Documents/my_project/ harrypotter

The above line will run 'all' steps. The raw, processed, and cleaned file directories will be placed into ~/Documents/my_project/. The fandom that will be scraped is https://harrypotter.fandom.com

./run-me.sh 1 ~/scraped_data/ matrix

The above line will run step 1 only which is scraping. The raw file directories will be placed into ~/scraped_data/. The fandom that will be scraped is https://matrix.fandom.com

./run-me.sh 4 ~/scraped_data/ matrix

The above line will run step 4 only which is cleaning. The cleaned file directories will be placed into ~/scraped_data/. The fandom that will be scraped is https://matrix.fandom.com. However, in this case the "matrix" argument serves only to complete the path to the files that need to be cleaned: ~/scraped_data/matrix_processed/

License

The code is made available under the GNU Affero General Public License v3.0.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.gitignore		.gitignore
Clean.py		Clean.py
Json2jsonl.py		Json2jsonl.py
LICENSE		LICENSE
README.md		README.md
ScrapeFandom.py		ScrapeFandom.py
WikiExtractor.py		WikiExtractor.py
extract.py		extract.py
requirements.txt		requirements.txt
run-me.sh		run-me.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fandom Scraper

Installation

Example Usage

License

About

Releases

Packages

Languages

License

rtjohn/fandom_etl

Folders and files

Latest commit

History

Repository files navigation

Fandom Scraper

Installation

Example Usage

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages