Goodreads Scraper

Python script to scrap Goodreads books shelves.

Installation
Usage

Installation

mkvirtualenv --python=`which python3` goodreads-scraper
pip install -r requirements.txt

Usage

Step 1: Select shelves

Go to https://www.goodreads.com/shelf and select the shelves you want to scrap from.

Say you want fantasy, adventure and thriller books. Go to the shelves.txt file and fill it with one shelf name per line.

This is how shelves.txt would look like:

fantasy
adventure
thriller

Step 2: Get your cookies

To retrieve all pages you want you'll need to log in into Goodreads and check the value of your _session_id2 cookie that will be set automatically in your web browser after making a request logged in. Set the value of the constant COOKIE in books_scraper.py with the one you obtained from your browser (or set it on your .env file).

If you skip this step, every request you make to get a shelf will return the first page, even if you ask for the second one.

Step 3: Run books scraper

In this step you will scrap books from the selected shelves by running the following command:

python books_scraper.py

You can set how many pages you want to scrap from each shelf by changing the value of the constant PAGES_PER_SHELF to whatever you want.

By the end of this step you will end up with 1 json file per page per shelf inside the shelves_pages folder. Something like this:

fantasy_1.json
fantasy_2.json
fantasy_3.json
adventure_1.json
adventure_2.json
adventure_3.json
thriller_1.json
thriller_2.json
thriller_3.json

So adventure_2.json corresponds to page number 2 of the adventure books shelf.

Step 4: Run shelves merger

But we just want one big books.json file...

Just run the following command to merge all generated files into one big clean books.json file:

python shelves_merger.py

This script will collect all books, remove duplicates, clean the attributes of the books and clean all reviews.

Step 5: Retrieve all unique authors and genres

The final step is to generate json files containing all authors names and genres by running the following command:

python get_data.py

With this you will end up with a json file called authors.json containing a list of all unique authors and one for the genres called genres.json.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
_data		_data
authors_source_pages_mobile		authors_source_pages_mobile
authors_urls		authors_urls
books_source_pages		books_source_pages
shelves_pages		shelves_pages
shelves_pages_books_urls		shelves_pages_books_urls
stats		stats
.flake8		.flake8
.gitignore		.gitignore
README.md		README.md
authors_scraper.py		authors_scraper.py
books_scraper.py		books_scraper.py
requirements.txt		requirements.txt
shelves.txt		shelves.txt
shelves_merger.py		shelves_merger.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goodreads Scraper

Installation

Usage

Step 1: Select shelves

Step 2: Get your cookies

Step 3: Run books scraper

Step 4: Run shelves merger

Step 5: Retrieve all unique authors and genres

About

Releases

Packages

Contributors 3

Languages

javierlopeza/goodreads-scraper

Folders and files

Latest commit

History

Repository files navigation

Goodreads Scraper

Installation

Usage

Step 1: Select shelves

Step 2: Get your cookies

Step 3: Run books scraper

Step 4: Run shelves merger

Step 5: Retrieve all unique authors and genres

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages