Let's say you want to read some sort of fiction. You're a fan of it, perhaps. But mobile websites are kind of non-ideal, so you'd like a proper ebook made from whatever you're reading.
You need Python 3.7+ and poetry.
My recommended setup process is:
$ pip install poetry
$ poetry install
$ poetry shell
...adjust as needed. Just make sure the dependencies from pyproject.toml
get installed somehow.
Basic
$ python3 leech.py [[URL]]
A new file will appear named Title of the Story.epub
.
This is equivalent to the slightly longer
$ python3 leech.py download [[URL]]
Flushing the cache
$ python3 leech.py flush
If you want to put it on a Kindle you'll have to convert it. I'd recommend Calibre, though you could also try using kindlegen directly.
- Fanfiction.net
- FictionPress
- ArchiveOfOurOwn
- Yes, it has its own built-in EPUB export, but the formatting is horrible
- Various XenForo-based sites: SpaceBattles and SufficientVelocity, most notably
- RoyalRoad
- Fiction.live (Anonkun)
- DeviantArt galleries/collections
- Sta.sh
- Completely arbitrary sites, with a bit more work (see below)
A very small amount of configuration is possible by creating a file called leech.json
in the project directory. Currently you can define login information for sites that support it, and some options for book covers.
Example:
{
"logins": {
"QuestionableQuesting": ["username", "password"]
},
"cover": {
"fontname": "Comic Sans MS",
"fontsize": 30,
"bgcolor": [20, 120, 20],
"textcolor": [180, 20, 180],
"cover_url": "https://website.com/image.png"
},
"output_dir": "/tmp/ebooks",
"site_options": {
"RoyalRoad": {
"output_dir": "/tmp/litrpg_isekai_trash"
}
}
}
If you want to just download a one-off story from a site, you can create a definition file to describe it. This requires investigation and understanding of things like CSS selectors, which may take some trial and error.
Example practical.json
:
{
"url": "https://practicalguidetoevil.wordpress.com/table-of-contents/",
"title": "A Practical Guide To Evil: Book 1",
"author": "erraticerrata",
"chapter_selector": "#main .entry-content > ul:nth-of-type(1) > li > a",
"content_selector": "#main .entry-content",
"filter_selector": ".sharedaddy, .wpcnt, style",
"cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png"
}
Run as:
$ ./leech.py practical.json
This tells leech to load url
, follow the links described by chapter_selector
, extract the content from those pages as described by content_selector
, and remove any content from that which matches filter_selector
. Optionally, cover_url
will replace the default cover with the image of your choice.
If chapter_selector
isn't given, it'll create a single-chapter book by applying content_selector
to url
.
This is a fairly viable way to extract a story from, say, a random Wordpress installation with a convenient table of contents. It's relatively likely to get you at least most of the way to the ebook you want, with maybe some manual editing needed.
A more advanced example with JSON would be:
{
"url": "https://practicalguidetoevil.wordpress.com/2015/03/25/prologue/",
"title": "A Practical Guide To Evil: Book 1",
"author": "erraticerrata",
"content_selector": "#main .entry-wrapper",
"content_title_selector": "h1.entry-title",
"content_text_selector": ".entry-content",
"filter_selector": ".sharedaddy, .wpcnt, style",
"next_selector": "a[rel=\"next\"]:not([href*=\"prologue\"])",
"cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png"
}
Because there's no chapter_selector
here, leech will keep on looking for a link which it can find with next_selector
and following that link. We also see more advanced metadata acquisition here, with content_title_selector
and content_text_selector
being used to find specific elements from within the content.
If multiple matches for content_selector
are found, leech will assume multiple chapters are present on one page, and will handle that. If you find a story that you want on a site which has all the chapters in the right order and next-page links, this is a notably efficient way to download it. See examples/dungeonkeeperami.json
for this being used.
If you need more advanced behavior, consider looking at...
To add support for a new site, create a file in the sites
directory that implements the Site
interface. Take a look at ao3.py
for a minimal example of what you have to do.
You can build the project's Docker container like this:
docker build . -t kemayo/leech:snapshot
The container's entrypoint runs leech
directly and sets the current working directory to /work
, so you can mount any directory there:
docker run -it --rm -v ${DIR}:/work kemayo/leech:snapshot download [[URL]]
If you submit a pull request to add support for another reasonably-general-purpose site, I will nigh-certainly accept it.
Run EpubCheck on epubs you generate to make sure they're not breaking.