Skip to content

Commit

Permalink
Merge pull request #10 from apankowski/feature/automated-download-pro…
Browse files Browse the repository at this point in the history
…cess

Automated download process
  • Loading branch information
apankowski authored Oct 29, 2023
2 parents c3c8139 + c285d7c commit c664289
Show file tree
Hide file tree
Showing 3 changed files with 27 additions and 85 deletions.
50 changes: 11 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,68 +4,40 @@ This is a POC downloader of documents from [doc88.com](https://doc88.com). It sa

## Instructions

The download procedure is a bit of a PITA, but hey… it's a POC.

1. Navigate to the desired document in your browser.
2. Make sure browser's zoom level is set to 100% — based on some tests it seems that zoom levels lower than 100% can result in lower quality of captured pages.
3. Open Developer Tools (e.g. press <kbd>Ctrl</kbd>+<kbd>Shift</kbd>+<kbd>I</kbd>).
4. Switch to JavaScript Console.
5. Paste [this JavaScript](downloadPages.js) in Console and confirm with <kbd>Enter</kbd>.
6. Preload all pages. Type:
6. Download the pages. Type:
```javascript
preloadAllPages()
downloadPages()
```
in Console and hit <kbd>Enter</kbd>. Wait until the process ends, printing `Finished preloading pages` in the Console.
7. Download pages in batches. Type:
```javascript
downloadPages(1, 10)
```
in Console and hit <kbd>Enter</kbd> to download pages 1 through 10.
* ℹ It is advised to download 10 pages at a time. After saving a batch of pages simply enter `downloadPages(11, 20)` to download pages 11 through 20, and so on.
* ℹ In case of Chrome, the first time you download a batch of pages you may see a popup stating that "This site is attempting to download multiple files". You have to allow it as each PDF page is downloaded as a separate file.
* See [options](#options) section below for options.
in Console and hit <kbd>Enter</kbd>. This will download all the pages. Pages will be automatically preloaded and saved one by one, as they are loaded.
See [options](#options) section below for options.
7. Wait until the process ends, printing `Finished downloading pages` in the Console.
8. Make sure all desired pages were downloaded correctly.

### Options

`downloadPages` function takes options object as the 3rd optional argument, e.g.:

```javascript
downloadPages(1, 10, {quality: 0.8, imageNamePrefix: 'temp_'})
downloadPages({fromPage: 2, toPage: 10, quality: 0.8, imageNamePrefix: 'temp_'})
```

Possible options:

1. `format` – downloaded image format; string; either `'jpg'` or `'png'`; default is `'jpg'`
2. `quality` – quality of images; applicable when `format` is `'jpg'`; number between `0` and `1`; default is `0.9`
3. `imageNamePrefix` – prefix for names of downloaded images; string; default is `'page'` (resulting in downloaded file names e.g.: `page001.jpg`, `page002.jpg`, etc. assuming `format` is `'jpg'`)

## Bulk download

You can bulk download all the pages without the browser block to this behavior by running [this JavaScript](batchDownloadAll.js) in Console and confirming with <kbd>Enter</kbd>. It downloads all the images in 10 page blocks waiting for a timeout so the Browser does not block this operation.

Download pages in batches. Type:

```javascript
const numPages = 100; // Your number of pages
batchDownload(numPages);
```

You can also specify the images format and the desired wait interval between downloads:

```javascript
const numPages = 100; // Your number of pages
const format = 'jpeg'; // JPEG image format
const interval = 1000; // 1000 milliseconds
batchDownload(numPages, format, interval);
```
1. `fromPage` – first page in range to be downloaded; number; default is `1`
2. `toPage` – last page in range to be downloaded; number; default is total number of pages in the document
3. `format` – downloaded image format; string; either `'jpg'` or `'png'`; default is `'jpg'`
4. `quality` – quality of images; applicable when `format` is `'jpg'`; number between `0` and `1`; default is `0.9`
5. `imageNamePrefix` – prefix for names of downloaded images; string; default is `'page'` (resulting in downloaded file names e.g.: `page001.jpg`, `page002.jpg`, etc. assuming `format` is `'jpg'`)

## Converting downloaded images back to a PDF

Under Linux you can easily convert downloaded images back to a PDF.

To do that:

1. Install ImageMagick package:
```shell
sudo apt-get install imagemagick
Expand Down
39 changes: 0 additions & 39 deletions batchDownloadAll.js

This file was deleted.

23 changes: 16 additions & 7 deletions downloadPages.js
Original file line number Diff line number Diff line change
Expand Up @@ -35,18 +35,19 @@ function waitUntilPageIsLoaded(pageNo, pageCanvas, resolve){
}
}

async function preloadPage(pageNo) {
async function preloadPage(pageNo, pageCanvas) {
console.log("Preloading page #" + pageNo)
const pageCanvas = getPageCanvas(pageNo)
pageCanvas.scrollIntoView()
return new Promise((resolve) => waitUntilPageIsLoaded(pageNo, pageCanvas, resolve))
}

// Keep for debugging purposes
async function preloadAllPages() {
revealAllPagePlaceholders()
const pageCount = getPageCount()
for (let pageNo = 1; pageNo <= pageCount; pageNo++) {
await preloadPage(pageNo)
const pageCanvas = getPageCanvas(pageNo)
await preloadPage(pageNo, pageCanvas)
}
console.log("Finished preloading pages")
}
Expand Down Expand Up @@ -89,13 +90,21 @@ function downloadCanvasAsImage(canvas, imageName, imageFormat) {
)
}

function downloadPages(from, to, options = {}) {
async function downloadPages(options = {}) {
revealAllPagePlaceholders()

const imageFormat = imageFormatFor(options)
const { fromPage = 1, toPage = getPageCount() } = options

for (let pageNo = from; pageNo <= to; pageNo++) {
for (let pageNo = fromPage; pageNo <= toPage; pageNo++) {
const pageCanvas = getPageCanvas(pageNo)
if (pageCanvas === null) break
if (!pageCanvas) break; // Exit early if page number is out of range

const imageName = imageNameFor(pageNo, options)
downloadCanvasAsImage(pageCanvas, imageName, imageFormat)
await preloadPage(pageNo, pageCanvas).then(() => {
downloadCanvasAsImage(pageCanvas, imageName, imageFormat)
console.log("Downloaded page #" + pageNo)
})
}
console.log("Finished downloading pages " + fromPage + "-" + toPage)
}

0 comments on commit c664289

Please sign in to comment.