Merge pull request #10 from apankowski/feature/automated-download-pro…

…cess Automated download process
apankowski · Oct 29, 2023 · c664289 · c664289
2 parents c3c8139 + c285d7c
commit c664289
Show file tree

Hide file tree

Showing 3 changed files with 27 additions and 85 deletions.
diff --git a/README.md b/README.md
@@ -4,68 +4,40 @@ This is a POC downloader of documents from [doc88.com](https://doc88.com). It sa
 
 ## Instructions
 
-The download procedure is a bit of a PITA, but hey… it's a POC.
-
 1. Navigate to the desired document in your browser.
 2. Make sure browser's zoom level is set to 100% — based on some tests it seems that zoom levels lower than 100% can result in lower quality of captured pages.
 3. Open Developer Tools (e.g. press <kbd>Ctrl</kbd>+<kbd>Shift</kbd>+<kbd>I</kbd>).
 4. Switch to JavaScript Console.
 5. Paste [this JavaScript](downloadPages.js) in Console and confirm with <kbd>Enter</kbd>.
-6. Preload all pages. Type:
+6. Download the pages. Type:
     ```javascript
-    preloadAllPages()
+    downloadPages()
     ```
-   in Console and hit <kbd>Enter</kbd>. Wait until the process ends, printing `Finished preloading pages` in the Console.
-7. Download pages in batches. Type:
-   ```javascript
-   downloadPages(1, 10)
-   ```
-   in Console and hit <kbd>Enter</kbd> to download pages 1 through 10.
-    * ℹ It is advised to download 10 pages at a time. After saving a batch of pages simply enter `downloadPages(11, 20)` to download pages 11 through 20, and so on.
-    * ℹ In case of Chrome, the first time you download a batch of pages you may see a popup stating that "This site is attempting to download multiple files". You have to allow it as each PDF page is downloaded as a separate file.
-    * See [options](#options) section below for options.
+   in Console and hit <kbd>Enter</kbd>. This will download all the pages. Pages will be automatically preloaded and saved one by one, as they are loaded.  
+   See [options](#options) section below for options.
+7. Wait until the process ends, printing `Finished downloading pages` in the Console.
 8. Make sure all desired pages were downloaded correctly.
 
 ### Options
 
 `downloadPages` function takes options object as the 3rd optional argument, e.g.:
 
 ```javascript
-downloadPages(1, 10, {quality: 0.8, imageNamePrefix: 'temp_'})
+downloadPages({fromPage: 2, toPage: 10, quality: 0.8, imageNamePrefix: 'temp_'})
 ```
 
 Possible options:
 
-1. `format` – downloaded image format; string; either `'jpg'` or `'png'`; default is `'jpg'`
-2. `quality` – quality of images; applicable when `format` is `'jpg'`; number between `0` and `1`; default is `0.9`
-3. `imageNamePrefix` – prefix for names of downloaded images; string; default is `'page'` (resulting in downloaded file names e.g.: `page001.jpg`, `page002.jpg`, etc. assuming `format` is `'jpg'`)
-
-## Bulk download
-
-You can bulk download all the pages without the browser block to this behavior by running [this JavaScript](batchDownloadAll.js) in Console and confirming with <kbd>Enter</kbd>. It downloads all the images in 10 page blocks waiting for a timeout so the Browser does not block this operation.
-
-Download pages in batches. Type:
-
-```javascript
-const numPages = 100; // Your number of pages
-batchDownload(numPages);
-```
-
-You can also specify the images format and the desired wait interval between downloads:
-
-```javascript
-const numPages = 100; // Your number of pages
-const format = 'jpeg'; // JPEG image format
-const interval = 1000; // 1000 milliseconds
-batchDownload(numPages, format, interval);
-```
+1. `fromPage` – first page in range to be downloaded; number; default is `1`
+2. `toPage` – last page in range to be downloaded; number; default is total number of pages in the document
+3. `format` – downloaded image format; string; either `'jpg'` or `'png'`; default is `'jpg'`
+4. `quality` – quality of images; applicable when `format` is `'jpg'`; number between `0` and `1`; default is `0.9`
+5. `imageNamePrefix` – prefix for names of downloaded images; string; default is `'page'` (resulting in downloaded file names e.g.: `page001.jpg`, `page002.jpg`, etc. assuming `format` is `'jpg'`)
 
 ## Converting downloaded images back to a PDF
 
 Under Linux you can easily convert downloaded images back to a PDF.
 
-To do that:
-
 1. Install ImageMagick package:
     ```shell
     sudo apt-get install imagemagick

diff --git a/batchDownloadAll.js b/batchDownloadAll.js
diff --git a/downloadPages.js b/downloadPages.js
@@ -35,18 +35,19 @@ function waitUntilPageIsLoaded(pageNo, pageCanvas, resolve){
   }
 }
 
-async function preloadPage(pageNo) {
+async function preloadPage(pageNo, pageCanvas) {
   console.log("Preloading page #" + pageNo)
-  const pageCanvas = getPageCanvas(pageNo)
   pageCanvas.scrollIntoView()
   return new Promise((resolve) => waitUntilPageIsLoaded(pageNo, pageCanvas, resolve))
 }
 
+// Keep for debugging purposes
 async function preloadAllPages() {
   revealAllPagePlaceholders()
   const pageCount = getPageCount()
   for (let pageNo = 1; pageNo <= pageCount; pageNo++) {
-    await preloadPage(pageNo)
+    const pageCanvas = getPageCanvas(pageNo)
+    await preloadPage(pageNo, pageCanvas)
   }
   console.log("Finished preloading pages")
 }
@@ -89,13 +90,21 @@ function downloadCanvasAsImage(canvas, imageName, imageFormat) {
   )
 }
 
-function downloadPages(from, to, options = {}) {
+async function downloadPages(options = {}) {
+  revealAllPagePlaceholders()
+
   const imageFormat = imageFormatFor(options)
+  const { fromPage = 1, toPage = getPageCount() } = options
 
-  for (let pageNo = from; pageNo <= to; pageNo++) {
+  for (let pageNo = fromPage; pageNo <= toPage; pageNo++) {
     const pageCanvas = getPageCanvas(pageNo)
-    if (pageCanvas === null) break
+    if (!pageCanvas) break; // Exit early if page number is out of range
+
     const imageName = imageNameFor(pageNo, options)
-    downloadCanvasAsImage(pageCanvas, imageName, imageFormat)
+    await preloadPage(pageNo, pageCanvas).then(() => {
+      downloadCanvasAsImage(pageCanvas, imageName, imageFormat)
+      console.log("Downloaded page #" + pageNo)
+    })
   }
+  console.log("Finished downloading pages " + fromPage + "-" + toPage)
 }