-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
104 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Colab Edition\n", | ||
"\n", | ||
"<div align=\"center\">\n", | ||
"\n", | ||
"[![logo](https://ipitio.github.io/ocr-pdf/public/wide.webp)](https://github.com/ipitio/ocr-pdf)\n", | ||
"\n", | ||
"<h1><a href=\"https://github.com/ipitio/ocr-pdf\" target=\"_blank\" rel=\"noopener noreferrer\">\n", | ||
" ocr2pdf\n", | ||
"</a></h1>\n", | ||
"\n", | ||
"**Convert images and scans to searchable PDFs!**\n", | ||
"\n", | ||
"---\n", | ||
"\n", | ||
"[![downloads](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fipitio.github.io%2Fbackage%2Fipitio%2Focr-pdf%2Focr-pdf.json&query=%24.downloads&logo=github&logoColor=959da5&labelColor=333a41&label=pulls)](https://github.com/arevindh/pihole-speedtest/pkgs/container/pihole-speedtest) [![build](https://github.com/ipitio/ocr-pdf/actions/workflows/publish.yml/badge.svg)](https://github.com/ipitio/ocr-pdf/actions/workflows/publish.yml)\n", | ||
"\n", | ||
"</div>\n", | ||
"\n", | ||
"This notebook is meant to be run on [Colab](https://colab.research.google.com/github/ipitio/ocr-pdf/blob/master/colab.ipynb). It will convert your files and can optionally save them to [Drive](https://drive.google.com/drive/my-drive) `/ocr-pdf`. Open the link above for more information.\n", | ||
"\n", | ||
"## Steps\n", | ||
"\n", | ||
"1. Make two new folders, one inside the other\n", | ||
" - The outer one can be named anything, say `pdf`\n", | ||
" - The inner one must be named `todo`\n", | ||
"2. Place your files in the `todo` folder\n", | ||
" - Those by themselves will just be converted\n", | ||
" - Those inside subfolders will also be merged in alphabetical order\n", | ||
"3. Share the outer `pdf` folder with this notebook\n", | ||
" - Zip the folder\n", | ||
" - Open this notebook in [Colab](https://colab.research.google.com/github/ipitio/ocr-pdf/blob/master/colab.ipynb)\n", | ||
" - Run the cell below to be prompted to connect Drive and upload the zip\n", | ||
"\n", | ||
"You'll be offered a zip of the converted (and merged) files to download locally, whether or not Drive was connected\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"\n", | ||
"# Connect to Drive\n", | ||
"try:\n", | ||
" from google.colab import files, drive\n", | ||
" drive.mount(\"/content/drive\", force_remount=True)\n", | ||
" drive = True\n", | ||
"except:\n", | ||
" drive = False\n", | ||
"\n", | ||
"# Extract your PDFs\n", | ||
"files.upload()\n", | ||
"\n", | ||
"# Get the name of the zip file\n", | ||
"pdfs = [pdf for pdf in os.listdir() if pdf.endswith(\".zip\")]\n", | ||
"if len(pdfs) == 0:\n", | ||
" raise Exception(\"No ZIP file found\")\n", | ||
"\n", | ||
"# Transform them\n", | ||
"%pip install udocker\n", | ||
"!udocker --allow-root install\n", | ||
"\n", | ||
"for pdf in pdfs:\n", | ||
" !unzip -o \"$pdf\"\n", | ||
" !rm -f \"$pdf\"\n", | ||
" !udocker --allow-root run -v /content/\"$pdf\":/app/pdf ghcr.io/ipitio/ocr-pdf bash predict.sh pdf\n", | ||
" converted = os.listdir(\"$pdf/done\")\n", | ||
"\n", | ||
" # And load\n", | ||
" if drive and len(converted) > 0:\n", | ||
" ![ -d \"drive/MyDrive/ocr-pdf\" ] || mkdir \"drive/MyDrive/ocr-pdf\"\n", | ||
" !\\cp -r \"$pdf/done/\"* \"drive/MyDrive/ocr-pdf/\"\n", | ||
"\n", | ||
" if len(converted) == 1 and os.path.isfile(\"$pdf/done/\" + converted[0]):\n", | ||
" files.download(\"$pdf/done/\" + converted[0])\n", | ||
" elif len(converted) > 0:\n", | ||
" !zip -r \"$pdf.zip\" \"$pdf/done\"\n", | ||
" files.download(\"$pdf.zip\")\n", | ||
" else:\n", | ||
" print(\"No PDFs found\")" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"language_info": { | ||
"name": "python" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |