-
PyMuPDF Pros
- Recent Github activity; commits + closed issues + closed pull requests
- Github popularity; 2.6k Stars + 325 Forks
- Integrates Google Tesseract Engine for Optical Character Recognition (OCR)
- File conversions, to and from Pdf or other formats
- Wide range of support for working with text, images, drawings, shape objects, forms in pdf files
- Multiprocessing
Cons
- Extracting from a table in pdf files
Unknowns
- Batch processing of files
- Async operations
-
PyTesseract Pros
- Recent Github activity; commits + closed issues + closed pull requests
- Github popularity; 4.9k Stars + 659 Forks
- Wraps around Google Tesseract Engine for Optical Character Recognition (OCR)
- File output conversions, to and from Pdf or other formats
- Language setting
Cons
- ??
Unknowns
- Batch processing of files
- Multiprocessing
- Async operations
-
Textract Pros
- Wide range of file support
- Github popularity; 3.5k Stars + 528 Forks
- Uses Google Tesseract Engine for Optical Character Recognition (OCR)
- Works with video, audio, doc files
- Language setting
Cons
- Minimal recent Github activity; commits + closed issues + closed pull requests
Unknowns
- Batch processing of files
- Multiprocessing
- Async operations
-
PdfMiner.Six Pros
- Available in Command line
- Github popularity; 4.6k Stars + 834 Forks
- Wide range of support for working with text, shape objects, images in pdf files
- File output generation, to and from Pdf or other formats
Cons
- Minimal recent Github activity; commits + closed issues + closed pull requests
- Extracting from a table in pdf files
Unknowns
- Batch processing of files
- Multiprocessing
- Async operations
-
PdfPlumber Pros
- Recent Github activity; commits + closed issues + closed pull requests
- Github popularity; 4k Stars + 504 Forks
- Wide range of support for working with text, lines, shape objects, images, tables, forms in pdf files
- Visual debugging using ImageMagick implementation
- Pdf file/single page conversion to image
Cons
- Works with pdf files only
- Does not support Optical Character Recognition (OCR)
- Generating a pdf file from another format
Unknowns
- Batch processing of files
- Multiprocessing
- Async operations
-
PyPdf Pros
- Recent Github activity; commits + closed issues + closed pull requests
- Github popularity; 5.8k Stars + 1.2k Forks
- Wide range of support for working with text and metadata in pdf files
Cons
- Works with pdf files only
- Extracting from a image, table, shape objects in pdf files
- Does not support Optical Character Recognition (OCR)
Unknowns
- Batch processing of files
- Multiprocessing
- Async operations
# Install Tesseract OCR engine
sudo apt install -y tesseract-ocr
sudo apt install -y libtesseract-dev
# Setup virtual enviroment
python3 -m venv .venv
# Activate the virtual environment
. ./.venv/bin/activate
# Upgrade pip to the latest version
pip install --upgrade pip
# Install the python packages
pip install -r requirements.txt
# Running the app
python3 app.py
- Flask uploads doesn't seem to work correctly and gives an error message about werkzeug
ImportError: cannot import name 'secure_filename' from 'werkzeug'
Solution: Decided to use file type extensions instead of trying another module like flask-Reuploaded
- OCR conversion
Couldn't get PymuPDF to utilize the tesseract OCR engine and fell short in handling documents that required OCR
Solution: imported pytesseract which wraps around the tesseract engine to handle OCR file processing