llm-pdf-ocr-api is a Flask-based web service designed to perform Optical Character Recognition (OCR) on PDF files using machine vision and AI models. Built on PyTorch and Transformers and optimized with NVIDIA CUDA, this API provides two endpoints, one for OCR processing, and one for listing available models. This API is wrapped in a Docker container.
When a user submits a file to the /ocr endpoint, the following steps are executed:
- Receive the Request:
- The server accepts a POST request containing the PDF file and optional parameters for OCR settings.
- Extract and Open the PDF:
- The PDF file is extracted from the form data and opened to access its content.
- Configure OCR Parameters:
- Parameters for the OCR process, such as the model and image processing settings, are set with defaults applied where not specified.
- Optional parameters are read from the form data, such as
model
,threshold_value
,kernel_width
,kernel_height
, andmin_area
. - Defaults are used for any parameters not provided.
- Optional parameters are read from the form data, such as
- Parameters for the OCR process, such as the model and image processing settings, are set with defaults applied where not specified.
- Process Each Page:
- Each page of the PDF is processed sequentially. The steps include:
- Rendering the page as an image.
- Converting the image to grayscale and applying binary thresholding.
- Performing morphological operations to enhance image clarity.
- Extracting lines using contour detection and filtering by area.
- Each page of the PDF is processed sequentially. The steps include:
- Extract Text:
- Text is extracted from each line of the image using the TrOCR model. The text from all lines is compiled into a single output.
- Return the Response:
- The extracted text is sent back in a JSON response.
- Handle Errors:
- Errors during processing are caught and returned as a detailed error message.
- Python: The script runs in a Python3 environment.
- Flask: Serves as the backbone of the web application, facilitating the creation of endpoints and handling HTTP requests.
- google-protobuf: Utilized for data serialization and deserialization, important for model loading and configuration.
- gunicorn: An extension that provides a Python WSGI HTTP Server for UNIX.
- numpy: Supports high-performance operations on large multi-dimensional arrays and matrices, used extensively in image manipulation.
- OpenCV (opencv-python-headless): Used to segment larger bodies of text into individual lines.
- Pillow (PIL): Helps with image processing tasks through the Python Imaging Library (Fork).
- PyMuPDF (fitz): Utilized for PDF parsing with Python bindings for the MuPDF library.
- sentencepiece: Helps with unsupervised text tokenization and detokenization.
- torch: Utilized for machine learning tasks in computer vision and natural language processing.
- transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch.
To install llm-pdf-ocr-api, follow these steps:
Begin by cloning the repository containing the llm-newsletter-generator to your local machine.
git clone https://github.com/samestrin/llm-pdf-ocr-api/
Navigate to the project directory:
cd llm-pdf-ocr-api
Install the required dependencies using pip:
pip install -r src/requirements.txt
Endpoint: /ocr
Method: POST
Process a PDF file and return the extracted text.
file
: PDF filemodel
(optional): Specifies the OCR model to be used for text extraction. Defaults to microsoft/trocr-base-printed if not provided.threshold_value
(optional): Determines the threshold value for binary thresholding of images. The default value is 150.kernel_width
(optional): Defines the width of the kernel used in morphological operations to clean up the image. It defaults to 20.kernel_height
(optional): Specifies the height of the kernel used in morphological operations. The default is 1.min_area
(optional): Sets the minimum area of contours that are considered as valid lines of text. The default minimum area is 50.
Endpoint: /models
Method: GET
Show all AI models available.
The API handles errors gracefully and returns appropriate error responses:
- 400 Bad Request: Invalid request parameters.
- 500 Internal Server Error: Unexpected server error.
Contributions to this project are welcome. Please fork the repository and submit a pull request with your changes or improvements.
This project is licensed under the MIT License - see the LICENSE file for details.