Skip to content

Latest commit

 

History

History
81 lines (60 loc) · 2.57 KB

README.md

File metadata and controls

81 lines (60 loc) · 2.57 KB

Docling PDF Processor w/ Streamlit

A simple UI wrapper around Docling for document processing. I built this to make document analysis more accessible and thought others might find it useful.

Inspired by Docling and its integration with LlamaIndex.

What This Does

  • Processes PDFs using Docling's document analysis
  • Extracts text, tables, and performs OCR
  • Presents results in a clean Streamlit interface
  • Handles multi-page documents and complex tables
  • Makes document processing accessible to non-technical users

Demo

Setup

git clone https://github.com/lesteroliver911/docling-pdf-processor.git
cd docling-pdf-processor
pip install -r requirements.txt
streamlit run app.py

How It Works

The app combines three powerful frameworks:

  • Docling: Advanced document processing and analysis
  • LlamaIndex: Robust framework for structuring and indexing document data
  • Streamlit: Simple web interface

Key functions:

# Setting up the document processor
def initialize_converter():
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    return DocumentConverter(...)

# Processing PDFs
def process_pdf(uploaded_file, doc_converter):
    # Handles conversion and extraction
    # Returns markdown and multimodal content

Configuration

You can adjust a few settings in the code:

  • OMP_NUM_THREADS: CPU threads (default: 4)
  • IMAGE_RESOLUTION_SCALE: Image quality (default: 2.0)

Requirements

docling
llama-index
streamlit
pandas
python-dotenv

Using the App

  1. Upload a PDF
  2. Check out the three tabs:
    • AI Preview: Quick look at the content
    • Extracted Content: Full text and structure
    • Document Analysis: Page-by-page breakdown

Notes

  • Works best with clearly formatted PDFs
  • Table extraction might need tweaking for complex layouts
  • OCR can be slow on large documents
  • Docling provides robust document processing - check their documentation for more features
  • LlamaIndex integration adds powerful document structuring capabilities - see their Docling reader docs

Feel free to use this code, modify it, or suggest improvements. You can find me on LinkedIn if you want to discuss Python, AI, or document processing.