Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: enhance project documentation for improved clarity and usability #208

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
186 changes: 186 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
# MegaParse Architecture

This document provides a comprehensive overview of the MegaParse system architecture, including component relationships, data flow, and core implementation details.

## System Components

### 1. Core Parser Library (megaparse)

The core library provides the fundamental parsing capabilities:

```
libs/megaparse/
├── src/megaparse/
│ ├── parser/ # Parser implementations
│ │ ├── base.py # Abstract base parser
│ │ ├── unstructured_parser.py
│ │ ├── megaparse_vision.py
│ │ ├── llama.py
│ │ └── doctr_parser.py
│ ├── api/ # FastAPI application
│ │ └── app.py # API endpoints
│ └── checker/ # Format utilities
```

### 2. Client SDK (megaparse_sdk)

The SDK provides a high-level interface for API interaction:

```
libs/megaparse_sdk/
├── src/megaparse_sdk/
│ ├── client/ # API client implementation
│ └── schema/ # Data models and configurations
```

### 3. FastAPI Interface

The API layer exposes parsing capabilities as HTTP endpoints:

- `/v1/file`: File upload and parsing
- `/v1/url`: URL content extraction and parsing
- `/healthz`: Health check endpoint

## Data Flow

1. **Document Input**
```
Client → SDK → API → Parser Library
```
- Client submits document through SDK
- SDK validates and sends to API
- API routes to appropriate parser
- Parser processes and returns results

2. **Parser Selection**
```
Input → Strategy Selection → Parser Assignment → Processing
```
- Input type determines available strategies
- Strategy influences parser selection
- Parser processes according to strategy

## Core Classes and Flow

### MegaParse Class

The central orchestrator managing the parsing workflow:

```python
class MegaParse:
def __init__(self, parser: BaseParser):
self.parser = parser

def load(self, file_path: str, strategy: StrategyEnum = StrategyEnum.AUTO) -> str:
# 1. Validate input
# 2. Select strategy
# 3. Process document
# 4. Format output
```

### Parser Hierarchy

```
BaseParser (Abstract)
├── UnstructuredParser
│ └── Basic document parsing
├── MegaParseVision
│ └── AI-powered parsing (GPT-4V)
├── LlamaParser
│ └── Enhanced PDF parsing
└── DoctrParser
└── OCR-based parsing
```

### Strategy Selection

The `StrategyEnum` determines parsing behavior:

- `AUTO`: Automatic strategy selection based on input
- `FAST`: Optimized for speed (simple documents)
- `HI_RES`: Maximum accuracy (complex documents)

## Implementation Details

### Parser Selection Logic

1. **Input Analysis**
- File type detection
- Content complexity assessment
- Available parser evaluation

2. **Strategy Application**
- AUTO: Selects optimal parser
- FAST: Prioritizes UnstructuredParser
- HI_RES: Prefers MegaParseVision/LlamaParser

### Error Handling

The system implements multiple error handling layers:

1. **SDK Level**
- Input validation
- Connection error handling
- Rate limiting management

2. **API Level**
- Request validation
- Authentication
- Resource management

3. **Parser Level**
- Format-specific error handling
- Processing error recovery
- Output validation

## Deployment Architecture

### Docker Support

Two deployment options:

1. **Standard Image**
```yaml
# Basic parsing capabilities
docker compose up
```

2. **GPU-Enabled Image**
```yaml
# Enhanced processing with GPU support
docker compose -f docker-compose.gpu.yml up
```

### API Server

- FastAPI application
- Uvicorn ASGI server
- Interactive documentation at `/docs`
- Health monitoring at `/healthz`

## Extension Points

### Custom Parser Implementation

Extend `BaseParser` for custom parsing logic:

```python
class CustomParser(BaseParser):
def convert(self, file_path: str, strategy: StrategyEnum) -> str:
# Custom implementation
pass

async def aconvert(self, file_path: str, strategy: StrategyEnum) -> str:
# Async implementation
pass
```

### Strategy Customization

Create custom strategies by extending `StrategyEnum`:

```python
class CustomStrategy(StrategyEnum):
CUSTOM = "custom"
# Define behavior in parser implementation
```
129 changes: 97 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,46 @@

MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you're dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.

## Quick Start Guide 🚀

1. **Prerequisites**
- Python >= 3.11
- Poppler (for PDF support)
- Tesseract (for OCR support)
- libmagic (for file type detection)

2. **Installation**
```bash
# Install system dependencies (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install -y poppler-utils tesseract-ocr libmagic1

# Install system dependencies (macOS)
brew install poppler tesseract libmagic

# Install MegaParse
pip install megaparse
```

3. **Environment Setup**
```bash
# Create a .env file with your API keys
OPENAI_API_KEY=your_openai_key # Required for MegaParseVision
LLAMA_CLOUD_API_KEY=your_llama_key # Optional, for LlamaParser
```

## Project Architecture 🏗️

MegaParse is organized into two main components:

- **megaparse**: Core parsing library with multiple parsing strategies
- UnstructuredParser: Basic document parsing
- MegaParseVision: Advanced parsing with GPT-4V
- LlamaParser: Enhanced PDF parsing using LlamaIndex
- DoctrParser: OCR-based parsing

- **megaparse_sdk**: Client SDK for interacting with the MegaParse API

## Key Features 🎯

- **Versatile Parser**: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.
Expand All @@ -23,62 +63,87 @@ MegaParse is a powerful and versatile parser that can handle various types of do

https://github.com/QuivrHQ/MegaParse/assets/19614572/1b4cdb73-8dc2-44ef-b8b4-a7509bc8d4f3

## Installation

required python version >= 3.11

```bash
pip install megaparse
```

## Usage

1. Add your OpenAI or Anthropic API key to the .env file

2. Install poppler on your computer (images and PDFs)

3. Install tesseract on your computer (images and PDFs)

4. If you have a mac, you also need to install libmagic ```brew install libmagic```
## Usage Examples 💡

### Basic Usage with UnstructuredParser
The UnstructuredParser is the default parser that works with most document types without requiring additional API keys:

```python
from megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.parser.unstructured_parser import UnstructuredParser

# Initialize the parser
parser = UnstructuredParser()
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")

# Parse a document
response = megaparse.load("./document.pdf")
print(response)
megaparse.save("./test.md")
```

### Use MegaParse Vision
# Save the parsed content as markdown
megaparse.save("./output.md")
```

* Change the parser to MegaParseVision
### Advanced Usage with MegaParseVision
MegaParseVision uses advanced AI models for improved parsing accuracy:

```python
from megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.parser.megaparse_vision import MegaParseVision

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY")) # type: ignore
# Initialize with GPT-4V
model = ChatOpenAI(model="gpt-4v", api_key=os.getenv("OPENAI_API_KEY"))
parser = MegaParseVision(model=model)
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")

# Parse with advanced features
response = megaparse.load("./complex_document.pdf")
print(response)
megaparse.save("./test.md")
megaparse.save("./output.md")
```

**Supported Models**: MegaParseVision works with multimodal models:
- OpenAI: GPT-4V
- Anthropic: Claude 3 Opus, Claude 3 Sonnet
- Custom models (implement the BaseModel interface)

### Parsing Strategies
MegaParse supports different parsing strategies to balance speed and accuracy:

- **AUTO**: Automatically selects the best strategy based on document type
- **FAST**: Optimized for speed, best for simple documents
- **HI_RES**: Maximum accuracy, recommended for complex documents

```python
from megaparse.parser.strategy import StrategyEnum

# Use high-resolution parsing
response = megaparse.load("./complex_document.pdf", strategy=StrategyEnum.HI_RES)
```
**Note**: The model supported by MegaParse Vision are the multimodal ones such as claude 3.5, claude 4, gpt-4o and gpt-4.

## Use as an API
There is a MakeFile for you, simply use :
```make dev```
at the root of the project and you are good to go.
## Running the API Server 🌐

### Using Docker (Recommended)
```bash
# Build and start the API server
docker compose build
docker compose up

# For GPU support
docker compose -f docker-compose.gpu.yml up
```

### Manual Setup
```bash
# Install dependencies using UV (recommended)
UV_INDEX_STRATEGY=unsafe-first-match uv pip sync

# Start the API server
uv pip run uvicorn megaparse.api.app:app
```

See localhost:8000/docs for more info on the different endpoints !
The API will be available at http://localhost:8000 with interactive documentation at http://localhost:8000/docs

## BenchMark

Expand Down
Loading
Loading