A clean, modular tool for converting PDFs to text and creating RAG-ready chunks with embeddings. Perfect for building knowledge bases from document collections.
- PDF to Text Conversion: Support for both standard and OCR-based extraction
- Text Chunking: Intelligent chunking with configurable size and overlap, supporting both dense (embeddings) and sparse (BM25) vector generation.
- Vector Generation: Automatic embedding creation using Azure OpenAI and BM25 sparse vector generation.
- Multi-language Support: Built-in translation capabilities (Arabic to English)
- Batch Processing: Process entire directories with progress tracking
- Metadata Extraction: Rich metadata for better search and retrieval
- Clean Architecture: Modular design for easy customization and extension
git clone https://github.com/aAceAtDev/RAG-chunker.git
cd RARchunker/pdf2text2chunks
pip install -r requirements.txt
pip install -e .
# For Azure services
pip install -e ".[azure]"
# For Google Cloud services
pip install -e ".[google]"
# For OCR capabilities
pip install -e ".[ocr]"
# For development
pip install -e ".[dev]"
Create a configuration file or set environment variables:
# Create sample config
python -c "from src.config import Config; Config().create_sample_config()"
Edit config.sample.json
with your credentials:
{
"azure": {
"api_key": "your-azure-openai-api-key",
"endpoint": "https://your-resource.openai.azure.com/",
"api_version": "2024-05-01-preview",
"embedding_model": "text-embedding-3-small"
},
"google": {
"credentials_path": "path/to/your/google-credentials.json",
"project_id": "your-google-project-id",
"bucket_name": "your-bucket-name",
"location": "us-central1"
},
"vectors": {
"enable_embeddings": true,
"embedding_retries": 3,
"embedding_delay": 0.06,
"enable_bm25": true,
"bm25_max_dim": 1000,
"bm25_k1": 1.2,
"bm25_b": 0.75
},
"processing": {
"chunk_size": 512,
"chunk_overlap": 50,
"max_workers": 4,
"default_language": "en",
"supported_languages": [
"en",
"ar"
]
}
}
# Convert all PDFs in a directory
python main.py convert --input pdfs/ --output text/
# Create chunks with both embeddings and BM25 (default)
python main.py chunk --input text/ --output chunks/ --azure-key YOUR_API_KEY
# Create chunks with embeddings only
python main.py chunk --input text/ --output chunks/ --azure-key YOUR_API_KEY --embeddings-only
# Create chunks with BM25 only
python main.py chunk --input text/ --output chunks/ --bm25-only
# Custom chunk settings
python main.py chunk --input text/ --output chunks/ \
--azure-key YOUR_API_KEY \
--chunk-size 1024 \
--chunk-overlap 100
# Translate text files from Arabic to English using Azure
python main.py translate --input text_ar/ --output text_en/ --translator azure --key YOUR_API_KEY
# Run complete pipeline: PDF β Text β Chunks
python main.py pipeline --input pdfs/ --azure-key YOUR_API_KEY
# With translation
python main.py pipeline --input pdfs/ --azure-key YOUR_API_KEY --translate
from src.pdf_converter import PDFConverter
converter = PDFConverter(input_dir="pdfs", output_dir="text")
results = converter.process_directory()
print(f"Converted {results['successful']} files")
from src.text_chunker import TextChunker
chunker = TextChunker(
input_dir="text",
output_dir="chunks",
azure_api_key="your-key",
chunk_size=512,
chunk_overlap=50,
enable_embeddings=True,
enable_bm25=True,
bm25_max_dim=1000
)
results = chunker.process_directory()
print(f"Created {results['total_chunks']} chunks")
from src.translator import Translator
translator = Translator(
service="azure",
api_key="your-key",
source_lang="ar",
target_lang="en",
input_dir="text_ar",
output_dir="text_en"
)
results = translator.process_directory()
export AZURE_OPENAI_API_KEY="your-api-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export CHUNK_SIZE=512
export CHUNK_OVERLAP=50
export MAX_WORKERS=4
export ENABLE_EMBEDDINGS=true
export ENABLE_BM25=true
export BM25_MAX_DIM=1000
The tool looks for config.json
in the current directory. You can also specify a custom config file:
from src.config import Config
config = Config("my-config.json")
Standard UTF-8 text files with cleaned and normalized content.
JSON files containing structured chunk data:
{
"id": "document_chunk_0001",
"content": "chunk text content...",
"embedding": [0.1, 0.2, ...],
"metadata": {
"file": {
"filename": "document.pdf",
"title": "Document Title",
"author": "Author Name"
},
"chunk": {
"chunk_index": 0,
"total_chunks": 25,
"word_count": 487,
"character_count": 2451
},
"content": {
"categories": ["..."],
"entities": ["..."],
"key_phrases": ["..."]
}
}
}
class CustomChunker(TextChunker):
def create_chunks(self, text: str, filename: str) -> List[Dict[str, Any]]:
# Your custom chunking logic
# Remember to call self._get_embeddings and self._generate_bm25_vectors
# and include them in the returned chunk data if enabled.
custom_chunks_data = []
# ... your logic ...
return custom_chunks_data
chunker = CustomChunker(...)
def custom_metadata_extractor(text: str) -> Dict:
# Extract domain-specific metadata
return metadata
chunker.extract_metadata_from_text = custom_metadata_extractor
from src.ocr_converter import OCRConverter
ocr_converter = OCRConverter(
output_dir="text",
bucket_name="your-gcs-bucket", # For Google Vision API
credentials_path="path/to/credentials.json"
)
results = ocr_converter.process_bucket()
- Standard PDFs: ~10-50 pages/minute
- OCR PDFs: ~5-15 pages/minute
- Chunking: ~1000 chunks/minute
- Embedding: ~500 chunks/minute (depends on API limits)
- Standard processing: ~100-500MB
- Large PDFs: Up to 2GB (automatic cleanup implemented)
- Batch processing: Scales with worker count
1. spaCy model not found
python -m spacy download en_core_web_sm
2. Azure API errors
- Check your API key and endpoint
- Verify rate limits
- Ensure model deployment name is correct
3. Memory issues with large PDFs
- Reduce
max_workers
- Process files individually
- Increase system memory
4. OCR dependencies
# For Tesseract
sudo apt-get install tesseract-ocr
# For OpenCV
pip install opencv-python
python main.py convert --input pdfs/ --output text/ --verbose
from src.utils import print_system_summary
print_system_summary()
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
git clone https://github.com/yourusername/pdf-to-rag-converter.git
cd pdf-to-rag-converter
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black .
# Type checking
mypy src/
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with PyPDF2, OpenAI, spaCy, and other fantastic open-source libraries
- Inspired by the need for better document processing in RAG systems
- Special thanks to the Arabic NLP community for guidance on multilingual processing
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]
Made with β€οΈ for the RAG community