The OCR and Voice Recognition Module is a comprehensive tool designed to extract and process text from PDF documents, images, and audio files. Leveraging multiple OCR engines and advanced voice recognition technologies, this module ensures high accuracy and includes features such as error correction using Language Models (LLMs), math formula processing, and document structure identification. Highly configurable and supporting GPU acceleration, it caters to a wide range of applications from document digitization to voice-controlled systems.
- Installation
- Usage
- Features
- Contributing
- License
- Acknowledgements
- FAQs
- Contact
- Roadmap
- Changelog
- Demo
- Python 3.8 or higher
- Git
- Virtual environment tool (e.g.,
venv
orvirtualenv
) - Tesseract OCR installed on your system
-
Clone the Repository
git clone https://github.com/PStarH/ocr-voice-recognition-module.git cd ocr-voice-recognition-module
-
Create a Virtual Environment
python3 -m venv venv source venv/bin/activate
-
Install Dependencies
pip install -r requirements.txt
-
Install Tesseract OCR
- Ubuntu
sudo apt-get update sudo apt-get install tesseract-ocr
- macOS
brew install tesseract
- Windows
- Download the installer from Tesseract OCR and follow the installation instructions.
- Ubuntu
-
Download Additional Models Ensure that the required models for EAST, CRAFT, and LLMs are downloaded and placed in the appropriate directories as specified in the configuration.
-
Configure Environment Variables Create a
.env
file in the root directory with the following structure:USE_LOCAL_LLM=True API_PROVIDER=OLLAMA OLLAMA_API_URL=http://localhost:11434 OLLAMA_MODEL_NAME=ggml-gpt4all-j-v1.3-groovy CLAUDE_MODEL_STRING=claude-3-haiku-20240307 MATH_OCR_API_KEY=your_math_ocr_api_key MATH_OCR_ENDPOINT=your_math_ocr_endpoint LLM_ERROR_CORRECTION_MODEL=Llama-3.1-8B-Lexi-Uncensored_Q5_fixedrope.gguf LLM_LAYOUT_MODEL=Llama-3.1-8B-Lexi-Uncensored_Q5_fixedrope.gguf PREPROCESSING_ENABLED=True PROGRESS_TRACKING_ENABLED=True OCR_ENGINE=pytesseract PADDLEOCR_ENABLED=True PADDLEOCR_LANGUAGE=en PADDLEOCR_USE_GPU=False TEXT_DETECTION_MODEL=EAST TEXT_DETECTION_THRESHOLD=0.5
Note: Replace placeholder values with your actual configuration details.
python OCR.py
- Input PDF File: Specify the path to the PDF file you want to process by updating the
input_pdf_file_path
variable in themain
function. - Reformat as Markdown: Set
reformat_as_markdown
toTrue
to convert the extracted text into Markdown format. - Suppress Headers and Page Numbers: Set
suppress_headers_and_page_numbers
toTrue
to remove headers and page numbers from the final output.
input_pdf_file_path = 'path/to/your/document.pdf'
max_test_pages = 0 # Set to 0 to process all pages
skip_first_n_pages = 0 # Set to skip initial pages if needed
reformat_as_markdown = True
suppress_headers_and_page_numbers = True
python Voice-Recognition.py
Configure the input audio file path and other settings in the main
function as needed.
- PDF to Image Conversion: Converts PDF files to images for OCR processing.
- Multiple OCR Engines: Supports
pytesseract
,EasyOCR
, andPaddleOCR
as primary and backup OCR engines. - Text Detection Models: Utilizes advanced text detection models like EAST and CRAFT for accurate region identification.
- Error Correction: Integrates with LLMs to correct OCR and voice recognition errors, enhancing text quality.
- Math Formula Processing: Detects and processes mathematical formulas using specialized OCR tools.
- Document Structure Identification: Analyzes and formats the extracted text into structured Markdown.
- Voice Recognition: Implements advanced voice recognition with multiple ASR engines and validation mechanisms.
- GPU Acceleration: Supports GPU usage for faster processing with compatible models.
- Asynchronous Processing: Implements asynchronous operations for efficient handling of large documents and audio files.
- Progress Tracking: Provides progress indicators during OCR and processing tasks.
- Language Support: Configurable to support multiple languages for OCR and voice recognition.
Contributions are welcome! Please follow these steps:
- Fork the Repository
- Create a Feature Branch
git checkout -b feature/YourFeature
- Commit Your Changes
git commit -m "Add your feature"
- Push to the Branch
git push origin feature/YourFeature
- Open a Pull Request
Please ensure that your code follows the project's coding standards and includes appropriate documentation.
Please read and follow our Code of Conduct to ensure a welcoming and respectful environment for all contributors.
This project is licensed under the GPL-3.0 License.
- Tesseract OCR
- EasyOCR
- PaddleOCR
- EAST Text Detector
- CRAFT Text Detector
- Transformers by Hugging Face
- librosa
- Pyannote.audio
- DeepSpeech
- Ollama
- PStarH
Update the OCR_ENGINE
variable in your .env
file to pytesseract
, easyocr
, or paddleocr
based on your preference.
Yes, the module is fully functional on CPU. However, GPU acceleration is available and recommended for faster processing if your system supports it.
Ensure that the required language packs are installed for your chosen OCR engines and update the SUPPORTED_LANGUAGES
configuration in the .env
file.
Check the error logs for specific issues, ensure all prerequisites are met, and verify that all dependencies are correctly installed. Feel free to open an issue on the repository for further assistance.
Yes, the module includes a feedback mechanism. Refer to the collect_user_feedback
function in the code for details on how to provide feedback.
For support or inquiries, please reach out via GitHub Issues.
- Voice Recognition Integration: Enhance voice recognition features for improved accessibility and additional input methods.
- Enhanced Error Handling: Expand error handling mechanisms to cover more edge cases and provide detailed logging.
- Support for Additional OCR Engines: Integrate more OCR engines to increase flexibility and accuracy.
- Web Interface: Develop a web-based interface for easier interaction and processing of documents.
- Real-Time Processing: Enable real-time OCR and voice recognition processing for live document feeds and audio streams.
- Multilingual Support: Expand OCR and voice recognition capabilities to support multiple languages beyond English.
- User Authentication: Add authentication mechanisms for secure access to OCR and voice recognition functionalities in shared environments.
- Cloud Deployment: Adapt the module for deployment on cloud platforms to leverage scalable resources.
- API Development: Create a RESTful API to allow other applications to interact with the OCR and Voice Recognition module programmatically.
- Performance Optimization: Continuously optimize the module for faster processing times and reduced resource consumption.
- Initial release with OCR and Voice Recognition capabilities.
- Supported OCR engines: pytesseract, EasyOCR, PaddleOCR.
- Integrated text detection models: EAST and CRAFT.
- Implemented error correction using LLMs.
- Added math formula processing.
- Configured GPU acceleration support.
- Enhanced error handling and logging mechanisms.
- Added support for additional languages.
- Improved performance optimizations for faster processing.
- Integrated new OCR engines and updated existing ones.
- Added real-time processing features.
- Expanded Contributing and Acknowledgements sections.