This branch contains an implementation of OCR using only Tesseract and image manipulation libraries such as OpenCV.
To be able to run the code, Tesseract must be installed. See https://github.com/tesseract-ocr/tessdoc/blob/main/Downloads.md for how to download Teserract. Tesseract should be added to the PATH variable.
Then this traineddata file should be moved to the tessdata folder where your Tesseract program is stored. The file is collected from https://github.com/DoubangoTelecom/tesseractMRZ and is used according to the following license.
Create virtual environment (first install)
python -m venv venv
Activate virtual environment (every time)
source venv/bin/activate # Unix-like
venv/Scripts/activate # Windows
Set-up (first install)
pip install -e .
Install dependencies
pip install -r requirements.txt # After remote dependency changes
Save your dependency changes
pip freeze > requirements.txt # After local dependency changes
Run main method of a specific file (example):
python -m passport_mrz_reader.pure_tesseract.tesseract_predict
Enable Jupyter widgets
jupyter nbextension enable --py widgetsnbextension --sys-prefix
Open Jupyter notebook (VSCode should also work)
jupyter notebook
Deactivate virtual environment (if desired)
deactivate
Tests can be run using the following command:
python -m unittest discover -s passport_mrz_reader/tests
data:
- Contains the labeled dataset and MRZ images
- Contains the different trained models, one folder for every approach
- pure tesseract: Contains a .traineddata file which should be moved to the folder where Tesseract is installed.
src:
- common:
- Contains functionality that is shared across all implementations
- pure tesseract: Contains an implementation using only tesseract and image processing libraries such as Open CV to perform OCR on passport MRZ.