Train Tesseract LSTM with make on Windows
The tesstrain-win comes from the tesseract-ocr/tesstrain , In order to make it run on Windows, there are some changes to the makefile and the overall file structure.
The ocrd-train(OCR-D/ocrd-train) in tesstrain-win is the Predecessor of tesseract-ocr/tesstrain. it Could help us understand the makefile of tesseract-ocr/tesstrain.
The file structure in tesstrain-win:
You will need a recent version (>= 4.0.0beta1) of tesseract built with the training tools and matching leptonica bindings.
Build instructions and more can be found in the Tesseract project wiki.
You need a recent version of Python 3.x. For image processing the Python library Pillow is used.
In order to run the makefile on Windows, you need the Cygwin.
Before training your own database, it is recommended to train ocrd-testset.zip first.
If the ocrd-testset.zip can be trained normally, it means that the current computer training environment is OK.
-
Extract ./data/foo-ground-truth/ocrd-testset.zip to ./data/foo-ground-truth.
-
Run the command prompt as an administrator, Go to the tesstrain-win directory, e. g.:
cd %USERPROFILE%/tesstrain-win
- run make training
make training
-
Give your database a name
You could give the name by change the line 11 in makefile
MODEL_NAME = New_Name
Or you could give the name when you run make training
make training MODEL_NAME=New_Name
-
Prepare the base traineddata
If you train from scratch, no need to do this. If you train Fine-tune, download the base traineddata from the tessdata_best,and Place it to the ./data/tessdata.
-
update the foo.numbers/foo.punc/foo.wordlist in data filefolder
The three files should be consistent with the base traineddata or the relevant language you are training.
e.g. :if your base traineddata is eng, You could download them from langdata_lstm/eng.But you need to rename them separately:New_Name.numbers/New_Name.punc/New_Name.wordlist after download.
-
Prepare the ground truth
Place ground truth consisting of line images and transcriptions in the folder data/MODEL_NAME-ground-truth. This list of files will be split into training and evaluation data, the ratio is defined by the RATIO_TRAIN variable.
Images must be TIFF and have the extension .tif or PNG and have the extension .png, .bin.png or .nrm.png.
Transcriptions must be single-line plain text and have the same name as the line image but with the image extension replaced by .gt.txt.
-
Run the command prompt as an administrator, Go to the tesstrain-win directory, e. g.:
cd %USERPROFILE%/tesstrain-win
- run make training
make training