Install Tesseract v.3.05 and ImageMagick
brew install tesseract
brew install imagemagick
To pack up and create a traineddata file from a folder containing box/tif training files and a wordlist for training dictionary:
python training-packer.py path-to-folder word-list-file.txt
Place the resulting traineddata file (located in same folder as box/tif pairs), in Tesseract's /usr/local/share/tessdata/
folder.
To preprocess the folder of .tif directory pages:
python im-processor.py path-to-ImageMagick-textcleaner-script path-to-images/ path-to-output-folder/
If the folder contains .jpeg or .jpg rather than .tif, im-processor.py
will use ImageMagick to convert them to tifs first before preprocessing them.
To run Tesseract 3.5 on the processed images:
python directory-tess3.py path-to-images/ path-to-output-folder/ trainingdata-iso-filename
The trainingdata-iso-filename
will be the name of the trainning data language file (e.g. a variation on "eng") that has been placed in the tessdata folder (see below).
The process below utilizes a combination of Tesseract's standard English training data and additional fonts extracted from city directories.
Building the training files can be done by following the tutorial here. Start by making a few box/tif pairs, selecting useful pages from the directories that contain characters that prove problematic (especially H, h, and ½; pages with italics are also helpful). For directories, it is useful to build standard and italic font training files. Create a language name, derived from 'eng' the ISO 639-2 for English, as a prefix. Do not use 'eng' so as to differentiate the training from the in-built Tesseract eng training data.
To make box/tif pairs for a new English-language font for the 1849 directory, non-italic fonts only:
tesseract eng2.dir1849.exp0.tif eng2.dir1849.exp0 batch.nochop makebox
And for a planned italics-only example (add an i to 1849):
tesseract eng2.dir1849i.exp0.tif eng2.dir1849i.exp0 batch.nochop makebox
Run this for every page wanted for training data, changing the exp integer for each separate pair, i.e.:
tesseract eng2.dir1849.exp2.tif eng2.dir1849.exp2 batch.nochop makebox
tesseract eng2.dir1849i.exp1.tif eng2.dir1849i.exp1 batch.nochop makebox
...
Next, correct the generated .box files using this Python utility script. Delete any lines in the italics training file that are not italics.
Run this line again for every box/tif pair to generate the .tr files:
tesseract eng2.dir1849.exp1.tif eng2.dir1849.exp1 nobatch box.train
tesseract eng2.dir1849i.exp0.tif eng2.dir1849i.exp0 nobatch box.train
...
etc.
Extract unicharset file. In one line:
unicharset_extractor eng2.dir1849.exp0.box eng2.dir1849.exp1.box eng2.dir1849.exp2.box eng2.dir1849i.exp0.box eng2.dir1849i.exp1.box
Create the font_properties
file as per guidelines here with, for example, two lines, one for the standard font and one for the italic font. Make sure the font name listed in the file is dir1849, dir1849i, etc. to match the font name in the box/tif files. Enter the appropriate 1/0 for font type.
Perform the shapeclustering, mftraining, and cntraining steps on all files, in one line:
shapeclustering -F font_properties -U unicharset eng2.dir1849.exp0.tr eng2.dir1849i.exp0.tr
mftraining -F font_properties -U unicharset -O eng2.unicharset eng2.dir1849.exp0.tr eng2.dir1849i.exp0.tr
cntraining eng2.dir1849.exp0.tr eng2.dir1849i.exp0.tr
At this stage a word-dawg
list (i.e. a dictionary of useful words can be made). Prep the word list (similarly, for frequent words, a frequent word list) as a .txt file, each word on one line, with \n line ending. Run, as per Tesseract tutorial:
wordlist2dawg words_list eng2.word-dawg eng2.unicharset
Make sure this dawg file is in same directory with .tr and unicharset files. Prefix inttemp, normproto, pffmtable, and shapetable with language name:
mv inttemp eng2.inttemp
mv normproto eng2.normproto
mv pffmtable eng2.pffmtable
mv shapetable eng2.shapetable
Lastly, package everything up into a traineddata file:
combine_tessdata eng2.