Skip to content

Latest commit

 

History

History
101 lines (76 loc) · 5.42 KB

README.md

File metadata and controls

101 lines (76 loc) · 5.42 KB

hocrmod

This project attempts to address an edge case with Tesseract where small regions are missed for recognition, and uses Tesseract's support of hocr to merge the OCR from missing regions. This seems to most often occur with page numbers, for example, as seen for page 14 in the example below:

There is an open issue on this sort of scenario and it may be sorted out within the Tesseract itself. The main program here, hocrmod.py, is a simple python script that uses some OpenCV tricks to detect missing regions and has several options:

usage: hocrmod.py [-h] [-f FILE] [-b BORDER] [-a ARGUMENTS] [-d] [-c CONF]
                  [-l LANG]

optional arguments:
  -h, --help            show this help message and exit

named arguments:
  -f FILE, --file FILE  input image, for example: imgs/my_image.tif
  -b BORDER, --border BORDER
                        adjust border value for extracted regions
  -a ARGUMENTS, --arguments ARGUMENTS
                        arguments for tesseract on missing regions
  -d, --debug           create debug files
  -c CONF, --conf CONF  set confidence number threshold for mised regions
  -l LANG, --lang LANG  language for OCR

The easiest way to see what's happening with this approach is to run the script with the -d option. For example:

python hocrmod.py -f mj0029.jpg -d

The script will look for a corresponding hocr file with the same path as the image. If one is not found, then Tesseract will be run on the image. To use a slightly more ambitious image from the kind folks at the Internet Archive, consider this:

Tesseract does an amazing job on most of this image. With the -d option, we can look in the regions image to inspect what's left afterwards. The script uses the base hocr file (which provides coordinates), to blank out regions that have been identified by Tesseract, leaving the following:

Again, Tesseract does a lot of good things here, there really isn't much left. But, in this case, the small regions cover some important semantic content, particularly the page number. Tesseract also, rightfully, ignores the separator lines. These are usually stylistic and are not appropriate for OCR. However, the script tries to use OpenCV to identify these and blank them out rather than skipping regions with separators altogether. This is because of situations like the following:

Here the line overlaps with a textual area. The line identification is not infallible, and there will often be questionable regions in the mix, but the associated contours image will show what regions will be subject to OCR with this approach:

The script uses pytesseract and the parameters can be overridden for the psm number and other arguments. There is a check for a confidence number, since bogus regions are common when what is most often desired is the following, i.e., the elusive page number:

With the -d option, the resulting hocr files will be created that have the coordinates of the regions in the file name, e.g. sim5_coords_00169_02971_00334_03082.png. The original hocr file will be renamed with a .bak extension, unless no text is produced with the script, in which case the original hocr file will be untouched.

This project also includes the cleanhocr.py script which we use to filter hocr files based on a confidence level. This is sometimes useful with the psm for sparse text options, where the results can include content not captured by other settings. None of these options seem to capture the page number in the scenario here, but the sparse text options can often make a difference for other types of missed text. The parameters are as follows:

usage: cleanhocr.py [-h] [-f FILE] [-c CONF] [-l LANG] [-n] [-t TITLE]

options:
  -h, --help            show this help message and exit

named arguments:
  -f FILE, --file FILE  input file, for example: page.hocr
  -c CONF, --conf CONF  set confidence number threshold for ocr words
  -l LANG, --lang LANG  language for OCR
  -n, --number          flag to bypass confidence value for words with number(s)
  -t TITLE, --title TITLE
                        title to set for HOCR file

Thanks, as always, to the Internet Archive for all of the great work they do, and to my colleagues at OurDigitalWorld as well as the Centre for Digital Scholarship for supporting and encouraging these kinds of projects to help digitize analogue collections.

art rhyno ourdigitalworld/cdigs