Ubuntu下使用Tesseract-ocr

安装

使用jTessBoxEditor训练

准备好样本图片，merge成test.tif

生成BOX文件, 打开在 test.tif 文件目录下打开终端，执行

tesseract test.tif test makebox

在命令行执行:echo font 0 0 0 0 0 >font_properties 注意 font 是值创建字体的名字，下面合并训练文件时会用到。结果生成了font_properties文件

生成.tr训练文件,在命令行执行:

tesseract test.tif test -l eng -psm 7 nobatch box.train

生成字符集文件

在命令行执行 :

unicharset_extractor test.box

生成shape文件

在命令行执行 :

shapeclustering -F font_properties -U unicharset -O test.unicharset test.tr

生成聚集字符特征文件

在命令行执行:

mftraining -F font_properties -U unicharset -O test.unicharset test.tr

生成字符正常化特征文件在命令行执行:

cntraining test.tr

生成的文件用 mv 命令进行更名

mv normproto test.normproto  
mv inttemp test.inttemp
mv pffmtable test.pffmtable  
mv unicharset test.unicharset  
mv shapetable test.shapetable

合并训练文件在命令行执行:

combine_tessdata test.

将fontyp.traineddata文件拷贝至Tesseract-OCR文件夹里的tessdata语言包文件夹里

mv test.traineddata test1.traineddata 
sudo cp test1.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
sudo cp test1.traineddata ../jTessBoxEditor/tesseract-ocr/tessdata/

将图片灰度，二值化

img_grey = img.convert("L")
# 将颜色转成灰度

img_two = img_grey.point(lambda x: 255 if x > 100 else 0)
# x 是颜色的中间值

使用



-psm pagesegmode 也是一个可选参数默认值为3  不同的值用来说明待识别图片 提高识别率，不同值的含义如下：

0 =只进行定向和脚本检测（OSD）

1 =通过OSD进行页面自动分割

2 =自动分割，但没有OSD，或OCR

3 =全自动分割，但没有OSD（默认）

4 =假设待识别图片是一列的文本

5 =假设待识别图片是一个统一的垂直对齐的文本块

6 =假设待识别图片是一个统一的文本块

7 =把图像作为一个单一的文本行

8 =把图像当作一个字

9 =把图像作为一个字在一个圆圈中

10 =把图像当作一个单独的字符

11    Sparse text. Find as much text as possible in no particular order.

12    Sparse text with OSD.

13    Raw line. Treat the image as a single text line,
		bypassing hacks that are Tesseract-specific.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ubuntu下使用Tesseract-ocr.md

Ubuntu下使用Tesseract-ocr.md

Ubuntu下使用Tesseract-ocr

安装

使用jTessBoxEditor训练

将图片灰度，二值化

使用

1123456

Files

Ubuntu下使用Tesseract-ocr.md

Latest commit

History

Ubuntu下使用Tesseract-ocr.md

File metadata and controls

Ubuntu下使用Tesseract-ocr

安装

使用jTessBoxEditor训练

将图片灰度，二值化

使用

1123456