-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Improved tesseract OCR performance #5
Comments
Recognizing text in screenshots is, I think, a fundamentally different problem to OCR, unless the text is huge. Unfortunately it usually isn't in screenshots: (#29). Tesseract is also seems very sensitive to small changes in the input: (#19, #30). This is likely a result of the way it performs binarization. Doing this in a smarter way, one that is optimized for screenshots rather than scans and photographs, would probably give better improvements than training on the Ubuntu font. |
I agree, it's not optimal. So far though, it seems the main issue is text with a drop-shadow on a coloured background. Title bars and text within windows, where the background is a uniform colour, tesseract has better success with. As a test, I thought I'd try a few things, including the following to see if tesseract had a better time.
Here's a script I threw together to test on the problematic "Install Xubuntu" example @ali1234 gave in #29 #!/bin/bash
# Takes two parameters, one is an image file and the other is a text string in quotes
# We have a bunch of nested loops which iterate over the various tesseract ocr options
# running each in turn, and producing a text file with the output of each run
# We grep for the text string in the resulting text that tesseract produces
# The first parameter is the image file to be processed
image=$1
imagename=$(basename $image)
# The second parameter is the text string to search for
text=$2
# The number of tiles to split the image into both horizontally and vertically
tiles=4
# Create a temporary directory to store the output of the tesseract runs which is datestamped
workdir=$(date +%Y%m%d-%H%M%S)
mkdir $workdir
for x in $(seq 0 $((tiles-1))); do
for y in $(seq 0 $((tiles-1))); do
for scale in 100 200 300 400 500; do
scaledimage="$workdir"/scaled-"$scale"-"$imagename"
convert "$image" -resize "$scale"% "$scaledimage"
width=$(identify -format "%w" "$scaledimage")
height=$(identify -format "%h" "$scaledimage")
tilewidth=$((width/tiles))
tileheight=$((height/tiles))
xoffset=$((x*tilewidth))
yoffset=$((y*tileheight))
tileimage="$workdir"/tile-"$x"-"$y"-"$scale"-"$imagename"
convert "$scaledimage" -crop "$tilewidth"x"$tileheight"+"$xoffset"+"$yoffset" "$tileimage"
for psm in 4 5 6 7 8 9 10 11 12 13; do
for oem in 3; do
echo "Running tesseract with scale $scale%, psm $psm, oem $oem on tile $x $y"
if ! tesseract --loglevel ALL -c tessedit_write_images=true --psm "$psm" --oem "$oem" "$tileimage" $workdir/out-tile-"$x"-"$y"-"$scale"-"$psm"-"$oem"; then
echo "tesseract failed"
exit 1
fi
grep -q "$text" $workdir/out-tile-$x-$y-$scale-$psm-$oem.txt
if [ $? -eq 0 ]; then
echo "Found $text in $workdir/out-tile-$x-$y-$scale-$psm-$oem.txt"
fi
done
done
done
done
done What seemed to work best was chopping the image up. Now, this won't work every time, especially if the text we're looking for is cut in half horizontally or vertically on the split.
Scaling the image wasn't needed at all. Just cropping out all the other stuff, and letting tesseract focus on one thing. This is the successful crop that it used (0,1, first one across, second tile down of four): (Note I had to convert from ppm to png before uploading, so some compression may have occurred which changes this image) It might be valuable to allow tesseract a few goes at each screenshot, and only as an option where we know things are problematic (such as poorly dithered wallpapers, bad anti-aliasing, drop-shadows and the like). Because I imagine that once we get past this first screen, most of the rest of the test would run just fine - unless Xubunu has some wacky installer with plasma backgrounds :D |
The problem as I see it is that tesseract's binarization is too smart. It is designed for scans which often have the problem that brightness varies over the image. It's possible for the background in one part of the image to be darker than the text in another part, like in this example: https://tesseract-ocr.github.io/tessdoc/images/binarisation.png So it has to use some kind of adaptive thresholding. That only works well at high resolution because it introduces fringing. For screenshots we can use simpler methods - take a single channel, or take an absolute threshold across the entire image, because it is unlikely the text we are looking for is going to vary all that much, if at all. The background might vary but it will never be simultaneously brighter and darker than the text. |
The other problem is that tesseract expects the image to be mostly filled with text, rather than mostly empty space with just a few words scattered around. This is likely why dividing up the image helps. This could be automated by looking at the image gradient to find regions that are not "flat" empty space, then partitioning into rectangles of roughly equal gradient. |
For reference, the 8x ESRGAN upscale takes about 2 minutes on i7 6700 or about 4 seconds on RTX 2070. |
For reference this is what other projects have done. NixOS Testing suite has a similar function to test graphical applications, they use imagemagick to transform the image to a tiff and a few other things I'm not familiar with, and there is a second option to return different interpretations of the text. I don't have any first hand experience for how good with a little searching I did see one comment about it being unreliable for small fonts but could not find any other comments complaining about the performance. |
What would you like to be added:
Improved text recognition within screenshots.
Why is this needed:
Tesseract is pretty great, but sometimes doesn't recognise text on screenshots. We already scale the screenshot up 3x before running tesseract-ocr on it. That improved text recognition tremendously. But I think there's more we could do.
Additional context:
It's possible to train tesseract to create our own dataset. Is that worthwhile? Is it worth training tesseract on the Ubuntu font for example?
The text was updated successfully, but these errors were encountered: