[Bug]: real text replaced by � � (visually unchanged, only by copying) #1297

JoKalliauer · 2024-04-24T13:45:48Z

Describe the bug

If I run ocrmypdf with --skip-text on some pdf-files with "real text", than the existing text gets replaced by ��

The reason is, to run it on real text because I have a document with sensitive information similar to this one that also contains screenshots.

Steps to reproduce

wget https://www.fcp.at/sites/default/files/2019-08/abstract_diplomarbeit_moschen.pdf
ocrmypdf -j 1 --optimize 01 -l deu+eng abstract_diplomarbeit_moschen.pdf output.pdf --skip-text -v1
Open output.pdf
Copy text into any text-application (notepad++/editor/writer/libre office/...)

Files

abstract_diplomarbeit_moschen.pdf

output.pdf

How did you download and install the software?

Linux package manager (apt, dnf, etc.)

OCRmyPDF version

16.1.1

Relevant log output

log output (click to open)

ocrmypdf 16.1.1                                                                                                                                                               __main__.py:59
Running: ['tesseract', '--version']                                                                                                                                          __init__.py:133
Found tesseract 5.3.4.post44                                                                                                                                                 __init__.py:342
Running: ['tesseract', '--version']                                                                                                                                          __init__.py:133
Running: ['gs', '--version']                                                                                                                                                 __init__.py:133
Found gs 10.2.1                                                                                                                                                              __init__.py:342
Running: ['gs', '--version']                                                                                                                                                 __init__.py:133
Running: ['tesseract', '--list-langs']                                                                                                                                       __init__.py:133
stdout/stderr = List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (3):                                                                                    __init__.py:73
deu
eng
osd

pikepdf mmap enabled                                                                                                                                                          helpers.py:326
os.symlink(abstract_diplomarbeit_moschen_ink.pdf, /tmp/ocrmypdf.io.esbdwxy5/origin)                                                                                           helpers.py:179
os.symlink(/tmp/ocrmypdf.io.esbdwxy5/origin, /tmp/ocrmypdf.io.esbdwxy5/origin.pdf)                                                                                            helpers.py:179
Gathering info with 1 thread workers                                                                                                                                             info.py:772
pikepdf mmap enabled                                                                                                                                                          helpers.py:326
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Using Tesseract OpenMP thread limit 1                                                                                                                                   tesseract_ocr.py:183
pikepdf mmap enabled                                                                                                                                                          helpers.py:326
    1 skipping all processing on this page                                                                                                                                  _pipeline.py:319
    2 skipping all processing on this page                                                                                                                                  _pipeline.py:319
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                                                                         _graft.py:140
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                                                                     _graft.py:165
    2 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                                                                         _graft.py:140
    2 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                                                                     _graft.py:165
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Postprocessing...                                                                                                                                                                 ocr.py:146
os.symlink(/tmp/ocrmypdf.io.esbdwxy5/graft_layers.pdf, /tmp/ocrmypdf.io.esbdwxy5/fix_docinfo.pdf)                                                                             helpers.py:179
Running: ['gs', '--version']                                                                                                                                                 __init__.py:133
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None',                                               __init__.py:133
'-sColorConversionStrategy=LeaveColorUnchanged', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2',
'-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.esbdwxy5/fix_docinfo.pdf', '/tmp/ocrmypdf.io.esbdwxy5/pdfa.ps']
GPL Ghostscript 10.02.1 (2023-11-01)                                                                                                                                         __init__.py:108
Copyright (C) 2023 Artifex Software, Inc.  All rights reserved.                                                                                                              __init__.py:108
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:                                                                                                   __init__.py:108
see the file COPYING for details.                                                                                                                                            __init__.py:108
Processing pages 1 through 2.                                                                                                                                                __init__.py:108
Page 1                                                                                                                                                                       __init__.py:108
Page 2                                                                                                                                                                       __init__.py:108
Running: ['tesseract', '--version']                                                                                                                                          __init__.py:133
Optimizable images: JPEGs: 0 PNGs: 0                                                                                                                                         optimize.py:349
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Optimizable images: JBIG2 groups: 0                                                                                                                                          optimize.py:360
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
os.symlink(/tmp/ocrmypdf.io.esbdwxy5/optimize.opt.pdf, /tmp/ocrmypdf.io.esbdwxy5/optimize.pdf)                                                                                helpers.py:179
Running: ['jbig2', '--version']                                                                                                                                              __init__.py:133
Running: ['pngquant', '--version']                                                                                                                                           __init__.py:133
Image optimization ratio: 1.00 savings: 0.0%                                                                                                                                _pipeline.py:976
Total file size ratio: 0.73 savings: -37.0%

The text was updated successfully, but these errors were encountered:

JoKalliauer added the bug label Apr 24, 2024

JoKalliauer assigned jbarlow83 Apr 24, 2024

JoKalliauer changed the title ~~[Bug]: real text replaced by � �~~ [Bug]: real text replaced by � � (visually unchanged, only by copying) Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: real text replaced by � � (visually unchanged, only by copying) #1297

[Bug]: real text replaced by � � (visually unchanged, only by copying) #1297

JoKalliauer commented Apr 24, 2024 •

edited

Loading

[Bug]: real text replaced by � � (visually unchanged, only by copying) #1297

[Bug]: real text replaced by � � (visually unchanged, only by copying) #1297

Comments

JoKalliauer commented Apr 24, 2024 • edited Loading

Describe the bug

Steps to reproduce

Files

How did you download and install the software?

OCRmyPDF version

Relevant log output

JoKalliauer commented Apr 24, 2024 •

edited

Loading