We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If I run ocrmypdf with --skip-text on some pdf-files with "real text", than the existing text gets replaced by ��
The reason is, to run it on real text because I have a document with sensitive information similar to this one that also contains screenshots.
wget https://www.fcp.at/sites/default/files/2019-08/abstract_diplomarbeit_moschen.pdf
ocrmypdf -j 1 --optimize 01 -l deu+eng abstract_diplomarbeit_moschen.pdf output.pdf --skip-text -v1
abstract_diplomarbeit_moschen.pdf
output.pdf
Linux package manager (apt, dnf, etc.)
16.1.1
ocrmypdf 16.1.1 __main__.py:59 Running: ['tesseract', '--version'] __init__.py:133 Found tesseract 5.3.4.post44 __init__.py:342 Running: ['tesseract', '--version'] __init__.py:133 Running: ['gs', '--version'] __init__.py:133 Found gs 10.2.1 __init__.py:342 Running: ['gs', '--version'] __init__.py:133 Running: ['tesseract', '--list-langs'] __init__.py:133 stdout/stderr = List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (3): __init__.py:73 deu eng osd pikepdf mmap enabled helpers.py:326 os.symlink(abstract_diplomarbeit_moschen_ink.pdf, /tmp/ocrmypdf.io.esbdwxy5/origin) helpers.py:179 os.symlink(/tmp/ocrmypdf.io.esbdwxy5/origin, /tmp/ocrmypdf.io.esbdwxy5/origin.pdf) helpers.py:179 Gathering info with 1 thread workers info.py:772 pikepdf mmap enabled helpers.py:326 Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00 Using Tesseract OpenMP thread limit 1 tesseract_ocr.py:183 pikepdf mmap enabled helpers.py:326 1 skipping all processing on this page _pipeline.py:319 2 skipping all processing on this page _pipeline.py:319 1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0 _graft.py:140 1 Page rotation: (content, auto) -> page = (0, 0) -> 0 _graft.py:165 2 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0 _graft.py:140 2 Page rotation: (content, auto) -> page = (0, 0) -> 0 _graft.py:165 OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00 Postprocessing... ocr.py:146 os.symlink(/tmp/ocrmypdf.io.esbdwxy5/graft_layers.pdf, /tmp/ocrmypdf.io.esbdwxy5/fix_docinfo.pdf) helpers.py:179 Running: ['gs', '--version'] __init__.py:133 Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', __init__.py:133 '-sColorConversionStrategy=LeaveColorUnchanged', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.esbdwxy5/fix_docinfo.pdf', '/tmp/ocrmypdf.io.esbdwxy5/pdfa.ps'] GPL Ghostscript 10.02.1 (2023-11-01) __init__.py:108 Copyright (C) 2023 Artifex Software, Inc. All rights reserved. __init__.py:108 This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY: __init__.py:108 see the file COPYING for details. __init__.py:108 Processing pages 1 through 2. __init__.py:108 Page 1 __init__.py:108 Page 2 __init__.py:108 Running: ['tesseract', '--version'] __init__.py:133 Optimizable images: JPEGs: 0 PNGs: 0 optimize.py:349 Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:-- Deflating JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:-- Optimizable images: JBIG2 groups: 0 optimize.py:360 JBIG2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:-- os.symlink(/tmp/ocrmypdf.io.esbdwxy5/optimize.opt.pdf, /tmp/ocrmypdf.io.esbdwxy5/optimize.pdf) helpers.py:179 Running: ['jbig2', '--version'] __init__.py:133 Running: ['pngquant', '--version'] __init__.py:133 Image optimization ratio: 1.00 savings: 0.0% _pipeline.py:976 Total file size ratio: 0.73 savings: -37.0%
The text was updated successfully, but these errors were encountered:
jbarlow83
No branches or pull requests
Describe the bug
If I run ocrmypdf with --skip-text on some pdf-files with "real text", than the existing text gets replaced by ��
The reason is, to run it on real text because I have a document with sensitive information similar to this one that also contains screenshots.
Steps to reproduce
wget https://www.fcp.at/sites/default/files/2019-08/abstract_diplomarbeit_moschen.pdf
ocrmypdf -j 1 --optimize 01 -l deu+eng abstract_diplomarbeit_moschen.pdf output.pdf --skip-text -v1
Files
abstract_diplomarbeit_moschen.pdf
output.pdf
How did you download and install the software?
Linux package manager (apt, dnf, etc.)
OCRmyPDF version
16.1.1
Relevant log output
log output (click to open)
The text was updated successfully, but these errors were encountered: