-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compress pdf file #28
Comments
This problem is being caused by pdftoppm. I worked around it by bypassing this utility. |
So is it possible to convert ppm to jpg or something else and make the pdf file smaller? |
Yes. You can often skip the ppm format step though. PDF allows you to
encode an image using other image formats including jpeg. The poppler
utility pdfimages extracts the PDF encoded image in its native format,
resolution, and color depth. When the image format isn't supported by your
OCR software, you can fall back to conversion. When it is supported,
there's no loss of PDF compression efficiency.
…On Wed, Jan 4, 2017 at 12:57 AM, cbjcbj ***@***.***> wrote:
So is it possible to convert ppm to jpg or something else and make the pdf
file smaller?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#28 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJedU2ekENcyhKC5f0O_nRez9J2KFVvKks5rOu5hgaJpZM4K7NTO>
.
|
Thank you. You said by skipping the ppm step. Do you mean I should change some lines of the ruby code or I should add some command line parameters instead of |
You would need to change the ruby code to bypass the pdftoppm utility. The
pdfimages utility is the more ideal route, but the imagemagick convert
utility could also be better. The bottom line is that the pdftoppm utility
is nearly worthless for monochrome pdf scanned documents.
…On Wed, Jan 4, 2017 at 3:00 AM, cbjcbj ***@***.***> wrote:
Thank you. You said by skipping the ppm step. Do you mean I should change
some lines of the ruby code or I should add some command line parameters
instead of pdfocr -i input.pdf -o output.pdf?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#28 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJedU5Znm6-gem5cSotTxzO-knVpRBfuks5rOwtHgaJpZM4K7NTO>
.
|
OK, thank you. I don't know ruby and I will have a try. |
This PERL script extracts all page images in their native format if they're JPG, PNG, or TIFF using the pdfimages utility. You could use it as the basis for a more capable converter. |
Thank you, I will have a try:) |
the syntax for the PERL script is: |
A PDF I am trying this on actually has multiple images making up each page of the PDF. I don't know what was used to scan the PDF, but most of these images are actually tiny PBMs corresponding to various small marks on the page. The text is stored in one or a few PBMs or JPEGs per page. For PDFs like this, I'm not sure whether it would be best to run the OCR engine on each image individually or to convert each page to a complete image as is done currently before running it through the OCR engine, but either way, I would prefer it if the text could be incorporated into a copy of the original PDF instead of using the exported images and the text to create the output. I don't know how feasible this would be, but if possible, that seems like it would be a good way to do it. |
Hi, my input file of pdfocr is ~9M and my output file is about 390M, and I try to used pdftk to compress but the compress rate is less than 0.1%. So I wonder it is possible to compress the pdf file. Thank you.
The text was updated successfully, but these errors were encountered: