compress pdf file #28

cbjcbj · 2016-11-24T01:54:34Z

Hi, my input file of pdfocr is ~9M and my output file is about 390M, and I try to used pdftk to compress but the compress rate is less than 0.1%. So I wonder it is possible to compress the pdf file. Thank you.

wilsotc · 2017-01-03T16:07:35Z

This problem is being caused by pdftoppm. I worked around it by bypassing this utility.

cbjcbj · 2017-01-04T00:57:05Z

So is it possible to convert ppm to jpg or something else and make the pdf file smaller?

wilsotc · 2017-01-04T02:42:01Z

Yes. You can often skip the ppm format step though. PDF allows you to encode an image using other image formats including jpeg. The poppler utility pdfimages extracts the PDF encoded image in its native format, resolution, and color depth. When the image format isn't supported by your OCR software, you can fall back to conversion. When it is supported, there's no loss of PDF compression efficiency.

…

On Wed, Jan 4, 2017 at 12:57 AM, cbjcbj ***@***.***> wrote: So is it possible to convert ppm to jpg or something else and make the pdf file smaller? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#28 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJedU2ekENcyhKC5f0O_nRez9J2KFVvKks5rOu5hgaJpZM4K7NTO> .

cbjcbj · 2017-01-04T03:00:23Z

Thank you. You said by skipping the ppm step. Do you mean I should change some lines of the ruby code or I should add some command line parameters instead of pdfocr -i input.pdf -o output.pdf?

wilsotc · 2017-01-04T03:09:53Z

You would need to change the ruby code to bypass the pdftoppm utility. The pdfimages utility is the more ideal route, but the imagemagick convert utility could also be better. The bottom line is that the pdftoppm utility is nearly worthless for monochrome pdf scanned documents.

…

On Wed, Jan 4, 2017 at 3:00 AM, cbjcbj ***@***.***> wrote: Thank you. You said by skipping the ppm step. Do you mean I should change some lines of the ruby code or I should add some command line parameters instead of pdfocr -i input.pdf -o output.pdf? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#28 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJedU5Znm6-gem5cSotTxzO-knVpRBfuks5rOwtHgaJpZM4K7NTO> .

cbjcbj · 2017-01-04T07:19:46Z

OK, thank you. I don't know ruby and I will have a try.

wilsotc · 2017-01-04T13:10:36Z

fix.txt

wilsotc · 2017-01-04T13:17:50Z

This PERL script extracts all page images in their native format if they're JPG, PNG, or TIFF using the pdfimages utility. You could use it as the basis for a more capable converter.
test.zip

cbjcbj · 2017-01-04T13:42:44Z

Thank you, I will have a try:)

wilsotc · 2017-01-04T13:45:20Z

the syntax for the PERL script is:
test2.pl -i

wodin · 2017-01-17T15:41:28Z

A PDF I am trying this on actually has multiple images making up each page of the PDF. I don't know what was used to scan the PDF, but most of these images are actually tiny PBMs corresponding to various small marks on the page. The text is stored in one or a few PBMs or JPEGs per page.

For PDFs like this, I'm not sure whether it would be best to run the OCR engine on each image individually or to convert each page to a complete image as is done currently before running it through the OCR engine, but either way, I would prefer it if the text could be incorporated into a copy of the original PDF instead of using the exported images and the text to create the output.

I don't know how feasible this would be, but if possible, that seems like it would be a good way to do it.

This was referenced Jan 17, 2017

Resulting PDFs are not searchable in OS X Preview.app #27

Open

Preserve image data (/filesize) from original PDF #15

Open

TDavLinguist mentioned this issue Jun 16, 2017

Need to decrease pdf size #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compress pdf file #28

compress pdf file #28

cbjcbj commented Nov 24, 2016

wilsotc commented Jan 3, 2017

cbjcbj commented Jan 4, 2017

wilsotc commented Jan 4, 2017 via email

cbjcbj commented Jan 4, 2017

wilsotc commented Jan 4, 2017 via email

cbjcbj commented Jan 4, 2017

wilsotc commented Jan 4, 2017

wilsotc commented Jan 4, 2017

cbjcbj commented Jan 4, 2017

wilsotc commented Jan 4, 2017

wodin commented Jan 17, 2017 •

edited

Loading

compress pdf file #28

compress pdf file #28

Comments

cbjcbj commented Nov 24, 2016

wilsotc commented Jan 3, 2017

cbjcbj commented Jan 4, 2017

wilsotc commented Jan 4, 2017 via email

cbjcbj commented Jan 4, 2017

wilsotc commented Jan 4, 2017 via email

cbjcbj commented Jan 4, 2017

wilsotc commented Jan 4, 2017

wilsotc commented Jan 4, 2017

cbjcbj commented Jan 4, 2017

wilsotc commented Jan 4, 2017

wodin commented Jan 17, 2017 • edited Loading

wodin commented Jan 17, 2017 •

edited

Loading