Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress pdf file #28

Open
cbjcbj opened this issue Nov 24, 2016 · 11 comments
Open

compress pdf file #28

cbjcbj opened this issue Nov 24, 2016 · 11 comments

Comments

@cbjcbj
Copy link

cbjcbj commented Nov 24, 2016

Hi, my input file of pdfocr is ~9M and my output file is about 390M, and I try to used pdftk to compress but the compress rate is less than 0.1%. So I wonder it is possible to compress the pdf file. Thank you.
pic

@wilsotc
Copy link

wilsotc commented Jan 3, 2017

This problem is being caused by pdftoppm. I worked around it by bypassing this utility.

@cbjcbj
Copy link
Author

cbjcbj commented Jan 4, 2017

So is it possible to convert ppm to jpg or something else and make the pdf file smaller?

@wilsotc
Copy link

wilsotc commented Jan 4, 2017 via email

@cbjcbj
Copy link
Author

cbjcbj commented Jan 4, 2017

Thank you. You said by skipping the ppm step. Do you mean I should change some lines of the ruby code or I should add some command line parameters instead of pdfocr -i input.pdf -o output.pdf?

@wilsotc
Copy link

wilsotc commented Jan 4, 2017 via email

@cbjcbj
Copy link
Author

cbjcbj commented Jan 4, 2017

OK, thank you. I don't know ruby and I will have a try.

@wilsotc
Copy link

wilsotc commented Jan 4, 2017

fix.txt

@wilsotc
Copy link

wilsotc commented Jan 4, 2017

This PERL script extracts all page images in their native format if they're JPG, PNG, or TIFF using the pdfimages utility. You could use it as the basis for a more capable converter.
test.zip

@cbjcbj
Copy link
Author

cbjcbj commented Jan 4, 2017

Thank you, I will have a try:)

@wilsotc
Copy link

wilsotc commented Jan 4, 2017

the syntax for the PERL script is:
test2.pl -i

@wodin
Copy link

wodin commented Jan 17, 2017

A PDF I am trying this on actually has multiple images making up each page of the PDF. I don't know what was used to scan the PDF, but most of these images are actually tiny PBMs corresponding to various small marks on the page. The text is stored in one or a few PBMs or JPEGs per page.

For PDFs like this, I'm not sure whether it would be best to run the OCR engine on each image individually or to convert each page to a complete image as is done currently before running it through the OCR engine, but either way, I would prefer it if the text could be incorporated into a copy of the original PDF instead of using the exported images and the text to create the output.

I don't know how feasible this would be, but if possible, that seems like it would be a good way to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants