Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PAVÉS instead of pdfminer #1272

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from
Open

Conversation

dhdaines
Copy link
Contributor

@dhdaines dhdaines commented Feb 7, 2025

This supersedes #1226 (alas, all was in vain). PAVÉS is a library that (among other things) uses PLAYA-PDF to provide a mostly drop-in replacement for pdfminer.six, minus a number of bugs and limitations.

It is somewhat faster, and can also use multiple CPUs, though this PR doesn't do that as it isn't totally clear how to fit that into pdfplumber, though I will take a look at it when I get a minute.

By contrast to #1226 this means that you can still use custom LAParams for instance. But you still get marked content sections, color spaces that make sense, etc.

@dhdaines
Copy link
Contributor Author

dhdaines commented Feb 7, 2025

This is unfortunately a bit slower than pdfminer.six, in part because of the overhead of making a zillion useless LTChar and other objects before creating the final pdfplumber objects, but also because it adds some extra information that pdfminer.six was incapable of supplying.

Running time pdfplumber ../PDF32000_2008.pdf >/dev/null (that's the 756-page PDF 1.7 standard) on a fairly slow computer (Core i7-860 circa 2012), I get these results.

Using pdfminer.six (current develop branch):

real    5m5.912s
user    4m59.665s
sys     0m6.212s

Using PLAYA (branch in #1226):

real    4m32.255s
user    4m26.192s
sys     0m6.023s

Using PAVÉS (this branch):

real    5m20.015s
user    5m13.360s
sys     0m6.607s

We could definitely optimize this by using the code from #1226 in the case where there are no custom LAParams to worry about (since we have PLAYA already anyway).

Sadly I checked and there is no easy way to support the parallelism of PLAYA and PAVÉS with the current pdfplumber interface, otherwise it could be 2-3x faster.

@dhdaines
Copy link
Contributor Author

dhdaines commented Feb 7, 2025

This is unfortunately a bit slower than pdfminer.six ...
We could definitely optimize this by using the code from #1226 in the case where there are no custom LAParams to worry about (since we have PLAYA already anyway).

Okay, I did that and now it is faster again:

real    4m34.727s
user    4m28.572s
sys     0m6.123s

I should however mention that running the pdfplumber CLI on a big document like that is horrendously memory-inefficient since it processes all the pages at once before printing any results. I might make another PR for that (which could, potentially, use the parallelism in PLAYA).

@jsvine
Copy link
Owner

jsvine commented Feb 11, 2025

Impressive! Do you think there's an approach to implementing this where pdfminer.six and paves could be interchangeable? I.e., the user could select which engine to use?

@dhdaines
Copy link
Contributor Author

Impressive! Do you think there's an approach to implementing this where pdfminer.six and paves could be interchangeable? I.e., the user could select which engine to use?

Hmm, this wouldn't be terribly complicated given that the API is nearly identical - if you want to guarantee backward compatibility this would be the right way to go for the moment.

Ultimately I think PAVÉS could support the same subset of the pdfminer.six API using something else (something faster) under the hood such as pypdfium2 - I think this might be possible with the get_objects method: https://pypdfium2.readthedocs.io/en/v4/python_api.html#pypdfium2._helpers.page.PdfPage.get_objects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants