-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use PAVÉS instead of pdfminer #1272
base: develop
Are you sure you want to change the base?
Conversation
This is unfortunately a bit slower than Running Using pdfminer.six (current develop branch):
Using PLAYA (branch in #1226):
Using PAVÉS (this branch):
We could definitely optimize this by using the code from #1226 in the case where there are no custom Sadly I checked and there is no easy way to support the parallelism of PLAYA and PAVÉS with the current pdfplumber interface, otherwise it could be 2-3x faster. |
Okay, I did that and now it is faster again:
I should however mention that running the |
Impressive! Do you think there's an approach to implementing this where |
Hmm, this wouldn't be terribly complicated given that the API is nearly identical - if you want to guarantee backward compatibility this would be the right way to go for the moment. Ultimately I think PAVÉS could support the same subset of the pdfminer.six API using something else (something faster) under the hood such as pypdfium2 - I think this might be possible with the |
This supersedes #1226 (alas, all was in vain). PAVÉS is a library that (among other things) uses PLAYA-PDF to provide a mostly drop-in replacement for pdfminer.six, minus a number of bugs and limitations.
It is somewhat faster, and can also use multiple CPUs, though this PR doesn't do that as it isn't totally clear how to fit that into pdfplumber, though I will take a look at it when I get a minute.
By contrast to #1226 this means that you can still use custom
LAParams
for instance. But you still get marked content sections, color spaces that make sense, etc.