Use PAVÉS instead of pdfminer #1272

dhdaines · 2025-02-07T18:52:00Z

This supersedes #1226 (alas, all was in vain). PAVÉS is a library that (among other things) uses PLAYA-PDF to provide a mostly drop-in replacement for pdfminer.six, minus a number of bugs and limitations.

It is somewhat faster, and can also use multiple CPUs, though this PR doesn't do that as it isn't totally clear how to fit that into pdfplumber, though I will take a look at it when I get a minute.

By contrast to #1226 this means that you can still use custom LAParams for instance. But you still get marked content sections, color spaces that make sense, etc.

dhdaines · 2025-02-07T20:35:40Z

This is unfortunately a bit slower than pdfminer.six, in part because of the overhead of making a zillion useless LTChar and other objects before creating the final pdfplumber objects, but also because it adds some extra information that pdfminer.six was incapable of supplying.

Running time pdfplumber ../PDF32000_2008.pdf >/dev/null (that's the 756-page PDF 1.7 standard) on a fairly slow computer (Core i7-860 circa 2012), I get these results.

Using pdfminer.six (current develop branch):

real    5m5.912s
user    4m59.665s
sys     0m6.212s

Using PLAYA (branch in #1226):

real    4m32.255s
user    4m26.192s
sys     0m6.023s

Using PAVÉS (this branch):

real    5m20.015s
user    5m13.360s
sys     0m6.607s

We could definitely optimize this by using the code from #1226 in the case where there are no custom LAParams to worry about (since we have PLAYA already anyway).

Sadly I checked and there is no easy way to support the parallelism of PLAYA and PAVÉS with the current pdfplumber interface, otherwise it could be 2-3x faster.

dhdaines · 2025-02-07T21:02:21Z

This is unfortunately a bit slower than pdfminer.six ...
We could definitely optimize this by using the code from #1226 in the case where there are no custom LAParams to worry about (since we have PLAYA already anyway).

Okay, I did that and now it is faster again:

real    4m34.727s
user    4m28.572s
sys     0m6.123s

I should however mention that running the pdfplumber CLI on a big document like that is horrendously memory-inefficient since it processes all the pages at once before printing any results. I might make another PR for that (which could, potentially, use the parallelism in PLAYA).

jsvine · 2025-02-11T03:50:18Z

Impressive! Do you think there's an approach to implementing this where pdfminer.six and paves could be interchangeable? I.e., the user could select which engine to use?

dhdaines · 2025-02-13T14:35:04Z

Impressive! Do you think there's an approach to implementing this where pdfminer.six and paves could be interchangeable? I.e., the user could select which engine to use?

Hmm, this wouldn't be terribly complicated given that the API is nearly identical - if you want to guarantee backward compatibility this would be the right way to go for the moment.

Ultimately I think PAVÉS could support the same subset of the pdfminer.six API using something else (something faster) under the hood such as pypdfium2 - I think this might be possible with the get_objects method: https://pypdfium2.readthedocs.io/en/v4/python_api.html#pypdfium2._helpers.page.PdfPage.get_objects

feat: use PAVÈS instead of pdfminer

6cebebc

dhdaines mentioned this pull request Feb 7, 2025

Update version of pdfminer-six to 20240706 #1166

Open

fix(deps): make sure to get the crypto add-on (should fix in paves...)

e9f314f

dhdaines mentioned this pull request Feb 7, 2025

Use PLAYA instead of pdfminer #1226

Draft

feat: use PLAYA directly if custom LAParams not needed

6df7e04

fix: tweak some fields and tests

1b26195

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use PAVÉS instead of pdfminer #1272

Use PAVÉS instead of pdfminer #1272

dhdaines commented Feb 7, 2025 •

edited

Loading

dhdaines commented Feb 7, 2025 •

edited

Loading

dhdaines commented Feb 7, 2025

jsvine commented Feb 11, 2025

dhdaines commented Feb 13, 2025

Use PAVÉS instead of pdfminer #1272

Are you sure you want to change the base?

Use PAVÉS instead of pdfminer #1272

Conversation

dhdaines commented Feb 7, 2025 • edited Loading

dhdaines commented Feb 7, 2025 • edited Loading

dhdaines commented Feb 7, 2025

jsvine commented Feb 11, 2025

dhdaines commented Feb 13, 2025

dhdaines commented Feb 7, 2025 •

edited

Loading

dhdaines commented Feb 7, 2025 •

edited

Loading