-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[documents] Benchmark PDF document reading + numpy conversion options #23
Comments
PDF reader benchmark:
As a conclusion:
|
Let's stick with PyMuPDF for now then and hope that we somehow manage to get around #113 later on! |
Just a short heads-up (I see that this issue got closed a while ago; people might still find it via Google):
|
Hi @MartinThoma 👋 Thanks for letting us know! For the sake of documentation, in #486, we considered another recent option: pypdfium. We should do a full benchmark for performances but the license is compatible with all OSS projects and the support is great so far :) |
PyPDF2 stays with BSD (3-Clause). Nice, I didn't know pypdfium. If you let me know how it extracts text from a PDF, I'll add it to the benchmark :-) |
Thanks for the compliment ;). FYI, I'm currently working on a full-scale API rewrite to fix some annoyances, so probably it would make sense to await this being merged before you implement something new with pypdfium2. |
Something like this? def get_pdfium_text(filepath: str) -> str:
text = ""
doc = pdfium.PdfDocument(filepath)
for page_num in len(doc):
textpage = doc.get_textpage(page_num)
text += textpage.get_text()
return text
Do you have a sample PDF? I'm always interested in extending PyPDF2 test cases / the benchmarks 😄 |
For the old API, yes. Perhaps you'll yet want to insert a newline character after each page, and call
Sure, a sample document is attached here (generated by pypdfium2's test suite). |
I've added PDFium to the text extraction benchmarks: https://github.com/py-pdf/benchmarks The gist of it:
The quality is so good that I'm now going over the differences and see if I need to adjust the ground truth. The scores might change a bit today (in favor of PDFium). What I notice so far:
|
Thanks for the benchmark! I'm happy that someone is looking into the text extraction feature more thoroughly, because I don't personally use it in my projects yet. For the problems you mentioned, can you please point me at the files in question? Perhaps we can ask upstream about it. |
Looking at the quality results, I see pypdfium2 is almost equal to pymupdf for most documents, except for sample 13, where it only has 64% coverage. After downloading the document and running
|
I was able to fix the issue by adding |
Yes, that is very likely! I think that was also the document PyPDF2 struggled with. I still want to have it in the benchmark as invalid PDF documents are sadly pretty common. |
About that part: I actually like the behavior of PDFium better than Tika. It's about hyperlinks in the document. Tika adds them to the bottom of the extract, PDFium skips them. I think they should be skipped. I'm adjusting the ground truth. |
FYI, I just made a bugfix release containing the |
Very nice! Well done! The latest benchmark results show that PDFium is now a little bit better than PyMuPDF in extracting texts from English/German documents (changes of the extraced text). It is still behind Tika, but not by much. The main part that changed: PyPDF2 has also improved, but still is noticably behind Tika/PyMuPDF/PDFium. We will get closer again with the 2.1.0 release (expected end of June) :-) |
I'm sorry for the doctr folks that we hijacked this issue 😅 I was actually thinking if a meta-package would be useful. Similar as matplotlib allows you to choose different backends, you could do something similar for processing PDF documents. PyPDF2 could be a reasonable fallback if using Java / C++ or some of the licenses is not acceptable, but if it is acceptable, you could use a faster backend like PyMuPDF (I'm actually not sure about what PDFium uses under the hood ... I guess C++? I also don't know which licenses PDFium and its dependencies have) |
Oh please don't @MartinThoma :) That's exactly the reason why we document this type of comparison on issues like this one! |
Is the BSD 3-clause license not compatible with Apache? Personally, I don't care too much about the license. For most stuff I set MIT because it's easiest to read. For my public stuff I have the attitude: "Do whatever you want with it, but (1) don't sue me if things break (2) don't claim that I endorse your project if I didn't (3) if my software solves a core problem of your software, I would appreciate if you give credit - not strictly required, but appreciated / seems fair". However, as PyPDF2 is already a bit older and over 100 people contributed to it, I'm uncertain what it would mean to change the license / add a new license. I don't know whom I would need to ask for permission. In the worst case, all contributors... which would be infeasable, as I will not be able to reach all of them. |
Under the hood it's C++17 indeed. The public headers, however, are C only (luckily, otherwise it wouldn't be possible to use
I'm not a lawyer, but as far as I'm aware, they are perfectly compatible.
I think changing the licensing of pypdf2 is neither necessary nor feasible, as you would indeed need the written agreement of all contributors. In any case, there's nothing wrong with BSD-3-Clause, is there? |
@mara004 Congratulations! PDFium now is on first place! https://github.com/py-pdf/benchmarks - I've decided that text extraction should NOT add the target of a link - only the text of the link. That changed the order a bit (but I need to check if Tika has ways to customize its extraction format). Either way, PDFium does a great job |
It is actually, lesser known I guess because it's a bit outside of the MIT/Apache/GPL trio, but they are compatible :) Glad to read about the good perf on your benchmark :) |
Currently, the core reading of PDF document is made with PyMuPDF. This needs to be benchmarked against alternatives to ensure we use the optimal backend here.
The text was updated successfully, but these errors were encountered: