Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace PyPDF2 with pypdfium2 #38

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

yiwei-ang
Copy link

@yiwei-ang yiwei-ang commented Aug 23, 2023

I really appreciate @alejandro-ao for creating good video demonstrating the perfect blend of openai, PDF readers and streamlit!

I've tried to use the tool for several PDFs, I found that there's an issue of text extraction quality using PyPDF2, that contexts of a PDF are not extracted fully and completely.

After looking into https://github.com/py-pdf/benchmarks, it seems we can go with pypdfium2 that serves similar functionality, while providing better text extraction quality and faster computational time (Verified from my end!)

@yiwei-ang yiwei-ang changed the title Replace pypdfium2 with Replace PyPDF2 with pypdfium2 Aug 23, 2023
@IlianP
Copy link

IlianP commented Sep 8, 2023

As a side note, LangChain also supports pypdfium2 as a document loader:
https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf#using-pypdfium2

@costabm
Copy link

costabm commented Nov 2, 2023

I have added this important feature to my larger pull request (my first one ever). I gave you credit there, but no sure this is the right way to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants