Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created a download argument for running the client #16

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

amensiko
Copy link

Hello!

Myself and my colleagues at NASA Jet Propulsion Laboratory got to use the Grobid Python Client for an internal project and have found it extremely useful for parsing scientific papers and extracting useful information from them. Grobid is certainly one of the most incredible parsing tools out there and it has helped us tremendously, so thank you so much for all your work!

Something that we really wanted to use the client for was the ability to parse the PDFs without downloading the output XMLs locally. I didn't see it as an option/argument for the client so I created it and added it to the code. In short, passing the --download flag as False will save the output in a cache represented by a list of tuples, where each tuple represents a file and it contains the filename, the path, and the XML output in a string form. Later on, the cache (client.cache) can be used for further parsing if need be (see an example in test-cache.py). Passing the --download flag as True will save the XML files locally, as the client did before my modifications.

I wanted to share my modifications in case they could be of use to others. Please let me know if you have any questions or concerns!

Anastasija

@kermitt2
Copy link
Owner

Hi @amensiko !

Thanks a lot for the nice words on Grobid and the PR !

If I understand well, the download option you introduce is actually a "write" option. The XML result is always downloaded, but you would like to have it not written on file (as it is by default in process_pdf()) but in a str variable, all these XML strings being accumulated in a array at the client itself.

The issue with a cache maintained in the client itself is that it will blow-up memory as soon there are many PDF processed (or we would need a disk DB for the cache), which is the purpose of this client.

If I understand your use case correctly, maybe you would like the XML written in a stream passed to the client instead of the file system? I guess we could use StringIO classes from Python io standard library, to pass a Stream to the client, as alternative to the default file system. When I wrote this client, it was more an example of usage of Grobid API in a concurrent manner, to be adapted depending on the use case (writing in a DB, in a stream, etc.), but it would be the opportunity to think about a more generic/complete/packaged client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants