Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Extracting PDF Content as XML #35

Closed
coroluca opened this issue Nov 23, 2024 · 7 comments
Closed

Support for Extracting PDF Content as XML #35

coroluca opened this issue Nov 23, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@coroluca
Copy link

Hi, I’d like to use Extractous for my document processing tasks. I often need to extract PDF content as XML to retain structural information, such as page boundaries. This is a feature supported by Apache Tika, but it seems that currently, Extractous only provides plain text extraction.

Would it be possible to add support for XML extraction, similar to Tika’s functionality? This feature would be incredibly useful for preserving document structure.

Thank you for considering this request!

@nmammeri nmammeri added the enhancement New feature or request label Nov 25, 2024
@nmammeri
Copy link
Contributor

Thanks for reporting this, we are working on this. Will update this issue when we have a working implementation.

@davidmezzetti
Copy link

Thanks for mentioning this project to me over on Reddit. I'll definitely consider integrating it into txtai as another text extraction engine once this change is in.

I've long thought Tika is a good solution but the Java piece trips a lot of people up.

@nmammeri
Copy link
Contributor

nmammeri commented Dec 4, 2024

Thanks @davidmezzetti, we can definitely assist with the integration. It was always on our plan to work on integrations with other frameworks such as txtai. At the moment we are focusing on supporting most expected Tika features (including xml output). Then we can move onto integrations.
We'll get in touch once this is in ..

@davidmezzetti
Copy link

Sounds good. I should be able to add an integration fairly easily, ~20-30 lines with txtai once you have this change. I'll keep an eye on this!

@coroluca
Copy link
Author

Hi @nmammeri,
do you have any updates on the progress of XML extraction support?
Thanks again for your work on this!

@nmammeri
Copy link
Contributor

Hi @davidmezzetti and @coroluca I'm glad to announce that we finally got the xml output feature in. Please check version 0.3.0 🎉 . thanks for your patience

@davidmezzetti
Copy link

Great work! I'll take a look.

@coroluca coroluca closed this as completed Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants