Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No support for ToXMLContentHandler #45

Closed
aleksandrskrivickis opened this issue May 17, 2024 · 2 comments
Closed

No support for ToXMLContentHandler #45

aleksandrskrivickis opened this issue May 17, 2024 · 2 comments

Comments

@aleksandrskrivickis
Copy link

aleksandrskrivickis commented May 17, 2024

Current handler returns plain text. Tika allows more structured output in form of XML using ToXMLContentHandler.

I propose to introduce optional parameter that would allow XML output if necessary to obtain more strucutred data.

@arcaputo3
Copy link
Contributor

Feel free to give this a shot: https://github.com/TJC-LP/tika-ocr/tree/TJC-LP/enable-xml-output

I'm going to test it in our Databricks workspace in the next few days, but locally seems to work as expected.

@aleksandrskrivickis
Copy link
Author

aleksandrskrivickis commented Jul 9, 2024

Thank you very much. I'm going to test changes proposed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants