You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to scan custom extensions as well. I work a lot with structured documents like .csv, .xml, .json etc. These could be scanned like normal text files.
The text was updated successfully, but these errors were encountered:
Ah, a good requirement! Yet, what about document metadata? I don't thing authors can be extracted from the files, the only viable information would be the last modified date and the extracted content language. Maybe the new NLP features might find some named entities, but I don't think there are more options here. What do you think?
One can extend Tika to extract metadata if those xml, json, etc have a certain structure and contain necessary information.
Since there is always going to be someone who says I miss extension X, I wonder if it would make sense to use patterns for things to scan somehow?
I would like to scan custom extensions as well. I work a lot with structured documents like .csv, .xml, .json etc. These could be scanned like normal text files.
The text was updated successfully, but these errors were encountered: