You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It should be straightforward to add an option to ignore the content column, but Tika still requires having the entire binary in memory to do OCR, so IMO memory is the bottleneck and not storage.
We've had success using small partitions and high-mem clusters for larger workflows. You can set spark.conf.set("spark.sql.files.maxPartitionBytes", 4194304) which attempts to reduce the partition size to 4MB vs. spark default of 128MB.
Dear @aamend @alexott @nfx,
I appreaciate your work on making
tika
file format possible.After reviewing serialiser code I have noticed you storing binary file as one of the columns.
Such a construct does not allow stable flow at a scale of more than 1000 large documents.
It could be prudent to store binary files outside of result dataframe.
Let me know your thoughts.
The text was updated successfully, but these errors were encountered: