Scalability issues when storing binary file in pyspark column #44

aleksandrskrivickis · 2024-05-17T09:19:45Z

Dear @aamend @alexott @nfx,
I appreaciate your work on making tika file format possible.

After reviewing serialiser code I have noticed you storing binary file as one of the columns.

Such a construct does not allow stable flow at a scale of more than 1000 large documents.

It could be prudent to store binary files outside of result dataframe.

Let me know your thoughts.

The text was updated successfully, but these errors were encountered:

arcaputo3 · 2024-07-03T21:09:56Z

It should be straightforward to add an option to ignore the content column, but Tika still requires having the entire binary in memory to do OCR, so IMO memory is the bottleneck and not storage.

We've had success using small partitions and high-mem clusters for larger workflows. You can set spark.conf.set("spark.sql.files.maxPartitionBytes", 4194304) which attempts to reduce the partition size to 4MB vs. spark default of 128MB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability issues when storing binary file in pyspark column #44

Scalability issues when storing binary file in pyspark column #44

aleksandrskrivickis commented May 17, 2024 •

edited

Loading

arcaputo3 commented Jul 3, 2024

Scalability issues when storing binary file in pyspark column #44

Scalability issues when storing binary file in pyspark column #44

Comments

aleksandrskrivickis commented May 17, 2024 • edited Loading

arcaputo3 commented Jul 3, 2024

aleksandrskrivickis commented May 17, 2024 •

edited

Loading