Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability issues when storing binary file in pyspark column #44

Open
aleksandrskrivickis opened this issue May 17, 2024 · 1 comment

Comments

@aleksandrskrivickis
Copy link

aleksandrskrivickis commented May 17, 2024

Dear @aamend @alexott @nfx,
I appreaciate your work on making tika file format possible.

After reviewing serialiser code I have noticed you storing binary file as one of the columns.

Such a construct does not allow stable flow at a scale of more than 1000 large documents.

It could be prudent to store binary files outside of result dataframe.

Let me know your thoughts.

@arcaputo3
Copy link
Contributor

It should be straightforward to add an option to ignore the content column, but Tika still requires having the entire binary in memory to do OCR, so IMO memory is the bottleneck and not storage.

We've had success using small partitions and high-mem clusters for larger workflows. You can set spark.conf.set("spark.sql.files.maxPartitionBytes", 4194304) which attempts to reduce the partition size to 4MB vs. spark default of 128MB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants