Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Can this library be used with streaming? #43

Open
MarioP-Dev opened this issue Jul 26, 2023 · 1 comment
Open

Question: Can this library be used with streaming? #43

MarioP-Dev opened this issue Jul 26, 2023 · 1 comment

Comments

@MarioP-Dev
Copy link

I am using AWS Glue for processing data coming from Kinesis.
We currently use the library for executing jobs that, in other words, have an end where the library deletes the shuffles, but jobs such as streaming have no end and the library does not proceed to clean up these shuffle files that have come to occupy 1TB of S3 storage.

Is there any way to force the shuffle files cleaning in a point in the script or automatically clean the shuffles for each batch?

@pspoerri
Copy link
Contributor

I have no experience with Spark streaming, however this plugin cleans up the shuffle data automatically. It relies on the unregister shuffle API to clean up shuffle files. So I believe it should work in your case.

There's no way to force cleanup since this shuffle plugin does not have knowledge of the Spark DAG.

Let me know if this solution works for you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants