Question: Can this library be used with streaming? #43

MarioP-Dev · 2023-07-26T13:38:33Z

I am using AWS Glue for processing data coming from Kinesis.
We currently use the library for executing jobs that, in other words, have an end where the library deletes the shuffles, but jobs such as streaming have no end and the library does not proceed to clean up these shuffle files that have come to occupy 1TB of S3 storage.

Is there any way to force the shuffle files cleaning in a point in the script or automatically clean the shuffles for each batch?

pspoerri · 2023-08-17T06:23:22Z

I have no experience with Spark streaming, however this plugin cleans up the shuffle data automatically. It relies on the unregister shuffle API to clean up shuffle files. So I believe it should work in your case.

There's no way to force cleanup since this shuffle plugin does not have knowledge of the Spark DAG.

Let me know if this solution works for you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Can this library be used with streaming? #43

Question: Can this library be used with streaming? #43

MarioP-Dev commented Jul 26, 2023

pspoerri commented Aug 17, 2023

Question: Can this library be used with streaming? #43

Question: Can this library be used with streaming? #43

Comments

MarioP-Dev commented Jul 26, 2023

pspoerri commented Aug 17, 2023