-
Notifications
You must be signed in to change notification settings - Fork 946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[spark] Support distributed orphan file clean for spark #4200
Conversation
cleanSnapshotDir(branches, p -> deletedInLocal.incrementAndGet()); | ||
|
||
// branch and manifest file | ||
CollectionAccumulator<Pair<String, String>> pairsAccumulator = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need a CollectionAccumulator
? Why not just use RDD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JingsongLi I consider 2 points:
- Want to reuse OrphanFilesClean##collectWithoutDataFile method and Spark API seems not contain side output stream like Flink
- If use rdd maybe need split another rdd. CollectionAccumulator work as side output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here CollectionAccumulator change to Rdd is better, had changed it and do some change in parent method for more common calling. @JingsongLi
Pair.of(branch, manifest.fileName()); | ||
collect.add(pair0); | ||
}; | ||
collectWithoutDataFile(branch, snapshot, null, manifestConsumer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to execute this once.
olderThanMillis, | ||
fileCleaner, | ||
parallelism) | ||
.doOrphanClean(ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to return a RDD or Dataset, in that way, all tables can be executed together to allow for greater utilization of all resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK,would changed it latter.
#4207 dataset code style is more graceful than rdd style, so want to use 4207 version to support the issue @ulysses-you @JingsongLi |
Purpose
Linked issue: close #4184
Tests
RemoveOrphanFilesProcedureTest
API and Format
Documentation