[spark] Support distributed orphan file clean for spark #4200

xuzifu666 · 2024-09-17T16:48:37Z

Purpose

Linked issue: close #4184

Tests

RemoveOrphanFilesProcedureTest

API and Format

Documentation

JingsongLi · 2024-09-18T03:34:03Z

.../paimon-spark-common/src/main/java/org/apache/paimon/spark/orphan/SparkOrphanFilesClean.java

+        cleanSnapshotDir(branches, p -> deletedInLocal.incrementAndGet());
+
+        // branch and manifest file
+        CollectionAccumulator<Pair<String, String>> pairsAccumulator =


Why we need a CollectionAccumulator? Why not just use RDD?

Thanks @JingsongLi I consider 2 points:

Want to reuse OrphanFilesClean##collectWithoutDataFile method and Spark API seems not contain side output stream like Flink

If use rdd maybe need split another rdd. CollectionAccumulator work as side output.

Here CollectionAccumulator change to Rdd is better, had changed it and do some change in parent method for more common calling. @JingsongLi

JingsongLi · 2024-09-19T07:48:03Z

.../paimon-spark-common/src/main/java/org/apache/paimon/spark/orphan/SparkOrphanFilesClean.java

+                                                    Pair.of(branch, manifest.fileName());
+                                            collect.add(pair0);
+                                        };
+                                collectWithoutDataFile(branch, snapshot, null, manifestConsumer);


It is better to execute this once.

JingsongLi · 2024-09-19T07:53:15Z

.../paimon-spark-common/src/main/java/org/apache/paimon/spark/orphan/SparkOrphanFilesClean.java

+                                    olderThanMillis,
+                                    fileCleaner,
+                                    parallelism)
+                            .doOrphanClean(ctx);


It is better to return a RDD or Dataset, in that way, all tables can be executed together to allow for greater utilization of all resources.

OK，would changed it latter.

xuzifu666 · 2024-09-19T08:14:20Z

#4207 dataset code style is more graceful than rdd style, so want to use 4207 version to support the issue @ulysses-you @JingsongLi

xuzifu666 added 2 commits September 17, 2024 23:51

[spark] Support distributed orphan file clean for spark

7c1461b

fix imports

cbd2b02

This was referenced Sep 17, 2024

[Feature] Implement distributed orphan file clean for spark #4184

Closed

[spark] Support distributed orphan file clean for spark #4199

Closed

JingsongLi reviewed Sep 18, 2024

View reviewed changes

xuzifu666 added 3 commits September 18, 2024 17:24

accrator 2 rdd

5b510d7

fix

d2bdd33

clean

97743f2

JingsongLi reviewed Sep 19, 2024

View reviewed changes

xuzifu666 closed this Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] Support distributed orphan file clean for spark #4200

[spark] Support distributed orphan file clean for spark #4200

xuzifu666 commented Sep 17, 2024

JingsongLi Sep 18, 2024

xuzifu666 Sep 18, 2024

xuzifu666 Sep 18, 2024

JingsongLi Sep 19, 2024

JingsongLi Sep 19, 2024

xuzifu666 Sep 19, 2024

xuzifu666 commented Sep 19, 2024

[spark] Support distributed orphan file clean for spark #4200

[spark] Support distributed orphan file clean for spark #4200

Conversation

xuzifu666 commented Sep 17, 2024

Purpose

Tests

API and Format

Documentation

JingsongLi Sep 18, 2024

Choose a reason for hiding this comment

xuzifu666 Sep 18, 2024

Choose a reason for hiding this comment

xuzifu666 Sep 18, 2024

Choose a reason for hiding this comment

JingsongLi Sep 19, 2024

Choose a reason for hiding this comment

JingsongLi Sep 19, 2024

Choose a reason for hiding this comment

xuzifu666 Sep 19, 2024

Choose a reason for hiding this comment

xuzifu666 commented Sep 19, 2024