Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spark] Support distributed orphan file clean for spark #4200

Closed

Conversation

xuzifu666
Copy link
Contributor

Purpose

Linked issue: close #4184

Tests

RemoveOrphanFilesProcedureTest

API and Format

Documentation

cleanSnapshotDir(branches, p -> deletedInLocal.incrementAndGet());

// branch and manifest file
CollectionAccumulator<Pair<String, String>> pairsAccumulator =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need a CollectionAccumulator? Why not just use RDD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JingsongLi I consider 2 points:

  1. Want to reuse OrphanFilesClean##collectWithoutDataFile method and Spark API seems not contain side output stream like Flink
  2. If use rdd maybe need split another rdd. CollectionAccumulator work as side output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here CollectionAccumulator change to Rdd is better, had changed it and do some change in parent method for more common calling. @JingsongLi

Pair.of(branch, manifest.fileName());
collect.add(pair0);
};
collectWithoutDataFile(branch, snapshot, null, manifestConsumer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to execute this once.

olderThanMillis,
fileCleaner,
parallelism)
.doOrphanClean(ctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to return a RDD or Dataset, in that way, all tables can be executed together to allow for greater utilization of all resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,would changed it latter.

@xuzifu666
Copy link
Contributor Author

#4207 dataset code style is more graceful than rdd style, so want to use 4207 version to support the issue @ulysses-you @JingsongLi

@xuzifu666 xuzifu666 closed this Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Implement distributed orphan file clean for spark
2 participants