Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Support clean expired log #3212

Merged
merged 15 commits into from
Sep 18, 2024

Conversation

horizonzy
Copy link
Contributor

@horizonzy horizonzy commented Jun 4, 2024

Add support for metadata cleanup as part of the checkpointing. Metadata cleanup removes expired Delta table log files (delta + checkpoints) according to the table log retention configuration. Any removed delta log files must not cause the table state to be inconstructible.

@vkorukanti
Copy link
Collaborator

vkorukanti commented Aug 16, 2024

Hi @horizonzy, thank you for the PR! Could you please rebase this PR? Once rebased we can review and get it merged. There has been a formatting change since the PR posted. Refer here on how to reformat the changes.

@horizonzy
Copy link
Contributor Author

I fixed the conflict, but the code style does not suit the rules, I have no idea how to fix it. The kernel-checkstyle.xml is not suitable for the IntelliJ idea settings.

@vkorukanti
Copy link
Collaborator

I fixed the conflict, but the code style does not suit the rules, I have no idea how to fix it. The kernel-checkstyle.xml is not suitable for the IntelliJ idea settings.

can you try the instructions here?

@horizonzy
Copy link
Contributor Author

I fixed the conflict, but the code style does not suit the rules, I have no idea how to fix it. The kernel-checkstyle.xml is not suitable for the IntelliJ idea settings.

can you try the instructions here?

Thanks!

Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. One comment to address keeping the last checkpoint.

logger.info(
"{}: Starting the deletion of log files older than {}", tablePath, fileCutOffTime);
int numDeleted = 0;
try (CloseableIterator<FileStatus> files =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check there atleast one checkpoint to after the latest deleted file to make sure the table is still time travelable.

008.json
009.json
010.json
010.checkpoint.parquet
011.json
012.json
013.json
014.json
...
020.json
020.checkpoint.parquet

If the retention selection 8 to 13.json files, we should only delete up to the 10.json.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fixed in the latest commit

logger.info("{}: Deleted {} log files older than {}", tablePath, numDeleted, fileCutOffTime);
}

private static CloseableIterator<FileStatus> listExpiredDeltaLogs(Engine engine, Path tablePath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method doesn't do anything to ensure that the results are indeed expired.

listDeltaLogs is better -> but do we already have a nicely-named-util that does that somewhere?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, it is not filtering out the expired because it is not updating the timestamps to be monotonically increasing. I changed it to listDeltaLogs. There is no utility method as it depends on the prefix and the actually API call isn't that complicated.

lastSeenCheckpointFiles.add(nextFile.getPath());
lastSeenCheckpointVersion = newLastSeenCheckpointVersion;
}
// TODO: do we need to delete unknown file types?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delta-Spark ignores unknown files.

Removing the TODO.

"{}: Deleting log files (start = {}, end = {}) because a checkpoint at "
+ "version {} indicates that these log files are no longer needed.",
tablePath,
getFirst(potentialLogFilesToDelete),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should remove these getFirst and getLast methods. Or, if you want them to stay, can you explain the benefit?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helps track what file ranges of files deleted without printing the complete list. This is mainly for debugging any unwanted behavior. If the logging is noisy, we can always remove it later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I just meant can we just do potentialLogFilesToDelete.get(0) instead of creating a new getFirst method.

If you think getFirst is cleaner, SGTM.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This will be a staple in the later JDK versions, and it is clean rather than trying to say (list.length - 1). It will be useful in many other places.

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vkorukanti vkorukanti merged commit d467f52 into delta-io:master Sep 18, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants