-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Support clean expired log #3212
[Kernel] Support clean expired log #3212
Conversation
Hi @horizonzy, thank you for the PR! Could you please rebase this PR? Once rebased we can review and get it merged. There has been a formatting change since the PR posted. Refer here on how to reformat the changes. |
46f2f8e
to
cdbd5b2
Compare
I fixed the conflict, but the code style does not suit the rules, I have no idea how to fix it. The |
can you try the instructions here? |
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR. One comment to address keeping the last checkpoint.
logger.info( | ||
"{}: Starting the deletion of log files older than {}", tablePath, fileCutOffTime); | ||
int numDeleted = 0; | ||
try (CloseableIterator<FileStatus> files = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should check there atleast one checkpoint to after the latest deleted file to make sure the table is still time travelable.
008.json
009.json
010.json
010.checkpoint.parquet
011.json
012.json
013.json
014.json
...
020.json
020.checkpoint.parquet
If the retention selection 8 to 13.json files, we should only delete up to the 10.json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fixed in the latest commit
ac54876
to
a061e5a
Compare
kernel/kernel-api/src/main/java/io/delta/kernel/internal/TableConfig.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/snapshot/MetadataCleanup.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/snapshot/MetadataCleanup.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/snapshot/MetadataCleanup.java
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/snapshot/MetadataCleanup.java
Outdated
Show resolved
Hide resolved
logger.info("{}: Deleted {} log files older than {}", tablePath, numDeleted, fileCutOffTime); | ||
} | ||
|
||
private static CloseableIterator<FileStatus> listExpiredDeltaLogs(Engine engine, Path tablePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method doesn't do anything to ensure that the results are indeed expired
.
listDeltaLogs
is better -> but do we already have a nicely-named-util that does that somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now, it is not filtering out the expired because it is not updating the timestamps to be monotonically increasing. I changed it to listDeltaLogs
. There is no utility method as it depends on the prefix and the actually API call isn't that complicated.
kernel/kernel-api/src/main/java/io/delta/kernel/internal/snapshot/MetadataCleanup.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/snapshot/MetadataCleanup.java
Outdated
Show resolved
Hide resolved
lastSeenCheckpointFiles.add(nextFile.getPath()); | ||
lastSeenCheckpointVersion = newLastSeenCheckpointVersion; | ||
} | ||
// TODO: do we need to delete unknown file types? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delta-Spark ignores unknown files.
Removing the TODO.
kernel/kernel-api/src/main/java/io/delta/kernel/internal/snapshot/MetadataCleanup.java
Outdated
Show resolved
Hide resolved
"{}: Deleting log files (start = {}, end = {}) because a checkpoint at " | ||
+ "version {} indicates that these log files are no longer needed.", | ||
tablePath, | ||
getFirst(potentialLogFilesToDelete), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should remove these getFirst
and getLast
methods. Or, if you want them to stay, can you explain the benefit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It helps track what file ranges of files deleted without printing the complete list. This is mainly for debugging any unwanted behavior. If the logging is noisy, we can always remove it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I just meant can we just do potentialLogFilesToDelete.get(0)
instead of creating a new getFirst
method.
If you think getFirst
is cleaner, SGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. This will be a staple in the later JDK versions, and it is clean rather than trying to say (list.length - 1)
. It will be useful in many other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Add support for metadata cleanup as part of the checkpointing. Metadata cleanup removes expired Delta table log files (delta + checkpoints) according to the table log retention configuration. Any removed delta log files must not cause the table state to be inconstructible.