-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add visitor SidecarVisitor
and Sidecar
action struct
#673
feat: add visitor SidecarVisitor
and Sidecar
action struct
#673
Conversation
kernel/src/actions/mod.rs
Outdated
#[cfg_attr(test, derive(Serialize, Default), serde(rename_all = "camelCase"))] | ||
struct Sidecar { | ||
/// A path to the sidecar file. Because sidecar files must always reside in the table's own | ||
/// _delta_log/_sidecars directory, implementations are encouraged to store only the file's name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is quoted from the protocol linked here.
I am assuming this means we will need to handle both cases where the path is:
- just the file name of the sidecar files
- the absolute path (including parent directories)
Or can we assume that implementations are not only encouraged, but DO simply store the file name of the sidecar files as the path
?
Delta spark assumes that path
is just the file name:
https://github.com/delta-io/delta/blob/990c8e8aedb69a1660e7780ab1d1a19f5c95615f/kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/ActionsIterator.java#L274
Can I get a 👍 to do the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In think we have to stick to the protocol here and assume it may also be a URL. That said, not sure how many writers out there in the wild already write v2 checkpoints. Maybe we can still update the protocol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the input @roeap. instead of blocking on updating the protocol, I'm going to go ahead with implementing support for only filenames in this PR along with error handling when a file path is passed. Supporting full-path parsing should be supported like you said so I've opened up an issue for future work #675
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the route we decide to go down, make sure that's clear in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update:
As @scovich pointed out, since we plan to do Url::join
with the sidecar directory as the base and the sidecar.path
as the input before reading sidecar files anyways, we will actually already support full file paths since the join
functionality handles parent path replacements:
An absolute URL (with a scheme) as input replaces the whole base URL (even the scheme).
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #673 +/- ##
==========================================
- Coverage 84.08% 84.05% -0.03%
==========================================
Files 77 77
Lines 17761 17823 +62
Branches 17761 17823 +62
==========================================
+ Hits 14934 14982 +48
- Misses 2116 2122 +6
- Partials 711 719 +8 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good. just some minor comments.
kernel/src/actions/mod.rs
Outdated
#[cfg_attr(test, derive(Serialize, Default), serde(rename_all = "camelCase"))] | ||
struct Sidecar { | ||
/// A path to the sidecar file. Because sidecar files must always reside in the table's own | ||
/// _delta_log/_sidecars directory, implementations are encouraged to store only the file's name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In think we have to stick to the protocol here and assume it may also be a URL. That said, not sure how many writers out there in the wild already write v2 checkpoints. Maybe we can still update the protocol?
kernel/src/actions/visitors.rs
Outdated
r#"{"sidecar":{"path":"016ae953-37a9-438e-8683-9a9a4a79a395.parquet","sizeInBytes":9268,"modificationTime":1714496113961,"tags": null}}"#, | ||
r#"{"sidecar":{"path":"3a0d65cd-4056-49b8-937b-95f9e3ee90e5.parquet","sizeInBytes":9268,"modificationTime":1714496113962,"tags": null}}"#, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should just merge this test with test_parse_action_batch_without_sidecar_actions
by moving the sidecar actions here into the action_batch
. That we we test both that:
- We correctly parse sidecar actions
- We correctly ignore irrelevant actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ similar to protocol and cdc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also make one of the sidecar files have a real tags field with a string-string map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, I'm not certain "tags": null
is something that occurs in tables in the wild. Would tags
just be omitted? In any case, we should probably test that we can take a sidecar
action with no tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced with Oussama on tags field, which can be:
not defined (omitted) / null (which does not get omitted from the action & shows up as null) / string: string map
Will be merging the tests and I'll add a sidecar action into the action_batch
to make sure it is ignored during parsing for other actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any update on these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it a bit awkward to make a new test with sidecar actions with just differing tags. I've found test tables that have tags be "null" that are being appropriately parsed
delta-kernel-rs/kernel/tests/data/app-txn-checkpoint/_delta_log/00000000000000000001.json
Line 2 in 6a82a57
{"add":{"path":"modified=2021-02-01/part-00001-3b6e7f26-8140-4067-8504-47540a363758-c000.snappy.parquet","partitionValues":{"modified":"2021-02-01"},"size":810,"modificationTime":1713400874285,"dataChange":true,"stats":"{\"numRecords\":8,\"minValues\":{\"value\":4,\"id\":\"A\"},\"maxValues\":{\"id\":\"B\",\"value\":11},\"nullCount\":{\"id\":0,\"value\":0}}","tags":null,"deletionVector":null,"baseRowId":null,"defaultRowCommitVersion":null,"clusteringProvider":null}} |
test_parse_sidecar_with_differing_tags
@@ -544,6 +584,36 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: None of these tests check the error case where a required field is missing 🤔 Maybe we should make an issue to add those. Personally I'd also like to see a VisitorError
type so that we can make stronger assertions on the failure cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea! #674
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't we also add a trivial case just for these additions to see it fail if sidecar.path is missing? (and keep the issue to extend to other cases?)
kernel/src/actions/visitors.rs
Outdated
// Since path column is required, use it to detect presence of a sidecar action | ||
if let Some(path) = getters[0].get_opt(i, "sidecar.path")? { | ||
if self.seen_paths.contains(path) { | ||
warn!("Duplicate sidecar path {} found during visiting", path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a batch of actions includes two sidecar actions referencing the same file, we issue a warning. We will ignore the duplicate file path to prevent unnecessary re-reading of the file and not throw an error as we can still construct table state.
This is done as a best-effort. Ideally we could ensure no sidecar files are read more than once by tracking sidecar file paths across all batches that makeup a checkpoint file, but this introduces an unbounded memory dependency on the # of sidecar-paths encountered (as we would need to store them all somehow). I do not think the added performance in cases where sidecar files are referenced more than once (unlikely) warrants more than this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's worth doing this extra work just to issue a warning. The vast majority of cases, we won't encounter this. If we do encounter it, AddRemoveDedupVisitor will handle it. I think of it like this: Expensive checks should only be done when it's a matter of correctness. All other checks/warnings should be cheap to do.
I would consider this an expensive check because we use double the memory to maintain both the Vec and the HashSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm missing something -- sidecar actions are only allowed in checkpoint files:
This action is only allowed in checkpoints following V2 spec
... and checkpoints are not allowed to contain duplicate actions of any kind:
A checkpoint contains the complete replay of all actions, up to and including the checkpointed table version, with invalid actions removed. Invalid actions are those that have been canceled out by subsequent ones (for example removing a file that has been added), using the rules for reconciliation.
So IMO we shouldn't be checking for duplicate sidecars unless we're also planning to do expensive checks for duplicate file actions in the checkpoint (which I don't think is worth the trouble).
kernel/src/actions/mod.rs
Outdated
#[cfg_attr(test, derive(Serialize, Default), serde(rename_all = "camelCase"))] | ||
struct Sidecar { | ||
/// A path to the sidecar file. Because sidecar files must always reside in the table's own | ||
/// _delta_log/_sidecars directory, implementations are encouraged to store only the file's name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the route we decide to go down, make sure that's clear in the docs.
kernel/src/actions/visitors.rs
Outdated
// Since path column is required, use it to detect presence of a sidecar action | ||
if let Some(path) = getters[0].get_opt(i, "sidecar.path")? { | ||
if self.seen_paths.contains(path) { | ||
warn!("Duplicate sidecar path {} found during visiting", path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's worth doing this extra work just to issue a warning. The vast majority of cases, we won't encounter this. If we do encounter it, AddRemoveDedupVisitor will handle it. I think of it like this: Expensive checks should only be done when it's a matter of correctness. All other checks/warnings should be cheap to do.
I would consider this an expensive check because we use double the memory to maintain both the Vec and the HashSet.
kernel/src/actions/visitors.rs
Outdated
|
||
// The visitor should error out when it encounters a full file path | ||
let res = visitor.visit_rows_of(batch.as_ref()).unwrap_err(); | ||
assert!(matches!(res, Error::VisitorError(_))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sry I should've made myself clear. VisitorError should definitely be in a separate PR/issue. Rust errors are best represented as enums where every error case is its own enum variant. That's a non-trivial amount of code, so it's best left for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh all right, I can go ahead and revert this
d910837
to
ffb4d64
Compare
@@ -501,7 +542,8 @@ mod tests { | |||
r#"{"commitInfo":{"timestamp":1677811178585,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists","partitionBy":"[]"},"isolationLevel":"WriteSerializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputRows":"10","numOutputBytes":"635"},"engineInfo":"Databricks-Runtime/<unknown>","txnId":"a6a94671-55ef-450e-9546-b8465b9147de"}}"#, | |||
r#"{"protocol":{"minReaderVersion":3,"minWriterVersion":7,"readerFeatures":["deletionVectors"],"writerFeatures":["deletionVectors"]}}"#, | |||
r#"{"metaData":{"id":"testId","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"value\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{"delta.enableDeletionVectors":"true","delta.columnMapping.mode":"none", "delta.enableChangeDataFeed":"true"},"createdTime":1677811175819}}"#, | |||
r#"{"cdc":{"path":"_change_data/age=21/cdc-00000-93f7fceb-281a-446a-b221-07b88132d203.c000.snappy.parquet","partitionValues":{"age":"21"},"size":1033,"dataChange":false}}"# | |||
r#"{"cdc":{"path":"_change_data/age=21/cdc-00000-93f7fceb-281a-446a-b221-07b88132d203.c000.snappy.parquet","partitionValues":{"age":"21"},"size":1033,"dataChange":false}}"#, | |||
r#"{"sidecar":{"path":"016ae953-37a9-438e-8683-9a9a4a79a395.parquet","sizeInBytes":9268,"modificationTime":1714496113961,"tags":{"tag_foo":"tag_bar"}}}"#, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: would a relative path look like this or would it be _sidecar/016ae953-37a9-438e-8683-9a9a4a79a395.parquet
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A relative path would just be the filename.
kernel/src/actions/visitors.rs
Outdated
r#"{"sidecar":{"path":"016ae953-37a9-438e-8683-9a9a4a79a395.parquet","sizeInBytes":9268,"modificationTime":1714496113961,"tags": null}}"#, | ||
r#"{"sidecar":{"path":"3a0d65cd-4056-49b8-937b-95f9e3ee90e5.parquet","sizeInBytes":9268,"modificationTime":1714496113962,"tags": null}}"#, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any update on these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GJ!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (left a couple nits)
#[allow(unused)] //TODO: Remove once we implement V2 checkpoint file processing | ||
#[derive(Schema, Debug, PartialEq)] | ||
#[cfg_attr(feature = "developer-visibility", visibility::make(pub))] | ||
pub(crate) struct Sidecar { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could add some docs and link to sidecar action in protocol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea 👍
@@ -544,6 +584,36 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't we also add a trivial case just for these additions to see it fail if sidecar.path is missing? (and keep the issue to extend to other cases?)
kernel/src/actions/visitors.rs
Outdated
@@ -4,6 +4,7 @@ | |||
use std::collections::HashMap; | |||
use std::sync::LazyLock; | |||
|
|||
use crate::actions::SIDECAR_NAME; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move below with other imports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dang good eyes 👍
@zachschuermann I'd rather keep it in a separate PR since I think we only need to test a single required field of an action to verify that the visitor extraction fails appropriately on a batch that is missing the column. Also, not immediately sure how to create an Arrow engine batch with a missing column in the first place since we have a lot of safeguards. |
What changes are proposed in this pull request?
This PR introduces the
Sidecar
action and its associatedSidecarVisitor
.This action and visitor is used for the V2 checkpoints Reader/Writer table feature.
Edge cases:
Error if the sidecarpath
field is a full file-path instead of just the filename. Ref to issue: Support full paths instead of only file-names in sidecar actions'spath
field #675resolves #668
Note: PR also simplifies some conditional flags detailed in issue: #672
How was this change tested?
Sidecar
worksSidecar
actions- Ensure duplicatepaths
in sidecar actions are ignored- EnsureVisitorError
is returned when full file path is passed inpath
field