Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add visitor SidecarVisitor and Sidecar action struct #673

Merged
merged 8 commits into from
Feb 5, 2025

Conversation

sebastiantia
Copy link
Collaborator

@sebastiantia sebastiantia commented Feb 1, 2025

What changes are proposed in this pull request?

This PR introduces the Sidecar action and its associated SidecarVisitor.

This action and visitor is used for the V2 checkpoints Reader/Writer table feature.

Edge cases:

resolves #668

Note: PR also simplifies some conditional flags detailed in issue: #672

How was this change tested?

  • Ensure schema projection to Sidecar works
  • Ensure that the visitor correctly extracts Sidecar actions
    - Ensure duplicate paths in sidecar actions are ignored
    - Ensure VisitorError is returned when full file path is passed in path field

#[cfg_attr(test, derive(Serialize, Default), serde(rename_all = "camelCase"))]
struct Sidecar {
/// A path to the sidecar file. Because sidecar files must always reside in the table's own
/// _delta_log/_sidecars directory, implementations are encouraged to store only the file's name.
Copy link
Collaborator Author

@sebastiantia sebastiantia Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is quoted from the protocol linked here.

I am assuming this means we will need to handle both cases where the path is:

  1. just the file name of the sidecar files
  2. the absolute path (including parent directories)

Or can we assume that implementations are not only encouraged, but DO simply store the file name of the sidecar files as the path?

Delta spark assumes that path is just the file name:
https://github.com/delta-io/delta/blob/990c8e8aedb69a1660e7780ab1d1a19f5c95615f/kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/ActionsIterator.java#L274

Can I get a 👍 to do the same?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In think we have to stick to the protocol here and assume it may also be a URL. That said, not sure how many writers out there in the wild already write v2 checkpoints. Maybe we can still update the protocol?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input @roeap. instead of blocking on updating the protocol, I'm going to go ahead with implementing support for only filenames in this PR along with error handling when a file path is passed. Supporting full-path parsing should be supported like you said so I've opened up an issue for future work #675

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the route we decide to go down, make sure that's clear in the docs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update:

As @scovich pointed out, since we plan to do Url::join with the sidecar directory as the base and the sidecar.path as the input before reading sidecar files anyways, we will actually already support full file paths since the join functionality handles parent path replacements:

An absolute URL (with a scheme) as input replaces the whole base URL (even the scheme).

Copy link

codecov bot commented Feb 1, 2025

Codecov Report

Attention: Patch coverage is 85.29412% with 10 lines in your changes missing coverage. Please review.

Project coverage is 84.05%. Comparing base (379f5e5) to head (2111b8f).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/actions/visitors.rs 80.39% 4 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #673      +/-   ##
==========================================
- Coverage   84.08%   84.05%   -0.03%     
==========================================
  Files          77       77              
  Lines       17761    17823      +62     
  Branches    17761    17823      +62     
==========================================
+ Hits        14934    14982      +48     
- Misses       2116     2122       +6     
- Partials      711      719       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sebastiantia sebastiantia marked this pull request as ready for review February 1, 2025 01:04
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good. just some minor comments.

kernel/src/actions/mod.rs Outdated Show resolved Hide resolved
#[cfg_attr(test, derive(Serialize, Default), serde(rename_all = "camelCase"))]
struct Sidecar {
/// A path to the sidecar file. Because sidecar files must always reside in the table's own
/// _delta_log/_sidecars directory, implementations are encouraged to store only the file's name.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In think we have to stick to the protocol here and assume it may also be a URL. That said, not sure how many writers out there in the wild already write v2 checkpoints. Maybe we can still update the protocol?

kernel/src/actions/mod.rs Outdated Show resolved Hide resolved
kernel/src/actions/mod.rs Outdated Show resolved Hide resolved
kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
Comment on lines 592 to 593
r#"{"sidecar":{"path":"016ae953-37a9-438e-8683-9a9a4a79a395.parquet","sizeInBytes":9268,"modificationTime":1714496113961,"tags": null}}"#,
r#"{"sidecar":{"path":"3a0d65cd-4056-49b8-937b-95f9e3ee90e5.parquet","sizeInBytes":9268,"modificationTime":1714496113962,"tags": null}}"#,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just merge this test with test_parse_action_batch_without_sidecar_actions by moving the sidecar actions here into the action_batch. That we we test both that:

  1. We correctly parse sidecar actions
  2. We correctly ignore irrelevant actions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ similar to protocol and cdc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also make one of the sidecar files have a real tags field with a string-string map.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, I'm not certain "tags": null is something that occurs in tables in the wild. Would tags just be omitted? In any case, we should probably test that we can take a sidecar action with no tags.

Copy link
Collaborator Author

@sebastiantia sebastiantia Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced with Oussama on tags field, which can be:
not defined (omitted) / null (which does not get omitted from the action & shows up as null) / string: string map

Will be merging the tests and I'll add a sidecar action into the action_batch to make sure it is ignored during parsing for other actions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any update on these tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it a bit awkward to make a new test with sidecar actions with just differing tags. I've found test tables that have tags be "null" that are being appropriately parsed

{"add":{"path":"modified=2021-02-01/part-00001-3b6e7f26-8140-4067-8504-47540a363758-c000.snappy.parquet","partitionValues":{"modified":"2021-02-01"},"size":810,"modificationTime":1713400874285,"dataChange":true,"stats":"{\"numRecords\":8,\"minValues\":{\"value\":4,\"id\":\"A\"},\"maxValues\":{\"id\":\"B\",\"value\":11},\"nullCount\":{\"id\":0,\"value\":0}}","tags":null,"deletionVector":null,"baseRowId":null,"defaultRowCommitVersion":null,"clusteringProvider":null}}
But if you think it's still good to include I can create a new test test_parse_sidecar_with_differing_tags

kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
@@ -544,6 +584,36 @@ mod tests {
Ok(())
}

#[test]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: None of these tests check the error case where a required field is missing 🤔 Maybe we should make an issue to add those. Personally I'd also like to see a VisitorError type so that we can make stronger assertions on the failure cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! #674

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't we also add a trivial case just for these additions to see it fail if sidecar.path is missing? (and keep the issue to extend to other cases?)

// Since path column is required, use it to detect presence of a sidecar action
if let Some(path) = getters[0].get_opt(i, "sidecar.path")? {
if self.seen_paths.contains(path) {
warn!("Duplicate sidecar path {} found during visiting", path);
Copy link
Collaborator Author

@sebastiantia sebastiantia Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a batch of actions includes two sidecar actions referencing the same file, we issue a warning. We will ignore the duplicate file path to prevent unnecessary re-reading of the file and not throw an error as we can still construct table state.

This is done as a best-effort. Ideally we could ensure no sidecar files are read more than once by tracking sidecar file paths across all batches that makeup a checkpoint file, but this introduces an unbounded memory dependency on the # of sidecar-paths encountered (as we would need to store them all somehow). I do not think the added performance in cases where sidecar files are referenced more than once (unlikely) warrants more than this change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's worth doing this extra work just to issue a warning. The vast majority of cases, we won't encounter this. If we do encounter it, AddRemoveDedupVisitor will handle it. I think of it like this: Expensive checks should only be done when it's a matter of correctness. All other checks/warnings should be cheap to do.

I would consider this an expensive check because we use double the memory to maintain both the Vec and the HashSet.

Copy link
Collaborator

@scovich scovich Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing something -- sidecar actions are only allowed in checkpoint files:

This action is only allowed in checkpoints following V2 spec

... and checkpoints are not allowed to contain duplicate actions of any kind:

A checkpoint contains the complete replay of all actions, up to and including the checkpointed table version, with invalid actions removed. Invalid actions are those that have been canceled out by subsequent ones (for example removing a file that has been added), using the rules for reconciliation.

So IMO we shouldn't be checking for duplicate sidecars unless we're also planning to do expensive checks for duplicate file actions in the checkpoint (which I don't think is worth the trouble).

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Feb 3, 2025
#[cfg_attr(test, derive(Serialize, Default), serde(rename_all = "camelCase"))]
struct Sidecar {
/// A path to the sidecar file. Because sidecar files must always reside in the table's own
/// _delta_log/_sidecars directory, implementations are encouraged to store only the file's name.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the route we decide to go down, make sure that's clear in the docs.

// Since path column is required, use it to detect presence of a sidecar action
if let Some(path) = getters[0].get_opt(i, "sidecar.path")? {
if self.seen_paths.contains(path) {
warn!("Duplicate sidecar path {} found during visiting", path);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's worth doing this extra work just to issue a warning. The vast majority of cases, we won't encounter this. If we do encounter it, AddRemoveDedupVisitor will handle it. I think of it like this: Expensive checks should only be done when it's a matter of correctness. All other checks/warnings should be cheap to do.

I would consider this an expensive check because we use double the memory to maintain both the Vec and the HashSet.


// The visitor should error out when it encounters a full file path
let res = visitor.visit_rows_of(batch.as_ref()).unwrap_err();
assert!(matches!(res, Error::VisitorError(_)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sry I should've made myself clear. VisitorError should definitely be in a separate PR/issue. Rust errors are best represented as enums where every error case is its own enum variant. That's a non-trivial amount of code, so it's best left for later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh all right, I can go ahead and revert this

kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
@sebastiantia sebastiantia force-pushed the introduce_sidecar_visitor branch from d910837 to ffb4d64 Compare February 4, 2025 18:59
@sebastiantia sebastiantia removed the breaking-change Change that will require a version bump label Feb 4, 2025
@@ -501,7 +542,8 @@ mod tests {
r#"{"commitInfo":{"timestamp":1677811178585,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists","partitionBy":"[]"},"isolationLevel":"WriteSerializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputRows":"10","numOutputBytes":"635"},"engineInfo":"Databricks-Runtime/<unknown>","txnId":"a6a94671-55ef-450e-9546-b8465b9147de"}}"#,
r#"{"protocol":{"minReaderVersion":3,"minWriterVersion":7,"readerFeatures":["deletionVectors"],"writerFeatures":["deletionVectors"]}}"#,
r#"{"metaData":{"id":"testId","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"value\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{"delta.enableDeletionVectors":"true","delta.columnMapping.mode":"none", "delta.enableChangeDataFeed":"true"},"createdTime":1677811175819}}"#,
r#"{"cdc":{"path":"_change_data/age=21/cdc-00000-93f7fceb-281a-446a-b221-07b88132d203.c000.snappy.parquet","partitionValues":{"age":"21"},"size":1033,"dataChange":false}}"#
r#"{"cdc":{"path":"_change_data/age=21/cdc-00000-93f7fceb-281a-446a-b221-07b88132d203.c000.snappy.parquet","partitionValues":{"age":"21"},"size":1033,"dataChange":false}}"#,
r#"{"sidecar":{"path":"016ae953-37a9-438e-8683-9a9a4a79a395.parquet","sizeInBytes":9268,"modificationTime":1714496113961,"tags":{"tag_foo":"tag_bar"}}}"#,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: would a relative path look like this or would it be _sidecar/016ae953-37a9-438e-8683-9a9a4a79a395.parquet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A relative path would just be the filename.

Comment on lines 592 to 593
r#"{"sidecar":{"path":"016ae953-37a9-438e-8683-9a9a4a79a395.parquet","sizeInBytes":9268,"modificationTime":1714496113961,"tags": null}}"#,
r#"{"sidecar":{"path":"3a0d65cd-4056-49b8-937b-95f9e3ee90e5.parquet","sizeInBytes":9268,"modificationTime":1714496113962,"tags": null}}"#,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any update on these tests?

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@OussamaSaoudi OussamaSaoudi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GJ!

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (left a couple nits)

#[allow(unused)] //TODO: Remove once we implement V2 checkpoint file processing
#[derive(Schema, Debug, PartialEq)]
#[cfg_attr(feature = "developer-visibility", visibility::make(pub))]
pub(crate) struct Sidecar {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could add some docs and link to sidecar action in protocol?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea 👍

@@ -544,6 +584,36 @@ mod tests {
Ok(())
}

#[test]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't we also add a trivial case just for these additions to see it fail if sidecar.path is missing? (and keep the issue to extend to other cases?)

@@ -4,6 +4,7 @@
use std::collections::HashMap;
use std::sync::LazyLock;

use crate::actions::SIDECAR_NAME;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move below with other imports?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dang good eyes 👍

@sebastiantia
Copy link
Collaborator Author

couldn't we also add a trivial case just for these additions to see it fail if sidecar.path is missing? (and keep the issue to extend to other cases?)

@zachschuermann I'd rather keep it in a separate PR since I think we only need to test a single required field of an action to verify that the visitor extraction fails appropriately on a batch that is missing the column.

Also, not immediately sure how to create an Arrow engine batch with a missing column in the first place since we have a lot of safeguards.

@sebastiantia sebastiantia merged commit 2240154 into delta-io:main Feb 5, 2025
20 of 21 checks passed
@sebastiantia sebastiantia deleted the introduce_sidecar_visitor branch February 6, 2025 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New visitor SidecarVisitor and Sidecar action struct
5 participants