Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Added Domain Metadata support to Delta Kernel #3835
base: master
Are you sure you want to change the base?
[Kernel] Added Domain Metadata support to Delta Kernel #3835
Changes from 18 commits
94260ef
7612613
2ac3985
807c896
84d4ae6
509bc27
83f3b1e
7d7032e
dadc5a6
0c188c7
7948fdf
6195e7f
fd06f6d
3e30a41
cec85cf
8edcc9a
baca1c5
7e0a172
127de8c
c5f3672
b2d6546
cd7ddeb
68c77d8
5afec36
e358878
4f56a2f
189ec75
a4e3104
b0e4a65
f89199c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this error is thrown when a commit (already on the filesystem contains duplicate entries for the same metadata). In this case the table is corrupted and we should use
InvalidTableException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is thrown in two cases:
Transaction.commit()
(write path).As discussed in another thread, I think we can possibly remove the duplicate check in DM conflict resolving, and in this case this error is only thrown in
Transaction.commit()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add docs (see getLatestTransactionVersion for an example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is used just in tests? Do we see a need for it when writing to the table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a doc for this.
Currently, it is only used in tests. I imagine we will need this when writing to the table in the future.
For example, in Row Tracking, we’ll need to get its domain metadata from the snapshot to retrieve the previous
HighWatermark
. We need this to assign fresh Row IDs to anyAddFile
actions withindataActions
that don’t yet have row IDs before committing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far this is just used in tests. Is it the common practise to add multiple domain metadata as part of the txn?
Also we could move the duplicate check here itself than validating just before the commit.
One more thing should we automatically upgrade the protocol version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also -- this should be in the TransactionBuilderImpl class instead. is there a specific reason you want to
build
a Transaction and then mutate and append domain metadatas to it afterwards?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a txn can contains multiple domain metadata - it seems to be the case in Delta-Spark.
One reason I do the duplication check just before committing is that domain metadata might be mutated in other code paths, such as during conflict resolution. Validating them right before the commit helps ensure that we capture all changes.
I'm not sure about other domain metadata usages. For Row Tracking, its domain metadata depends on all
AddFile
actions in thedataActions
passed intoTransaction.commit()
. We can only append this domain metadata after building the transaction and seeing allAddFile
actions. Also, conflict resolution can mutate existing domain metadatas.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like in practice we might not use an API like this? The domain metadata might just be generated as a step during
commit
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I'm having a bit of a hard time mentally modeling this without an actual feature using domain metadatas. Are we adding support for row tracking or anything else after this?
If these are all private APIs we can revisit whether they make sense when adding support for a feature that uses them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is used only for tests, can you please add
Visible for testing
to the javadoc.Also -- @allisonport-db and @vkorukanti what do you think about adding a
@VisibleForTesting
annotation that we can add to methods to make tthis even clearer?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to documenting it's for testing (at least for now as long as no feature is using it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annotation SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is only for testing now I think there's also some validity to moving it to
TransactionBuilderImpl
as @scottsand-db suggested to avoid mutating a transaction. But since it's just for testing maybe it's not that important ¯_(ツ)_/¯There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think avoiding mutating a transaction makes sense. I've moved it to
TransactionBuilderImpl
and mentionedVisible for testing
in its javadoc.