Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Added Domain Metadata support to Delta Kernel #3835

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
94260ef
Added Domain Metadata support to Delta Kernel
qiyuandong-db Oct 31, 2024
7612613
Lazily load domain metadata during log replay.
qiyuandong-db Oct 31, 2024
2ac3985
Use an iterator to wrap the data action iterator for DM duplicate det…
qiyuandong-db Oct 31, 2024
807c896
Update kernel/kernel-api/src/main/java/io/delta/kernel/internal/Delta…
qiyuandong-db Oct 31, 2024
84d4ae6
Improve error message
qiyuandong-db Oct 31, 2024
509bc27
Rename a unit test
qiyuandong-db Oct 31, 2024
83f3b1e
Don't allow duplicate DMs when reading actions from the winning txn d…
qiyuandong-db Nov 1, 2024
7d7032e
Update error messages in the tests
qiyuandong-db Nov 1, 2024
dadc5a6
Fix typos in comments
qiyuandong-db Nov 3, 2024
0c188c7
Add an integration test with spark
qiyuandong-db Nov 4, 2024
7948fdf
Fix the JavadocGenerationFailed error in the Delta Kernel CI job.
qiyuandong-db Nov 4, 2024
6195e7f
Resolve PR comments.
qiyuandong-db Nov 5, 2024
fd06f6d
Update util method extractDomainMetadataMap.
qiyuandong-db Nov 5, 2024
3e30a41
Remove blank lines.
qiyuandong-db Nov 5, 2024
cec85cf
Address PR comments
qiyuandong-db Nov 6, 2024
8edcc9a
Address PR comments
qiyuandong-db Nov 7, 2024
baca1c5
Move domain metadata actions out of dataActions
qiyuandong-db Nov 7, 2024
7e0a172
Fix javafmt
qiyuandong-db Nov 7, 2024
127de8c
Use a set to check for unsupported writer features
qiyuandong-db Nov 10, 2024
c5f3672
Resolve PR comments
qiyuandong-db Nov 11, 2024
b2d6546
Move golden table to kernel tests.
qiyuandong-db Nov 11, 2024
cd7ddeb
Use getTestResourceFilePath in test to get golden table path.
qiyuandong-db Nov 11, 2024
68c77d8
Resolve git comments
qiyuandong-db Nov 12, 2024
5afec36
Move resolveDomainMetadataConflict into handleDomainMetadata
qiyuandong-db Nov 12, 2024
e358878
Move addDomainMetadata from TransactionImpl to TransactionBuilderImpl
qiyuandong-db Nov 12, 2024
4f56a2f
Use SUPPORTED_WRITER_FEATURES in validateWriteSupportedTable
qiyuandong-db Nov 12, 2024
189ec75
Remove the duplicate check when reading winning txn's DM. Change extr…
qiyuandong-db Nov 13, 2024
a4e3104
Fix nit
qiyuandong-db Nov 13, 2024
b0e4a65
Rename populateDomainMetadataMap
qiyuandong-db Nov 13, 2024
f89199c
Remove unused imports
qiyuandong-db Nov 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{"commitInfo":{"timestamp":1730671956424,"operation":"CREATE TABLE","operationParameters":{"partitionBy":"[]","clusterBy":"[]","description":null,"isManaged":"false","properties":"{\"delta.checkpointInterval\":\"3\"}"},"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.5.3 Delta-Lake/3.3.0-SNAPSHOT","txnId":"90158bef-da23-4500-aafc-d2932e80f8cb"}}
{"metaData":{"id":"04e4bf27-b577-4f7d-b002-08b3bbc00ce5","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{"delta.checkpointInterval":"3"},"createdTime":1730671956256}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{"commitInfo":{"timestamp":1730671958509,"operation":"SET TBLPROPERTIES","operationParameters":{"properties":"{\"delta.feature.domainmetadata\":\"enabled\"}"},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.5.3 Delta-Lake/3.3.0-SNAPSHOT","txnId":"4b1dc72c-432c-4753-b9d3-68ab89f3cb91"}}
{"metaData":{"id":"04e4bf27-b577-4f7d-b002-08b3bbc00ce5","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{"delta.checkpointInterval":"3"},"createdTime":1730671956256}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":7,"writerFeatures":["domainMetadata","appendOnly","invariants"]}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{"commitInfo":{"timestamp":1730671958797,"operation":"Manual Update","operationParameters":{},"readVersion":1,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.5.3 Delta-Lake/3.3.0-SNAPSHOT","txnId":"9a6324de-800e-4c8f-9ce3-7766d0462474"}}
{"domainMetadata":{"domain":"testDomain1","configuration":"{\"key1\":\"1\"}","removed":false}}
{"domainMetadata":{"domain":"testDomain2","configuration":"","removed":false}}
{"domainMetadata":{"domain":"testDomain3","configuration":"","removed":false}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{"commitInfo":{"timestamp":1730671959801,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"readVersion":2,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"2","numOutputRows":"2","numOutputBytes":"956"},"engineInfo":"Apache-Spark/3.5.3 Delta-Lake/3.3.0-SNAPSHOT","txnId":"54aecdaa-e039-40c8-ac9b-bcd7f30183fa"}}
{"add":{"path":"test%25file%25prefix-part-00000-48cf7913-43ae-45bf-ab2c-94eb2fe77358-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1730671959767,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"id\":0},\"maxValues\":{\"id\":0},\"nullCount\":{\"id\":0}}"}}
{"add":{"path":"test%25file%25prefix-part-00001-071539c0-ef9e-478c-b550-035c6b5a31c2-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1730671959767,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"id\":1},\"maxValues\":{\"id\":1},\"nullCount\":{\"id\":0}}"}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{"commitInfo":{"timestamp":1730671962209,"operation":"Manual Update","operationParameters":{},"readVersion":3,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.5.3 Delta-Lake/3.3.0-SNAPSHOT","txnId":"063e4cd4-4091-4a30-a089-4e8a0fc5c0ac"}}
{"domainMetadata":{"domain":"testDomain1","configuration":"{\"key1\":\"10\"}","removed":false}}
{"domainMetadata":{"domain":"testDomain2","configuration":"","removed":true}}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"version":3,"size":7,"sizeInBytes":16337,"numOfAddFiles":2,"checkpointSchema":{"type":"struct","fields":[{"name":"txn","type":{"type":"struct","fields":[{"name":"appId","type":"string","nullable":true,"metadata":{}},{"name":"version","type":"long","nullable":true,"metadata":{}},{"name":"lastUpdated","type":"long","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"add","type":{"type":"struct","fields":[{"name":"path","type":"string","nullable":true,"metadata":{}},{"name":"partitionValues","type":{"type":"map","keyType":"string","valueType":"string","valueContainsNull":true},"nullable":true,"metadata":{}},{"name":"size","type":"long","nullable":true,"metadata":{}},{"name":"modificationTime","type":"long","nullable":true,"metadata":{}},{"name":"dataChange","type":"boolean","nullable":true,"metadata":{}},{"name":"tags","type":{"type":"map","keyType":"string","valueType":"string","valueContainsNull":true},"nullable":true,"metadata":{}},{"name":"deletionVector","type":{"type":"struct","fields":[{"name":"storageType","type":"string","nullable":true,"metadata":{}},{"name":"pathOrInlineDv","type":"string","nullable":true,"metadata":{}},{"name":"offset","type":"integer","nullable":true,"metadata":{}},{"name":"sizeInBytes","type":"integer","nullable":true,"metadata":{}},{"name":"cardinality","type":"long","nullable":true,"metadata":{}},{"name":"maxRowIndex","type":"long","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"baseRowId","type":"long","nullable":true,"metadata":{}},{"name":"defaultRowCommitVersion","type":"long","nullable":true,"metadata":{}},{"name":"clusteringProvider","type":"string","nullable":true,"metadata":{}},{"name":"stats","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"remove","type":{"type":"struct","fields":[{"name":"path","type":"string","nullable":true,"metadata":{}},{"name":"deletionTimestamp","type":"long","nullable":true,"metadata":{}},{"name":"dataChange","type":"boolean","nullable":true,"metadata":{}},{"name":"extendedFileMetadata","type":"boolean","nullable":true,"metadata":{}},{"name":"partitionValues","type":{"type":"map","keyType":"string","valueType":"string","valueContainsNull":true},"nullable":true,"metadata":{}},{"name":"size","type":"long","nullable":true,"metadata":{}},{"name":"deletionVector","type":{"type":"struct","fields":[{"name":"storageType","type":"string","nullable":true,"metadata":{}},{"name":"pathOrInlineDv","type":"string","nullable":true,"metadata":{}},{"name":"offset","type":"integer","nullable":true,"metadata":{}},{"name":"sizeInBytes","type":"integer","nullable":true,"metadata":{}},{"name":"cardinality","type":"long","nullable":true,"metadata":{}},{"name":"maxRowIndex","type":"long","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"baseRowId","type":"long","nullable":true,"metadata":{}},{"name":"defaultRowCommitVersion","type":"long","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"metaData","type":{"type":"struct","fields":[{"name":"id","type":"string","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"description","type":"string","nullable":true,"metadata":{}},{"name":"format","type":{"type":"struct","fields":[{"name":"provider","type":"string","nullable":true,"metadata":{}},{"name":"options","type":{"type":"map","keyType":"string","valueType":"string","valueContainsNull":true},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"schemaString","type":"string","nullable":true,"metadata":{}},{"name":"partitionColumns","type":{"type":"array","elementType":"string","containsNull":true},"nullable":true,"metadata":{}},{"name":"configuration","type":{"type":"map","keyType":"string","valueType":"string","valueContainsNull":true},"nullable":true,"metadata":{}},{"name":"createdTime","type":"long","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"protocol","type":{"type":"struct","fields":[{"name":"minReaderVersion","type":"integer","nullable":true,"metadata":{}},{"name":"minWriterVersion","type":"integer","nullable":true,"metadata":{}},{"name":"readerFeatures","type":{"type":"array","elementType":"string","containsNull":true},"nullable":true,"metadata":{}},{"name":"writerFeatures","type":{"type":"array","elementType":"string","containsNull":true},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"domainMetadata","type":{"type":"struct","fields":[{"name":"domain","type":"string","nullable":true,"metadata":{}},{"name":"configuration","type":"string","nullable":true,"metadata":{}},{"name":"removed","type":"boolean","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"checksum":"7a97cc187ffb0604cc50e51bddb3cbfa"}
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -1663,6 +1663,46 @@ class GoldenTables extends QueryTest with SharedSparkSession {
Row(0, 0) :: Nil,
schema)
}

generateGoldenTable("kernel-domain-metadata") { tablePath =>
qiyuandong-db marked this conversation as resolved.
Show resolved Hide resolved
withSQLConf(
("spark.databricks.delta.properties.defaults.checkpointInterval", "3")
) {
val tbl = "tbl"

sql(s"CREATE TABLE $tbl (id LONG) USING delta LOCATION '$tablePath'")
sql(s"ALTER TABLE $tbl SET TBLPROPERTIES('delta.feature.domainMetadata' = 'enabled')")

val deltaLog = DeltaLog.forTable(spark, new Path(tablePath))

deltaLog
.startTransaction()
.commitManually(
List(
DomainMetadata("testDomain1", "{\"key1\":\"1\"}", removed = false),
DomainMetadata("testDomain2", "", removed = false),
DomainMetadata("testDomain3", "", removed = false)
): _*
)

spark.range(0, 2).write.format("delta").mode("append").save(tablePath) // Checkpoint created

deltaLog
.startTransaction()
.commitManually(
List(
DomainMetadata("testDomain1", "{\"key1\":\"10\"}".stripMargin, removed = false),
DomainMetadata("testDomain2", "", removed = true)
): _*
)

// In the end, we need to read 1 checkpoint file and 1 log file to replay the golden table
// The state of the domain metadata should be:
// testDomain1: "\"key1\":\"10\"", removed = false
// testDomain2: "", removed = true
// testDomain3: "", removed = false
}
}
}

case class TestStruct(f1: String, f2: Long)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import static java.lang.String.format;

import io.delta.kernel.exceptions.*;
import io.delta.kernel.internal.actions.DomainMetadata;
import io.delta.kernel.types.DataType;
import io.delta.kernel.types.StructType;
import io.delta.kernel.utils.DataFileStatus;
Expand Down Expand Up @@ -274,6 +275,34 @@ public static KernelException invalidConfigurationValueException(
return new InvalidConfigurationValueException(key, value, helpMessage);
}

public static KernelException domainMetadataUnsupported() {
String message =
"Found DomainMetadata action(s) but table feature 'domainMetadata' "
+ "is not supported on this table.";
return new KernelException(message);
}

public static KernelException duplicateDomainMetadataAction(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this error is thrown when a commit (already on the filesystem contains duplicate entries for the same metadata). In this case the table is corrupted and we should use InvalidTableException

Copy link
Author

@qiyuandong-db qiyuandong-db Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is thrown in two cases:

  • When we detect duplicate domain metadata entries in a winning commit file during conflict resolution, which corresponds to what you mentioned (read path).
  • When validating domain metadata actions to be committed in Transaction.commit() (write path).

As discussed in another thread, I think we can possibly remove the duplicate check in DM conflict resolving, and in this case this error is only thrown in Transaction.commit().

String domain, String action1, String action2) {
qiyuandong-db marked this conversation as resolved.
Show resolved Hide resolved
qiyuandong-db marked this conversation as resolved.
Show resolved Hide resolved
String message =
String.format(
"Multiple actions detected for domain '%s' in single transaction: '%s' and '%s'. "
+ "Only one action per domain is allowed.",
domain, action1, action2);
return new KernelException(message);
}

public static ConcurrentWriteException concurrentDomainMetadataAction(
DomainMetadata domainMetadataAttempt, DomainMetadata winningDomainMetadata) {
String message =
String.format(
"A concurrent writer added a domainMetadata action for the same domain: %s. "
+ "No domain-specific conflict resolution available for this domain. "
qiyuandong-db marked this conversation as resolved.
Show resolved Hide resolved
+ "Attempted domainMetadata: %s. Winning domainMetadata: %s",
domainMetadataAttempt.getDomain(), domainMetadataAttempt, winningDomainMetadata);
return new ConcurrentWriteException(message);
}

/* ------------------------ HELPER METHODS ----------------------------- */
private static String formatTimestamp(long millisSinceEpochUTC) {
return new Timestamp(millisSinceEpochUTC).toInstant().toString();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
import io.delta.kernel.engine.CommitCoordinatorClientHandler;
import io.delta.kernel.engine.Engine;
import io.delta.kernel.internal.actions.CommitInfo;
import io.delta.kernel.internal.actions.DomainMetadata;
import io.delta.kernel.internal.actions.Metadata;
import io.delta.kernel.internal.actions.Protocol;
import io.delta.kernel.internal.fs.Path;
Expand All @@ -31,6 +32,7 @@
import io.delta.kernel.internal.snapshot.LogSegment;
import io.delta.kernel.internal.snapshot.TableCommitCoordinatorClientHandler;
import io.delta.kernel.types.StructType;
import java.util.Map;
import java.util.Optional;

/** Implementation of {@link Snapshot}. */
Expand Down Expand Up @@ -83,6 +85,10 @@ public Protocol getProtocol() {
return protocol;
}

public Map<String, DomainMetadata> getDomainMetadataMap() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add docs (see getLatestTransactionVersion for an example)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is used just in tests? Do we see a need for it when writing to the table?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a doc for this.

Currently, it is only used in tests. I imagine we will need this when writing to the table in the future.

For example, in Row Tracking, we’ll need to get its domain metadata from the snapshot to retrieve the previous HighWatermark. We need this to assign fresh Row IDs to any AddFile actions within dataActions that don’t yet have row IDs before committing.

return logReplay.getDomainMetadataMap();
}

public CreateCheckpointIterator getCreateCheckpointIterator(Engine engine) {
long minFileRetentionTimestampMillis =
System.currentTimeMillis() - TOMBSTONE_RETENTION.fromMetadata(metadata);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ public class TableFeatures {
add("columnMapping");
add("typeWidening-preview");
add("typeWidening");
add("domainMetadata");
}
});

Expand Down Expand Up @@ -93,7 +94,7 @@ public static void validateReadSupportedTable(
* <li>protocol writer version 1.
* <li>protocol writer version 2 only with appendOnly feature enabled.
* <li>protocol writer version 7 with {@code appendOnly}, {@code inCommitTimestamp}, {@code
* columnMapping}, {@code typeWidening} feature enabled.
* columnMapping}, {@code typeWidening}, {@code domainMetadata} feature enabled.
* </ul>
*
* @param protocol Table protocol
Expand Down Expand Up @@ -137,6 +138,8 @@ public static void validateWriteSupportedTable(
break;
case "typeWidening":
break;
case "domainMetadata":
qiyuandong-db marked this conversation as resolved.
Show resolved Hide resolved
break;
default:
throw unsupportedWriterFeature(tablePath, writerFeature);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,7 @@
import io.delta.kernel.internal.fs.Path;
import io.delta.kernel.internal.replay.ConflictChecker;
import io.delta.kernel.internal.replay.ConflictChecker.TransactionRebaseState;
import io.delta.kernel.internal.util.Clock;
import io.delta.kernel.internal.util.ColumnMapping;
import io.delta.kernel.internal.util.FileNames;
import io.delta.kernel.internal.util.InCommitTimestampUtils;
import io.delta.kernel.internal.util.VectorUtils;
import io.delta.kernel.internal.util.*;
import io.delta.kernel.types.StructType;
import io.delta.kernel.utils.CloseableIterable;
import io.delta.kernel.utils.CloseableIterator;
Expand Down Expand Up @@ -75,6 +71,7 @@ public class TransactionImpl implements Transaction {
private final Clock clock;
private Metadata metadata;
private boolean shouldUpdateMetadata;
private final List<DomainMetadata> domainMetadatas = new ArrayList<>();
qiyuandong-db marked this conversation as resolved.
Show resolved Hide resolved

private boolean closed; // To avoid trying to commit the same transaction again.

Expand Down Expand Up @@ -221,11 +218,22 @@ private TransactionCommitResult doCommit(
}
setTxnOpt.ifPresent(setTxn -> metadataActions.add(createTxnSingleAction(setTxn.toRow())));

// Check for duplicate domain metadata and if the protocol supports
DomainMetadataUtils.validateDomainMetadatas(domainMetadatas, protocol);

// Create domain metadata action rows
List<Row> domainMetadataActions = new ArrayList<>();
for (DomainMetadata domainMetadata : domainMetadatas) {
domainMetadataActions.add(createDomainMetadataSingleAction(domainMetadata.toRow()));
qiyuandong-db marked this conversation as resolved.
Show resolved Hide resolved
}

try (CloseableIterator<Row> stageDataIter = dataActions.iterator()) {
// Create a new CloseableIterator that will return the metadata actions followed by the
// data actions.
CloseableIterator<Row> dataAndMetadataActions =
toCloseableIterator(metadataActions.iterator()).combine(stageDataIter);
toCloseableIterator(metadataActions.iterator())
.combine(toCloseableIterator(domainMetadataActions.iterator()))
.combine(stageDataIter);

if (commitAsVersion == 0) {
// New table, create a delta log directory
Expand Down Expand Up @@ -269,6 +277,19 @@ public Optional<SetTransaction> getSetTxnOpt() {
return setTxnOpt;
}

/**
* Add domain metadata to the transaction to be committed.
*
* @param domainMetadatas List of domain metadata to be added.
*/
public void addDomainMetadata(List<DomainMetadata> domainMetadatas) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far this is just used in tests. Is it the common practise to add multiple domain metadata as part of the txn?

Also we could move the duplicate check here itself than validating just before the commit.

One more thing should we automatically upgrade the protocol version?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also -- this should be in the TransactionBuilderImpl class instead. is there a specific reason you want to build a Transaction and then mutate and append domain metadatas to it afterwards?

Copy link
Author

@qiyuandong-db qiyuandong-db Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a txn can contains multiple domain metadata - it seems to be the case in Delta-Spark.

One reason I do the duplication check just before committing is that domain metadata might be mutated in other code paths, such as during conflict resolution. Validating them right before the commit helps ensure that we capture all changes.

Is there a specific reason you want to build a Transaction and then mutate and append domain metadatas to it afterwards?

I'm not sure about other domain metadata usages. For Row Tracking, its domain metadata depends on all AddFile actions in the dataActions passed into Transaction.commit(). We can only append this domain metadata after building the transaction and seeing all AddFile actions. Also, conflict resolution can mutate existing domain metadatas.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like in practice we might not use an API like this? The domain metadata might just be generated as a step during commit?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I'm having a bit of a hard time mentally modeling this without an actual feature using domain metadatas. Are we adding support for row tracking or anything else after this?

If these are all private APIs we can revisit whether they make sense when adding support for a feature that uses them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is used only for tests, can you please add Visible for testing to the javadoc.

Also -- @allisonport-db and @vkorukanti what do you think about adding a @VisibleForTesting annotation that we can add to methods to make tthis even clearer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to documenting it's for testing (at least for now as long as no feature is using it)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annotation SGTM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is only for testing now I think there's also some validity to moving it to TransactionBuilderImpl as @scottsand-db suggested to avoid mutating a transaction. But since it's just for testing maybe it's not that important ¯_(ツ)_/¯

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think avoiding mutating a transaction makes sense. I've moved it to TransactionBuilderImpl and mentioned Visible for testing in its javadoc.

this.domainMetadatas.addAll(domainMetadatas);
}

public List<DomainMetadata> getDomainMetadatas() {
return domainMetadatas;
}

/**
* Generates a timestamp which is greater than the commit timestamp of the readSnapshot. This can
* result in an additional file read and that this will only happen if ICT is enabled.
Expand Down
Loading
Loading