[Kernel] [CC Refactor #2] Add `TableDescriptor` and `CommitCoordinatorClient` API #3797

scottsand-db · 2024-10-23T16:11:19Z

This is a stacked PR. Please view this PR's diff here:

scottsand-db/delta@delta_kernel_cc_1...delta_kernel_cc_2

Which Delta project/connector is this regarding?

Description

Adds new TableDescriptor and CommitCoordinatorClient API. Adds a new getCommitCoordinatorClient API to the Engine (with a default implementation that throws an exception).

How was this patch tested?

N/A trivial.

Does this PR introduce any user-facing changes?

Yes. See the above.

kernel/kernel-api/src/main/java/io/delta/kernel/coordinatedcommits/CommitCoordinatorClient.java

kernel/kernel-api/src/main/java/io/delta/kernel/annotation/Nullable.java

allisonport-db

Mostly clarifying questions due to my lack of knowledge for CC

allisonport-db · 2024-10-31T01:38:45Z

kernel/kernel-api/src/main/java/io/delta/kernel/coordinatedcommits/CommitCoordinatorClient.java

+   * Register the table represented by the given {@code logPath} at the provided {@code
+   * currentVersion} with the commit coordinator this commit coordinator client represents.
+   *
+   * <p>This API is called when the table is being converted from an existing file system table to a
+   * coordinated-commit table.
+   *
+   * <p>When a new coordinated-commit table is being created, the {@code currentVersion} will be -1
+   * and the upgrade commit needs to be a file system commit which will write the backfilled file
+   * directly.


For my understanding you (the client, ie Spark/Kernel etc) call this first for some version N. Then when commit is called with version N, the CCC recognizes that this is the same version and thus the commit needs to be immediately backfilled/written to the filesystem?

cc @LukasRupprecht @dhruvarya-db @prakharjain09

Commit N would add the CC configuration to the table so it'll be available in version N+1. It is not in version N so the commit does not go through the newly added commit coordinator client but rather just through the file system, i.e. backfilling is not necessary.

allisonport-db · 2024-10-31T01:41:15Z

kernel/kernel-api/src/main/java/io/delta/kernel/coordinatedcommits/CommitCoordinatorClient.java

+   * @param tableDescriptor The descriptor for the table.
+   * @param commitVersion The version of the commit that is being committed.
+   * @param actions The set of actions to be committed
+   * @param updatedActions Additional information for the commit, including:


Sorry a bunch of questions about CC writes not necessarily specific to this PR. What are the updatedActions for and why do they need to be separated from the other actions?

What are the updatedActions for and why do they need to be separated from the other actions?

Let's ask the feature owners: cc @dhruvarya-db and @sumeet-db and @prakharjain09

The updated actions are the CommitInfo and the previous and current Metadata/Protocol. They are also included in the actions (Protocol and Metadata only if they changed) but we want to pass them separately for convenience in case commit coordinator client implementations want to do something with them (for example check if the Metadata of a table has changed).

Got it thanks!

Also the actions is an iterator i.e. it can only be traversed once. The commit coordinator can't go through it to get these important updates (e.g. schema change / protocol change). So the API explicitly passes such updates.

allisonport-db · 2024-10-31T01:43:01Z

kernel/kernel-api/src/main/java/io/delta/kernel/coordinatedcommits/CommitCoordinatorClient.java

+   *       <li>Protocol changes
+   *     </ul>
+   *
+   * @return {@link CommitResponse} containing the file status of the committed file. Note: If the


this is the unbackfilled file right?

Not necessarily. It's acceptable for CC-Client to return the backfilled file too.

allisonport-db · 2024-10-31T01:49:27Z

kernel/kernel-api/src/main/java/io/delta/kernel/engine/Engine.java

+   */
+  default CommitCoordinatorClient getCommitCoordinatorClient(
+      String commitCoordinatorName, Map<String, String> commitCoordinatorConf) {
+    throw new UnsupportedOperationException("Not implemented");


Is this temporarily with a default implementation or will we be keeping it like this?

What will be the expected behavior if an engine interface hasn't implemented this method but some user tries to read a CC table? Will it throw this exception? Or do we want to force all engine impls to override this?

temporarily with a default implementation
Yup! This is so we can merge this without having to go and update all implementations of Engine within this PR

Is it right to assume that implementations will still be able to read other dynamic configurations when building the coordinator? e.g. the Delta-spark getCCC interface also takes in a sparkSession allowing for dynamic configuration of the client. Implementations of this method will still be able to read some other configuration source (even though it is not explicitly being passed) right?

Is it right to assume that implementations will still be able to read other dynamic configurations when building the coordinator?

Absolutely. We leave it up to the engine to create the CCC. If the engine is aware of any dynamodb configurations, it can use them!

Implementations of this method will still be able to read some other configuration source (even though it is not explicitly being passed) right?

Yes. I'd encourage you to look at the tracking issue #3817 to look at future PRs where you can see this being done.

allisonport-db

Changes themselves LGTM

LukasRupprecht · 2024-10-31T17:44:28Z

kernel/kernel-api/src/main/java/io/delta/kernel/coordinatedcommits/CommitCoordinatorClient.java

+   * and the upgrade commit needs to be a file system commit which will write the backfilled file
+   * directly.
+   *
+   * @param engine The {@link Engine} instance to use.


Can we add some information on what the Engine would/should be used for during table registration?

@LukasRupprecht -- we don't want to prescribe what the engine should be used for. It can be used for any json reading or filesystem operations needed. who knows how they implement their Commit Coordinator Client.

It's also just convention to pass in the engine, so that the client may use any future engine interfaces (think logging or metrics) without having so save a reference of the engine

LukasRupprecht · 2024-10-31T17:46:26Z

kernel/kernel-api/src/main/java/io/delta/kernel/coordinatedcommits/CommitCoordinatorClient.java

+  /**
+   * Commit the given set of actions to the table represented by {@code tableDescriptor}.
+   *
+   * @param engine The {@link Engine} instance to use.


Again, add some info for what we need the engine (here it'd be mainly for writing the commit file). Same for the other APIs below.

LukasRupprecht · 2024-10-31T17:51:05Z

kernel/kernel-api/src/main/java/io/delta/kernel/coordinatedcommits/CommitCoordinatorClient.java

+  Map<String, String> registerTable(
+      Engine engine,
+      String logPath,
+      @Nullable TableIdentifier tableIdentifier,


Why don't we use Optional like in TableDescriptor?

I've been flip-flopping on this, myself.

Using Optional is not preferred in a java public API ... but I've been considering refactoring and using Optional actually

@LukasRupprecht -- I'll use Optional 👍

LukasRupprecht · 2024-10-31T17:52:35Z

kernel/kernel-api/src/main/java/io/delta/kernel/coordinatedcommits/TableDescriptor.java

+ * The complete descriptor of a Coordinated Commits (CC) Delta table, including its logPath, table
+ * identifier, and table CC table configuration.


Suggested change

* The complete descriptor of a Coordinated Commits (CC) Delta table, including its logPath, table

* identifier, and table CC table configuration.

* The complete descriptor of a Coordinated Commits (CC) Delta table, including its logPath, table

* identifier (if access is not path-based), and table CC table configuration.

Hm... that's not quite correct, right? It's not like you pass either one of the path OR the identifier, but not BOTH, to the CC. We are passing both ...

I meant that if the access is path-based, there won't be a table identifier. This would explain why it is optional (because in some cases, there won't be one).

I'll include this change in PR 3 #3798

scottsand-db added kernel kernel-api-change labels Oct 23, 2024

scottsand-db requested review from vkorukanti and allisonport-db October 23, 2024 16:11

scottsand-db self-assigned this Oct 23, 2024

scottsand-db force-pushed the delta_kernel_cc_2 branch 2 times, most recently from 1e666b7 to 6348853 Compare October 23, 2024 21:17

scottsand-db requested a review from sumeet-db October 23, 2024 22:55

scottsand-db commented Oct 23, 2024

View reviewed changes

kernel/kernel-api/src/main/java/io/delta/kernel/coordinatedcommits/CommitCoordinatorClient.java Outdated Show resolved Hide resolved

scottsand-db requested a review from dhruvarya-db October 28, 2024 17:29

sumeet-db reviewed Oct 28, 2024

View reviewed changes

kernel/kernel-api/src/main/java/io/delta/kernel/annotation/Nullable.java Outdated Show resolved Hide resolved

sumeet-db approved these changes Oct 28, 2024

View reviewed changes

scottsand-db mentioned this pull request Oct 28, 2024

One Commit Coordinator Interface to rule them all #3817

Open

scottsand-db requested a review from prakharjain09 October 29, 2024 15:52

allisonport-db reviewed Oct 31, 2024

View reviewed changes

allisonport-db approved these changes Oct 31, 2024

View reviewed changes

LukasRupprecht reviewed Oct 31, 2024

View reviewed changes

scottsand-db added 13 commits November 1, 2024 09:08

Update Table.java

889e4db

Update TableImpl.java

9d7d489

Create Nullable.java

68052f3

Create TableDescriptor.java

56710ae

Create CommitCoordinatorClient.java

383cdea

Update Engine.java

0060ec4

Rename getUnbackfilledCommits to getCommits; we can re-do this later

f79463f

Update CommitCoordinatorClient.java

21c9fe1

Add equals, hashcode, toString to TableDescriptor

f518534

Delete Nullable.java

ade7d04

Nit update in TableIdentifier

795ccf7

javafmt TableDescriptor

cb6a7f1

Create TableDescriptorSuite.scala

61b46f3

scottsand-db force-pushed the delta_kernel_cc_2 branch from e9bb733 to 61b46f3 Compare November 1, 2024 16:57

scottsand-db requested a review from LukasRupprecht November 1, 2024 16:58

dhruvarya-db approved these changes Nov 1, 2024

View reviewed changes

scottsand-db merged commit 6ae4b62 into delta-io:master Nov 1, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] [CC Refactor #2] Add `TableDescriptor` and `CommitCoordinatorClient` API #3797

[Kernel] [CC Refactor #2] Add `TableDescriptor` and `CommitCoordinatorClient` API #3797

scottsand-db commented Oct 23, 2024 •

edited

Loading

allisonport-db left a comment

allisonport-db Oct 31, 2024

scottsand-db Oct 31, 2024

LukasRupprecht Oct 31, 2024

allisonport-db Oct 31, 2024

scottsand-db Oct 31, 2024

LukasRupprecht Oct 31, 2024

allisonport-db Oct 31, 2024

prakharjain09 Oct 31, 2024

allisonport-db Oct 31, 2024

sumeet-db Oct 31, 2024

allisonport-db Oct 31, 2024

scottsand-db Oct 31, 2024

dhruvarya-db Nov 1, 2024

scottsand-db Nov 1, 2024

allisonport-db left a comment

LukasRupprecht Oct 31, 2024

scottsand-db Nov 1, 2024

scottsand-db Nov 1, 2024

LukasRupprecht Oct 31, 2024

LukasRupprecht Oct 31, 2024

scottsand-db Oct 31, 2024

scottsand-db Nov 1, 2024

LukasRupprecht Oct 31, 2024

scottsand-db Nov 1, 2024

LukasRupprecht Nov 1, 2024 •

edited

Loading

scottsand-db Nov 1, 2024

		* The complete descriptor of a Coordinated Commits (CC) Delta table, including its logPath, table
		* identifier, and table CC table configuration.

[Kernel] [CC Refactor #2] Add TableDescriptor and CommitCoordinatorClient API #3797

[Kernel] [CC Refactor #2] Add TableDescriptor and CommitCoordinatorClient API #3797

Conversation

scottsand-db commented Oct 23, 2024 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

allisonport-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonport-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LukasRupprecht Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[Kernel] [CC Refactor #2] Add `TableDescriptor` and `CommitCoordinatorClient` API #3797

[Kernel] [CC Refactor #2] Add `TableDescriptor` and `CommitCoordinatorClient` API #3797

scottsand-db commented Oct 23, 2024 •

edited

Loading

LukasRupprecht Nov 1, 2024 •

edited

Loading