Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Umbrella Feature Request] Delta Kernel APIs to simplify building connectors for reading Delta tables #1783

Open
1 of 3 tasks
vkorukanti opened this issue May 23, 2023 · 9 comments
Labels
enhancement New feature or request

Comments

@vkorukanti
Copy link
Collaborator

vkorukanti commented May 23, 2023

Feature request

This is an uber issue for designing and implementing APIs (Delta Kernel) to unify and simplify APIs for connectors to read Delta Lake tables. Currently the focus is on reading, support for writing will be added later.

Motivation

Delta connector ecosystem is currently fragmented with too many independent protocol implementations - Delta Spark, Delta Standalone, Trino, delta-rs, Delta Sharing etc. This leads to the following problems:

  1. High variability in performance and bugs in connectors - Each implementation tries to implement the spec in different ways causing suboptimal performance and data correctness bugs.

  2. Sluggish adoption of new protocol features - Whenever there is a protocol update, every implementation needs to be updated separately. Furthermore, even when multiple connectors share the log replay implementation, each connector currently requires deep understanding of the protocol details for the actual data operations (i.e., reads, writes, upserts) to be implemented correctly. For example, Delta Standalone hides the details of log file formats, but ultimately exposes raw actions via programmatic APIs. Connectors using Standalone must understand all the nitty gritty details of the column stats in the AddFiles to use them correctly for skipping. Such friction prevents new connector creation as well as slows d own adoption of new protocol features in existing connectors.

To reduce fragmentation and speed up the rate of innovation, we have to simplify and unify the connector ecosystem. 

  • Simplify the programmatic APIs for building connectors - We want to build a "kernel" library (or a small set of them in different languages) that hides all the protocol details of all operations behind simple library APIs. Connectors will just use those APIs to get scan file data that it can forward to the engine without any interpretation of the underlying raw actions. The engine will just use the scan file data to read data using the Kernel APIs. For example, for reads,

    •  core will generate a list of scan files (as generic records)
    • connector + engine will blindly distribute these scan file records to the workers and call Kernel API ScanFile.read(scan file record) to get rows without having to understand what file action the data is coming from. 
  • Unify the ecosystem - With these simplified APIs, we will be able to encourage new connectors to be built on them, and we can slowly convince the community to transition existing connectors to them too. 

Further details

See the design doc for details.
See the presentation for high level details.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.

Project Plan

Delta 3.0

Task description Category PR/Issue Status Author
Decimal support
  • across expressions
  • ColumnarBatch/Row interface
  • Default parquet reader
Protocol support #1951 DONE @allisonport-db
Timestamp support
  • across expressions
  • ColumnarBatch/Row interface
  • Default parquet reader
Protocol support #1920 DONE @allisonport-db
Improve Log Replay Code: use ColumnarBatch method & improve test coverage/fix found bugs Protocol Support #1939, #2069 DONE @scottsand-db/@vkorukanti
Log Segment Loading: Support multi-part checkpoint reads Protocol Support #1984 DONE @allisonport-db
Additional expressions
  • Comparison (<,<=, >, >=)
Performance, API #1997 DONE @vkorukanti
Partition pruning support Performance #2071 DONE @vkorukanti
Improve the complex type value access from the ColumnVector interface API #2087 DONE @allisonport-db
Usage doc Misc. #1927 DONE @vkorukanti
Examples programs using Java Kernel Misc. #1926 DONE @vkorukanti
Various Parquet reader bug fixes and code cleanup Default TableClient #1974, #1980 DONE @vkorukanti
Code checkstyle/build setup Misc. #1901, #1962, #1970, #1977, #2085, #1954 DONE @allisonport-db
Misc. clean up of APIs API #2041, #2058, #2064 DONE @vkorukanti

Delta 3.1

Task description Category PR/Issue Status Author
Support for file skipping using file stats in Delta Log Performance #2229 IN PROGRESS @allisonport-db
More unit tests based on golden tables Testing DONE @allisonport-db
Logging framework Misc. #2230 DONE @allisonport-db
Support id column mapping mode Protocol Support #2374 DONE @vkorukanti

Laundry List

Task description Category PR/Issue Status Author Proposed Release
Exceptions framework Misc. #2231
Additional expressions
  • IS_NULL/IS_NOT_NULL
  • IN List Support
Performance, API
Timestamp partition column support Protocol Support
Test reading large tables - a large state with multiple different types of actions Testing
Performance eval of reading large state tables Performance
TimestampNTZ Support
  • across expressions
  • ColumnarBatch/Row interface
  • Default parquet reader
Protocol Support
Utility methods to (de)serialize Row/ColumnarBatch - speeds up connector development Misc.
Add checkpoint v2 support Protocol #2232
Get snapshot by version Protocol
Get snapshot by timestamp Protocol
table_changes Protocol
streaming support Protocol
@vkorukanti vkorukanti added the enhancement New feature or request label May 23, 2023
allisonport-db added a commit to allisonport-db/delta that referenced this issue May 30, 2023
See delta-io#1783 for details.

This PR just sets up the `kernel/` subdirectory and sbt for future development.

delta-io#1786 updates the github actions and is based off of this PR

Closes #9
allisonport-db added a commit that referenced this issue May 30, 2023
See #1783 for details.

This PR just sets up the `kernel/` subdirectory and sbt for future development.

#1786 updates the github actions and is based off of this PR

Closes #1785
@vkorukanti
Copy link
Collaborator Author

Attached the design doc.

@tdas tdas pinned this issue Jun 6, 2023
@tdas
Copy link
Contributor

tdas commented Jun 6, 2023

Thank you @vkorukanti

allisonport-db added a commit that referenced this issue Jun 6, 2023
Adds the initial java interfaces for Delta Kernel #1783.

Also adds javastyle checks and some javadoc settings.

N/A. Only adds interfaces.

Closes #1808
@csimplestring
Copy link

this is awesome! I already developed a connector for Go: https://github.com/csimplestring/delta-go
definitely I will refactor to follow this design !

@felipepessoto
Copy link
Contributor

This looks great!
In the initial release, the plan is to implement the Kernel APIs in which languages?

@tdas
Copy link
Contributor

tdas commented Jun 20, 2023

We are starting with Java, and we are definitely interested in Rust. Beyond that I definitely want to discuss with the community about more languages. For example, @csimplestring I wonder whether if there is a Delta Kernel implemented in Rust, can the Go implementation just call into it? I am not familiar in the Rust ecosystem, and maintainers of delta-io/delta-rs like @rtyler @wjones127 have centuries of more experience about such matters. But I hope we can just build a Rust Kernel and other close-to-native languages just use it.

@csimplestring
Copy link

Hi @felipepessoto @tdas I am keeping a closer eye on this Java version kernel api development. Yes, definitely I plan to support this in Go.

In this repo: https://github.com/csimplestring/delta-go, it is a Go implementation of the Scala version Standalone Connector, which all features are supported except for the s3-multi cluster log store.

For the delta kernel, I created a new repo and closely follow the development of this Java version, which adding the API interfaces first. I think it is not difficult to reuse the code.

vkorukanti added a commit that referenced this issue Jun 21, 2023
…zation

This PR is part of #1783. It adds additional data types supported by Delta Lake protocol that were missing from the interfaces PR #1808.

It also adds serialization and deserailization of table schema represented as `StructType`.

UTs

Closes #1842
vkorukanti added a commit that referenced this issue Jun 22, 2023
This PR is part of #1783.

It implements Parquet reader based on `parquet-mr` and generates the output as columnar batches of `ColumnVector` and `ColumnarBatch` interface implementations.

UTs

Closes #1846
vkorukanti added a commit that referenced this issue Jun 23, 2023
This PR is part of #1783.

Following client implementations for the default module are added:

     * `JsonHandler`
     * `ExpressionHandler`
     * `FileSystemClient`

    and the supporting classes.

UTs

Closes #1843
@felipepessoto
Copy link
Contributor

Thanks @tdas, and about Spark Scala version, is it expected changes? Like refactoring it to work on top of a Scala Kernel API?

vkorukanti added a commit that referenced this issue Jun 27, 2023
This PR is part of #1783.

It adds the Delta table state reconstruction and end-2-end API implementation.

Integration tests with different types of Delta tables.

Closes #1857
vkorukanti added a commit that referenced this issue Jun 27, 2023
(Cherry-pick of 27111ee to branch-2.5)

This PR is part of #1783.

It adds the Delta table state reconstruction and end-2-end API implementation.

Integration tests with different types of Delta tables.

Closes #1857
@vkorukanti vkorukanti added this to the 3.0.0 milestone Jul 19, 2023
@tdas
Copy link
Contributor

tdas commented Aug 4, 2023

@felipepessoto We are very far right now to have a concrete plan for the Spark Scala version which is already very optimized for the Spark platform. We have to design and build the Delta Kernel write support first.

@watfordkcf
Copy link
Contributor

I'm interested and happy to help where I can. I'm putting together a data access layer over top of our delta lake, and it is mostly Rust right now. Consistency would be key to helping get this out the door (jumping between Scala, Python, and Rust in the world of Delta/Spark is a bit of an interesting trip).

@MrPowers MrPowers unpinned this issue Aug 21, 2023
@vkorukanti vkorukanti removed this from the 3.0.0 milestone Aug 22, 2023
@vkorukanti vkorukanti modified the milestones: 2.2.1, 3.0.0 Aug 22, 2023
@vkorukanti vkorukanti removed this from the 3.0.0 milestone Oct 24, 2023
@tdas tdas moved this from Todo to In Progress in Linux Foundation Delta Lake Roadmap Nov 16, 2023
@tdas tdas changed the title [Feature Request] Delta Kernel APIs to simplify building connectors for reading Delta tables [Umbrella Feature Request] Delta Kernel APIs to simplify building connectors for reading Delta tables Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

5 participants