-
Notifications
You must be signed in to change notification settings - Fork 112
Research on using Iceberg instead of ClickHouse/SQLite #932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
From today's call: Iceberg Integration Discussion:@skshetry investigated iceberg integration, concluding that direct integration is challenging due to iceberg's nature as a metastore rather than a SQL database. ClickHouse and DuckDB support read-only access, while Trino and Spark offer write capabilities, though the writing method remains unclear. He considered import/export support as a starting point, but acknowledged its limitations compared to native iceberg support. The team discussed the ideal use case of writing and reading iceberg tables directly in object storage, but recognized the current limitations of write access for many tools. Iceberg vs. ClickHouse and Data Processing:The discussion explored the benefits of iceberg compared to ClickHouse. @shcheklein highlighted iceberg's decoupling of persistent storage and compute. @skshetry contrasted iceberg's lakehouse approach for potentially unstructured data with ClickHouse's data warehouse approach for structured data. @dmpetrov emphasized iceberg's optimized query capabilities stemming from its data structure and storage in object storage like S3, contrasting it with less efficient methods like querying files directly from S3. @shcheklein further explained that iceberg allows the use of various databases on top of the same data, making it more flexible. Iceberg Investigation and Next Steps:Investigation into iceberg revealed challenges in integrating it into the data chain. It suggested import support as a potential initial step, acknowledging that this would be similar to existing Parquet support and would not leverage iceberg's native capabilities. We also discussed the various iceberg catalogs and their implications for integration. |
Apache Iceberg is an open table format designed for huge analytic datasets. Iceberg enables various compute engines (like Briefly, let me discuss the architecture of Iceberg. Iceberg mainly has 3 layers:
Additional References: Compute EnginesMore than catalog, we care more about compute engines. Let's discuss them briefly. SparkSpark is the most feature-rich compute engine for Iceberg operations. Supports read/write. Clickhouse:Clickhouse has two kinds of supports:
Both of these are read-only. Clickhouse recommends to use the latter. Duckdb:Duckdb has read-only support for reading/querying from iceberg tables, including from s3. No write support unfortunately, however there's an open issue: duckdb/duckdb-iceberg#37). See https://duckdb.org/docs/stable/extensions/iceberg.html, and https://duckdb.org/docs/stable/guides/network_cloud_storage/s3_iceberg_import.html#loading-iceberg-tables-from-s3. The duckdb integration is an experimental project, more of a proof-of-concept. It only started supporting REST catalog last month and is not yet released. There are some things that it does not support: Hidden Partitioning and predicate pushdown for table scanning. Also no write support. :( DaftDaft provides both reading and writing support for iceberg. It also supports push-down filtering. Looking at their code, write support does not look trivial. :( SnowflakeSnowflake has a catalog support as well as query engine support (SQL). TrinoTrino supports iceberg, and has sql support, including write operations. https://trino.io/docs/current/connector/iceberg.html#sql-support PyIcebergPyIceberg is an official Apache Iceberg project, that provides a Python implementation for accessing Iceberg tables, without the need of a JVM. https://github.com/apache/iceberg-python This has read/write support for Iceberg. write support is recent. pyiceberg does not have See apache/iceberg-python#736.
Also, has support for expressions for filtering. Supports SQL, Rest, Hive, Glue and DyanamoDB catalogs. (TBD: need to add more information) PolarsPolars supports reading from an Apache Iceberg table, and has push-down filtering.
It also uses Ibis
DbtKedroDask
Delta Lake/DataFusion(haven't checked yet!) Notes about my PR #937My PR uses Although that's an import and a blocking call, so we don't have any way to do computes before it gets saved as a datachain dataset. And we don't have a way to apply filters beforehand, so unlike polars/daft, we'll save entire table. So far, the only way to directly read from iceberg is There is also a way to query Note that If (I'll add more information as I do research, so far there is not a satisfying answer). |
Current Implementation
SQLite (Local/CLI Usage)
The project uses SQLite for local operations through two main classes:
SQLiteDatabaseEngine
datachain/src/datachain/data_storage/sqlite.py
Line 98 in ed973c8
SQLiteWarehouse
https://github.com/iterative/datachain/blob/ed973c82f4ab4674a5a24816eaf689498117aef6/src/datachain/data_storage/sqlite.py#L406C7-L406C22
Clickhouse implementation
Currently used in the SaaS version
Why Consider Iceberg?
Current Pain Points
The text was updated successfully, but these errors were encountered: