Skip to content

Research on using Iceberg instead of ClickHouse/SQLite #932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
amritghimire opened this issue Feb 24, 2025 · 2 comments
Open

Research on using Iceberg instead of ClickHouse/SQLite #932

amritghimire opened this issue Feb 24, 2025 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@amritghimire
Copy link
Contributor

amritghimire commented Feb 24, 2025

Current Implementation

SQLite (Local/CLI Usage)

The project uses SQLite for local operations through two main classes:

  1. SQLiteDatabaseEngine

class SQLiteDatabaseEngine(DatabaseEngine):

  1. SQLiteWarehouse

https://github.com/iterative/datachain/blob/ed973c82f4ab4674a5a24816eaf689498117aef6/src/datachain/data_storage/sqlite.py#L406C7-L406C22

Clickhouse implementation

Currently used in the SaaS version

Why Consider Iceberg?

Current Pain Points

  1. Dual Implementation Overhead:
    • Maintaining separate SQLite and ClickHouse implementations
    • Different transaction and concurrency models
    • Separate optimization strategies
  2. Performance
@amritghimire amritghimire added the enhancement New feature or request label Feb 24, 2025
@amritghimire
Copy link
Contributor Author

From today's call:

Iceberg Integration Discussion:

@skshetry investigated iceberg integration, concluding that direct integration is challenging due to iceberg's nature as a metastore rather than a SQL database. ClickHouse and DuckDB support read-only access, while Trino and Spark offer write capabilities, though the writing method remains unclear. He considered import/export support as a starting point, but acknowledged its limitations compared to native iceberg support. The team discussed the ideal use case of writing and reading iceberg tables directly in object storage, but recognized the current limitations of write access for many tools.

Iceberg vs. ClickHouse and Data Processing:

The discussion explored the benefits of iceberg compared to ClickHouse. @shcheklein highlighted iceberg's decoupling of persistent storage and compute. @skshetry contrasted iceberg's lakehouse approach for potentially unstructured data with ClickHouse's data warehouse approach for structured data. @dmpetrov emphasized iceberg's optimized query capabilities stemming from its data structure and storage in object storage like S3, contrasting it with less efficient methods like querying files directly from S3. @shcheklein further explained that iceberg allows the use of various databases on top of the same data, making it more flexible.

Iceberg Investigation and Next Steps:

Investigation into iceberg revealed challenges in integrating it into the data chain. It suggested import support as a potential initial step, acknowledging that this would be similar to existing Parquet support and would not leverage iceberg's native capabilities. We also discussed the various iceberg catalogs and their implications for integration.
The team concluded that deeper research into the capabilities and limitations of iceberg is needed before proceeding with full integration.

@skshetry
Copy link
Member

skshetry commented Mar 5, 2025

Apache Iceberg is an open table format designed for huge analytic datasets. Iceberg enables various compute engines (like trino, spark ) to work with large scale datasets using a standardized table format.

Briefly, let me discuss the architecture of Iceberg. Iceberg mainly has 3 layers:

Apache Iceberg - from dremio

  1. Catalog: Iceberg uses a catalog (e.g., Hive Metastore, AWS Glue, SQL catalog, REST catalog) to track table locations and metadata changes to ensure consistency. There are multiple flavours of catalog (as said in eg.).
  2. Metadata Storage: Iceberg maintains table metadata (e.g., snapshots, manifests) in object storage alongside the actual data files. This metadata enables schema evolution, time travel, and efficient queries.
  3. Data File Storage: Iceberg stores data as columnar Parquet (or ORC/Avro) files in object storage. (PyIceberg - the official Apache Iceberg project for Python supports multiple storages: s3, gs, hdfs, and abfs ).

Additional References:

  1. Introduction from the original creators of iceberg

Compute Engines

More than catalog, we care more about compute engines. Let's discuss them briefly.

image

Spark

Spark is the most feature-rich compute engine for Iceberg operations. Supports read/write.

Clickhouse:

Clickhouse has two kinds of supports:

  1. Iceberg Table Engline
  2. and via a function Iceberg Table Function

Both of these are read-only. Clickhouse recommends to use the latter.
There is an open FR for write support: ClickHouse/ClickHouse#49973.

Duckdb:

Duckdb has read-only support for reading/querying from iceberg tables, including from s3. No write support unfortunately, however there's an open issue: duckdb/duckdb-iceberg#37).

See https://duckdb.org/docs/stable/extensions/iceberg.html, and https://duckdb.org/docs/stable/guides/network_cloud_storage/s3_iceberg_import.html#loading-iceberg-tables-from-s3.

The duckdb integration is an experimental project, more of a proof-of-concept. It only started supporting REST catalog last month and is not yet released. There are some things that it does not support: Hidden Partitioning and predicate pushdown for table scanning.

Also no write support. :(

Daft

Daft provides both reading and writing support for iceberg. It also supports push-down filtering.
It uses pyiceberg internally.

Looking at their code, write support does not look trivial. :(

https://github.com/Eventual-Inc/Daft/blob/fb89b2a557b867742591e6bb1327e8695c2f6423/daft/dataframe/dataframe.py#L820

Snowflake

Snowflake has a catalog support as well as query engine support (SQL).

Trino

Trino supports iceberg, and has sql support, including write operations.

https://trino.io/docs/current/connector/iceberg.html#sql-support

PyIceberg

PyIceberg is an official Apache Iceberg project, that provides a Python implementation for accessing Iceberg tables, without the need of a JVM.

https://github.com/apache/iceberg-python

This has read/write support for Iceberg. write support is recent. pyiceberg does not have
all the features that Java implementation has.

See apache/iceberg-python#736.

Also, has support for expressions for filtering. Supports SQL, Rest, Hive, Glue and DyanamoDB catalogs.

(TBD: need to add more information)

Polars

Polars supports reading from an Apache Iceberg table, and has push-down filtering.

It also uses pyiceberg.

Ibis

Ibis has not implemented iceberg support yet. See:

Dbt

Kedro

Dask

Delta Lake/DataFusion

(haven't checked yet!)

Notes about my PR #937

My PR uses to_arrow_batch_reader to read in batches to reduce memory usage. (apache/iceberg-python#786)

Although that's an import and a blocking call, so we don't have any way to do computes before it gets saved as a datachain dataset. And we don't have a way to apply filters beforehand, so unlike polars/daft, we'll save entire table.

So far, the only way to directly read from iceberg is duckdb locally, and use an engine on Studio.

There is also a way to query pyarrow.Table with duckdb, but that requires keeping an in-memory results.

Note that pyiceberg also has support for expressions, but that has to be applied by users through some ways (or, converted by us from SQL expressions to pyiceberg expressions which may not be feasible or is difficult due to our architecture).

If pyiceberg implemented Expose PyIceberg table as PyArrow Dataset, it'd be easy to mix duckdb and pyiceberg together.

(I'll add more information as I do research, so far there is not a satisfying answer).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants