Skip to content

[feat] Ability to read table using version-hint.txt #763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kevinjqliu opened this issue May 23, 2024 · 10 comments · Fixed by #1887
Closed

[feat] Ability to read table using version-hint.txt #763

kevinjqliu opened this issue May 23, 2024 · 10 comments · Fixed by #1887

Comments

@kevinjqliu
Copy link
Contributor

Feature Request / Improvement

Although not in the official spec, version-hint.txt can be useful to read an iceberg table without a catalog.

This is useful when considering an iceberg table as a collection of files (metadata and data files) in a "directory" (s3 path). This can also be useful when ingesting iceberg tables without a catalog. An iceberg table can thus be "packaged" as a directory.

Example Usecase

  • An Iceberg table is created in a service (with catalog) in the path (s3://blah/warehouse/foo/bar/)
  • Reading the Iceberg table with another service by just providing the path (s3://blah/warehouse/foo/bar/)

When reading, version-hint.txt determines the metadata json, usually provided by querying the catalog.
When writing, version-hint.txt is committed with the atomic update to the catalog.

Additionally, StaticTable can use version-hint.txt to create an iceberg table from a path.

Relevant Issues:

cc @djouallah

@kevinjqliu
Copy link
Contributor Author

We discussed this issue in the monthly sync and agreed that this is a useful feature. We'll first implement the read side in pyiceberg. The write side is complicated due to having to support multiple concurrent writers and atomic updates in blob store, such as S3.

I will raise this issue with the Java Iceberg implementation and see if there's support also to include this as part of the Iceberg spec.

@lamb-russell
Copy link

DuckDB appears to depend on the version-hint.text file when scanning iceberg.

image

@kevinjqliu
Copy link
Contributor Author

@lamb-russell duckdb_iceberg can read the "metadata json file" directly.

See steven-luabase/duckdb-iceberg-demo#1 (comment)

@kevinjqliu
Copy link
Contributor Author

It would be great if duckdb_iceberg could support reading directly from the catalog.

@djouallah
Copy link

djouallah commented Jul 24, 2024

it is quite ironic, it seems the only iceberg vendor who generate hint.text is snowflake !!! go figure

edit : no more :( snowflake stopped producing hint.text :(

@Fokko
Copy link
Contributor

Fokko commented Jul 24, 2024

I think it is fine to add support for reading the version-hint.txt, but we should not produce it.

@kevinjqliu kevinjqliu changed the title [feat] Ability to read/write table using version-hint.txt [feat] Ability to read table using version-hint.txt Oct 24, 2024
@srilman
Copy link
Contributor

srilman commented Mar 11, 2025

@Fokko is this issue still open for working on? For context, we had to build a PyIceberg-based Hadoop Catalog with a subset of features for backwards compatibility when moving Bodo from Iceberg-Java to PyIceberg. See https://github.com/bodo-ai/Bodo/blob/main/bodo/io/iceberg/catalog/dir.py. It would be nice to move at least the read parts to the main repo

@djouallah
Copy link

fwiw, I just gave up and I am using duckdb to read iceberg table , pycieberg is clearly not interested in this scenario

@Fokko
Copy link
Contributor

Fokko commented Mar 12, 2025

@srilman Yes, I still think it would be valuable

@arnaudbriche
Copy link
Contributor

I submitted a small PR to allow using version-hint.text: #1887

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants