-
Notifications
You must be signed in to change notification settings - Fork 288
Support dynamic overwrite #1287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Good one @kevinjqliu. There is also a PR #931 that's waiting for me (sorry for that)! |
Ah the PR contains quite some similar functionality indeed. It seems that the PR does a delete+append. If I understand correctly, this could lead to reading incomplete data: in case the delete snapshot is referenced during the read, the data for the overwritten partition(s) is missing. |
Thanks @Fokko didnt see that one, I'll close this issue when that PR is merged
@koenvo I think the delete+append are done in the context of a transaction. When the transaction is committed, both are either registered together at the same time or rejected at the same time. |
I also believe those two snapshots are added in a single transaction. What I mean is that it’s possible to time-travel to the delete snapshot. In that case you are looking at data where the delete is already applied but the append is not. |
yes, thats right. This will create 2 snapshots and if you time travel to the first one, you will only see the table with the data deleted. |
Ah good question. In our normal process the Iceberg tables are only queried using our own application. The application will always (for now at least) use the latest snapshot. That works fine when those two snapshots are committed in a single transactions. But we also expose the tables in AWS Athena for ad-hoc queries. This makes it possible for people to choose any snapshot, including the delete one. |
I see thanks for the explanation. When writing in fast append mode (DELETE+APPEND), it's possible to accidentally time travel to the DELETE snapshot and only see the deleted data. |
It would be nice to support multiple partition overwrites in a single transaction => produce single snapshot |
Feature Request / Improvement
Currently
overwrite
consists of a delete + append operation.iceberg-python/pyiceberg/table/__init__.py
Lines 462 to 471 in e771190
As an optimization, we can support dynamic overwrite for when an entire partition is replaced.
Heres an example from @koenvo
https://gist.github.com/koenvo/e23bfab32c7e7810eb52f82c6304fc22
The text was updated successfully, but these errors were encountered: