Skip to content

Support dynamic overwrite #1287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kevinjqliu opened this issue Nov 4, 2024 · 8 comments · Fixed by #931
Closed

Support dynamic overwrite #1287

kevinjqliu opened this issue Nov 4, 2024 · 8 comments · Fixed by #931

Comments

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Nov 4, 2024

Feature Request / Improvement

Currently overwrite consists of a delete + append operation.

self.delete(delete_filter=overwrite_filter, snapshot_properties=snapshot_properties)
with self.update_snapshot(snapshot_properties=snapshot_properties).fast_append() as update_snapshot:
# skip writing data files if the dataframe is empty
if df.shape[0] > 0:
data_files = _dataframe_to_data_files(
table_metadata=self.table_metadata, write_uuid=update_snapshot.commit_uuid, df=df, io=self._table.io
)
for data_file in data_files:
update_snapshot.append_data_file(data_file)

As an optimization, we can support dynamic overwrite for when an entire partition is replaced.

Heres an example from @koenvo
https://gist.github.com/koenvo/e23bfab32c7e7810eb52f82c6304fc22

@Fokko
Copy link
Contributor

Fokko commented Nov 4, 2024

Good one @kevinjqliu. There is also a PR #931 that's waiting for me (sorry for that)!

@koenvo
Copy link
Contributor

koenvo commented Nov 4, 2024

Ah the PR contains quite some similar functionality indeed.

It seems that the PR does a delete+append. If I understand correctly, this could lead to reading incomplete data: in case the delete snapshot is referenced during the read, the data for the overwritten partition(s) is missing.

@kevinjqliu
Copy link
Contributor Author

Thanks @Fokko didnt see that one, I'll close this issue when that PR is merged

It seems that the PR does a delete+append. If I understand correctly, this could lead to reading incomplete data: in case the delete snapshot is referenced during the read, the data for the overwritten partition(s) is missing.

@koenvo I think the delete+append are done in the context of a transaction. When the transaction is committed, both are either registered together at the same time or rejected at the same time.

@koenvo
Copy link
Contributor

koenvo commented Nov 6, 2024

I also believe those two snapshots are added in a single transaction. What I mean is that it’s possible to time-travel to the delete snapshot. In that case you are looking at data where the delete is already applied but the append is not.

@kevinjqliu
Copy link
Contributor Author

yes, thats right. This will create 2 snapshots and if you time travel to the first one, you will only see the table with the data deleted.
What if your use case here to time travel to only one of the snapshots?

@koenvo
Copy link
Contributor

koenvo commented Nov 6, 2024

Ah good question. In our normal process the Iceberg tables are only queried using our own application. The application will always (for now at least) use the latest snapshot. That works fine when those two snapshots are committed in a single transactions. But we also expose the tables in AWS Athena for ad-hoc queries. This makes it possible for people to choose any snapshot, including the delete one.

@kevinjqliu
Copy link
Contributor Author

I see thanks for the explanation. When writing in fast append mode (DELETE+APPEND), it's possible to accidentally time travel to the DELETE snapshot and only see the deleted data.
Let's get the dynamic overwrite PR in, I think that should be written as 1 snapshot.

@sundaresanr
Copy link

It would be nice to support multiple partition overwrites in a single transaction => produce single snapshot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants