Skip to content

After checkpoint statistics are not available in add_actions_table if stats written as struct #3375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alexwilcoxson-rel opened this issue Apr 10, 2025 · 5 comments
Labels
bug Something isn't working

Comments

@alexwilcoxson-rel
Copy link
Contributor

Environment

Delta-rs version:
0.25.5

Binding:
Rust, Python

Environment:

  • Cloud provider: Azure
  • OS: macOS, Linux
  • Other:

Bug

What happened:
Our tables write checkpoints with statistics written as structs, delta.checkpoint.writeStatsAsStruct = true and delta.checkpoint.writeStatsAsJson = false

After a checkpoint if you call add_actions_table looking at statistics:

  1. it only checks for existence of stats on Adds vs including stats_parsed as well: https://github.com/delta-io/delta-rs/blob/python-v0.25.5/crates/core/src/table/state_arrow.rs#L98
  2. probably because the files iterator used internally uses read_adds which does not set stats_parsed

What you expected to happen:
I expect add_actions_table to have statistics available regardless of what the latest checkpoint is and how the stats were written to it

How to reproduce it:

  1. Configure table with delta.checkpoint.writeStatsAsStruct = true and delta.checkpoint.writeStatsAsJson = false
  2. Write data
  3. Checkpoint
  4. call add_actions_table
  5. observe no stats are present

More details:
log_data method is probably usable here for add_actions_table since it already has the data in arrow format AND it hydrates stats regardless of how they are represented in checkpoints or not.

It would just need a method on FileStatsAccessor to build a record batch out of its internal columns.

As a workaround I can probably enable json stats in addition to struct stats in checkpoints for little overhead.

Our use case is we make the add_actions_table queryable with datafusion to provide a sql function to explore delta table stats.

@alexwilcoxson-rel alexwilcoxson-rel added the bug Something isn't working label Apr 10, 2025
@ion-elgreco
Copy link
Collaborator

@roeap does the kernel-rs log replay handle this?

@roeap
Copy link
Collaborator

roeap commented Apr 11, 2025

@ion-elgreco, Currently working on that - will make sure we cover this with tests ...

@ion-elgreco
Copy link
Collaborator

@roeap great! :)

@alexwilcoxson-rel
Copy link
Contributor Author

@roeap @ion-elgreco sounds great, anywhere I can follow that work?

@roeap
Copy link
Collaborator

roeap commented Apr 11, 2025

@alexwilcoxson-rel - Yes, there is a tracking issue for kernel work and one bigger PR I am currently working on to get the ball rolling.

I should be pushing an update later today which hopefully simplifies things quite a bit.

In case you have any feedback on the implementation - always welcome :) !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants