Skip to content

Remove dict_id from schema and make it an IPC concern only #7467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

brancz
Copy link
Contributor

@brancz brancz commented May 5, 2025

Which issue does this PR close?

Closes #6356 #5981 #1206

Rationale for this change

See the above issues. And this is a follow up to #6711 and #6873.

What changes are included in this PR?

Removal of dict_id from the canonical Schema's Field and the resulting repercussions of that.

Are there any user-facing changes?

Yes, the previously declared deprecations and a few more resulting changes that we didn't foresee previously.

The ones we did not previously foresee involve those places where we now need to pipe through an IPC schema where we previously only had the regular Schema (which used to be enough since that carried the dict_id):

  • flight_data_to_arrow_batch (arrow-flight)
  • arrow_data_from_flight_data (arrow-flight; sql)
  • read_record_batch (arrow-ipc)
  • read_dictionary (arrow-ipc)
  • FileDecoder::new (arrow-ipc)

I think these don't have stability guarantees but for the sake of completeness:

  • field_to_json (arrow-integration-test)
  • record_batch_from_json (arrow-integration-test)
  • array_from_json (arrow-integration-test)
  • dictionary_array_from_json (arrow-integration-test)

@tustvold @alamb @thinkharderdev

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels May 5, 2025
@brancz brancz force-pushed the ipc-dict-id-2 branch 5 times, most recently from ac99e01 to f690fc8 Compare May 5, 2025 11:58
@tustvold
Copy link
Contributor

tustvold commented May 5, 2025

It seems unfortunate to have methods requiring multiple different types of schema passed to them.

I'm afraid I don't have the capacity at the moment to disentangle what this PR is actually changing, the flight changes in particular I think will lead to a breaking change that hasn't been communicated via a prior deprecation.

I'm going to have to leave this one to other maintainers to look into, I no longer have the necessary context to follow what the IPC APIs are doing anymore.

@brancz
Copy link
Contributor Author

brancz commented May 5, 2025

In all cases we have the option to remove the “regular” schema, but it would have to mean that we construct it wherever we need the arrow-schema schema, which we can do but it’s a schema ref almost everywhere so it seemed like that was intentional for performance or lifetime reasons. I can definitely make that change if that’s the path we want to try there’s a danger though that it is perf sensitive and then we need to end up breaking the api twice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Relax dict_id equality in field merging
2 participants