Skip to content

Add coerce_types flag to parquet ArrowWriter #1938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tustvold opened this issue Jun 24, 2022 · 7 comments
Open

Add coerce_types flag to parquet ArrowWriter #1938

tustvold opened this issue Jun 24, 2022 · 7 comments
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted

Comments

@tustvold
Copy link
Contributor

tustvold commented Jun 24, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

As discussed in #1666 not all types can be represented within a parquet schema.

Describe the solution you'd like

The consensus appears to be to:

  • By default faithfully round-trip the source data, performing no potentially lossy type conversion
  • Add a coerce_types flag that will use the arrow cast kernels to coerce incompatible types prior to writing them

In particular

Date64

If not coerce_types, write as Int64 and embed logical type in arrow schema only. Otherwise case to Date32

Timestamp

If not coerce_types, write as is, setting LogicalType / ConvertedType only where appropriate.

If coerce_types, cast to a UTC timestamp with the closest supported time unit, likely needing #1936.

Interval

If not coerce_types, write as FixedSizeBinaryArray matching the arrow representation and store logical type in arrow schema.

If coerce_types, convert to the relevant parquet representation.

Describe alternatives you've considered

See #1666

@dsgibbons
Copy link
Contributor

take

@alamb
Copy link
Contributor

alamb commented May 5, 2025

I think this issue was done in #6840

If that is not correct, please reopen / let me know what else needs to be done

@alamb alamb closed this as completed May 5, 2025
@dsgibbons
Copy link
Contributor

@alamb my original PR only handled Date64. I'm not sure if Interval and Timestamp are still outstanding.

@alamb alamb reopened this May 9, 2025
@alamb
Copy link
Contributor

alamb commented May 9, 2025

reopening per @dsgibbons 's comments

@alamb
Copy link
Contributor

alamb commented May 9, 2025

@dsgibbons would you be willing to make a PR to complete Interval and Timestamp so we can close this issue?

@dsgibbons
Copy link
Contributor

dsgibbons commented May 10, 2025

@alamb yes, but will probably take a while (1-2 months). If someone else wants to finish this off in the meantime then go ahead. I'll comment once I start working on this.

@alamb
Copy link
Contributor

alamb commented May 10, 2025

Thank you @dsgibbons !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants