-
Notifications
You must be signed in to change notification settings - Fork 918
fix duration conversion error #5626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Could we get a test of this please |
I agree -- a test is important to make sure we don't break this fix in some future refactor |
sorry for the noob question, where should I add a test for it? |
I think we could add a round trip test here perhaps: arrow-rs/parquet/src/arrow/arrow_writer/mod.rs Lines 1996 to 2007 in 91f0b17
|
write duration to parquet get a panic... arrow-rs/parquet/src/arrow/schema/mod.rs Line 445 in a999fb8
|
Yes I think we would need #1938. Technically polars is doing something a little funky here |
or put another way there is no way to write a duration to a parquet file with arrow-rs yet (but polars can do so). So the ideal solution is support writing duration to parquet too (but I suspect there might be work to sort out what the desired behavior is) Perhaps for this PR we can just check in a small parquet file that was written by polars and ensure we can read it back correctly 🤔 |
Closing as this PR has been inactive for a while |
I have the same problem with type pl.Time. When I write a column with pl.Time to a parquet file, it reads the column as int64. |
Thanks @EddyXorb -- any chance you can make a PR with a test? It seems that that is all that is missing to be able to get this fixed |
I would love to help, but at the moment I am very very busy. Once I have time (mid december), I can try to provide a test. |
Thank you! |
Hi has this been fixed? I am running into the same issue with not being able to correctly roundtrip Duration types. |
I think we are waiting on a test |
I think the original author has abandoned their interest in trying to contribute. Given Duration is an officially supported type in arrow(and not an extension type) should the official maintainers just supply the patch? |
I personally don't have time to write it myself unfortunately. I am not sure about the other maintainers. I would be happy to review a new PR with tests, etc |
Ok, what is happening here is as follows: arrow-rs and arrow-cpp (and potentially polars) add a special file metadata field called "ARROW:schema" that records the desired Arrow schema. This is described in more detail here: In order for the arrow-rs parquet reader to read the data as a duration it needs to interpret the contents of that metadata, which is what I think the code in this PR does. It would be really nice if the arrow-rs parquet writer also wrote the correct metadata so duration data in parquet that was written could be read correctly by arrow and parquet So therefore I think we should add a test like "round trip" test that writes a RecordBatch with a Duration to a parquet file and reads it back, verifying that the read data is the same. An example of such a test is here |
The issue is that parquet doesn't have a duration type - see here, and the parquet schema must remain authoritative #1663. To get around this arrow writers embed an encoded version of the arrow schema that can be used as a hint for type inference by compatible readers - see #1666. To support writing types not supported by parquet we added a coerce_types option (see #1938). This will need to be extended to also coerce durations. |
Let's reopen this PR as we are actively working on it now |
@alamb I fix the writer part and add a "round trip" test. Please reopen this pr and take a look. |
For some reason GitHub will not let me reopen this PR. Can you please make a new PR and @ mention me? Thank you for your help! |
Of course. Here is the new pr |
Which issue does this PR close?
Closes #5625
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?