Skip to content

fix duration conversion error #5626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

Liyixin95
Copy link
Contributor

Which issue does this PR close?

Closes #5625

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 11, 2024
@tustvold
Copy link
Contributor

Could we get a test of this please

@alamb
Copy link
Contributor

alamb commented Apr 12, 2024

I agree -- a test is important to make sure we don't break this fix in some future refactor

@Liyixin95
Copy link
Contributor Author

Could we get a test of this please

sorry for the noob question, where should I add a test for it?

@alamb
Copy link
Contributor

alamb commented Apr 13, 2024

Could we get a test of this please

sorry for the noob question, where should I add a test for it?

I think we could add a round trip test here perhaps:

#[test]
fn date32_single_column() {
required_and_optional::<Date32Array, _>(0..SMALL_SIZE as i32);
}
#[test]
fn date64_single_column() {
// Date64 must be a multiple of 86400000, see ARROW-10925
required_and_optional::<Date64Array, _>(
(0..(SMALL_SIZE as i64 * 86400000)).step_by(86400000),
);
}

@Liyixin95
Copy link
Contributor Author

Could we get a test of this please

sorry for the noob question, where should I add a test for it?

I think we could add a round trip test here perhaps:

#[test]
fn date32_single_column() {
required_and_optional::<Date32Array, _>(0..SMALL_SIZE as i32);
}
#[test]
fn date64_single_column() {
// Date64 must be a multiple of 86400000, see ARROW-10925
required_and_optional::<Date64Array, _>(
(0..(SMALL_SIZE as i64 * 86400000)).step_by(86400000),
);
}

write duration to parquet get a panic...

DataType::Duration(_) => Err(arrow_err!("Converting Duration to parquet not supported",)),

@tustvold
Copy link
Contributor

tustvold commented Apr 15, 2024

write duration to parquet get a panic...

Yes I think we would need #1938. Technically polars is doing something a little funky here

@alamb
Copy link
Contributor

alamb commented Apr 16, 2024

or put another way there is no way to write a duration to a parquet file with arrow-rs yet (but polars can do so).

So the ideal solution is support writing duration to parquet too (but I suspect there might be work to sort out what the desired behavior is)

Perhaps for this PR we can just check in a small parquet file that was written by polars and ensure we can read it back correctly 🤔

@tustvold tustvold marked this pull request as draft May 8, 2024 06:04
@tustvold
Copy link
Contributor

tustvold commented Oct 8, 2024

Closing as this PR has been inactive for a while

@tustvold tustvold closed this Oct 8, 2024
@EddyXorb
Copy link

I have the same problem with type pl.Time. When I write a column with pl.Time to a parquet file, it reads the column as int64.

@alamb
Copy link
Contributor

alamb commented Nov 22, 2024

Thanks @EddyXorb -- any chance you can make a PR with a test? It seems that that is all that is missing to be able to get this fixed

@EddyXorb
Copy link

I would love to help, but at the moment I am very very busy. Once I have time (mid december), I can try to provide a test.

@alamb
Copy link
Contributor

alamb commented Nov 22, 2024

Thank you!

@sirpresto
Copy link

Hi has this been fixed? I am running into the same issue with not being able to correctly roundtrip Duration types.

@alamb
Copy link
Contributor

alamb commented May 2, 2025

Hi has this been fixed? I am running into the same issue with not being able to correctly roundtrip Duration types.

I think we are waiting on a test

@sirpresto
Copy link

I think the original author has abandoned their interest in trying to contribute. Given Duration is an officially supported type in arrow(and not an extension type) should the official maintainers just supply the patch?

@alamb
Copy link
Contributor

alamb commented May 5, 2025

I think the original author has abandoned their interest in trying to contribute. Given Duration is an officially supported type in arrow(and not an extension type) should the official maintainers just supply the patch?

I personally don't have time to write it myself unfortunately. I am not sure about the other maintainers.

I would be happy to review a new PR with tests, etc

@Liyixin95
Copy link
Contributor Author

@alamb I provide a test file here. I will add a test case in this pr after the test file get merged.

@alamb
Copy link
Contributor

alamb commented May 7, 2025

Ok, what is happening here is as follows: arrow-rs and arrow-cpp (and potentially polars) add a special file metadata field called "ARROW:schema" that records the desired Arrow schema. This is described in more detail here:

In order for the arrow-rs parquet reader to read the data as a duration it needs to interpret the contents of that metadata, which is what I think the code in this PR does.

It would be really nice if the arrow-rs parquet writer also wrote the correct metadata so duration data in parquet that was written could be read correctly by arrow and parquet

So therefore I think we should add a test like "round trip" test that writes a RecordBatch with a Duration to a parquet file and reads it back, verifying that the read data is the same. An example of such a test is here

@tustvold
Copy link
Contributor

tustvold commented May 7, 2025

The issue is that parquet doesn't have a duration type - see here, and the parquet schema must remain authoritative #1663.

To get around this arrow writers embed an encoded version of the arrow schema that can be used as a hint for type inference by compatible readers - see #1666.

To support writing types not supported by parquet we added a coerce_types option (see #1938). This will need to be extended to also coerce durations.

@alamb
Copy link
Contributor

alamb commented May 7, 2025

Let's reopen this PR as we are actively working on it now

@Liyixin95
Copy link
Contributor Author

@alamb I fix the writer part and add a "round trip" test. Please reopen this pr and take a look.

@alamb
Copy link
Contributor

alamb commented May 8, 2025

@alamb I fix the writer part and add a "round trip" test. Please reopen this pr and take a look.

For some reason GitHub will not let me reopen this PR. Can you please make a new PR and @ mention me?

Screenshot 2025-05-08 at 6 50 07 AM

Screenshot 2025-05-08 at 6 50 25 AM

Thank you for your help!

@Liyixin95
Copy link
Contributor Author

For some reason GitHub will not let me reopen this PR. Can you please make a new PR and @ mention me?

Of course. Here is the new pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

write duration to parquet but read as int64
5 participants