Skip to content

feat: Support Arrow type Dictionary(_, FixedSizeBinary(_)) when writing Parquet #7446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

albertlockett
Copy link

@albertlockett albertlockett commented Apr 27, 2025

Which issue does this PR close?

Rationale for this change

Support writing parquet from arrow when the Dictionary type is used and the values in the dictionary are FixedSizeBinary.

What changes are included in this PR?

The type is now supported. We treat the type as a byte array (similar to what we would do the arrow type Dictionary(_, Utf8))

Are there any user-facing changes?

If the user tries to write parquet file from their arrow record batch, and they use the type Dictionary(_, FixedSizeBinary(_)), the write will no longer fail with the error message:

called `Result::unwrap()` on an `Err` value: NYI("Attempting to write an Arrow type that is not yet implemented")

@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 27, 2025
@albertlockett albertlockett changed the title feat: Support Arrow type Dictionary(_, FixedSizeList(_)) when writing Parquet feat: Support Arrow type Dictionary(_, FixedSizeBinary(_)) when writing Parquet Apr 27, 2025
@albertlockett
Copy link
Author

Hey @alamb I fixed the linter failures from this run you triggered:
https://github.com/apache/arrow-rs/actions/runs/14695421193

Would you mind triggering the workflow again?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reviewing the changes (no knowledge of the context) this looks good to me. If I understand correctly you are converting from FSB into Binary for the write path and then casting on the read path?

Are there other Parquet implementations that support this type? Is there maybe a test file we could add to an integration test somewhere (e.g. a file of Dictionary(_, FixedSizeBinary(_)) written by parquet-cpp?

I'm not super familiar with the expectations for tests in parquet. I suspect @alamb or @tustvold might know more.

Comment on lines 1921 to 1924
DataType::Dictionary(
Box::new(DataType::UInt8),
Box::new(DataType::FixedSizeBinary(4)),
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be thorough can we iterate through the various key types to ensure we got the match statement in byte_array_dictionary correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sg thanks @westonpace. I'll make this change

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace made this change

@alamb
Copy link
Contributor

alamb commented May 2, 2025

I think the idea of this function is to allow round tripping of Dictionary arrow types via parquet.

I am not sure we have documented the behavior of metadata anywhere -- we probably should

Since the arrow type system and the parquet type system are different, to ensure that arrow-rs parquet can recover the original arrow types if data was written to parquet, it adds metadata into the parquet file to help choose which arrow type to use when there are several potential types for a parquet type (e.g. parquet Binary can go to BinaryView or Binary for example)

@albertlockett
Copy link
Author

I think the idea of this function is to allow round tripping of Dictionary arrow types via parquet.

I am not sure we have documented the behavior of metadata anywhere -- we probably should

Since the arrow type system and the parquet type system are different, to ensure that arrow-rs parquet can recover the original arrow types if data was written to parquet, it adds metadata into the parquet file to help choose which arrow type to use when there are several potential types for a parquet type (e.g. parquet Binary can go to BinaryView or Binary for example)

@alamb yes, that's the idea with this change. Thanks for the explainer on the metadata

@alamb
Copy link
Contributor

alamb commented May 7, 2025

@alamb yes, that's the idea with this change. Thanks for the explainer on the metadata

No problems -- I also made a PR to update the docs to explain it better

@alamb
Copy link
Contributor

alamb commented May 7, 2025

Thank you @albertlockett and @westonpace

alamb
alamb previously approved these changes May 7, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertlockett can you please add one "end to end" roundtrip test, like this:

#[test]
fn test_float16_roundtrip() -> Result<()> {
let schema = Arc::new(Schema::new(vec![
Field::new("float16", ArrowDataType::Float16, false),
Field::new("float16-nullable", ArrowDataType::Float16, true),
]));
let mut buf = Vec::with_capacity(1024);
let mut writer = ArrowWriter::try_new(&mut buf, schema.clone(), None)?;
let original = RecordBatch::try_new(
schema,
vec![
Arc::new(Float16Array::from_iter_values([
f16::EPSILON,
f16::MIN,
f16::MAX,
f16::NAN,
f16::INFINITY,
f16::NEG_INFINITY,
f16::ONE,
f16::NEG_ONE,
f16::ZERO,
f16::NEG_ZERO,
f16::E,
f16::PI,
f16::FRAC_1_PI,
])),
Arc::new(Float16Array::from(vec![
None,
None,
None,
Some(f16::NAN),
Some(f16::INFINITY),
Some(f16::NEG_INFINITY),
None,
None,
None,
None,
None,
None,
Some(f16::FRAC_1_PI),
])),
],
)?;
writer.write(&original)?;
writer.close()?;
let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buf), 1024)?;
let ret = reader.next().unwrap()?;
assert_eq!(ret, original);
// Ensure can be downcast to the correct type
ret.column(0).as_primitive::<Float16Type>();
ret.column(1).as_primitive::<Float16Type>();
Ok(())
}

That writes/reads data from an actual parquet file?

@alamb alamb dismissed their stale review May 7, 2025 11:01

Clicked wrong button, needs a round trip test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Arrow type Dictionary with value FixedSizeBinary in Parquet
3 participants