-
Notifications
You must be signed in to change notification settings - Fork 916
feat: Support Arrow type Dictionary(_, FixedSizeBinary(_))
when writing Parquet
#7446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Dictionary(_, FixedSizeList(_))
when writing ParquetDictionary(_, FixedSizeBinary(_))
when writing Parquet
Hey @alamb I fixed the linter failures from this run you triggered: Would you mind triggering the workflow again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reviewing the changes (no knowledge of the context) this looks good to me. If I understand correctly you are converting from FSB into Binary for the write path and then casting on the read path?
Are there other Parquet implementations that support this type? Is there maybe a test file we could add to an integration test somewhere (e.g. a file of Dictionary(_, FixedSizeBinary(_))
written by parquet-cpp?
I'm not super familiar with the expectations for tests in parquet. I suspect @alamb or @tustvold might know more.
DataType::Dictionary( | ||
Box::new(DataType::UInt8), | ||
Box::new(DataType::FixedSizeBinary(4)), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be thorough can we iterate through the various key types to ensure we got the match statement in byte_array_dictionary correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sg thanks @westonpace. I'll make this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace made this change
I think the idea of this function is to allow round tripping of Dictionary arrow types via parquet. I am not sure we have documented the behavior of metadata anywhere -- we probably should Since the arrow type system and the parquet type system are different, to ensure that arrow-rs parquet can recover the original arrow types if data was written to parquet, it adds metadata into the parquet file to help choose which arrow type to use when there are several potential types for a parquet type (e.g. parquet Binary can go to BinaryView or Binary for example) |
@alamb yes, that's the idea with this change. Thanks for the explainer on the metadata |
No problems -- I also made a PR to update the docs to explain it better |
Thank you @albertlockett and @westonpace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albertlockett can you please add one "end to end" roundtrip test, like this:
arrow-rs/parquet/src/arrow/arrow_reader/mod.rs
Lines 1261 to 1319 in 11c99a3
#[test] | |
fn test_float16_roundtrip() -> Result<()> { | |
let schema = Arc::new(Schema::new(vec![ | |
Field::new("float16", ArrowDataType::Float16, false), | |
Field::new("float16-nullable", ArrowDataType::Float16, true), | |
])); | |
let mut buf = Vec::with_capacity(1024); | |
let mut writer = ArrowWriter::try_new(&mut buf, schema.clone(), None)?; | |
let original = RecordBatch::try_new( | |
schema, | |
vec![ | |
Arc::new(Float16Array::from_iter_values([ | |
f16::EPSILON, | |
f16::MIN, | |
f16::MAX, | |
f16::NAN, | |
f16::INFINITY, | |
f16::NEG_INFINITY, | |
f16::ONE, | |
f16::NEG_ONE, | |
f16::ZERO, | |
f16::NEG_ZERO, | |
f16::E, | |
f16::PI, | |
f16::FRAC_1_PI, | |
])), | |
Arc::new(Float16Array::from(vec![ | |
None, | |
None, | |
None, | |
Some(f16::NAN), | |
Some(f16::INFINITY), | |
Some(f16::NEG_INFINITY), | |
None, | |
None, | |
None, | |
None, | |
None, | |
None, | |
Some(f16::FRAC_1_PI), | |
])), | |
], | |
)?; | |
writer.write(&original)?; | |
writer.close()?; | |
let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buf), 1024)?; | |
let ret = reader.next().unwrap()?; | |
assert_eq!(ret, original); | |
// Ensure can be downcast to the correct type | |
ret.column(0).as_primitive::<Float16Type>(); | |
ret.column(1).as_primitive::<Float16Type>(); | |
Ok(()) | |
} |
That writes/reads data from an actual parquet file?
Which issue does this PR close?
Rationale for this change
Support writing parquet from arrow when the
Dictionary
type is used and the values in the dictionary areFixedSizeBinary
.What changes are included in this PR?
The type is now supported. We treat the type as a byte array (similar to what we would do the arrow type
Dictionary(_, Utf8)
)Are there any user-facing changes?
If the user tries to write parquet file from their arrow record batch, and they use the type
Dictionary(_, FixedSizeBinary(_))
, the write will no longer fail with the error message: