feat: Support Arrow type `Dictionary(_, FixedSizeBinary(_))` when writing Parquet #7446

albertlockett · 2025-04-27T19:19:36Z

Which issue does this PR close?

Closes #7445.

Rationale for this change

Support writing parquet from arrow when the Dictionary type is used and the values in the dictionary are FixedSizeBinary.

What changes are included in this PR?

The type is now supported. We treat the type as a byte array (similar to what we would do the arrow type Dictionary(_, Utf8))

Are there any user-facing changes?

If the user tries to write parquet file from their arrow record batch, and they use the type Dictionary(_, FixedSizeBinary(_)), the write will no longer fail with the error message:

called `Result::unwrap()` on an `Err` value: NYI("Attempting to write an Arrow type that is not yet implemented")

albertlockett · 2025-04-30T12:21:45Z

Hey @alamb I fixed the linter failures from this run you triggered:
https://github.com/apache/arrow-rs/actions/runs/14695421193

Would you mind triggering the workflow again?

westonpace

Just reviewing the changes (no knowledge of the context) this looks good to me. If I understand correctly you are converting from FSB into Binary for the write path and then casting on the read path?

Are there other Parquet implementations that support this type? Is there maybe a test file we could add to an integration test somewhere (e.g. a file of Dictionary(_, FixedSizeBinary(_)) written by parquet-cpp?

I'm not super familiar with the expectations for tests in parquet. I suspect @alamb or @tustvold might know more.

westonpace · 2025-05-02T16:22:03Z

parquet/src/arrow/arrow_writer/mod.rs

+            DataType::Dictionary(
+                Box::new(DataType::UInt8),
+                Box::new(DataType::FixedSizeBinary(4)),
+            ),


Just to be thorough can we iterate through the various key types to ensure we got the match statement in byte_array_dictionary correct?

Sure, sg thanks @westonpace. I'll make this change

@westonpace made this change

alamb · 2025-05-02T20:49:00Z

I think the idea of this function is to allow round tripping of Dictionary arrow types via parquet.

I am not sure we have documented the behavior of metadata anywhere -- we probably should

Since the arrow type system and the parquet type system are different, to ensure that arrow-rs parquet can recover the original arrow types if data was written to parquet, it adds metadata into the parquet file to help choose which arrow type to use when there are several potential types for a parquet type (e.g. parquet Binary can go to BinaryView or Binary for example)

albertlockett · 2025-05-04T15:07:07Z

I think the idea of this function is to allow round tripping of Dictionary arrow types via parquet.

I am not sure we have documented the behavior of metadata anywhere -- we probably should

Since the arrow type system and the parquet type system are different, to ensure that arrow-rs parquet can recover the original arrow types if data was written to parquet, it adds metadata into the parquet file to help choose which arrow type to use when there are several potential types for a parquet type (e.g. parquet Binary can go to BinaryView or Binary for example)

@alamb yes, that's the idea with this change. Thanks for the explainer on the metadata

alamb · 2025-05-07T10:57:52Z

@alamb yes, that's the idea with this change. Thanks for the explainer on the metadata

No problems -- I also made a PR to update the docs to explain it better

Document Arrow <--> Parquet schema conversion better #7479

alamb · 2025-05-07T10:58:22Z

Thank you @albertlockett and @westonpace

alamb

@albertlockett can you please add one "end to end" roundtrip test, like this:

arrow-rs/parquet/src/arrow/arrow_reader/mod.rs

Lines 1261 to 1319 in 11c99a3

    
           #[test] 
        
           fn test_float16_roundtrip() -> Result<()> { 
        
               let schema = Arc::new(Schema::new(vec![ 
        
                   Field::new("float16", ArrowDataType::Float16, false), 
        
                   Field::new("float16-nullable", ArrowDataType::Float16, true), 
        
               ])); 
        
               let mut buf = Vec::with_capacity(1024); 
        
               let mut writer = ArrowWriter::try_new(&mut buf, schema.clone(), None)?; 
        
               let original = RecordBatch::try_new( 
        
                   schema, 
        
                   vec![ 
        
                       Arc::new(Float16Array::from_iter_values([ 
        
                           f16::EPSILON, 
        
                           f16::MIN, 
        
                           f16::MAX, 
        
                           f16::NAN, 
        
                           f16::INFINITY, 
        
                           f16::NEG_INFINITY, 
        
                           f16::ONE, 
        
                           f16::NEG_ONE, 
        
                           f16::ZERO, 
        
                           f16::NEG_ZERO, 
        
                           f16::E, 
        
                           f16::PI, 
        
                           f16::FRAC_1_PI, 
        
                       ])), 
        
                       Arc::new(Float16Array::from(vec![ 
        
                           None, 
        
                           None, 
        
                           None, 
        
                           Some(f16::NAN), 
        
                           Some(f16::INFINITY), 
        
                           Some(f16::NEG_INFINITY), 
        
                           None, 
        
                           None, 
        
                           None, 
        
                           None, 
        
                           None, 
        
                           None, 
        
                           Some(f16::FRAC_1_PI), 
        
                       ])), 
        
                   ], 
        
               )?; 
        
               writer.write(&original)?; 
        
               writer.close()?; 
        
               let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buf), 1024)?; 
        
               let ret = reader.next().unwrap()?; 
        
               assert_eq!(ret, original); 
        
               // Ensure can be downcast to the correct type 
        
               ret.column(0).as_primitive::<Float16Type>(); 
        
               ret.column(1).as_primitive::<Float16Type>(); 
        
               Ok(()) 
        
           }

That writes/reads data from an actual parquet file?

Clicked wrong button, needs a round trip test

albertlockett added 5 commits April 27, 2025 14:30

support FixedSizedBinary in dict encoding

058c669

roundtrip works

ab3d7d3

cleanup

9e09d0e

clippy and linter

9eef5a5

support all types of keys in byte_array_dictionary

cb7a674

github-actions bot added the parquet Changes to the parquet crate label Apr 27, 2025

albertlockett changed the title ~~feat: Support Arrow type Dictionary(_, FixedSizeList(_)) when writing Parquet~~ feat: Support Arrow type Dictionary(_, FixedSizeBinary(_)) when writing Parquet Apr 27, 2025

albertlockett added 2 commits April 27, 2025 15:31

back out change included by mistake

d3701ab

linter

47c6f7e

westonpace reviewed May 2, 2025

View reviewed changes

albertlockett added 2 commits May 4, 2025 10:59

PR feedback before cleanup

4fdc3c9

PR feedback from Weston

7f2a44e

alamb mentioned this pull request May 7, 2025

Document Arrow <--> Parquet schema conversion better #7479

Open

alamb previously approved these changes May 7, 2025

View reviewed changes

alamb reviewed May 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support Arrow type `Dictionary(_, FixedSizeBinary(_))` when writing Parquet #7446

feat: Support Arrow type `Dictionary(_, FixedSizeBinary(_))` when writing Parquet #7446

albertlockett commented Apr 27, 2025 •

edited by alamb

Loading

albertlockett commented Apr 30, 2025

westonpace left a comment

westonpace May 2, 2025

albertlockett May 2, 2025

albertlockett May 4, 2025

alamb commented May 2, 2025

albertlockett commented May 4, 2025

alamb commented May 7, 2025 •

edited

Loading

alamb commented May 7, 2025

alamb left a comment

	#[test]
	fn test_float16_roundtrip() -> Result<()> {
	let schema = Arc::new(Schema::new(vec![
	Field::new("float16", ArrowDataType::Float16, false),
	Field::new("float16-nullable", ArrowDataType::Float16, true),
	]));

	let mut buf = Vec::with_capacity(1024);
	let mut writer = ArrowWriter::try_new(&mut buf, schema.clone(), None)?;

	let original = RecordBatch::try_new(
	schema,
	vec![
	Arc::new(Float16Array::from_iter_values([
	f16::EPSILON,
	f16::MIN,
	f16::MAX,
	f16::NAN,
	f16::INFINITY,
	f16::NEG_INFINITY,
	f16::ONE,
	f16::NEG_ONE,
	f16::ZERO,
	f16::NEG_ZERO,
	f16::E,
	f16::PI,
	f16::FRAC_1_PI,
	])),
	Arc::new(Float16Array::from(vec![
	None,
	None,
	None,
	Some(f16::NAN),
	Some(f16::INFINITY),
	Some(f16::NEG_INFINITY),
	None,
	None,
	None,
	None,
	None,
	None,
	Some(f16::FRAC_1_PI),
	])),
	],
	)?;

	writer.write(&original)?;
	writer.close()?;

	let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buf), 1024)?;
	let ret = reader.next().unwrap()?;
	assert_eq!(ret, original);

	// Ensure can be downcast to the correct type
	ret.column(0).as_primitive::<Float16Type>();
	ret.column(1).as_primitive::<Float16Type>();

	Ok(())
	}

feat: Support Arrow type Dictionary(_, FixedSizeBinary(_)) when writing Parquet #7446

Are you sure you want to change the base?

feat: Support Arrow type Dictionary(_, FixedSizeBinary(_)) when writing Parquet #7446

Conversation

albertlockett commented Apr 27, 2025 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

albertlockett commented Apr 30, 2025

westonpace left a comment

Choose a reason for hiding this comment

westonpace May 2, 2025

Choose a reason for hiding this comment

albertlockett May 2, 2025

Choose a reason for hiding this comment

albertlockett May 4, 2025

Choose a reason for hiding this comment

alamb commented May 2, 2025

albertlockett commented May 4, 2025

alamb commented May 7, 2025 • edited Loading

alamb commented May 7, 2025

alamb left a comment

Choose a reason for hiding this comment

feat: Support Arrow type `Dictionary(_, FixedSizeBinary(_))` when writing Parquet #7446

feat: Support Arrow type `Dictionary(_, FixedSizeBinary(_))` when writing Parquet #7446

albertlockett commented Apr 27, 2025 •

edited by alamb

Loading

alamb commented May 7, 2025 •

edited

Loading