Skip to content

Support Utf8View for Avro #7434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kumarlokesh
Copy link
Contributor

@kumarlokesh kumarlokesh commented Apr 22, 2025

Which issue does this PR close?

Closes #7262.

Rationale for this change

What changes are included in this PR?

This PR adds support for Utf8View to the arrow-avro reader, enabling the creation of StringViewArray instead of regular StringArray for string data.

  • Added Utf8View variant to Codec enum and with_utf8view method for codec conversion
  • Updated Decoder enum with a StringView variant to handle StringViewArray creation
  • Added a configuration option use_utf8view to ReadOptions struct
  • Added benchmarks
  • Created an example application demonstrating how to use the feature

Benchmark output:

array_creation/string_array_10_chars                                                                            
                        time:   [9.5921 µs 9.6520 µs 9.7296 µs]
                        change: [-7.4725% -5.8906% -4.2024%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) high mild
  1 (5.00%) high severe
array_creation/string_view_10_chars                                                                            
                        time:   [10.053 µs 10.128 µs 10.207 µs]
                        change: [-3.1158% -0.9311% +1.0989%] (p = 0.43 > 0.05)
                        No change in performance detected.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild
array_creation/string_array_100_chars                                                                            
                        time:   [16.242 µs 16.372 µs 16.576 µs]
                        change: [-1.3374% +0.4015% +2.4089%] (p = 0.71 > 0.05)
                        No change in performance detected.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) high mild
  1 (5.00%) high severe
array_creation/string_view_100_chars                                                                            
                        time:   [13.027 µs 13.290 µs 13.592 µs]
                        change: [-16.763% -12.893% -9.4265%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) high mild
  1 (5.00%) high severe
array_creation/string_array_1000_chars                                                                           
                        time:   [76.218 µs 77.540 µs 78.691 µs]
                        change: [-8.4455% -6.3045% -4.1053%] (p = 0.00 < 0.05)
                        Performance has improved.
array_creation/string_view_1000_chars                                                                            
                        time:   [45.964 µs 46.399 µs 46.770 µs]
                        change: [-8.7387% -3.9085% -0.5912%] (p = 0.07 > 0.05)
                        No change in performance detected.

string_operations/string_array_value_10_chars                                                                            
                        time:   [546.50 ns 565.27 ns 585.60 ns]
                        change: [-16.303% -10.503% -5.2608%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
string_operations/string_view_value_10_chars                                                                            
                        time:   [973.27 ns 992.09 ns 1.0101 µs]
                        change: [-14.527% -5.2942% +1.9214%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 3 outliers among 20 measurements (15.00%)
  1 (5.00%) low mild
  2 (10.00%) high mild
string_operations/string_array_value_100_chars                                                                            
                        time:   [548.03 ns 559.51 ns 573.23 ns]
                        change: [+1.1381% +3.0271% +5.0339%] (p = 0.01 < 0.05)
                        Performance has regressed.
string_operations/string_view_value_100_chars                                                                            
                        time:   [987.22 ns 1.0068 µs 1.0248 µs]
                        change: [-8.6431% +0.4588% +6.7220%] (p = 0.93 > 0.05)
                        No change in performance detected.
string_operations/string_array_value_1000_chars                                                                            
                        time:   [545.70 ns 556.69 ns 571.33 ns]
                        change: [-13.942% -8.4282% -3.7589%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 20 measurements (15.00%)
  1 (5.00%) high mild
  2 (10.00%) high severe
string_operations/string_view_value_1000_chars                                                                            
                        time:   [947.71 ns 960.19 ns 974.77 ns]
                        change: [+0.9469% +4.6877% +8.4201%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild

avro_reader/string_array_10_chars                                                                           
                        time:   [149.00 µs 152.79 µs 155.90 µs]
                        change: [+2.5473% +5.4637% +8.5654%] (p = 0.00 < 0.05)
                        Performance has regressed.
avro_reader/string_view_10_chars                                                                           
                        time:   [141.75 µs 142.96 µs 144.64 µs]
                        change: [-14.307% -5.2067% +1.3228%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
avro_reader/string_array_100_chars                                                                           
                        time:   [173.20 µs 177.79 µs 183.34 µs]
                        change: [-4.6247% +1.1333% +5.8015%] (p = 0.72 > 0.05)
                        No change in performance detected.
avro_reader/string_view_100_chars                                                                           
                        time:   [174.64 µs 178.95 µs 183.48 µs]
                        change: [-10.183% -0.2004% +7.0467%] (p = 0.97 > 0.05)
                        No change in performance detected.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
avro_reader/string_array_1000_chars                                                                           
                        time:   [404.78 µs 415.51 µs 428.94 µs]
                        change: [-12.609% -2.9688% +4.7088%] (p = 0.64 > 0.05)
                        No change in performance detected.
avro_reader/string_view_1000_chars                                                                           
                        time:   [392.38 µs 403.15 µs 416.30 µs]
                        change: [+4.5005% +11.040% +18.788%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) high mild
  1 (5.00%) high severe

From the above benchmark results, we can draw following conclusions:

1. Array Creation Performance
1.1 Small Strings (10 chars):
StringArray: 9.65 µs
StringViewArray: 10.13 µs
StringArray is about 5% faster for small strings

1.2 Medium Strings (100 chars):
StringArray: 16.37 µs
StringViewArray: 13.29 µs
StringViewArray is about 19% faster

1.3 Large Strings (1000 chars):
StringArray: 77.54 µs
StringViewArray: 46.40 µs
StringViewArray is about 40% faster

2. String Value Access Performance
Value Access (all string sizes):
StringArray: ~560 ns
StringViewArray: ~980 ns
StringArray is consistently about 43% faster for value access

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 22, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kumarlokesh -- this looks really nice

To make reviewing this PR easier, is there any way you can break it into multiple smaller PRs? Specifically,

  1. Improve the comments
  2. Support for Map
  3. Support for Utf8View?

Also it is not clear to me how much of a conflict this PR would have with:

pub fn codec(&self) -> &Codec {
&self.codec
}

/// Returns the nullability status of this data type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice -- thank you

ComplexType::Map(m) => Err(ArrowError::NotYetImplemented(format!(
"Map of {m:?} not currently supported"
))),
ComplexType::Map(m) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be a new feature (not just string view, but also map support)

Have you had a chance to review this PR?

@@ -23,9 +23,16 @@ use std::io::Read;
pub const CODEC_METADATA_KEY: &str = "avro.codec";

#[derive(Debug, Copy, Clone, Eq, PartialEq)]
/// Supported compression codecs for Avro data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these comments

@kumarlokesh kumarlokesh marked this pull request as draft April 28, 2025 20:17
@kumarlokesh
Copy link
Contributor Author

@alamb moved improvements to comments to another PR #7449

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Utf8View for avro
2 participants