Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return null for overflow when casting string to integer under safe option enabled #5398

Merged
merged 6 commits into from
Feb 15, 2024

Conversation

viirya
Copy link
Member

@viirya viirya commented Feb 13, 2024

Which issue does this PR close?

Closes #5397.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 13, 2024
@@ -438,7 +438,7 @@ macro_rules! parser_primitive {
($t:ty) => {
impl Parser for $t {
fn parse(string: &str) -> Option<Self::Native> {
lexical_core::parse::<Self::Native>(string.as_bytes()).ok()
string.parse::<Self::Native>().ok()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only touched integer parser.

Copy link
Member Author

@viirya viirya Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if floating parser also has same issue. I'm not sure about if there is floating type overflow behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can wait for a fix at the upstream crate. But the ticket is open for more than 6 months, and no progress so far. I think we may need to fix here directly.

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense in terms of correctness

Looks like this was added by #4050 which boasted a non-trivial speedup, so I guess this correctness fix will cause a performance regression 🤔

I wouldn't mind trying to take a look at the core issue in lexical_core but I can't promise anything 😅

Edit: seems there was even an issue for that raised ~1.5 years ago too: Alexhuszagh/rust-lexical#91

Given the maintainer doesn't seem active anymore either, I guess even if an upstream fix is suggested it might not get merged... unless we rely on a fork 🤔

Edit2: for reference, polars removed dependency on lexical: pola-rs/polars#12512

@viirya
Copy link
Member Author

viirya commented Feb 14, 2024

Hmm, polars uses atoi_simd::parse, maybe it is a good choice for both correctness and performance.

@viirya
Copy link
Member Author

viirya commented Feb 14, 2024

I switched to use atoi_simd. I will run bench to check performance diff.

@viirya
Copy link
Member Author

viirya commented Feb 14, 2024

The benchmark is mixed with improvement and a little regression.

Improvement:

4096 u64_small(0) - 128 time:   [70.724 µs 70.777 µs 70.867 µs]
                        change: [-12.211% -12.012% -11.792%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

4096 u64_small(0) - 1024
                        time:   [55.353 µs 55.382 µs 55.417 µs]
                        change: [-12.536% -12.217% -11.812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  9 (9.00%) high severe

4096 u64_small(0) - 4096
                        time:   [54.867 µs 55.030 µs 55.169 µs]
                        change: [-11.483% -11.257% -10.999%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
4096 i64(0) - 128       time:   [114.86 µs 114.93 µs 115.04 µs]
                        change: [-23.963% -23.561% -23.122%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

4096 i64(0) - 1024      time:   [99.625 µs 100.80 µs 101.90 µs]
                        change: [-27.344% -26.984% -26.517%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  3 (3.00%) high mild
  14 (14.00%) high severe

4096 i64(0) - 4096      time:   [108.51 µs 108.94 µs 109.50 µs]
                        change: [-18.432% -17.622% -16.805%] (p = 0.00 < 0.05)
                        Performance has improved.

Regression:

4096 u64(0) - 1024      time:   [115.03 µs 115.18 µs 115.41 µs]
                        change: [+4.2654% +4.4768% +4.6911%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 u64(0) - 4096      time:   [121.25 µs 121.51 µs 121.87 µs]
                        change: [+5.9620% +6.6001% +7.2883%] (p = 0.00 < 0.05)
                        Performance has regressed.

@@ -49,6 +49,7 @@ chrono = { workspace = true }
half = { version = "2.1", default-features = false }
num = { version = "0.4", default-features = false, features = ["std"] }
lexical-core = { version = "^0.8", default-features = false, features = ["write-integers", "write-floats", "parse-integers", "parse-floats"] }
atoi_simd = "0.15.6"
Copy link
Contributor

@tustvold tustvold Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned that this crate does not seem to have a very large community around it.

How does performance compare to atoi?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will run benchmark with atoi to compare it.

As this is a parser for integers, I guess we can easily switch to other similar crate (e.g., atoi) if we want. Community size seems not a big concern to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Community size seems not a big concern to me

Given the motivating factor for switching is a bug not getting attention, I am concerned. I'd be willing to sacrifice performance, for a better long-term maintenance story.

@viirya
Copy link
Member Author

viirya commented Feb 14, 2024 via email

@viirya
Copy link
Member Author

viirya commented Feb 14, 2024

atoi encounters more regression as expected, although looks like it is not as worse as str.parse:

4096 u64(0) - 128       time:   [202.56 µs 203.03 µs 203.43 µs]                                                    
                        change: [+15.998% +16.532% +17.063%] (p = 0.00 < 0.05)                                     
                        Performance has regressed.
                                                                                                                   
4096 u64(0) - 1024      time:   [181.58 µs 181.95 µs 182.37 µs]                                                    
                        change: [+16.615% +17.517% +18.295%] (p = 0.00 < 0.05)
                        Performance has regressed.                                                                 
Found 9 outliers among 100 measurements (9.00%)       
  5 (5.00%) high mild                            
  4 (4.00%) high severe  
                                                         
4096 u64(0) - 4096      time:   [197.34 µs 198.98 µs 201.23 µs]                                                    
                        change: [+25.579% +28.243% +31.146%] (p = 0.00 < 0.05)                                     
                        Performance has regressed.                                                                 
Found 21 outliers among 100 measurements (21.00%)                                                                  
  4 (4.00%) low severe                                
  2 (2.00%) low mild                           
  3 (3.00%) high mild
  12 (12.00%) high severe                                
                                                                                                                   
4096 i64_small(0) - 128 time:   [120.70 µs 121.20 µs 121.76 µs]               
                        change: [+3.6602% +4.1249% +4.5580%] (p = 0.00 < 0.05)                                     
                        Performance has regressed.                                                                 
                                                         
4096 i64_small(0) - 1024                                 
                        time:   [103.90 µs 105.02 µs 106.38 µs]
                        change: [+9.7972% +12.351% +15.643%] (p = 0.00 < 0.05)
                        Performance has regressed.                                                                 
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe                                                                                            

@viirya
Copy link
Member Author

viirya commented Feb 14, 2024

Due to atoi not work as expected, and the concern around atoi_simd, I changed it back to str.parse.

@tustvold
Copy link
Contributor

By default atoi allows for trailing content, the following ensures there is none.

macro_rules! parser_primitive {
    ($t:ty) => {
        impl Parser for $t {
            fn parse(string: &str) -> Option<Self::Native> {
                match atoi::FromRadix10SignedChecked::from_radix_10_signed_checked(string.as_bytes()) {
                    (Some(n), x) if x == string.len() => Some(n),
                    _ => None,
                }
            }
        }
    };
}

@tustvold tustvold merged commit eb4be68 into apache:master Feb 15, 2024
26 checks passed
@viirya
Copy link
Member Author

viirya commented Feb 15, 2024

Thank you @tustvold @Jefffrey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cast kernel doesn't return null for string to integral cases when overflowing under safe option enabled
3 participants