Return null for overflow when casting string to integer under safe option enabled #5398

viirya · 2024-02-13T19:28:03Z

Which issue does this PR close?

Closes #5397.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

viirya · 2024-02-13T19:28:54Z

arrow-cast/src/parse.rs

@@ -438,7 +438,7 @@ macro_rules! parser_primitive {
    ($t:ty) => {
        impl Parser for $t {
            fn parse(string: &str) -> Option<Self::Native> {
-                lexical_core::parse::<Self::Native>(string.as_bytes()).ok()
+                string.parse::<Self::Native>().ok()


Only touched integer parser.

Not sure if floating parser also has same issue. I'm not sure about if there is floating type overflow behavior.

We can wait for a fix at the upstream crate. But the ticket is open for more than 6 months, and no progress so far. I think we may need to fix here directly.

Jefffrey

This makes sense in terms of correctness

Looks like this was added by #4050 which boasted a non-trivial speedup, so I guess this correctness fix will cause a performance regression 🤔

I wouldn't mind trying to take a look at the core issue in lexical_core but I can't promise anything 😅

Edit: seems there was even an issue for that raised ~1.5 years ago too: Alexhuszagh/rust-lexical#91

Given the maintainer doesn't seem active anymore either, I guess even if an upstream fix is suggested it might not get merged... unless we rely on a fork 🤔

Edit2: for reference, polars removed dependency on lexical: pola-rs/polars#12512

viirya · 2024-02-14T18:25:53Z

Hmm, polars uses atoi_simd::parse, maybe it is a good choice for both correctness and performance.

viirya · 2024-02-14T18:29:32Z

I switched to use atoi_simd. I will run bench to check performance diff.

viirya · 2024-02-14T19:00:54Z

The benchmark is mixed with improvement and a little regression.

Improvement:

4096 u64_small(0) - 128 time:   [70.724 µs 70.777 µs 70.867 µs]
                        change: [-12.211% -12.012% -11.792%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

4096 u64_small(0) - 1024
                        time:   [55.353 µs 55.382 µs 55.417 µs]
                        change: [-12.536% -12.217% -11.812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  9 (9.00%) high severe

4096 u64_small(0) - 4096
                        time:   [54.867 µs 55.030 µs 55.169 µs]
                        change: [-11.483% -11.257% -10.999%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
4096 i64(0) - 128       time:   [114.86 µs 114.93 µs 115.04 µs]
                        change: [-23.963% -23.561% -23.122%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

4096 i64(0) - 1024      time:   [99.625 µs 100.80 µs 101.90 µs]
                        change: [-27.344% -26.984% -26.517%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  3 (3.00%) high mild
  14 (14.00%) high severe

4096 i64(0) - 4096      time:   [108.51 µs 108.94 µs 109.50 µs]
                        change: [-18.432% -17.622% -16.805%] (p = 0.00 < 0.05)
                        Performance has improved.

Regression:

4096 u64(0) - 1024      time:   [115.03 µs 115.18 µs 115.41 µs]
                        change: [+4.2654% +4.4768% +4.6911%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 u64(0) - 4096      time:   [121.25 µs 121.51 µs 121.87 µs]
                        change: [+5.9620% +6.6001% +7.2883%] (p = 0.00 < 0.05)
                        Performance has regressed.

tustvold · 2024-02-14T19:26:15Z

arrow-cast/Cargo.toml

@@ -49,6 +49,7 @@ chrono = { workspace = true }
 half = { version = "2.1", default-features = false }
 num = { version = "0.4", default-features = false, features = ["std"] }
 lexical-core = { version = "^0.8", default-features = false, features = ["write-integers", "write-floats", "parse-integers", "parse-floats"] }
+atoi_simd = "0.15.6"


I'm a little concerned that this crate does not seem to have a very large community around it.

How does performance compare to atoi?

I will run benchmark with atoi to compare it.

As this is a parser for integers, I guess we can easily switch to other similar crate (e.g., atoi) if we want. Community size seems not a big concern to me.

Community size seems not a big concern to me

Given the motivating factor for switching is a bug not getting attention, I am concerned. I'd be willing to sacrifice performance, for a better long-term maintenance story.

viirya · 2024-02-14T19:45:22Z

Hmm, that’s good point. Although I also think maintenance is not highly related to community size? Anyway, I will try atoi once I return to my laptop later.

…

On Wed, Feb 14, 2024 at 11:34 AM Raphael Taylor-Davies < ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In arrow-cast/Cargo.toml <#5398 (comment)>: > @@ -49,6 +49,7 @@ chrono = { workspace = true } half = { version = "2.1", default-features = false } num = { version = "0.4", default-features = false, features = ["std"] } lexical-core = { version = "^0.8", default-features = false, features = ["write-integers", "write-floats", "parse-integers", "parse-floats"] } +atoi_simd = "0.15.6" Community size seems not a big concern to me Given the motivating factor for switching is a bug not getting attention, I am concerned. I'd be willing to sacrifice performance, for a better long-term maintenance story. — Reply to this email directly, view it on GitHub <#5398 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAQZ5362DXWCNDEYLWPSXLYTUGT7AVCNFSM6AAAAABDHB72B2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQOBRGA4TKNZZG4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

viirya · 2024-02-14T20:53:33Z

atoi encounters more regression as expected, although looks like it is not as worse as str.parse:

4096 u64(0) - 128       time:   [202.56 µs 203.03 µs 203.43 µs]                                                    
                        change: [+15.998% +16.532% +17.063%] (p = 0.00 < 0.05)                                     
                        Performance has regressed.
                                                                                                                   
4096 u64(0) - 1024      time:   [181.58 µs 181.95 µs 182.37 µs]                                                    
                        change: [+16.615% +17.517% +18.295%] (p = 0.00 < 0.05)
                        Performance has regressed.                                                                 
Found 9 outliers among 100 measurements (9.00%)       
  5 (5.00%) high mild                            
  4 (4.00%) high severe  
                                                         
4096 u64(0) - 4096      time:   [197.34 µs 198.98 µs 201.23 µs]                                                    
                        change: [+25.579% +28.243% +31.146%] (p = 0.00 < 0.05)                                     
                        Performance has regressed.                                                                 
Found 21 outliers among 100 measurements (21.00%)                                                                  
  4 (4.00%) low severe                                
  2 (2.00%) low mild                           
  3 (3.00%) high mild
  12 (12.00%) high severe                                
                                                                                                                   
4096 i64_small(0) - 128 time:   [120.70 µs 121.20 µs 121.76 µs]               
                        change: [+3.6602% +4.1249% +4.5580%] (p = 0.00 < 0.05)                                     
                        Performance has regressed.                                                                 
                                                         
4096 i64_small(0) - 1024                                 
                        time:   [103.90 µs 105.02 µs 106.38 µs]
                        change: [+9.7972% +12.351% +15.643%] (p = 0.00 < 0.05)
                        Performance has regressed.                                                                 
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

arrow-cast/Cargo.toml

viirya · 2024-02-14T22:05:44Z

Due to atoi not work as expected, and the concern around atoi_simd, I changed it back to str.parse.

tustvold · 2024-02-14T22:44:30Z

By default atoi allows for trailing content, the following ensures there is none.

macro_rules! parser_primitive {
    ($t:ty) => {
        impl Parser for $t {
            fn parse(string: &str) -> Option<Self::Native> {
                match atoi::FromRadix10SignedChecked::from_radix_10_signed_checked(string.as_bytes()) {
                    (Some(n), x) if x == string.len() => Some(n),
                    _ => None,
                }
            }
        }
    };
}

This reverts commit 53dd047.

viirya · 2024-02-15T16:53:41Z

Thank you @tustvold @Jefffrey

Return null for overflow when casting string to integer

a36bb77

github-actions bot added the arrow Changes to the arrow crate label Feb 13, 2024

viirya commented Feb 13, 2024

View reviewed changes

Jefffrey approved these changes Feb 14, 2024

View reviewed changes

Use atoi_simd

ef5e05d

tustvold reviewed Feb 14, 2024

View reviewed changes

Use atoi

9914866

viirya commented Feb 14, 2024

View reviewed changes

arrow-cast/Cargo.toml Show resolved Hide resolved

Return to str.parse.

53dd047

viirya added 2 commits February 14, 2024 14:55

Revert "Return to str.parse."

c22e5cd

This reverts commit 53dd047.

Check trailing string

ebe2b58

tustvold merged commit eb4be68 into apache:master Feb 15, 2024
26 checks passed

This was referenced Feb 18, 2024

Refactor integer type inference logic to fit smallest type #5406

Closed

Remove usages of lexical_core for parsing integers #5422

Closed

tustvold mentioned this pull request Mar 1, 2024

Cast kernel doesn't return null for string to integral cases when overflowing under safe option enabled #5397

Closed

tustvold mentioned this pull request Mar 14, 2024

Empty String Parses as Zero in Unreleased Arrow #5504

Closed

Jefffrey mentioned this pull request Sep 15, 2024

Bump lexical-core to 1.0 #6397

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return null for overflow when casting string to integer under safe option enabled #5398

Return null for overflow when casting string to integer under safe option enabled #5398

viirya commented Feb 13, 2024

viirya Feb 13, 2024

viirya Feb 13, 2024 •

edited

Loading

viirya Feb 13, 2024

Jefffrey left a comment •

edited

Loading

viirya commented Feb 14, 2024

viirya commented Feb 14, 2024

viirya commented Feb 14, 2024 •

edited

Loading

tustvold Feb 14, 2024 •

edited

Loading

viirya Feb 14, 2024

tustvold Feb 14, 2024

viirya commented Feb 14, 2024 via email

viirya commented Feb 14, 2024

viirya commented Feb 14, 2024

tustvold commented Feb 14, 2024

viirya commented Feb 15, 2024

Return null for overflow when casting string to integer under safe option enabled #5398

Return null for overflow when casting string to integer under safe option enabled #5398

Conversation

viirya commented Feb 13, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

viirya Feb 13, 2024

Choose a reason for hiding this comment

viirya Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

viirya Feb 13, 2024

Choose a reason for hiding this comment

Jefffrey left a comment • edited Loading

Choose a reason for hiding this comment

viirya commented Feb 14, 2024

viirya commented Feb 14, 2024

viirya commented Feb 14, 2024 • edited Loading

tustvold Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

viirya Feb 14, 2024

Choose a reason for hiding this comment

tustvold Feb 14, 2024

Choose a reason for hiding this comment

viirya commented Feb 14, 2024 via email

viirya commented Feb 14, 2024

viirya commented Feb 14, 2024

tustvold commented Feb 14, 2024

viirya commented Feb 15, 2024

viirya Feb 13, 2024 •

edited

Loading

Jefffrey left a comment •

edited

Loading

viirya commented Feb 14, 2024 •

edited

Loading

tustvold Feb 14, 2024 •

edited

Loading