-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More rigorous treatment of floats in tests #13590
Conversation
44fec95
to
9942c60
Compare
Thank you @leoyvens -- this looks epic. I will review this PR but I may not have a chance to do so for a day or two. It looks awesome |
9942c60
to
70e7e62
Compare
Make sense to me👍 |
4ca9592
to
bb9dce5
Compare
bb9dce5
to
a456db3
Compare
There was the following test failure on amd64 and win64:
This is surfacing non-determinism across target platforms. This is expected behaviour for Rust std. For
So I went looking to see if there was a performant but portable float math library we could use. libm seems to be it, it's what rustc uses when targeting WASM. To gain confidence that we'd not be risking any significant performance regression, I analysed some benchmark results taken from the CI of the metallic-rs project (credit to @jdh8). It benchmarks only f32, not f64. From this data, I made a chart comparing std and libm results: libm and std seem to have similar performance. If we value portability, I'd propose that we switch to |
3 1956035476 9.590891361237 | ||
4 16155718643 9.531112968922 | ||
5 6449337880 7.074412226677 | ||
1 -438598674 12.15325379371643 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
16 digits is overspecified. double arithmetics is inherently imprecise and so we should compare with epsilon
truncating digits is not enough, given 2 ~= 1.9999999999999999
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we need the results of sqllogicaltest complete mode to be the same across different platforms.
So the old scheme of rounding half to a certain precision seems to be a good solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we need the results of sqllogicaltest complete mode to be the same across different platforms. So the old scheme of rounding half to a certain precision seems to be a good solution.
I agree
why would we want to use it? |
I think portability is not necessary, and PostgreSQL doesn't guarantee it either. |
There are myths and truths to floating-point reproducibility across platforms. Some facts I've gathered while working on this:
For many real-world DataFusion use cases, floating-point operations are reproducible. In my use case, I care about reproducibility so it's not just a tests problem, it's a product problem that I'd like the tests to cover. Epsilon comparison should be used for any test that surfaces a concrete reproducibility problem due to point 4, non-associativity. So far, the only such situation surfaced by local and CI testing was the tests for the On the |
These two points imply that a database's Floating-point arithmetic results are not supposed to be reproducible. Let's consider an example CREATE OR REPLACE TABLE doubles(a double);
INSERT INTO doubles VALUES (1e30);
INSERT INTO doubles VALUES (-1e30);
INSERT INTO doubles VALUES (1);
INSERT INTO doubles VALUES (-1);
SELECT sum(a) FROM doubles; If we now run SELECT sum(a) FROM doubles; database can return 0, 1 or -1. |
I agree with @jonahgao and @findepi -- ensuring exact floating point reproducibility is not something most database systems do, I think due to the (performance) cost of doing so
I think this is the core challenge of why getting reproduceable results on a multi-threaded processing system like DataFusion will be hard without major changes. To ensure the same floating point results requires ensuring the same order of calculation. It implies, for example, that the order processing intermediate aggregate rows must be the same, even if one core is done before another. So TLDR is
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this contribution @leoyvens -- it is great to see you tacking and working on improving the tests. 🙏
@@ -92,7 +92,6 @@ arrow-ipc = { version = "53.3.0", default-features = false, features = [ | |||
arrow-ord = { version = "53.3.0", default-features = false } | |||
arrow-schema = { version = "53.3.0", default-features = false } | |||
async-trait = "0.1.73" | |||
bigdecimal = "0.4.6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is great to remove bigdecimal
3 1956035476 9.590891361237 | ||
4 16155718643 9.531112968922 | ||
5 6449337880 7.074412226677 | ||
1 -438598674 12.15325379371643 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we need the results of sqllogicaltest complete mode to be the same across different platforms. So the old scheme of rounding half to a certain precision seems to be a good solution.
I agree
Of course libm doesn't solve point 4, non-associativity, but it solves point 5, math functions. But ok I understand that using std is prudent and that reproducibility for math functions is considered a niche requirement, as Postgres doesn't provide it. So I'll revert libm. On the rounding, to confirm, you don't want to round only tests actually affected by floating point determinism issues (points 4 and 5 in my above comment), but rather you want to go back to rounding across the board "just in case"? If so, I need to change a bunch of tests again. Is there a tool to automatically update the sqllogictests with the new output? I did it by hand the first time and it's tedious. |
there is:
(but mind #12752) |
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look |
@alamb I need some feedback on desired direction for use of rounding. I see three ways this could go :
Which is the desired approach? |
That would be my almost preferred way. the code goes something like fn f64_to_str(value: f64) -> String {
// TODO compare float values with epsilon. As a workaround the rendering precision is reduced.
float_to_str(value, 10)
}
fn float_to_str(value: f64, max_significant_digits: u8) -> String {
if value.is_nan() {
// The sign of NaN can be different depending on platform.
// So the string representation of NaN ignores the sign.
"NaN".to_string()
} else if value == f64::INFINITY {
"Infinity".to_string()
} else if value == f64::NEG_INFINITY {
"-Infinity".to_string()
} else {
let mut big_decimal = BigDecimal::from_f64(value)
.unwrap()
// Truncate trailing decimal zeros
.normalized();
let precision = big_decimal.digits();
if precision > max_significant_digits as u64 {
let scale = big_decimal.as_bigint_and_exponent().1;
big_decimal = big_decimal
.with_scale_round(
scale + (max_significant_digits as i64 - precision as i64),
RoundingMode::HalfUp,
)
// Truncate trailing decimal zeros
.normalized();
}
big_decimal.to_string()
}
} A better way would be to interpret the values with an epsilon, but that will be hard to do in SLT which relies on the rows being stringified before comparison. |
I agree -- this is how I have seen such comparisons done before |
In that case the potential valuable changes would be:
I don't see how to do 1 while maintaining the Postgres-like rounding behaviour that is currently used in tests. Point 2 is a contradiction because a DataFusion user would not see any rounding, nor is there any UDF to emulate Postgres rounding. Point 3 is possibly nice but not enough to motivate me to pursue it. So I don't see anything to be done that would be sufficiently valuable to both my use case and upstream DataFusion at this point, so closing the PR. Thanks all for the reviews. |
Thank you for the effort @leoyvens -- I am sorry we were not able to find an improvement to work on |
Which issue does this PR close?
My motivation was to improve DF testing of float outputs, even at the least-significant digits.
The situation in #13569 seemed a bit uncomfortable, and generally the SLTs were formatting floats and decimal in complicated ways. So I took a shot at tackling it "the hard way", which involved changing virtually all SLTs that print out floats.
Rationale for this change
The rationale is that this makes the SLT test output closer to the output a DataFusion user would typically see, in
datafusion-cli
, when writing float outputs to CSV or when usingarrow_cast
to convert a float to string.Tests do become more sensitive to minor changes in the handling of floats by DataFusion. But if the user will have to deal with it, then the tests should also have to deal with it.
The interaction with pg_compat tests is an interesting one. The Postgres
avg
over integers returns anumeric
, while the DataFusionavg
over integers returns aFloat64
. The SLTs previously dealt with this by rounding everything to 12 decimal digits. This PR deals with it by testing the value within an epsilon, with the justification of allowing us to entirely remove the dependency on thebigdecimal
crate.The
regr_*
family of UDAFs is also interesting. If the input is split across multiple batches, the output is not deterministic, so this PR also uses an epsilon there.What changes are included in this PR?
This PR does code changes only to
sqllogictest/src/engines/conversion.rs
. But the fallout to.slt
files makes the diff large.ryu
which is what arrow-rs uses, improving consistency with user-visible outputs.avg
of integers within an epsilon, rather than relying on implicit truncation by the SLT engine.bigdecimal
is removed (this was a test-only dependency).Are these changes tested?
The changes are to the tests. Who tests the tests? 😄
Are there any user-facing changes?
No.