-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve reciprocal estimate and tests #4363
base: main
Are you sure you want to change the base?
Conversation
Reciprocal estimations did not have enough accuracy. Tests were enabled to check for accurate values of reciprocals.
c651402
to
df9bbc3
Compare
I need some more documentation for what this is trying to solve. As far as I'm aware the FEAT_RPRES extension gives us the exact amount of precision required for SSE reciprocals, but doesn't match 3DNow! reciprocals (12-bit versus 14-bit). The non-RPRES implementation using FDIV and giving more precision than required just being an artifact and is allowed by spec, but can't use reciprocal estimate alone their because ARM's non-RPRES path is only 8-bit and needs a single step to improve precision. So I had just chose fdiv instead of rcpe+rcps there since fdiv alone is technically faster on hardware we cared about. |
This doesn't seem to be the case. I cannot find ARM docs on the exact precision provided by frsqrte, but all my measurements point out that frsqrte(x) is less precise than 1/sqrt(x) by more than the intel error margin for SSE of 1.5*2^-12. Tested it with https://gist.github.com/pmatos/c86455f1b4139186602a461e3b071275 Similarly for frecpe.
Running my test on 100M tests results in the following (on a Lenovo T14s):
What I did was to enable all the tests requiring precision checks. If you prefer can I split the PR into several small PRs. |
The most simple documentation comes from the FEAT_RPRES text.
Similar for RecipSqrtEstimate but annoyingly split between two pages. I think a thing to note is that subnormal/inf behaviour will end up resulting in different behaviour between an fdiv implementation and a reciprocal implementation and an easy way to check that (since this is only a 32-bit search space) is to just brute force all the values. #include <cstdio>
#include <cstdint>
#include <xmmintrin.h>
#include <emmintrin.h>
constexpr uint32_t MASK_12BIT = 0x007ff800U;
constexpr uint32_t ONE_ULP_12BIT = 0x00000800U;
constexpr uint32_t EXP_MASK = 0x7f800000;
constexpr uint32_t INF = 0x7f800000;
constexpr uint32_t SUBNORMAL = 0;
[[maybe_unused]]
static uint32_t recip(uint32_t in) {
__m128 val = _mm_cvtsi32_si128(in);
val = _mm_rcp_ss(val);
return _mm_cvtsi128_si32(val);
}
[[maybe_unused]]
static uint32_t recip_div(uint32_t in) {
__m128 one = _mm_cvtsi32_si128(0x3f800000);
__m128 val = _mm_cvtsi32_si128(in);
val = _mm_div_ss(one, val);
return _mm_cvtsi128_si32(val);
}
[[maybe_unused]]
static uint32_t recip_sqrt(uint32_t in) {
__m128 val = _mm_cvtsi32_si128(in);
val = _mm_rsqrt_ss(val);
return _mm_cvtsi128_si32(val);
}
[[maybe_unused]]
static uint32_t recip_sqrt_div(uint32_t in) {
__m128 one = _mm_cvtsi32_si128(0x3f800000);
__m128 val = _mm_cvtsi32_si128(in);
val = _mm_sqrt_ss(val);
val = _mm_div_ss(one, val);
return _mm_cvtsi128_si32(val);
}
#define RECIP
#ifdef RECIP
auto Test1 = recip;
auto Test2 = recip_div;
#else
auto Test1 = recip_sqrt;
auto Test2 = recip_sqrt_div;
#endif
int main(int argc, char **argv) {
size_t inf_results{};
size_t subnormal_results{};
for (uint64_t val = 0; val < 0x1'0000'0000ULL; ++val) {
auto recip_val = Test1(val);
auto div_val = Test2(val);
auto recip_val_masked = recip_val & MASK_12BIT;
auto div_val_masked = div_val & MASK_12BIT;
// Ignore results that generate INF and Subnormals
if ((recip_val & EXP_MASK) == INF) {
++inf_results;
continue;
}
if ((recip_val & EXP_MASK) == SUBNORMAL) {
++subnormal_results;
continue;
}
if (!((recip_val_masked + ONE_ULP_12BIT) >= div_val_masked ||
(recip_val_masked - ONE_ULP_12BIT) <= div_val_masked)) {
fprintf(stderr, "OUTSIDE OF RANGE: 0x%08x turned in to 0x%08x and 0x%08x\n", (uint32_t)val, recip_val, div_val);
}
}
fprintf(stderr, "Ignored %zd inf results, and %zd subnormals\n", inf_results, subnormal_results);
return 0;
} Comparing Zen 3 versus Oryon.
For the reciprocal sqrt, Same story
|
I must say I am not sure what to make of the message. I think I understand what you're saying, while at the same time it doesn't invalidate the fact that the reciprocal estimate on an AFP/RPRES system (frecpe or fsqrtpe) does not seem to be within the error margin required by the intel sse/avx instructions of 1.5*2^-12 relative error against the real reciprocal. At the same time, the only values I am checking are related to reciprocal of 0.0, -0.0, and reciprocal sqrt of 0.0, -0.0 and negative numbers (which should be inf, -inf and -nan respectively) - ie. not dealing or looking at subnormals. |
Reciprocal estimations did not have enough accuracy. Tests were enabled to check for accurate values of reciprocals.