Improve reciprocal estimate and tests #4363

pmatos · 2025-02-17T17:06:52Z

Reciprocal estimations did not have enough accuracy. Tests were enabled to check for accurate values of reciprocals.

Sonicadvance1 · 2025-02-17T17:55:58Z

I need some more documentation for what this is trying to solve. As far as I'm aware the FEAT_RPRES extension gives us the exact amount of precision required for SSE reciprocals, but doesn't match 3DNow! reciprocals (12-bit versus 14-bit).

The non-RPRES implementation using FDIV and giving more precision than required just being an artifact and is allowed by spec, but can't use reciprocal estimate alone their because ARM's non-RPRES path is only 8-bit and needs a single step to improve precision. So I had just chose fdiv instead of rcpe+rcps there since fdiv alone is technically faster on hardware we cared about.

pmatos · 2025-02-18T14:25:56Z

I need some more documentation for what this is trying to solve. As far as I'm aware the FEAT_RPRES extension gives us the exact amount of precision required for SSE reciprocals, but doesn't match 3DNow! reciprocals (12-bit versus 14-bit).

This doesn't seem to be the case. I cannot find ARM docs on the exact precision provided by frsqrte, but all my measurements point out that frsqrte(x) is less precise than 1/sqrt(x) by more than the intel error margin for SSE of 1.5*2^-12. Tested it with https://gist.github.com/pmatos/c86455f1b4139186602a461e3b071275

Similarly for frecpe.

The non-RPRES implementation using FDIV and giving more precision than required just being an artifact and is allowed by spec, but can't use reciprocal estimate alone their because ARM's non-RPRES path is only 8-bit and needs a single step to improve precision. So I had just chose fdiv instead of rcpe+rcps there since fdiv alone is technically faster on hardware we cared about.

Running my test on 100M tests results in the following (on a Lenovo T14s):

76.8% of frsqrte results are outside the 1.5*10^-12 error margin.
0% of frsqrte+frsqrts results are outside the error margin.
frsqrte+frsqrts seems to still be faster than fdiv+fsqrt.

What I did was to enable all the tests requiring precision checks.
I modified the tests to check for the necessary precision. 3Dnow has sometimes 14 or 15 bits of precision requirements while others have 1.5*2^-12 relative error requirement. I also added checks for +inf and -inf results and for negative reciprocal square roots that are allowed on 3dnow instructions.

If you prefer can I split the PR into several small PRs.

Sonicadvance1 · 2025-02-18T21:38:00Z

The most simple documentation comes from the FEAT_RPRES text.

FEAT_RPRES allows an increase in the precision of the single-precision floating-point reciprocal
estimate and reciprocal square root estimate from an 8-bit mantissa to a 12-bit mantissa.
<...>
For more information, see RecipEstimate() and RecipSqrtEstimate().

For RecipEstimate

Similar for RecipSqrtEstimate but annoyingly split between two pages.

I think a thing to note is that subnormal/inf behaviour will end up resulting in different behaviour between an fdiv implementation and a reciprocal implementation and an easy way to check that (since this is only a 32-bit search space) is to just brute force all the values.

#include <cstdio>
#include <cstdint>
#include <xmmintrin.h>
#include <emmintrin.h>

constexpr uint32_t MASK_12BIT = 0x007ff800U;
constexpr uint32_t ONE_ULP_12BIT = 0x00000800U;

constexpr uint32_t EXP_MASK = 0x7f800000;
constexpr uint32_t INF = 0x7f800000;
constexpr uint32_t SUBNORMAL = 0;

[[maybe_unused]]
static uint32_t recip(uint32_t in) {
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_rcp_ss(val);
  return _mm_cvtsi128_si32(val);
}

[[maybe_unused]]
static uint32_t recip_div(uint32_t in) {
  __m128 one = _mm_cvtsi32_si128(0x3f800000);
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_div_ss(one, val);
  return _mm_cvtsi128_si32(val);
}

[[maybe_unused]]
static uint32_t recip_sqrt(uint32_t in) {
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_rsqrt_ss(val);
  return _mm_cvtsi128_si32(val);
}

[[maybe_unused]]
static uint32_t recip_sqrt_div(uint32_t in) {
  __m128 one = _mm_cvtsi32_si128(0x3f800000);
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_sqrt_ss(val);
  val = _mm_div_ss(one, val);
  return _mm_cvtsi128_si32(val);
}

#define RECIP
#ifdef RECIP
auto Test1 = recip;
auto Test2 = recip_div;
#else
auto Test1 = recip_sqrt;
auto Test2 = recip_sqrt_div;
#endif

int main(int argc, char **argv) {
  size_t inf_results{};
  size_t subnormal_results{};
  for (uint64_t val = 0; val < 0x1'0000'0000ULL; ++val) {
    auto recip_val = Test1(val);
    auto div_val = Test2(val);
    auto recip_val_masked = recip_val & MASK_12BIT;
    auto div_val_masked = div_val & MASK_12BIT;

    // Ignore results that generate INF and Subnormals
    if ((recip_val & EXP_MASK) == INF) {
      ++inf_results;
      continue;
    }
    if ((recip_val & EXP_MASK) == SUBNORMAL) {
      ++subnormal_results;
      continue;
    }

    if (!((recip_val_masked + ONE_ULP_12BIT) >= div_val_masked ||
          (recip_val_masked - ONE_ULP_12BIT) <= div_val_masked)) {
      fprintf(stderr, "OUTSIDE OF RANGE: 0x%08x turned in to 0x%08x and 0x%08x\n", (uint32_t)val, recip_val, div_val);
    }
  }
  fprintf(stderr, "Ignored %zd inf results, and %zd subnormals\n", inf_results, subnormal_results);
  return 0;
}

Comparing Zen 3 versus Oryon.
For the reciprocal test, Oryon produced the same number of infs and subnormals.

Ignored 33554430 inf results, and 33554434 subnormals

For the reciprocal sqrt, Same story

Ignored 2164260863 inf results, and 1 subnormals

… not RPRES support

pmatos · 2025-02-20T17:53:46Z

The most simple documentation comes from the FEAT_RPRES text.

FEAT_RPRES allows an increase in the precision of the single-precision floating-point reciprocal
estimate and reciprocal square root estimate from an 8-bit mantissa to a 12-bit mantissa.
<...>
For more information, see RecipEstimate() and RecipSqrtEstimate().

For RecipEstimate

Similar for RecipSqrtEstimate but annoyingly split between two pages.

I think a thing to note is that subnormal/inf behaviour will end up resulting in different behaviour between an fdiv implementation and a reciprocal implementation and an easy way to check that (since this is only a 32-bit search space) is to just brute force all the values.

#include <cstdio>
#include <cstdint>
#include <xmmintrin.h>
#include <emmintrin.h>

constexpr uint32_t MASK_12BIT = 0x007ff800U;
constexpr uint32_t ONE_ULP_12BIT = 0x00000800U;

constexpr uint32_t EXP_MASK = 0x7f800000;
constexpr uint32_t INF = 0x7f800000;
constexpr uint32_t SUBNORMAL = 0;

[[maybe_unused]]
static uint32_t recip(uint32_t in) {
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_rcp_ss(val);
  return _mm_cvtsi128_si32(val);
}

[[maybe_unused]]
static uint32_t recip_div(uint32_t in) {
  __m128 one = _mm_cvtsi32_si128(0x3f800000);
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_div_ss(one, val);
  return _mm_cvtsi128_si32(val);
}

[[maybe_unused]]
static uint32_t recip_sqrt(uint32_t in) {
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_rsqrt_ss(val);
  return _mm_cvtsi128_si32(val);
}

[[maybe_unused]]
static uint32_t recip_sqrt_div(uint32_t in) {
  __m128 one = _mm_cvtsi32_si128(0x3f800000);
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_sqrt_ss(val);
  val = _mm_div_ss(one, val);
  return _mm_cvtsi128_si32(val);
}

#define RECIP
#ifdef RECIP
auto Test1 = recip;
auto Test2 = recip_div;
#else
auto Test1 = recip_sqrt;
auto Test2 = recip_sqrt_div;
#endif

int main(int argc, char **argv) {
  size_t inf_results{};
  size_t subnormal_results{};
  for (uint64_t val = 0; val < 0x1'0000'0000ULL; ++val) {
    auto recip_val = Test1(val);
    auto div_val = Test2(val);
    auto recip_val_masked = recip_val & MASK_12BIT;
    auto div_val_masked = div_val & MASK_12BIT;

    // Ignore results that generate INF and Subnormals
    if ((recip_val & EXP_MASK) == INF) {
      ++inf_results;
      continue;
    }
    if ((recip_val & EXP_MASK) == SUBNORMAL) {
      ++subnormal_results;
      continue;
    }

    if (!((recip_val_masked + ONE_ULP_12BIT) >= div_val_masked ||
          (recip_val_masked - ONE_ULP_12BIT) <= div_val_masked)) {
      fprintf(stderr, "OUTSIDE OF RANGE: 0x%08x turned in to 0x%08x and 0x%08x\n", (uint32_t)val, recip_val, div_val);
    }
  }
  fprintf(stderr, "Ignored %zd inf results, and %zd subnormals\n", inf_results, subnormal_results);
  return 0;
}

Comparing Zen 3 versus Oryon. For the reciprocal test, Oryon produced the same number of infs and subnormals.

Ignored 33554430 inf results, and 33554434 subnormals

For the reciprocal sqrt, Same story

Ignored 2164260863 inf results, and 1 subnormals

I must say I am not sure what to make of the message. I think I understand what you're saying, while at the same time it doesn't invalidate the fact that the reciprocal estimate on an AFP/RPRES system (frecpe or fsqrtpe) does not seem to be within the error margin required by the intel sse/avx instructions of 1.5*2^-12 relative error against the real reciprocal.

At the same time, the only values I am checking are related to reciprocal of 0.0, -0.0, and reciprocal sqrt of 0.0, -0.0 and negative numbers (which should be inf, -inf and -nan respectively) - ie. not dealing or looking at subnormals.

pmatos added 2 commits February 17, 2025 18:50

Improve reciprocal estimate and tests

1809296

Reciprocal estimations did not have enough accuracy. Tests were enabled to check for accurate values of reciprocals.

instcountci: Improve reciprocal estimate and tests

df9bbc3

pmatos force-pushed the ReciprocalsFix branch from c651402 to df9bbc3 Compare February 17, 2025 17:51

pmatos added 3 commits February 20, 2025 10:22

Fix VFRSqrtScalarInsert incorrectly zeroing out top bits with AFP but…

f63c449

… not RPRES support

wip

95df42c

wip

7c310fc

pmatos added 6 commits February 20, 2025 18:56

instcountci update

80f9778

wip

c277a63

instcountci update

a1f90ef

wip

24171b5

wip

83bec98

wip

e84b32c

pmatos marked this pull request as draft February 21, 2025 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reciprocal estimate and tests #4363

Improve reciprocal estimate and tests #4363

pmatos commented Feb 17, 2025

Sonicadvance1 commented Feb 17, 2025

pmatos commented Feb 18, 2025

Sonicadvance1 commented Feb 18, 2025

pmatos commented Feb 20, 2025

Improve reciprocal estimate and tests #4363

Are you sure you want to change the base?

Improve reciprocal estimate and tests #4363

Conversation

pmatos commented Feb 17, 2025

Sonicadvance1 commented Feb 17, 2025

pmatos commented Feb 18, 2025

Sonicadvance1 commented Feb 18, 2025

pmatos commented Feb 20, 2025