Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reciprocal estimate and tests #4363

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

pmatos
Copy link
Collaborator

@pmatos pmatos commented Feb 17, 2025

Reciprocal estimations did not have enough accuracy. Tests were enabled to check for accurate values of reciprocals.

Reciprocal estimations did not have enough accuracy. Tests were enabled
to check for accurate values of reciprocals.
@Sonicadvance1
Copy link
Member

I need some more documentation for what this is trying to solve. As far as I'm aware the FEAT_RPRES extension gives us the exact amount of precision required for SSE reciprocals, but doesn't match 3DNow! reciprocals (12-bit versus 14-bit).

The non-RPRES implementation using FDIV and giving more precision than required just being an artifact and is allowed by spec, but can't use reciprocal estimate alone their because ARM's non-RPRES path is only 8-bit and needs a single step to improve precision. So I had just chose fdiv instead of rcpe+rcps there since fdiv alone is technically faster on hardware we cared about.

@pmatos
Copy link
Collaborator Author

pmatos commented Feb 18, 2025

I need some more documentation for what this is trying to solve. As far as I'm aware the FEAT_RPRES extension gives us the exact amount of precision required for SSE reciprocals, but doesn't match 3DNow! reciprocals (12-bit versus 14-bit).

This doesn't seem to be the case. I cannot find ARM docs on the exact precision provided by frsqrte, but all my measurements point out that frsqrte(x) is less precise than 1/sqrt(x) by more than the intel error margin for SSE of 1.5*2^-12. Tested it with https://gist.github.com/pmatos/c86455f1b4139186602a461e3b071275

Similarly for frecpe.

The non-RPRES implementation using FDIV and giving more precision than required just being an artifact and is allowed by spec, but can't use reciprocal estimate alone their because ARM's non-RPRES path is only 8-bit and needs a single step to improve precision. So I had just chose fdiv instead of rcpe+rcps there since fdiv alone is technically faster on hardware we cared about.

Running my test on 100M tests results in the following (on a Lenovo T14s):

  • 76.8% of frsqrte results are outside the 1.5*10^-12 error margin.
  • 0% of frsqrte+frsqrts results are outside the error margin.
  • frsqrte+frsqrts seems to still be faster than fdiv+fsqrt.

What I did was to enable all the tests requiring precision checks.
I modified the tests to check for the necessary precision. 3Dnow has sometimes 14 or 15 bits of precision requirements while others have 1.5*2^-12 relative error requirement. I also added checks for +inf and -inf results and for negative reciprocal square roots that are allowed on 3dnow instructions.

If you prefer can I split the PR into several small PRs.

@Sonicadvance1
Copy link
Member

The most simple documentation comes from the FEAT_RPRES text.

FEAT_RPRES allows an increase in the precision of the single-precision floating-point reciprocal
estimate and reciprocal square root estimate from an 8-bit mantissa to a 12-bit mantissa.
<...>
For more information, see RecipEstimate() and RecipSqrtEstimate().

For RecipEstimate
Image_2025-02-18_12-07-08

Similar for RecipSqrtEstimate but annoyingly split between two pages.

I think a thing to note is that subnormal/inf behaviour will end up resulting in different behaviour between an fdiv implementation and a reciprocal implementation and an easy way to check that (since this is only a 32-bit search space) is to just brute force all the values.

#include <cstdio>
#include <cstdint>
#include <xmmintrin.h>
#include <emmintrin.h>

constexpr uint32_t MASK_12BIT = 0x007ff800U;
constexpr uint32_t ONE_ULP_12BIT = 0x00000800U;

constexpr uint32_t EXP_MASK = 0x7f800000;
constexpr uint32_t INF = 0x7f800000;
constexpr uint32_t SUBNORMAL = 0;

[[maybe_unused]]
static uint32_t recip(uint32_t in) {
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_rcp_ss(val);
  return _mm_cvtsi128_si32(val);
}

[[maybe_unused]]
static uint32_t recip_div(uint32_t in) {
  __m128 one = _mm_cvtsi32_si128(0x3f800000);
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_div_ss(one, val);
  return _mm_cvtsi128_si32(val);
}

[[maybe_unused]]
static uint32_t recip_sqrt(uint32_t in) {
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_rsqrt_ss(val);
  return _mm_cvtsi128_si32(val);
}

[[maybe_unused]]
static uint32_t recip_sqrt_div(uint32_t in) {
  __m128 one = _mm_cvtsi32_si128(0x3f800000);
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_sqrt_ss(val);
  val = _mm_div_ss(one, val);
  return _mm_cvtsi128_si32(val);
}

#define RECIP
#ifdef RECIP
auto Test1 = recip;
auto Test2 = recip_div;
#else
auto Test1 = recip_sqrt;
auto Test2 = recip_sqrt_div;
#endif

int main(int argc, char **argv) {
  size_t inf_results{};
  size_t subnormal_results{};
  for (uint64_t val = 0; val < 0x1'0000'0000ULL; ++val) {
    auto recip_val = Test1(val);
    auto div_val = Test2(val);
    auto recip_val_masked = recip_val & MASK_12BIT;
    auto div_val_masked = div_val & MASK_12BIT;

    // Ignore results that generate INF and Subnormals
    if ((recip_val & EXP_MASK) == INF) {
      ++inf_results;
      continue;
    }
    if ((recip_val & EXP_MASK) == SUBNORMAL) {
      ++subnormal_results;
      continue;
    }

    if (!((recip_val_masked + ONE_ULP_12BIT) >= div_val_masked ||
          (recip_val_masked - ONE_ULP_12BIT) <= div_val_masked)) {
      fprintf(stderr, "OUTSIDE OF RANGE: 0x%08x turned in to 0x%08x and 0x%08x\n", (uint32_t)val, recip_val, div_val);
    }
  }
  fprintf(stderr, "Ignored %zd inf results, and %zd subnormals\n", inf_results, subnormal_results);
  return 0;
}

Comparing Zen 3 versus Oryon.
For the reciprocal test, Oryon produced the same number of infs and subnormals.

Ignored 33554430 inf results, and 33554434 subnormals

For the reciprocal sqrt, Same story

Ignored 2164260863 inf results, and 1 subnormals

@pmatos
Copy link
Collaborator Author

pmatos commented Feb 20, 2025

The most simple documentation comes from the FEAT_RPRES text.

FEAT_RPRES allows an increase in the precision of the single-precision floating-point reciprocal
estimate and reciprocal square root estimate from an 8-bit mantissa to a 12-bit mantissa.
<...>
For more information, see RecipEstimate() and RecipSqrtEstimate().

For RecipEstimate Image_2025-02-18_12-07-08

Similar for RecipSqrtEstimate but annoyingly split between two pages.

I think a thing to note is that subnormal/inf behaviour will end up resulting in different behaviour between an fdiv implementation and a reciprocal implementation and an easy way to check that (since this is only a 32-bit search space) is to just brute force all the values.

#include <cstdio>
#include <cstdint>
#include <xmmintrin.h>
#include <emmintrin.h>

constexpr uint32_t MASK_12BIT = 0x007ff800U;
constexpr uint32_t ONE_ULP_12BIT = 0x00000800U;

constexpr uint32_t EXP_MASK = 0x7f800000;
constexpr uint32_t INF = 0x7f800000;
constexpr uint32_t SUBNORMAL = 0;

[[maybe_unused]]
static uint32_t recip(uint32_t in) {
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_rcp_ss(val);
  return _mm_cvtsi128_si32(val);
}

[[maybe_unused]]
static uint32_t recip_div(uint32_t in) {
  __m128 one = _mm_cvtsi32_si128(0x3f800000);
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_div_ss(one, val);
  return _mm_cvtsi128_si32(val);
}

[[maybe_unused]]
static uint32_t recip_sqrt(uint32_t in) {
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_rsqrt_ss(val);
  return _mm_cvtsi128_si32(val);
}

[[maybe_unused]]
static uint32_t recip_sqrt_div(uint32_t in) {
  __m128 one = _mm_cvtsi32_si128(0x3f800000);
  __m128 val = _mm_cvtsi32_si128(in);
  val = _mm_sqrt_ss(val);
  val = _mm_div_ss(one, val);
  return _mm_cvtsi128_si32(val);
}

#define RECIP
#ifdef RECIP
auto Test1 = recip;
auto Test2 = recip_div;
#else
auto Test1 = recip_sqrt;
auto Test2 = recip_sqrt_div;
#endif

int main(int argc, char **argv) {
  size_t inf_results{};
  size_t subnormal_results{};
  for (uint64_t val = 0; val < 0x1'0000'0000ULL; ++val) {
    auto recip_val = Test1(val);
    auto div_val = Test2(val);
    auto recip_val_masked = recip_val & MASK_12BIT;
    auto div_val_masked = div_val & MASK_12BIT;

    // Ignore results that generate INF and Subnormals
    if ((recip_val & EXP_MASK) == INF) {
      ++inf_results;
      continue;
    }
    if ((recip_val & EXP_MASK) == SUBNORMAL) {
      ++subnormal_results;
      continue;
    }

    if (!((recip_val_masked + ONE_ULP_12BIT) >= div_val_masked ||
          (recip_val_masked - ONE_ULP_12BIT) <= div_val_masked)) {
      fprintf(stderr, "OUTSIDE OF RANGE: 0x%08x turned in to 0x%08x and 0x%08x\n", (uint32_t)val, recip_val, div_val);
    }
  }
  fprintf(stderr, "Ignored %zd inf results, and %zd subnormals\n", inf_results, subnormal_results);
  return 0;
}

Comparing Zen 3 versus Oryon. For the reciprocal test, Oryon produced the same number of infs and subnormals.

Ignored 33554430 inf results, and 33554434 subnormals

For the reciprocal sqrt, Same story

Ignored 2164260863 inf results, and 1 subnormals

I must say I am not sure what to make of the message. I think I understand what you're saying, while at the same time it doesn't invalidate the fact that the reciprocal estimate on an AFP/RPRES system (frecpe or fsqrtpe) does not seem to be within the error margin required by the intel sse/avx instructions of 1.5*2^-12 relative error against the real reciprocal.

At the same time, the only values I am checking are related to reciprocal of 0.0, -0.0, and reciprocal sqrt of 0.0, -0.0 and negative numbers (which should be inf, -inf and -nan respectively) - ie. not dealing or looking at subnormals.

@pmatos pmatos marked this pull request as draft February 21, 2025 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants