Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poly_tobytes() is too slow #152

Closed
hanno-becker opened this issue Sep 22, 2024 · 1 comment · Fixed by #146
Closed

poly_tobytes() is too slow #152

hanno-becker opened this issue Sep 22, 2024 · 1 comment · Fixed by #146
Labels
enhancement New feature or request

Comments

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 22, 2024

Our adaptation

void poly_tobytes(uint8_t r[KYBER_POLYBYTES], const poly *a)
{
    unsigned int i;
    uint16_t t0, t1;

    for (i = 0; i < KYBER_N / 2; i++)
    {
        // map to positive standard representatives
        // REF-CHANGE: Hoist signed-to-unsigned conversion into separate function
        t0 = scalar_signed_to_unsigned_q_16(a->coeffs[2 * i]);
        t1 = scalar_signed_to_unsigned_q_16(a->coeffs[2 * i + 1]);
        r[3 * i + 0] = (t0 >> 0);
        r[3 * i + 1] = (t0 >> 8) | (t1 << 4);
        r[3 * i + 2] = (t1 >> 4);
    }
}

of poly_frombytes() is very slow due to the calls to scalar_signed_to_unsigned_q_16(), and -- within there -- cmov_int16. This accounts for ~20% performance loss on a MLKEM-512 key generation on Apple M1.

When using the reference implementation

void poly_tobytes(uint8_t r[KYBER_POLYBYTES], const poly *ap)
{
  unsigned int i;
  uint16_t t0, t1;
  const int16_t *a = (const int16_t*) ap;

  for(i=0;i<KYBER_N/2;i++) {
    // map to positive standard representatives
    t0  = a[2*i];
    t0 += ((int16_t)t0 >> 15) & KYBER_Q;
    t1 = a[2*i+1];
    t1 += ((int16_t)t1 >> 15) & KYBER_Q;
    r[3*i+0] = (t0 >> 0);
    r[3*i+1] = (t0 >> 8) | (t1 << 4);
    r[3*i+2] = (t1 >> 4);
  }
}

in turn, the Apple compiler does an amazing job vectorizing the code.

We don't want the compiler to introduce a branch for ((int16_t)t1 >> 15) & KYBER_Q;, but in practice, this does not seem to happen.

@hanno-becker hanno-becker added the enhancement New feature or request label Sep 22, 2024
@hanno-becker
Copy link
Contributor Author

Resolved by #146

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant