`poly_tobytes()` is too slow #152

hanno-becker · 2024-09-22T05:32:13Z

Our adaptation

void poly_tobytes(uint8_t r[KYBER_POLYBYTES], const poly *a)
{
    unsigned int i;
    uint16_t t0, t1;

    for (i = 0; i < KYBER_N / 2; i++)
    {
        // map to positive standard representatives
        // REF-CHANGE: Hoist signed-to-unsigned conversion into separate function
        t0 = scalar_signed_to_unsigned_q_16(a->coeffs[2 * i]);
        t1 = scalar_signed_to_unsigned_q_16(a->coeffs[2 * i + 1]);
        r[3 * i + 0] = (t0 >> 0);
        r[3 * i + 1] = (t0 >> 8) | (t1 << 4);
        r[3 * i + 2] = (t1 >> 4);
    }
}

of poly_frombytes() is very slow due to the calls to scalar_signed_to_unsigned_q_16(), and -- within there -- cmov_int16. This accounts for ~20% performance loss on a MLKEM-512 key generation on Apple M1.

When using the reference implementation

void poly_tobytes(uint8_t r[KYBER_POLYBYTES], const poly *ap)
{
  unsigned int i;
  uint16_t t0, t1;
  const int16_t *a = (const int16_t*) ap;

  for(i=0;i<KYBER_N/2;i++) {
    // map to positive standard representatives
    t0  = a[2*i];
    t0 += ((int16_t)t0 >> 15) & KYBER_Q;
    t1 = a[2*i+1];
    t1 += ((int16_t)t1 >> 15) & KYBER_Q;
    r[3*i+0] = (t0 >> 0);
    r[3*i+1] = (t0 >> 8) | (t1 << 4);
    r[3*i+2] = (t1 >> 4);
  }
}

in turn, the Apple compiler does an amazing job vectorizing the code.

We don't want the compiler to introduce a branch for ((int16_t)t1 >> 15) & KYBER_Q;, but in practice, this does not seem to happen.

The text was updated successfully, but these errors were encountered:

hanno-becker · 2024-09-22T13:32:11Z

Resolved by #146

hanno-becker added the enhancement New feature or request label Sep 22, 2024

hanno-becker mentioned this issue Sep 22, 2024

Add AArch64 ASM for matrix-vector base multiplication #146

Merged

hanno-becker closed this as completed in #146 Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`poly_tobytes()` is too slow #152

`poly_tobytes()` is too slow #152

hanno-becker commented Sep 22, 2024 •

edited

Loading

hanno-becker commented Sep 22, 2024

poly_tobytes() is too slow #152

poly_tobytes() is too slow #152

Comments

hanno-becker commented Sep 22, 2024 • edited Loading

hanno-becker commented Sep 22, 2024

`poly_tobytes()` is too slow #152

`poly_tobytes()` is too slow #152

hanno-becker commented Sep 22, 2024 •

edited

Loading