You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
voidpoly_tobytes(uint8_tr[KYBER_POLYBYTES], constpoly*a)
{
unsigned inti;
uint16_tt0, t1;
for (i=0; i<KYBER_N / 2; i++)
{
// map to positive standard representatives// REF-CHANGE: Hoist signed-to-unsigned conversion into separate functiont0=scalar_signed_to_unsigned_q_16(a->coeffs[2*i]);
t1=scalar_signed_to_unsigned_q_16(a->coeffs[2*i+1]);
r[3*i+0] = (t0 >> 0);
r[3*i+1] = (t0 >> 8) | (t1 << 4);
r[3*i+2] = (t1 >> 4);
}
}
of poly_frombytes() is very slow due to the calls to scalar_signed_to_unsigned_q_16(), and -- within there -- cmov_int16. This accounts for ~20% performance loss on a MLKEM-512 key generation on Apple M1.
Our adaptation
of
poly_frombytes()
is very slow due to the calls toscalar_signed_to_unsigned_q_16()
, and -- within there --cmov_int16
. This accounts for ~20% performance loss on a MLKEM-512 key generation on Apple M1.When using the reference implementation
in turn, the Apple compiler does an amazing job vectorizing the code.
We don't want the compiler to introduce a branch for
((int16_t)t1 >> 15) & KYBER_Q;
, but in practice, this does not seem to happen.The text was updated successfully, but these errors were encountered: