-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalar field? #1
Comments
I have a complete Ed25519 + ECDSA implementation that I have not had time to release yet :/ I suggest however to use Stein's algorithm on May I ask for which target platform you make a Rust implementation? Is it Cortex-M4 or more meant for Cortex-A processors? |
By this you mean 99% assembly? :)
Thanks for the rapid answer! I've seen this approach motivated with "masking scalar", assumed the point is to get around non-constant time nature; having a name for the approach is very helpful.
Target is Cortex-M4/M33 (specifically NXP LPC55S69), use case the rewrite of SoloKeys (FIDO2/PIV) in Rust. The github.com/RustCrypto organization covers symmetric crypto/hashes quite well, but for asymmetric crypto I've found a dearth of usable+understandable implementations. By this I mean that the high-level structure should be easy to understand mathematically and follow in code, while the actual field arithmetic should be super fast and use UMAAL. Previously I put together github.com/ycrypto/salty which merges Haase's fe25519 arithmetic (for his CPace/AuCPace PAKEs) with basically TweetNaCl, I intend to do the same with github.com/ycrypto/nisty (which currently just wraps micro-ecc, and for which on a high level there's github.com/RustCrypto/elliptic-curves/tree/master/p256 as prior art). I'm actually not sure if your or Haase's fe25519 (code: https://github.com/BjoernMHaase/fe25519/blob/master/STM32F407/crypto/asm/cortex_m4_mpy_fe25519.S, approach: https://eprint.iacr.org/2018/286) is faster hehe. If I may ask, did you craft the assembly entirely by hand, or use some assembly-generating framework based on general principles? The hole in my understandability argument above is following the assembly itself, which I hope to improve with "helpful explanations", but some kind of optimality argument based on principled building blocks would be even nicer... |
All assembler code has been carefully been written by hand, to get an optimal register allocation etc. Haase's work is very similar, but has slightly more overhead. The multiplication routine is just a standard "schoolbook multiplication" implementation, but ordered in a way to reduce the memory overhead (since the full data does not fit in registers). The reduction part at the end of the P-256 field multiplication is just a carefully unrolled Montgomery reduction, again to optimize register allocation. I've added a new repo here: https://github.com/Emill/P256-Cortex-M4 which adds ECDSA support and hence implements arithmetics for the group order. I've chosen to use a variant of Algorithm 12.1 from https://gcd.cr.yp.to/safegcd-20190413.pdf for the group inversion. This allows to avoid forcing the user to generate another 256-bit random value, but still gives more or less the same performance for the inversion itself as the method I described earlier. The repo is not 99% asm, but all relevant low level code is, where we need the performance. The high level parts are written in C. |
Very interesting, thank you for sharing! The choice of license is also rather appreciated :) |
Hi Emil,
first of all thanks a lot for releasing these nice UMAAL-based implementations of the base field of P256 under a permissive license! I'm building a Rust implementation of P256 ECDH/ECDSA around them, using your speed optimized
P256_{add,sub,mul,sqr}mod
routines as computational core.I now find myself in the strange situation where the entire ephemeral (public) point calculation in ECDSA is twice as fast as inverting the ephemeral scalar
k
with Euler's theorem and my own Barrett reduction-based Rust implementation 🤦♂️. Did you ever put thought into speeding up the scalar field, or know of an existing UMAAL-based implementation for it? I think any "N256_mulmod
" assembly routine would give a major speed bump, even if not completely optimized - my lack of assembly skills are currently preventing me from adapting yourP256_mulmod
ton = 0xffffffff00000000ffffffffffffffffbce6faada7179e84f3b9cac2fc632551
.The text was updated successfully, but these errors were encountered: