Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help out for batched random key generation? #705

Open
stevefan1999-personal opened this issue Sep 23, 2024 · 9 comments
Open

Help out for batched random key generation? #705

stevefan1999-personal opened this issue Sep 23, 2024 · 9 comments

Comments

@stevefan1999-personal
Copy link

stevefan1999-personal commented Sep 23, 2024

I'm trying to generate vanity wallet addresses, which are originally in ed25519, so I start by generating random bytes from a PRNG enough to cast into a secret key and then generate key pairs, but I ran into performance issue so I want to debug it:

for _ in 0..cli.threads {
    let thr = {
        let counter = counter.clone();
        move || {
            let mut csprng = fastrand::Rng::new();
            let mut pair = SecretKey::default();

            loop {
                let key: SigningKey = SigningKey::from_bytes(&pair);
                counter.fetch_add(1, Ordering::Relaxed);
            }
        }
    };

    thread::spawn(thr);
}

On my 3700X PC, and running in full 16 threads, it only runs 300K keypairs/second with a precise monotonic clock. However, the C vanity address generator on my same PC, with same parameters can generate 30M keypairs/second with random data, so I think there could be something wrong here

So I started digging, which means we start from here:

pub fn from_bytes(secret_key: &SecretKey) -> Self {
let verifying_key = VerifyingKey::from(&ExpandedSecretKey::from(secret_key));
Self {
secret_key: *secret_key,
verifying_key,
}
}

And then here:

impl From<&SecretKey> for ExpandedSecretKey {
#[allow(clippy::unwrap_used)]
fn from(secret_key: &SecretKey) -> ExpandedSecretKey {
let hash = Sha512::default().chain_update(secret_key).finalize();
ExpandedSecretKey::from_bytes(hash.as_ref())
}
}

And then here:

impl ExpandedSecretKey {
/// Construct an `ExpandedSecretKey` from an array of 64 bytes. In the spec, the bytes are the
/// output of a SHA-512 hash. This clamps the first 32 bytes and uses it as a scalar, and uses
/// the second 32 bytes as a domain separator for hashing.
pub fn from_bytes(bytes: &[u8; 64]) -> Self {
// TODO: Use bytes.split_array_ref once it’s in MSRV.
let mut scalar_bytes: [u8; 32] = [0u8; 32];
let mut hash_prefix: [u8; 32] = [0u8; 32];
scalar_bytes.copy_from_slice(&bytes[00..32]);
hash_prefix.copy_from_slice(&bytes[32..64]);
// For signing, we'll need the integer, clamped, and converted to a Scalar. See
// PureEdDSA.keygen in RFC 8032 Appendix A.
let scalar = Scalar::from_bytes_mod_order(clamp_integer(scalar_bytes));
ExpandedSecretKey {
scalar,
hash_prefix,
}
}

And then here:

pub fn from_bytes_mod_order(bytes: [u8; 32]) -> Scalar {
// Temporarily allow s_unreduced.bytes > 2^255 ...
let s_unreduced = Scalar { bytes };
// Then reduce mod the group order and return the reduced representative.
let s = s_unreduced.reduce();
debug_assert_eq!(0u8, s[31] >> 7);
s
}

And eventually here:

let xR = UnpackedScalar::mul_internal(&x, &constants::R);
let x_mod_l = UnpackedScalar::montgomery_reduce(&xR);

I noticed that despite I'm using AVX512, the flame graph shows that those UnpackedScalar::mul_internal and UnpackedScalar::montgomery_reduce are still using generic serial implementation and took a large group of execution time, I'm not sure if this can be sped-up and improved? Or maybe I should look for something else for address generation?

I noticed that the wallet vanity address generator in C uses SUPERCOP's version, and their ed25519 key generation is much faster as well.

@stevefan1999-personal
Copy link
Author

I made a minimally viable benchmark program here in https://github.com/stevefan1999-personal/ed25519-gen-speed-test. I used the fastest random number generator possible to get rid of its effect.

On my 3700X PC, this generates:

7 keys per second on average
268020 keys per second on average
284770 keys per second on average
264184 keys per second on average
252750 keys per second on average
245968 keys per second on average
241423 keys per second on average
237701 keys per second on average
234541 keys per second on average
232072 keys per second on average
229985 keys per second on average
228280 keys per second on average

@stevefan1999-personal
Copy link
Author

stevefan1999-personal commented Sep 23, 2024

I reran the key generation benchmark with "fast" featured enabled back on for ed25519-dalek, and performance improved 3 times, and this is the hot path tree diagram that I got
image

@tarcieri
Copy link
Contributor

@stevefan1999-personal
Copy link
Author

Take a look at the MultiscalarMul trait: https://docs.rs/curve25519-dalek/4.1.3/curve25519_dalek/edwards/struct.EdwardsPoint.html#impl-MultiscalarMul-for-EdwardsPoint

May I get some papers or lecture notes on how to use that?

@tarcieri
Copy link
Contributor

It takes an iterator over Scalars and an iterator over EdwardsPoints and batch multiplies the scalars by the points, so you can generate a batch of candidate scalars and multiply several of them by the base point at once

@stevefan1999-personal
Copy link
Author

It takes an iterator over Scalars and an iterator over EdwardsPoints and batch multiplies the scalars by the points, so you can generate a batch of candidate scalars and multiply several of them by the base point at once

If I read that correctly, I can use multiscalar_mul to do something that means scalar * constants::ED25519_BASEPOINT_TABLE but similar to a dot product operation instead, right?

@stevefan1999-personal
Copy link
Author

Okay, so I came up with this:

    let mut rng = Rng::new();

    let scalars: Vec<ExpandedSecretKey> = (0..16)
        .map(|_| {
            let mut secret_key: SecretKey = SecretKey::default();
            rng.fill(&mut secret_key);
            ExpandedSecretKey::from(&secret_key)
        })
        .collect();

    EdwardsPoint::multiscalar_mul(
        scalars.iter().map(|x| x.scalar),
        scalars.iter().map(|_| constants::ED25519_BASEPOINT_POINT),
    );

However, it returns an Edwards point, so how do I separate the multiplied (scalar * ed25519 base point) to their own respective points?

@tarcieri
Copy link
Contributor

Aah sorry, that's the wrong API shape for your problem

@stevefan1999-personal
Copy link
Author

stevefan1999-personal commented Sep 23, 2024

Aah sorry, that's the wrong API shape for your problem

Ahh I think it is close. I just need to generate N scalars from entropy, and get the [for some scalar in scalars (scalar * ed25519 base point)] without having it dot product summed, but do so in an optimized, data-parallel way. This way I don't have to blow up the CPU cache while doing SIMD in lockstep.

Also, I tried to simplify the public key generation, is this correct?

                    let mut secret_key = SecretKey::default();
                    csprng.fill_bytes(&mut secret_key);
                    let pub_key =
                        EdwardsPoint::mul_base(&Scalar::from_bytes_mod_order(clamp_integer(
                            Sha512::default().chain_update(secret_key).finalize()[..32]
                                .try_into()
                                .unwrap(),
                        )))
                        .compress()
                        .to_bytes();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants