Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arithmetizing Keccak-p permutation #1559

Open
Al-Kindi-0 opened this issue Oct 31, 2024 · 8 comments
Open

Arithmetizing Keccak-p permutation #1559

Al-Kindi-0 opened this issue Oct 31, 2024 · 8 comments

Comments

@Al-Kindi-0
Copy link
Collaborator

We want to arithmetize (using an AIR) the Keccak-p permutation. Keccak-p is parametrized using two parameters $b$ and $n_r$ where $n_r$ is the number of rounds and $b$ is the size in bits of the input/output size of the permutation. For us, the choice $(b, n_r) = (1600, 24)$ is the most relevant as this is the choice used in SHA3. The Keccak-p permutation operates on a state $\mathsf{s}$ comprised of $1600$ bits and interpreted as a cube of of size $5 \times 5 \times 64$ and indexed with variables $(x, y, z)$ where the inner-most $64$-bit vector has index $(x, y) = (0, 0)$. If we denote by $\mathsf{a}(x, y, z)$ the value of the $(x, y, z)$-bit of the cube then we can use the following correspondence $$\mathsf{s}[w\cdot(5y + x) + z] = \mathsf{a}[x, y, z]$$
For a fixed $(x,y)$, we call the vector of $64$-bits indexed by $z$ a lane and we denote it by $\mathsf{a}[x, y]$. We overload bit operations to also denote their vectorized versions operating on lanes.

We can now describe the Keccak-p permutation as the repeated application, for $n_r = 24$ rounds, of the following composition of functions $$\iota_r \circ \chi \circ \pi \circ \rho \circ \theta$$
where only $\iota_r$ depends on the round index.

Description

The $\theta$ step function

  1. For $x= 0,\cdots, 4$ & $z= 0,\cdots,63$:
    1. $C[x, z] := \mathsf{a}[x, 0, z] + \mathsf{a}[x, 1, z] + \mathsf{a}[x, 2, z] + \mathsf{a}[x, 3, z] + \mathsf{a}[x, 4, z]$
  2. For $x= 0,\cdots, 4$ & $z= 0,\cdots,63$:
    1. $D[x, z] := C[x - 1, z] + C[x + 1, z - 1]$
  3. For $x= 0,\cdots, 4$, $y= 0,\cdots, 4$, & $z= 0,\cdots,63$:
    1. $\mathsf{a} [x, y, z] = \mathsf{a} [x, y, z] + D[x, z]$

where index arithmetics is done modulo the size of the corresponding index-space and addition of bits is done in the finite binary field of size $2$ i.e., xor.

Note that the above can be written more succinctly using vectorized notation as follows:

  1. For $x= 0,\cdots, 4$:
    1. $C[x] := \mathsf{a}[x, 0] + \mathsf{a}[x, 1] + \mathsf{a}[x, 2] + \mathsf{a}[x, 3] + \mathsf{a}[x, 4]$
  2. For $x= 0,\cdots, 4$:
    1. $D[x] := C[x - 1] + \mathsf{ROT}(C[x + 1], 1)$, where $\mathsf{ROT}(., k)$ is cyclic rotation by $k$ bits
  3. For $x= 0,\cdots, 4$, & $y= 0,\cdots, 4$:
    1. $\mathsf{a} [x, y] = \mathsf{a} [x, y] + D[x]$

Basically, we have two operations on $64$-bit vectors, i.e. lanes, $\mathsf{XOR}$ and $\mathsf{ROT}$.

The $\rho$ step function:

  1. $\mathsf{a}[0, 0] = \mathsf{a}[0, 0]$ i.e., the inner-most lane is fixed by the $\rho$ map.
  2. $(x, y) = (1, 0)$
  3. For $t =0, \cdots 23:$
    1. $\mathsf{a}[x, y] = \mathsf{ROT}(\mathsf{a}[x, y], \frac{(t+1)\cdot(t+2)}{2})$
    2. $(x, y) = (y, (2x + 3y) % 5)$

The $\pi$ step function:

$\mathsf{a}[x, y] = \mathsf{a}[x + 3y, x]$

The $\chi$ step function:

$\mathsf{a}[x, y] = \mathsf{a}[x, y] + ( \neg\mathsf{a}[x+1, y]) \wedge \mathsf{a}[x+2, y]$

The $\iota_r$ step function:

$\mathsf{a}[0, 0] = \mathsf{a}[0, 0] + \mathsf{RC}[r]$

where the round constant $\mathsf{RC}[r]$ for round $r$ is a 64-bit vector.

Arithmetization strategies

Using Look-Up Tables

Since all of the arithmetic operations happen at the level of lanes, and since our field is a 64-bit prime field, we have to either work with the individual bits of each lane or work with limbs of size less than $64$ and use lookup tables (LUT). From a practical point of view, limbs of size $8$ are probably the best choice. We can probably use limbs of size $10$ but then things become more complicated and will require additional LUTs which might end-up hurting performance.
Let's discuss the approach using LUTs. Suppose we choose limb size $8$ and hence we can represent each lane using $8$ limbs and equivalently $8$ field elements.
In this setting:

  1. $\mathsf{AND}, \mathsf{XOR}$ are done limb-wise using LUTs.

  2. $\mathsf{NOT}$ can be done using a simple subtraction from $2^8 - 1$ at the level of each individual limb.

  3. $\mathsf{ROT}(\cdot, k)$ for $k= 1, \cdots, 63$: Note that since we are working with limbs each of size $8$ bits, we can without loss of generality focus on the case when $k= 1, \cdots, 7$ as all other cases are easily reducible with not extra arithmetic cost to this case. Since each lane is $64$-bit we will need to work with two groups each made of 4 limbs so that we avoid overflow. Let $\mathsf{L}$ and $\mathsf{H}$ denote the lower, respectively higher, 4 limbs. Take for example $\mathsf{L}$, then we need to prove that:

    1. There exists a tuple $(a, b, c, d, e)$ such that $\mathsf{L}[0] + \mathsf{L}[1] \cdot 2^8 + \mathsf{L}[2] \cdot 2^{16} + \mathsf{L}[3] \cdot 2^{24} = a + b \cdot 2^k + c \cdot 2^{k + 8} + d \cdot 2^{k + 16} + e \cdot 2^{k + 24}$,
    2. $b, c$ and $d$ are bytes i.e., in the range $0, \cdots, 2^{16} - 1$,
    3. $e || a$ is a byte i.e., in the range $0, \cdots, 2^{16} - 1$,

    A similar thing is done for $\mathsf{H}$ and the result of the rotation is computed by patching the split bits of $\mathsf{L}$ and $\mathsf{H}$.

In total, we will need 1 LUT for $\mathsf{AND}$, 1 LUT for $\mathsf{XOR}$, 7 LUT for all possible byte splittings as used in $\mathsf{ROT}$, and 1 LUT to range check bytes.

We now give an account of the number of cells needed in order to arithmetize one round of the Keccak-p permutation:

  1. We need $5 \times 5 \times 8$ cells to store the initial state. Recall that each field element stores $8$ bits.
  2. In the first step of the $\theta$ step function, we need, for each $x$, to $\mathsf{XOR}$ $5$ terms where each term is a lane. Hence we need $4 \times 8$ intermediate cells for a total of $4 \times 8 \times 5$ intermediate cells.
  3. We need $5 \times 8$ to store the result of $\mathsf{ROT}$ and $5 \times 10$ for the extra witness elements to do the rotation.
  4. We need $5 \times 8$ to store the result of the $\mathsf{XOR}$ operation needed to compute $D$.
  5. We need $5 \times 5 \times 8$ to store the output of the $\theta$ step function.
  6. We need $5 \times 5 \times 8$ to store the result of the rotations and $5 \times 5 \times 10$ for the extra witness elements to do the rotations.
  7. We need $5 \times 5 \times 8$ to do the $\mathsf{AND}$ operations and $5 \times 5 \times 8$ to do the final $\mathsf{XOR}$ for the $\chi$ step function.
  8. We need $8$ to store the result of the $\iota_r$ step function.

The total number of columns is thus $1508$ of which $900$ will be subject to lookups. This means, assuming that the max degree of the AIR is $3$, that we will need $450$ helper columns for LogUp in the auxiliary trace. Hence, assuming a degree $2$ extension, we will need a total of $2408$ columns in the base field without accounting for the multiplicity columns for LogUp.

Working with individual bits

There is another approach described by Angus Gruen which doesn't make use of any LUT and instead works with individual bits at some of the intermediate steps of the permutation function.
More precisely, the state is either considered as a cube of bits of dimensions $5 \times 5\times 64$ and denoted by $A[x, y, z]$, or as a cube of $u32$-s of dimension $5 \times 5\times 2$ and denoted by $\textbf{A}[x, y, j]$. Using this dual view, we can now first describe one round of the permutation and then the constraints to verify it.

  1. The input state is stored as $\textbf{A}[x, y, j]$ for $j = 0, 1$

  2. Checking $\theta$ step function:

    Denote by $A^{'}[x, y, z]$ the output of the $\theta$ step function as an element in $5 \times 5\times 64$. Then, using Lemma 2.1 in the aforementioned work, verifying this step is equivalent to the following:

    1. Define $C^{'}[x, z] := C[x, z] + C[x - 1, z] + C[x + 1, z - 1]$
    2. Enforce $\mathsf{XOR}_3(C[x, z], C[x - 1, z], C[x + 1, z - 1]) = C^{'}[x, z]$ where $\mathsf{XOR}_3(x, y, z) := x + y + z - 2(xy + xz + yz) + 4xyz$
    3. Enforce $\textbf{A}[x, y, j] = \sum_{i=0}^{31}2^i \cdot \mathsf{XOR}_3(A^{'}[x, y, i + 32j], C[x, i + 32j], C^{'}[x, i + 32j])$ for $j = 0, 1$
    4. Letting $D[x, z] := \left(\sum_{y=0}^4 A^{'}[x, y, z] \right) - C^{'}[x, z]$, enforce $D[x, z] \cdot (D[x, z] - 2) \cdot (D[x, z] - 4) = 0$
  3. Checking the $\rho$ step is trivial, and can be absorbed in the the check for the $\chi$ step, as we now have the $A^{'}[x, y, z]$ bit view representation.

  4. Checking the $\pi$ step is also trivial and can be absorbed in the check for the $\chi$ step.

  5. Checking the $\chi$ step function:

    1. Define $B[x, y, z] = A^{'}[3y + x, x, z - r[x, y]]$ and note that $B$ represents the output of the applications of the $\rho$ and $\pi$ steps.
    2. Let $\mathsf{A}^{''}[x, y, j]$ be the output of the $\chi$ step represented as an array of $u32$-s. Verifying the $\chi$ step then boils down to enforcing $\sum_{i=0}^{31} 2^i \mathsf{XORNAND}(B[x, y, i + 32j], B[x + 1, y, i + 32j], B[x + 2, y, i + 32j]) = \mathsf{A}^{''}[x, y, j]$ for $j = 0, 1$, where $\mathsf{XORNAND}(x, y, z) := x + z - yz - 2xz + xyz$.
  6. Checking the $\iota_r$ step:

    1. Let $A_{0, 0}^{''}[k]$ denote the individual bits of $\mathsf{A}^{''}[0, 0, j]$. Then checking the $\iota_r$ is equivalent to checking that:
      1. $\sum_{i=0}^{31} 2^i A_{0, 0}^{''}[i + 32j] = \mathsf{A}^{''}[0, 0, j]$ for $j = 0, 1$
      2. $\sum_{i=0}^{31} 2^i \mathsf{XOR}(A_{0, 0}^{''}[i + 32j], RC[k, i + 32j]) = \mathsf{A}^{'''}[0, 0, j]$ where $\mathsf{A}^{'''}$ is the $u32$ representation of the $(0, 0)$ lane at the end of the current round.

We can now give an account of the number of cells needed to verify one round of the Keccak-p permutation function:

  1. $5 \times 5 \times 2$ cells to store $\textbf{A}[x, y, j]$ as $u32$ elements
  2. $5 \times 64$ cells to store $C[x, z]$
  3. $5 \times 5 \times 64$ cells to store $A^{'}$
  4. $5 \times 64$ cells to store $C^{'}$
  5. $5 \times 5 \times 2$ cells to store $\mathsf{A}^{''}[x, y, j]$
  6. $64$ cells to store the bits of $\mathsf{A}^{''}[0, 0, j]$ for $j = 0, 1$
  7. $2$ cells to store $\mathsf{A}^{'''}[0, 0, j]$ for $j = 0, 1$

The total is 2406 cells (all base field) to verify the correctness of one of the 24 rounds comprising a call to the Keccak-p permutation.

Some thoughts

It seems that the second approach is better performance wise if one is thinking of using only Keccak-p and its derivatives. If one the other hand, other (classic) hash functions are to be supported, e.g., Blake, then the LUT-based approach might make sense as the cost associated with LUT-s could be amortized across arithmetizations.

@bobbinth
Copy link
Contributor

Awesome write-up! Thank you!

One context in which we may want to use this is to verify that a bunch of Keccak hashes were done correctly. This would involve having a dedicated proof for these Keccak hashes which can then be verified in the VM. For example, such a proof may involve computing 32 keccak hashes on 32 different inputs.

I think the questions here are:

  1. How AIR for such a computation could look like? For example, we may need to use RPO/RPX to commit to all the inputs and then we'd need to make sure that a given input maps to a given digest.
  2. In both constructions, the number of columns is quite large. How will it affect the recursive verifier? This is specifically relevant for computing leaf hashes. For example, just hashing a single row would require 300 cycles. We also need to compute random linear combinations of all columns. Assuming there are 2400 of them, how many cycles would it take?

@Al-Kindi-0
Copy link
Collaborator Author

1. How AIR for such a computation could look like? For example, we may need to use RPO/RPX to commit to all the inputs and then we'd need to make sure that a given input maps to a given digest.

I think that would be the general idea. It would entail that we would have Keccak and RPX both and in parallel hashing the same stream of data and hence be able to use the RPX digest as public input. This means that in the AIR for Keccak we will also have to add RPX alongside it, from a cost point of view this is negligible relative to the pure cost of arithmetizing Keccak.

2. In both constructions, the number of columns is quite large. How will it affect the recursive verifier? This is specifically relevant for computing leaf hashes. For example, just hashing a single row would require 300 cycles. We also need to compute random linear combinations of all columns. Assuming there are 2400 of them, how many cycles would it take?

I did some rough cycle counting based on our current (partial) implementation of the recursive verifier for Miden VM. Since we will have a smaller number of constraints than in the VM, I am assuming that the cost to generate the randomness in the case of Keccak will be the same as that of our recursive verifier (but even if this is not true, this will still be a fraction of the cost of leaf-unhashing). Thus the main component that will be different between our recursive verifier for the VM and a recursive verifier for Keccak is leaf unhashing and computing the random linear combinations of the DEEP quotients. I am counting roughly $3320$ cycles per FRI query versus our current $463$ per FRI query.
This means that with our current parameter setting of $27$ FRI queries, we would need $89640$ instead of $12501$ for the two aforementioned steps i.e., leaf unhashing and computing DEEP queries.

Note that constraint evaluation in the case of the Keccak AIR will be significantly lower than constraint evaluation for the Miden VM case, and hence the final cost for the whole recursive verifier for the Keccak AIR should be well bellow $2^{17}$ cycles.

@bobbinth
Copy link
Contributor

bobbinth commented Nov 1, 2024

Thus the main component that will be different between our recursive verifier for the VM and a recursive verifier for Keccak is leaf unhashing and computing the random linear combinations of the DEEP quotients. I am counting roughly 3320 cycles per FRI query versus our current 463 per FRI query.

How does this roughly break down between unhashing and liner combinations? Seems like linear combinations would dominate here, right?

@Al-Kindi-0
Copy link
Collaborator Author

Exactly, the cost for unhashing is around $400$ while the cost for the random linear combination is around $2780$

@bobbinth
Copy link
Contributor

bobbinth commented Nov 3, 2024

One more radical (and longer-term) approach to bring the cycle-count down, is it extend the hasher chiplet to enable hashing and computing random linear combination at the same time. This would require the hasher chiplet to read memory while it performs hashing and is probably tricky to do in general, but if we can come up with a good way to do this, then from the stack standpoint, leaf unhashing and computing random linear combinations may be done with just a few instructions regardless of the leaf size.

@Al-Kindi-0
Copy link
Collaborator Author

I agree, I think given some prior discussions related to this topic, the right "primitive" would be unhashing while at the same time computing an inner product against a vector in memory. This would have potentially other uses besides computing the DEEP queries, e.g., evaluating large polynomials at a random point.

@bobbinth
Copy link
Contributor

bobbinth commented Nov 4, 2024

In regards to recursive proof verification, I think there are 2 general things we should investigate for how to offload computation to chiplets:

  1. Leaf unhashing + DEEP composition (as discussed above).
  2. FRI verification.

It maybe that we can replace our current specialized instructions for linear combinations and FRI folding with something even higher-level that may allow us to reduce recursive proof verification cycle count even beyond what we have now.

@Al-Kindi-0
Copy link
Collaborator Author

Makes sense! I have a feeling that we can get a long way with a couple of instructions that compute inner products, both in base field and extension field, to optimize both points in your comment.
I will open an issue to discuss this soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants