Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIR chiplet for arithmetic circuit evaluation #1610

Open
Al-Kindi-0 opened this issue Dec 24, 2024 · 18 comments
Open

AIR chiplet for arithmetic circuit evaluation #1610

Al-Kindi-0 opened this issue Dec 24, 2024 · 18 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@Al-Kindi-0
Copy link
Collaborator

Feature description

Introduction

A large portion of the recursive verifier work goes into the constraint evaluation step. This usually entails evaluating a large algebraic expression over the extension field. If we think of this algebraic expression as an arithmetic circuit then this circuit has as inputs:

  1. The out-of-domain (OOD) point $z$
  2. The OOD trace columns evaluation frame: these are the (claimed) evaluations of the committed trace polynomials at the OOD point $z$ and $gz$
  3. The OOD constraint columns evaluation frame: these are the (claimed) evaluations of the committed quotient segment polynomials at the OOD point $z$
  4. The constraint composition randomness: these are the values used for combining the different constraints
  5. The public inputs: these are values (over base field) and are usually inputs to boundary constraints
  6. The divisors: these are expressions of the form $\frac{1}{z^a - b}$

The output of the circuit is then a single value (in the extension field) which should then be constrained to be equal to zero.

Hence, the problem of constraints evaluation can be described by the following components:

  1. Inputs: these are the input values to the circuit
  2. Arithmetic circuit description:
    1. Gate's type: we can assume that we have only addition and multiplication gates
    2. Circuit topology: each gate's inputs and outputs.

An AIR for evaluating arithmetic circuits

Assume throughout that each gate is fan-in 2 but fan-out any positive integer.

Now, suppose that the description of the circuit is stored in memory as a sequence of tuples describing each of the gates where each tuple is made up of

  1. op: either add or mul describing the type of the gate
  2. [l_inp_id, r_inp_id]: the ids of the left and right input wires to the gate
  3. out_id: the id of the current gate which also serves as the id of the output wires of the gate
  4. m: a multiplicity parameter describing the fan-out of the gate

From a user point of view, the circuit will be specified by a digest and loaded to memory from the advice provider by un-hashing the said digest. Let circuit_ptr be a pointer to the memory region containing this circuit description.
As for the inputs to the circuit, we assume that they are stored in a contiguous region of memory, and let input_ptr be a pointer to this memory region.

Chiplet description

Layout

The following is a pictorial description of the layout of the chiplet. It can be thought of as being composed of three portions:

eval_chiplet

  1. Selectors: these are the selectors used to switch off between different modes of the chiplet. See below.
  2. Hardcoded portion:
    1. op, [l_inp_id, r_inp_id], out_id and m are as described above.
  3. Dynamic portion:
    1. [l_inp_val, r_inp_val] are the current gates left and right input values. These are over the extension field over which the constraints evaluation is done.
    2. out_val is the current gates output values. These are over the extension field as well.
    3. input_ptr is the pointer to the inputs as described above.
    4. circuit_ptr is the pointer to the circuit description as described above.
    5. clk is the clock cycle at which the memory access' happen.
    6. ctx is the context at which the memory access' happen.

Note that there is an extra bus that will be used to wire the gates. The idea is for each gate to send output requests of the form (out_id, out_val) with multiplicity m. Similarly each gate sends a input request of the form (l_inp_id, l_inp_val) and (r_inp_id, r_inp_val). We will use selectors to turn on/off these requests, see next section. Moreover, we will append a label to each type of interaction with the bus in order to distinguish them.

Workings

Loading inputs

This is indicated by a flag f_load derived from selectors. Here the chiplet makes read requests (MEM_READ_OP_LABEL, clk, ctx, input_ptr) to memory and populates out_val with the received values representing the circuit inputs.
The chiplet also makes read requests (MEM_READ_OP_LABEL, clk, ctx, circuit_ptr) to memory and populates op, [l_inp_id, r_inp_id], out_id and m. In practice and at this step, the only relevant data is m as op, [l_inp_id, r_inp_id] are not relevant at this stage and out_id can be "computed" by the chiplet by starting from 0 and incrementing it by 1 for each subsequent row while f_load == 1.
At this stage we do not enforce any constraints but for the outputting of the gates values. This is done by sending (out_id, out_val) with multiplicity m to the wiring bus.
Both input_ptr and circuit_ptr are incremented after each request.
When the Loading stage ends, which is determined when the next row is the first row of the next stage, the chiplet sends a message to the decoder with the current value of the input_ptr pointer, call it input_end_ptr, to indicate that it has finished loading the inputs.

Evaluating

This is indicated by a flag f_eval derived from selectors and mutually exclusive with f_load. Here the chiplet stops reading the inputs but continues reading the circuit description but now reads the full description of each gate i.e., op, [l_inp_id, r_inp_id], out_id and m. Moreover, we now impose the gate constraint out_val = l_inp_val * r_inp_val or out_val = l_inp_val + r_inp_val depending on op where the addition/product is over the extension field.
From the wiring point of view, the chiplet makes, to the wiring bus, for each gate, requests to get inputs (l_inp_id, l_inp_val) and (r_inp_id, r_inp_val) and requests to output (out_id, out_val) with multiplicity m.

Outputting

This is indicated by a flag f_out derived from selectors and mutually exclusive with f_eval and f_load. Here the chiplet stops reading the inputs and circuit description. f_out also indicates that the out_val of the previous gate is the output of the circuit. Given the above, we can add a constraint for out_val to be zero or just output it by including it in the final message to the decoder. This message will include the final value of circuit_ptr, call it circuit_end_ptr, to signal that the chiplet has read and evaluated all of the circuit.

Usage

We will assume that the inputs have already been loaded to memory which is the case for the applications we have in mind i.e., recursive verification. Then the program wishing to evaluate the circuit on the aforementioned inputs will un-hash the circuit to some region of memory. At this point the program calls the instruction that does the evaluation, say eval_circuit, with inputs input_ptr, circuit_ptr, input_end_ptr and circuit_end_ptr and this will correspond to the decoder making a request to the circuit evaluation chiplet of the form (BEGIN_EVAL_OP_LABEL, clk, ctx, input_ptr) and (BEGIN_EVAL_OP_LABEL, clk, ctx, circuit_ptr) to initialize the evaluation and (END_EVAL_OP_LABEL, clk, ctx, input_end_ptr) and (END_EVAL_OP_LABEL, clk, ctx, circuit_end_ptr) to request a response from the chiplet, where *_end_ptr is, as before, the value of the pointer at the boundary of the memory region of either the circuit inputs or the circuit description.
The chiplet will then respond first by initializing circuit evaluation and for this we will probably need a fourth flag, say f_init. Here, the chiplet will make responses to the decoder (BEGIN_EVAL_OP_LABEL, clk, ctx, input_ptr) and (BEGIN_EVAL_OP_LABEL, clk, ctx, circuit_ptr) to indicate the start of the circuit evaluation. The next chiplet row will then be the first row in the Loading phase.
After executing Loading, the chiplet will make the response (END_EVAL_OP_LABEL, clk, ctx, input_end_ptr) to signal that it has completed loading the circuit inputs from memory.

Next comes the Evaluating phase after which the chiplet transitions to the Outputting phase and here it will issue a response (END_EVAL_OP_LABEL, clk, ctx, circuit_end_ptr) to the decoder indicating that it is done evaluating the circuit, which will imply that the circuit evaluated to zero or, if we choose to send the evaluation to the decoder, we can include it in this last message.

Remarks and Open Questions:

  1. A full description of the flags is TBD.
  2. A description of the constraints on the flags is TBD. We are going to probably need constraints for correctly transitioning the circuit evaluation chiplet from one state to the other in a set sequential order.
  3. It might be better to make the gates more expressive at the cost of adding more columns. The rational for this would be to decrease the gate count and hence the chiplet's execution trace length.

Why is this feature needed?

As the VM's complexity increases, more AIR constraints are added which in turns translates to more work for the recursive verifier at the constraints evaluation step. Thus we need to find a way to accelerate this step and the current issue aims at investigating how to do so.

@Al-Kindi-0 Al-Kindi-0 added the enhancement New feature or request label Dec 24, 2024
@bobbinth
Copy link
Contributor

Looks great! Thank you for such a detailed description! A few thoughts/comments:

Gate definition

One of the things we probably want to do is to make sure that we can encode a gate in a single field element. This way, we can increase the efficiency of "unhashing" of a circuit in memory. From your description, a gate is a tuple (op, l_in_id, r_in_id, out_id, m). To fit into a filed element, we'd need to limit l_in_id, r_in_id, and out_id to something like $2^{16}$ and m to $2^{14}$.

Alternatively, we could skip $m$ (the prover will need to fill it in in the chiplet) and then l_in_id, r_in_id, and out_id could be limited by $2^{20}$.

Selectors

We could avoid selectors by splitting loading inputs, evaluating, and reading output into 3 separate instructions (or maybe evaluating and reading output could be combined into a single instruction). The instructions could look like so:

#! Inputs: [input_ptr, input_end_ptr, eval_id]
#! Output: [input_ptr + 1, input_end_ptr, eval_id]
load_input

#! Inputs: [circuit_ptr, circuit_end_ptr, eval_id]
#! Output: [result_1, result_0, eval_id]
eval_circuit

Here, eval_id would be some unique value that the user would pick for a given evaluation (e.g., it could be based on the clock cycle).

The user would the load inputs into the evaluator bus by executing load_inputs in a loop, and then call eval_circuit using the same eval_id.

This way, the only memory reads the chiplet needs to do is to load the gates. This may simplify things a bit, though not sure by how much.

Another potential option is to encode all actions into op. This way, in addition to add and mul we'd also have read operation which would read an input from a given memory location.

Input format

One thing that is not entirely clear to me is what's the best way to arrange inputs in memory. I believe currently we "interleave" current/next row values of the OOD point in a single word. Would this still work? Or would we need to make it so that each input value is stored in a separate word?

Reading gates from memory

If I understood correctly, the described design assumes that we do 1 memory read of a single element per gate (i.e., per row of the chiplet). Since we can actually read 4 elements per memory read, I wonder if there is some way we can improve on this.

Separately, I'm wondering if reading 1 element per row is compatible with element-addressable memory work (I think it should be, but we should double-check).

Cost estimates

I'd like to start thinking through some cost estimates for this. For example, how many hashes and memory reads would be required for a circuit of 5K gates and 200 inputs (I think these are ballpark numbers of what we'd need for the VM circuit). My initial estimates are:

  • 5K gates require:
    • 5K rows in the hasher chiplet (8 gates can be "unhashed" in a single permutation).
    • 5K rows in the memory chiplet during evaluation (unless we can find a way to read 4 gates at a time).
  • 200 inputs require:
    • 800 rows in the hasher chiplet (2 inputs can be "unhashed" in a single permutation).
    • 200 rows in the memory chiplet during input loading.

If we are verifying multiple proofs, circuit "unhashing" costs would be amortized. So, overall, verifying a proof would add between 6K and 11K rows to the chiplets table.

It might be better to make the gates more expressive at the cost of adding more columns. The rational for this would be to decrease the gate count and hence the chiplet's execution trace length.

My intuition here is that we should keep the gates pretty simple: they are just additions/multiplications in the degree 2 extension field.

One potential exception to this could be something that could help us with MDS matrix multiplication because this part alone would contribute over 600 gates to the circuit.

@bobbinth
Copy link
Contributor

Another potential option is to encode all actions into op. This way, in addition to add and mul we'd also have read operation which would read an input from a given memory location.

This may actually work quite well. We could have 4 operations:

  • add - addition in the extension field.
  • mul - multiplication in the extension field.
  • read - read the input from memory.
  • const - set an input to some constant value.

One question is whether we also need a sub operation (it can be emulated with multiplying by $-1$, but maybe good to have a dedicated operation for it).

The layout could look something like this:
image

Where:

  • The numbers under each element represent the number of columns - so, 15 columns total.
  • I've assumed that we can arrange inputs and circuit to be next to each other in memory and so we'd need only 1 memory pointer. Not 100% sure if this will work though.

@Al-Kindi-0
Copy link
Collaborator Author

Alternatively, we could skip m (the prover will need to fill it in in the chiplet) and then l_in_id, r_in_id, and out_id could be limited by 2^20.

I like this idea but it seems a bit counterintuitive to me that it could be sound though I can't immediately see a flaw in it.
This means that we can probably encode up to $2^{20}$ unique gate identifiers.

Another potential option is to encode all actions into op. This way, in addition to add and mul we'd also have read operation which would read an input from a given memory location.

I think we should go with this option. As you describe in your second comment and if I understand correctly, this will require that the input data and circuit data be interleaved in memory. This is because for a read at address addr we would request a read at addr + 1 and addr + 2 in order to populate say l_in_id with the received value. Here I am assuming element addressable memory. Alternatively, we could just layout things so that things are word-aligned which might require some padding.
For a trace that is not too wide, insuring the above layout of input data and circuit data should not be too difficult. We can, after un-hashing the circuit description to memory, just populate the appropriate memory slots containing the circuit description with the input data. This should not be costly provided moderate trace widths.

One thing that is not entirely clear to me is what's the best way to arrange inputs in memory. I believe currently we "interleave" current/next row values of the OOD point in a single word. Would this still work? Or would we need to make it so that each input value is stored in a separate word?

If we go with the solution in the previous bullet point then this should not be an issue as far as I can see.

If I understood correctly, the described design assumes that we do 1 memory read of a single element per gate (i.e., per row of the chiplet). Since we can actually read 4 elements per memory read, I wonder if there is some way we can improve on this.

I think we should be able to read 1 element per gate once element addressable memory is done, but we will have to double check in the relevant PR.
Reading a word would be nice from a memory-chiplet-length point of view but I am not sure how to do it efficiently.

800 rows in the hasher chiplet (2 inputs can be "unhashed" in a single permutation).

When I think of input I think of a field element in the extension field, is it the same here?

One potential exception to this could be something that could help us with MDS matrix multiplication because this part alone would contribute over 600 gates to the circuit.

We can resort to this if the circuits with only mul / add gates become too large.

@Al-Kindi-0
Copy link
Collaborator Author

Another potential option is to encode all actions into op. This way, in addition to add and mul we'd also have read operation which would read an input from a given memory location.

This may actually work quite well. We could have 4 operations:

* `add` - addition in the extension field.

* `mul` - multiplication in the extension field.

* `read` - read the input from memory.

* `const` - set an input to some constant value.

One question is whether we also need a sub operation (it can be emulated with multiplying by − 1 , but maybe good to have a dedicated operation for it).

The layout could look something like this: image

Where:

* The numbers under each element represent the number of columns - so, 15 columns total.

* I've assumed that we can arrange inputs and circuit to be next to each other in memory and so we'd need only 1 memory pointer. Not 100% sure if this will work though.

One thing that is not clear to me from the above is what would the interface between the arithmetic evaluation chiplet and the decoder look like.

@bobbinth
Copy link
Contributor

bobbinth commented Jan 3, 2025

Alternatively, we could skip m (the prover will need to fill it in in the chiplet) and then l_in_id, r_in_id, and out_id could be limited by 2^20.

I like this idea but it seems a bit counterintuitive to me that it could be sound though I can't immediately see a flaw in it. This means that we can probably encode up to 2^20 unique gate identifiers.

Maybe including $m$ into the gate encoding is a safer way to go. I think 2^16 gates should be sufficient in most cases, and we can even push it to 2^17 if we reduce max value of $m$ to 1024.

I think we should go with this option. As you describe in your second comment and if I understand correctly, this will require that the input data and circuit data be interleaved in memory. This is because for a read at address addr we would request a read at addr + 1 and addr + 2 in order to populate say l_in_id with the received value. Here I am assuming element addressable memory. Alternatively, we could just layout things so that things are word-aligned which might require some padding.

For a trace that is not too wide, insuring the above layout of input data and circuit data should not be too difficult. We can, after un-hashing the circuit description to memory, just populate the appropriate memory slots containing the circuit description with the input data. This should not be costly provided moderate trace widths.

The way I was imagining this is that inputs would be aligned on word boundaries and each input word would be something like [gate, 0, in_0, in_1] where gate is the encoding of the gate and (in_0, in_1) would be the extension field element (this could also support degree 3 filed extension, if needed).

When computing the commitment to a circuit, input gates would be represented with full words [gate, 0, 0, 0]. This is a bit wasteful, but should simplify data movement.

Then, before invoking the chiplet, we'd need to populate the input values for each of the input cells - though, as you've mentioned, this may take quite a few cycles as we'd need to manually copy over each input into the appropriate gate.

The main benefit of this is that in the chiplet we need to do only 1 memory read per row:

  • For read operation we'd read the entire word from memory and then we'd increment the memory pointer by 4.
  • For add/mul operations we'd read a single element from memory and increment the memory pointer by 1.

At the same time, as mentioned above, copying over inputs will probably require quite a bit of data movement - so, maybe there is a better approach.

800 rows in the hasher chiplet (2 inputs can be "unhashed" in a single permutation).

When I think of input I think of a field element in the extension field, is it the same here?

Correct. Here I was assuming we'd have an arrangement of single extension field element per word (with the "unused" elements set to zero). This is different from what we do now where we pack 2 field elements per word. With 2 elements per word we'd need only 400 rows, but I do think that there may be benefit to this field element arrangement (e.g., would be much easier to go to degree 3 extension).

One thing that is not clear to me from the above is what would the interface between the arithmetic evaluation chiplet and the decoder look like.

I think the decoder would view it as a single instruction - something like:

#! Inputs: [circuit_ptr, circuit_end_ptr]
#! Output: [result_1, result_0]
eval

This instruction would need to be used in procedures which have some additional knowledge about the circuit being evaluated because these procedures would need to populate the inputs in memory. The process could look like so:

  1. "Unhash" the circuit to memory (only needs to be done once per circuit).
  2. "Unhash" the inputs to memory.
  3. Populate the "input words" in the circuit description (but as mentioned above, maybe there is a better approach).
  4. Call the eval instruction.

We'd have multiple procedures which look like this but for different circuits - e.g., evaluate_vm_constraints, evaluate_keccak_constraints, evaluate_ecdsa_constraints. Since each procedure is specialized, we should be able to use repeat loops rather than while loops which should make things a bit more efficient.

@Al-Kindi-0
Copy link
Collaborator Author

Maybe including m into the gate encoding is a safer way to go. I think 2^16 gates should be sufficient in most cases, and we can even push it to 2^17 if we reduce max value of m to 1024.

Agreed! An upper bound of 1024 on m should be more than enough. This could be doubled inside the circuit at the cost of one additional gate if there is a need for a larger m sometime.

When computing the commitment to a circuit, input gates would be represented with full words [gate, 0, 0, 0]. This is a bit wasteful, but should simplify data movement.

Yes, that's what padding was referring to. I think there are typically two extreme configurations that we face:

  1. The VM trace which is narrow but has a complex constraint evaluation circuit. Here the number of inputs is manageable and hence both the amount of padding as well as the amount of data movement should be negligible when compared to the cost of evaluating the circuit.
  2. Chiplet traces which are wide but usually have low complexity constraint evaluation circuits. Here the number of inputs will be large and hence we will incur a significant overhead due to the padding and data movement. On the other hand, the circuit will not be too complex.

All in all, I think we should be fine with the above proposal as it has the advantage of simplifying the design of the arithmetic circuit evaluation chiplet, namely reduction in the number of committed columns.

Correct. Here I was assuming we'd have an arrangement of single extension field element per word (with the "unused" elements set to zero). This is different from what we do now where we pack 2 field elements per word. With 2 elements per word we'd need only 400 rows, but I do think that there may be benefit to this field element arrangement (e.g., would be much easier to go to degree 3 extension).

Makes sense!

This instruction would need to be used in procedures which have some additional knowledge about the circuit being evaluated because these procedures would need to populate the inputs in memory. The process could look like so:

Makes sense! One more thing, how would the communication between the chiplet and the decoder work out? Requests from the decoder are straightforward but responses are not clear to me especially when we have multiple invocations of eval in the same execution trace of the chiplet.

@bobbinth
Copy link
Contributor

bobbinth commented Jan 5, 2025

All in all, I think we should be fine with the above proposal as it has the advantage of simplifying the design of the arithmetic circuit evaluation chiplet, namely reduction in the number of committed columns.

If we remove $m$ from gate encoding, I think there is a good way for optimizing this even further. Here is how it could work:

Let's say we use 00 to encode read operation. Then, an input gate could be encoded like so:

0 | 00 (2 bits for op) | 40 zeros | out_id (20 bits)

This means that field element encoding value for wire with ID 0 would be 0, for ID 1 would be 1 etc.

Then, we lay out the circuit in memory like so:

image

In the input gate section (and probably in the other gate section as well), we make sure that out wire ID is always increasing (i.e., incremented by 1). So, the section could look something like this:

[0, 0, 0, 0], [1, 0, 0, 0], ..., [n, 0, 0, 0]

Where $0, 1, ..., n$ are the gate IDs.

Then, the procedure for evaluating the circuit becomes:

  1. "Unhash" the circuit to memory (only needs to be done once per circuit).
  2. "Unhash" the inputs to memory overwriting the "inputs" section of the circuit.
  3. Loop through the inputs section and populate of the circuit and populate gate elements.
  4. Call the eval instruction.

After step 2 above, the inputs section will look something like this:

[0, 0, v0_0, v0_1], [0, 0, v1_0, v1_1], ..., [0, 0, vn_0, vn_1]

Where $(v_0, v_1)$ are the degree 2 elements defining the inputs.

After step 3, we want the section to look like so:

[0, 0, v0_0, v0_1], [1, 0, v1_0, v1_1], ..., [n, 0, vn_0, vn_1]

Assuming the number of inputs is fixed (e.g., 200), we should be able to accomplish this with the following loop:

# => [input_ptr, gate_id = 0, ...]

repeat.200
    dup.1 dup.1 mem_store
    # => [input_ptr, gate_id, ...]

   add.4 swap add.1 swap
    # => [input_ptr + 4, gate_id + 1, ...]
end

This does take up ~10 cycles per input (so, for 200 inputs it'll be around 2K cycles, and for 2000 inputs, 20K cycles) - but I think that's not too bad.

If the above works, then the main thing to figure out is how to remove $m$ from gate definitions. I think this should be doable.

@Al-Kindi-0
Copy link
Collaborator Author

I am not sure the above would work as it assumes that we can un-hash the inputs to memory for free as

[0, 0, v0_0, v0_1], [0, 0, v1_0, v1_1], ..., [0, 0, vn_0, vn_1]

The main reason is that, for our use case, the input will have other uses than just arithmetic circuit evaluation (ACE). For example, the OOD frame is used in the the computation of the random linear combination which goes into computing the queries to the DEEP polynomial.
This is a deal breaker currently with our use of rcomb_* but will not be so if we transition to the proposal of using inner_prod_*. Still, even if we transition to inner_prod_*, we are going to incur a cost for laying out OOD in a way that favors ACE as it will most certainly not be convenient for DEEP queries, i.e., not a deal breaker but with a cost.
To summarize, with the new proposal for inner_prod_* the above could be made to work but the savings in the cost for laying out the inputs as described above will imply, potentially, comparable costs in other places, e.g., DEEP queries, thus, potentially, negating the savings.

@bobbinth bobbinth added this to the v0.13.0 milestone Jan 8, 2025
@bobbinth
Copy link
Contributor

bobbinth commented Jan 8, 2025

So, for OOD frame we need to do 2 things:

  1. Provide OOD frame as an input to the arithmetic circuit evaluation chiplet.
  2. Compute $\sum_{i=0}^k \alpha^{i + 1}\cdot T_i(z)$ and $\sum_{i=0}^k\alpha^{i + 1} \cdot T_i(z \cdot g)$ using the Horner evaluation instruction.

And the layout that would be suitable for circuit evaluation would make computing linear combinations difficult and vice versa. Unless we can come up with a way that works for both.

@Al-Kindi-0
Copy link
Collaborator Author

Indeed, and since bullet point 2 is more performance critical (as it is invoked number of queries many times) it should, I believe, always take precedence.

@bobbinth
Copy link
Contributor

bobbinth commented Jan 8, 2025

bullet point 2 is more performance critical (as it is invoked number of queries many times)

I thought we need to do these only once as they only involve the OOD frame (and not the queries).

@Al-Kindi-0
Copy link
Collaborator Author

Indeed, that comment between parentheses is not correct.
So we would have to compare the costs of computing the sum in the second bullet point without the Horner evaluation instruction versus the cost of not using a layout the favor's circuit evaluation.
There might, however, be a way to make Horner evaluation work with the circuit evaluation layout but will have to look more into it.

@Al-Kindi-0
Copy link
Collaborator Author

There might, however, be a way to make Horner evaluation work with the circuit evaluation layout but will have to look more into it.

I think there is, will double check and write it soon.

@Al-Kindi-0
Copy link
Collaborator Author

Let $(c_i)_{0\leq i\leq 2n-2}$ be the coefficients of the polynomial represented with the proposed with-gaps layout in memory. This means that if $(a_i)_{ 0 \leq i \leq n -1}$ are the coefficients of the polynomial with the without-gaps layout then we have the following relations $c_{2i + 1} = 0$ and $c_{2i} = a_i$.

Now, we can use horner_eval_ext on the polynomial with coefficients $c_i$ to evaluate at some $x$ and this would give us $$c_0 + c_1 \cdot x + c_2 \cdot x^2 + \cdots + x^{2n-1} \cdot c_{2n - 1}$$

Using the above relation between $c_i$ and $a_i$ we get
$$a_0 + a_1 \cdot x^2 + a_2 \cdot (x^2)^2 + \cdots + (x^2)^{n-1} \cdot a_{n - 1}$$

This means that evaluating $(a_i)_{ 0 \leq i \leq n -1}$ at $\alpha$ is equal to the evaluation of $(c_i)_{ 0 \leq i \leq 2n -2}$ at $\beta$ where $$\alpha = \beta^2$$

In other words, evaluating the polynomial with coefficients $(c_i)_{ 0 \leq i \leq 2n -2}$ at the square-root of $\alpha$ is equal to the evaluation of the original polynomial at $\alpha$.

Sadly, we are not always guaranteed to find $\beta$ for a given $\alpha$ but we can get an estimate on the probability of failing to do so.

@Al-Kindi-0
Copy link
Collaborator Author

Considering the cost of evaluating the x-independent term, assuming a trace of width 80 aux trace of width 8 and constraint trace of width 8, we can use roughly the following code:

# Since we have 80 main trace columns, 8 aux trace columns and 8 constraint columns, the number of terms
# in the sum related to `z` is 80 + 8 + 8 = 96. Since we process at each iteration 4 terms, the loop has to be 
# repeated 24 times.
repeat.24
    adv_pipe
    # => [0, 0, w1, w0, 0, 0, v1, v0, C, data_ptr, alpha_ptr, acc1, acc0, ...]

    movup.5 movup.5
    # => [0, 0, 0, 0, w1, w0, v1, v0, C, data_ptr, alpha_ptr, acc1, acc0, ...]

    swapw exec.constants::TMPX mem_storew
    # => [w1, w0, v1, v0, 0, 0, 0, 0, C, data_ptr, alpha_ptr, acc1, acc0, ...]

    adv_pipe
    # => [0, 0, w1`, w0`, 0, 0, v1`, v0`, C, data_ptr + 2, alpha_ptr, acc1, acc0, ...]

    movup.5 movup.5
    # => [0, 0, 0, 0, w1`, w0`, v1`, v0`, C, data_ptr + 2, alpha_ptr, acc1, acc0, ...]

    exec.constants::TMPX mem_loadw swapw
    # => [w1`, w0`, v1`, v0`, w1, w0, v1, v0, C, data_ptr + 2, alpha_ptr, acc1, acc0, ...]

    horner_eval_ext
end
# Since we have 80 main trace columns and 8 aux trace columns (note that there are no terms coming from
# the constraint trace in the sum associated to `zg`), the number of terms in the sum related to `zg` is 80 + 8 = 88. 
# Since we process at each iteration 4 terms, the loop has to be repeated 22 times.
repeat.22
    adv_pipe
    # => [0, 0, w1, w0, 0, 0, v1, v0, C, data_ptr, alpha_ptr, acc1, acc0, ...]

    movup.5 movup.5
    # => [0, 0, 0, 0, w1, w0, v1, v0, C, data_ptr, alpha_ptr, acc1, acc0, ...]

    swapw exec.constants::TMPX mem_storew
    # => [w1, w0, v1, v0, 0, 0, 0, 0, C, data_ptr, alpha_ptr, acc1, acc0, ...]

    adv_pipe
    # => [0, 0, w1`, w0`, 0, 0, v1`, v0`, C, data_ptr + 2, alpha_ptr, acc1, acc0, ...]

    movup.5 movup.5
    # => [0, 0, 0, 0, w1`, w0`, v1`, v0`, C, data_ptr + 2, alpha_ptr, acc1, acc0, ...]

    exec.constants::TMPX mem_loadw swapw
    # => [w1`, w0`, v1`, v0`, w1, w0, v1, v0, C, data_ptr + 2, alpha_ptr, acc1, acc0, ...]

    horner_eval_ext
end

Note that the cost for the

[0, 0, v0_0, v0_1], [0, 0, v1_0, v1_1], ..., [0, 0, vn_0, vn_1]

alignment is $15$ cycles per iteration. Since we have $24 + 22 = 46$ iterations, the total overhead can be estimated as $46 * 15 = 690$ cycles.

@plafer
Copy link
Contributor

plafer commented Jan 15, 2025

Maybe including m into the gate encoding is a safer way to go. I think 2^16 gates should be sufficient in most cases, and we can even push it to 2^17 if we reduce max value of m to 1024.

If we remove m from gate encoding, I think there is a good way for optimizing this even further. Here is how it could work:

I think it's safe to remove m from the gate encoding as I don't see how to prover could cheat. For a given wire, m is equivalent to the number of gates that use that wire as input. And the prover is forced to read all the gates from memory. So if the prover tries to cheat and use an incorrect value for m, then the bus will simply fail (since the prover is forced to populate it properly with all the gates that use the wire as input). So it's probably fine to do as initially proposed and use 2^20 as an upper bound for the 3 values (without encoding m)?

@Al-Kindi-0
Copy link
Collaborator Author

Yeah, for a committed sound circuit, there should be one and only way for the prover to fill in the m column.
So we can thus go to a maximum of $2^{20}$ gates per evaluated circuit.

@bobbinth
Copy link
Contributor

Yes, I think we can safely get rid of $m$ from gate descriptions. The way I imagine this will work in the processor is as follows:

  • When the processor sees the eval_circuit instruction, it reads the circuit from memory and builds an internal representation of this circuit. In this internal representation, it will track how many times a given wire has appeared as an input into a gate. This will give the processor $m$ values for all wires.
  • The internal representation of the circuit can then be used to evaluate the circuit itself quickly and put the results onto the stack.
  • It can also be passed to the chiplet trace builder to correctly insert the value of $m$ for every gate.

Regarding computing $\sum_{i=0}^k \alpha^{i + 1}\cdot T_i(z)$ and $\sum_{i=0}^k\alpha^{i + 1} \cdot T_i(z \cdot g)$ - don't we need to compute theme separately? Meaning:

  • First, we iterate over values of $T_i(z)$ to compute $\sum_{i=0}^k \alpha^{i + 1}\cdot T_i(z)$.
  • Then, we iterate over values of $T_i(z \cdot g)$ to compute $\sum_{i=0}^k\alpha^{i + 1} \cdot T_i(z \cdot g)$.

So, our current "zig-zag" arrangement of $T_i(z)$ and $T_i(z \cdot g)$ wouldn't work. Instead, we'd need to lay out the OOD frame to first have all the $T_i(z)$ values and then $T_i(z \cdot g)$ values.

I think the approaches outlines in #1610 (comment) could work. Another potential approach is to have horner_eval_ext instruction work only with 2 elements at a time (instead of 4). So, it could look something like this (assuming 88 values):

# compute sume of alpha^{i + 1} * T_i(z)
repeat.44
     adv_pipe
    # => [0, 0, w1, w0, 0, 0, v1, v0, C, data_ptr, alpha_ptr, acc1, acc0, ...]

    hperm
    # => [0, 0, w1, w0, 0, 0, v1, v0, C', data_ptr, alpha_ptr, acc1, acc0, ...]

    # compute acc' = ((acc + v) * alpha + w) * alpha - please double check this formula
    horner_eval_ext
    # => [0, 0, w1, w0, 0, 0, v1, v0, C, data_ptr + 8, alpha_ptr, acc1', acc0', ...]
end

# TODO: save acc to memory

# compute sum of alpha^{i + 1} * T_i(z * g)
repeat.44
     adv_pipe
    # => [0, 0, w1, w0, 0, 0, v1, v0, C, data_ptr, alpha_ptr, acc1, acc0, ...]

    hperm
    # => [0, 0, w1, w0, 0, 0, v1, v0, C', data_ptr, alpha_ptr, acc1, acc0, ...]

    # compute acc' = ((acc + v) * alpha + w) * alpha - please double check this formula
    horner_eval_ext
    # => [0, 0, w1, w0, 0, 0, v1, v0, C, data_ptr + 8, alpha_ptr, acc1', acc0', ...]
end

# TODO: save acc to memory
# TODO: verify that OOD frame was un-hashed correctly

Even though we now need to do 44 iterations (and 88 total for the OOD frame), each iteration iteration requires only 3 cycles. And so, we get a total of 264 cycles to both unhash the OOD frame and to compute $\sum_{i=0}^k \alpha^{i + 1}\cdot T_i(z)$ and $\sum_{i=0}^k\alpha^{i + 1} \cdot T_i(z \cdot g)$.

There are probably some details I'm missing, but accounting for these, I think we should be able to perform this part of the process in under 500 cycles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants