-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AIR chiplet for arithmetic circuit evaluation #1610
Comments
Looks great! Thank you for such a detailed description! A few thoughts/comments: Gate definitionOne of the things we probably want to do is to make sure that we can encode a gate in a single field element. This way, we can increase the efficiency of "unhashing" of a circuit in memory. From your description, a gate is a tuple Alternatively, we could skip SelectorsWe could avoid selectors by splitting loading inputs, evaluating, and reading output into 3 separate instructions (or maybe evaluating and reading output could be combined into a single instruction). The instructions could look like so:
Here, The user would the load inputs into the evaluator bus by executing This way, the only memory reads the chiplet needs to do is to load the gates. This may simplify things a bit, though not sure by how much. Another potential option is to encode all actions into Input formatOne thing that is not entirely clear to me is what's the best way to arrange inputs in memory. I believe currently we "interleave" current/next row values of the OOD point in a single word. Would this still work? Or would we need to make it so that each input value is stored in a separate word? Reading gates from memoryIf I understood correctly, the described design assumes that we do 1 memory read of a single element per gate (i.e., per row of the chiplet). Since we can actually read 4 elements per memory read, I wonder if there is some way we can improve on this. Separately, I'm wondering if reading 1 element per row is compatible with element-addressable memory work (I think it should be, but we should double-check). Cost estimatesI'd like to start thinking through some cost estimates for this. For example, how many hashes and memory reads would be required for a circuit of 5K gates and 200 inputs (I think these are ballpark numbers of what we'd need for the VM circuit). My initial estimates are:
If we are verifying multiple proofs, circuit "unhashing" costs would be amortized. So, overall, verifying a proof would add between 6K and 11K rows to the chiplets table.
My intuition here is that we should keep the gates pretty simple: they are just additions/multiplications in the degree 2 extension field. One potential exception to this could be something that could help us with MDS matrix multiplication because this part alone would contribute over 600 gates to the circuit. |
I like this idea but it seems a bit counterintuitive to me that it could be sound though I can't immediately see a flaw in it.
I think we should go with this option. As you describe in your second comment and if I understand correctly, this will require that the input data and circuit data be interleaved in memory. This is because for a
If we go with the solution in the previous bullet point then this should not be an issue as far as I can see.
I think we should be able to read 1 element per gate once element addressable memory is done, but we will have to double check in the relevant PR.
When I think of input I think of a field element in the extension field, is it the same here?
We can resort to this if the circuits with only mul / add gates become too large. |
Maybe including
The way I was imagining this is that inputs would be aligned on word boundaries and each input word would be something like When computing the commitment to a circuit, input gates would be represented with full words Then, before invoking the chiplet, we'd need to populate the input values for each of the input cells - though, as you've mentioned, this may take quite a few cycles as we'd need to manually copy over each input into the appropriate gate. The main benefit of this is that in the chiplet we need to do only 1 memory read per row:
At the same time, as mentioned above, copying over inputs will probably require quite a bit of data movement - so, maybe there is a better approach.
Correct. Here I was assuming we'd have an arrangement of single extension field element per word (with the "unused" elements set to zero). This is different from what we do now where we pack 2 field elements per word. With 2 elements per word we'd need only 400 rows, but I do think that there may be benefit to this field element arrangement (e.g., would be much easier to go to degree 3 extension).
I think the decoder would view it as a single instruction - something like:
This instruction would need to be used in procedures which have some additional knowledge about the circuit being evaluated because these procedures would need to populate the inputs in memory. The process could look like so:
We'd have multiple procedures which look like this but for different circuits - e.g., |
Agreed! An upper bound of 1024 on m should be more than enough. This could be doubled inside the circuit at the cost of one additional gate if there is a need for a larger m sometime.
Yes, that's what padding was referring to. I think there are typically two extreme configurations that we face:
All in all, I think we should be fine with the above proposal as it has the advantage of simplifying the design of the arithmetic circuit evaluation chiplet, namely reduction in the number of committed columns.
Makes sense!
Makes sense! One more thing, how would the communication between the chiplet and the decoder work out? Requests from the decoder are straightforward but responses are not clear to me especially when we have multiple invocations of |
If we remove Let's say we use
This means that field element encoding value for wire with ID 0 would be 0, for ID 1 would be 1 etc. Then, we lay out the circuit in memory like so: In the input gate section (and probably in the other gate section as well), we make sure that
Where Then, the procedure for evaluating the circuit becomes:
After step 2 above, the inputs section will look something like this:
Where After step 3, we want the section to look like so:
Assuming the number of inputs is fixed (e.g., 200), we should be able to accomplish this with the following loop:
This does take up ~10 cycles per input (so, for 200 inputs it'll be around 2K cycles, and for 2000 inputs, 20K cycles) - but I think that's not too bad. If the above works, then the main thing to figure out is how to remove |
I am not sure the above would work as it assumes that we can un-hash the inputs to memory for free as
The main reason is that, for our use case, the input will have other uses than just arithmetic circuit evaluation (ACE). For example, the OOD frame is used in the the computation of the random linear combination which goes into computing the queries to the DEEP polynomial. |
So, for OOD frame we need to do 2 things:
And the layout that would be suitable for circuit evaluation would make computing linear combinations difficult and vice versa. Unless we can come up with a way that works for both. |
Indeed, and since bullet point 2 is more performance critical (as it is invoked number of queries many times) it should, I believe, always take precedence. |
I thought we need to do these only once as they only involve the OOD frame (and not the queries). |
Indeed, that comment between parentheses is not correct. |
I think there is, will double check and write it soon. |
Let Now, we can use Using the above relation between This means that evaluating In other words, evaluating the polynomial with coefficients Sadly, we are not always guaranteed to find |
Considering the cost of evaluating the x-independent term, assuming a trace of width 80 aux trace of width 8 and constraint trace of width 8, we can use roughly the following code:
Note that the cost for the
alignment is |
I think it's safe to remove |
Yeah, for a committed sound circuit, there should be one and only way for the prover to fill in the |
Yes, I think we can safely get rid of
Regarding computing
So, our current "zig-zag" arrangement of I think the approaches outlines in #1610 (comment) could work. Another potential approach is to have
Even though we now need to do 44 iterations (and 88 total for the OOD frame), each iteration iteration requires only 3 cycles. And so, we get a total of 264 cycles to both unhash the OOD frame and to compute There are probably some details I'm missing, but accounting for these, I think we should be able to perform this part of the process in under 500 cycles. |
Feature description
Introduction
A large portion of the recursive verifier work goes into the constraint evaluation step. This usually entails evaluating a large algebraic expression over the extension field. If we think of this algebraic expression as an arithmetic circuit then this circuit has as inputs:
The output of the circuit is then a single value (in the extension field) which should then be constrained to be equal to zero.
Hence, the problem of constraints evaluation can be described by the following components:
An AIR for evaluating arithmetic circuits
Assume throughout that each gate is fan-in 2 but fan-out any positive integer.
Now, suppose that the description of the circuit is stored in memory as a sequence of tuples describing each of the gates where each tuple is made up of
op
: eitheradd
ormul
describing the type of the gate[l_inp_id, r_inp_id]
: the ids of the left and right input wires to the gateout_id
: the id of the current gate which also serves as the id of the output wires of the gatem
: a multiplicity parameter describing the fan-out of the gateFrom a user point of view, the circuit will be specified by a digest and loaded to memory from the advice provider by un-hashing the said digest. Let
circuit_ptr
be a pointer to the memory region containing this circuit description.As for the inputs to the circuit, we assume that they are stored in a contiguous region of memory, and let
input_ptr
be a pointer to this memory region.Chiplet description
Layout
The following is a pictorial description of the layout of the chiplet. It can be thought of as being composed of three portions:
op
,[l_inp_id, r_inp_id]
,out_id
andm
are as described above.[l_inp_val, r_inp_val]
are the current gates left and right input values. These are over the extension field over which the constraints evaluation is done.out_val
is the current gates output values. These are over the extension field as well.input_ptr
is the pointer to the inputs as described above.circuit_ptr
is the pointer to the circuit description as described above.clk
is the clock cycle at which the memory access' happen.ctx
is the context at which the memory access' happen.Note that there is an extra bus that will be used to wire the gates. The idea is for each gate to send
output
requests of the form(out_id, out_val)
with multiplicitym
. Similarly each gate sends ainput
request of the form(l_inp_id, l_inp_val)
and(r_inp_id, r_inp_val)
. We will use selectors to turn on/off these requests, see next section. Moreover, we will append a label to each type of interaction with the bus in order to distinguish them.Workings
Loading inputs
This is indicated by a flag
f_load
derived fromselectors
. Here the chiplet makes read requests(MEM_READ_OP_LABEL, clk, ctx, input_ptr)
to memory and populatesout_val
with the received values representing the circuit inputs.The chiplet also makes read requests
(MEM_READ_OP_LABEL, clk, ctx, circuit_ptr)
to memory and populatesop
,[l_inp_id, r_inp_id]
,out_id
andm
. In practice and at this step, the only relevant data ism
asop
,[l_inp_id, r_inp_id]
are not relevant at this stage andout_id
can be "computed" by the chiplet by starting from0
and incrementing it by1
for each subsequent row whilef_load == 1
.At this stage we do not enforce any constraints but for the outputting of the gates values. This is done by sending
(out_id, out_val)
with multiplicitym
to the wiring bus.Both
input_ptr
andcircuit_ptr
are incremented after each request.When the
Loading
stage ends, which is determined when the next row is the first row of the next stage, the chiplet sends a message to the decoder with the current value of theinput_ptr
pointer, call itinput_end_ptr
, to indicate that it has finished loading the inputs.Evaluating
This is indicated by a flag
f_eval
derived fromselectors
and mutually exclusive withf_load
. Here the chiplet stops reading the inputs but continues reading the circuit description but now reads the full description of each gate i.e.,op
,[l_inp_id, r_inp_id]
,out_id
andm
. Moreover, we now impose the gate constraintout_val = l_inp_val * r_inp_val
orout_val = l_inp_val + r_inp_val
depending onop
where the addition/product is over the extension field.From the wiring point of view, the chiplet makes, to the wiring bus, for each gate, requests to get inputs
(l_inp_id, l_inp_val)
and(r_inp_id, r_inp_val)
and requests to output(out_id, out_val)
with multiplicitym
.Outputting
This is indicated by a flag
f_out
derived fromselectors
and mutually exclusive withf_eval
andf_load
. Here the chiplet stops reading the inputs and circuit description.f_out
also indicates that theout_val
of the previous gate is the output of the circuit. Given the above, we can add a constraint forout_val
to be zero or just output it by including it in the final message to the decoder. This message will include the final value ofcircuit_ptr
, call itcircuit_end_ptr
, to signal that the chiplet has read and evaluated all of the circuit.Usage
We will assume that the inputs have already been loaded to memory which is the case for the applications we have in mind i.e., recursive verification. Then the program wishing to evaluate the circuit on the aforementioned inputs will un-hash the circuit to some region of memory. At this point the program calls the instruction that does the evaluation, say
eval_circuit
, with inputsinput_ptr
,circuit_ptr
,input_end_ptr
andcircuit_end_ptr
and this will correspond to the decoder making a request to the circuit evaluation chiplet of the form(BEGIN_EVAL_OP_LABEL, clk, ctx, input_ptr)
and(BEGIN_EVAL_OP_LABEL, clk, ctx, circuit_ptr)
to initialize the evaluation and(END_EVAL_OP_LABEL, clk, ctx, input_end_ptr)
and(END_EVAL_OP_LABEL, clk, ctx, circuit_end_ptr)
to request a response from the chiplet, where*_end_ptr
is, as before, the value of the pointer at the boundary of the memory region of either the circuit inputs or the circuit description.The chiplet will then respond first by initializing circuit evaluation and for this we will probably need a fourth flag, say
f_init
. Here, the chiplet will make responses to the decoder(BEGIN_EVAL_OP_LABEL, clk, ctx, input_ptr)
and(BEGIN_EVAL_OP_LABEL, clk, ctx, circuit_ptr)
to indicate the start of the circuit evaluation. The next chiplet row will then be the first row in theLoading
phase.After executing
Loading
, the chiplet will make the response(END_EVAL_OP_LABEL, clk, ctx, input_end_ptr)
to signal that it has completed loading the circuit inputs from memory.Next comes the
Evaluating
phase after which the chiplet transitions to theOutputting
phase and here it will issue a response(END_EVAL_OP_LABEL, clk, ctx, circuit_end_ptr)
to the decoder indicating that it is done evaluating the circuit, which will imply that the circuit evaluated to zero or, if we choose to send the evaluation to the decoder, we can include it in this last message.Remarks and Open Questions:
Why is this feature needed?
As the VM's complexity increases, more AIR constraints are added which in turns translates to more work for the recursive verifier at the constraints evaluation step. Thus we need to find a way to accelerate this step and the current issue aims at investigating how to do so.
The text was updated successfully, but these errors were encountered: