Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Circuit implementation for compression #42

Open
Brechtpd opened this issue Nov 21, 2022 · 14 comments · May be fixed by #55
Open

Circuit implementation for compression #42

Brechtpd opened this issue Nov 21, 2022 · 14 comments · May be fixed by #55

Comments

@Brechtpd
Copy link

Brechtpd commented Nov 21, 2022

This is to reduce the amount of data we have to push on-chain for blocks. Sticking to the same compression scheme as optimism would be nice, and DEFLATE should be straightforward to implement as a circuit.

First step to implement this would be to just do huffman decoding using a static huffman tree, decoding an arbitrary length data stream. The huffman tree is stored in an advice column so other rows can do a lookup to decode the data using the tree.

@xiaodino
Copy link

Is DEFLATE compression circuit a new circuit like evm_circuit, tx_circuit and keccak_circuit?

@johntaiko
Copy link

Can we prove that the encoded txlist has the same result(data,huffman table) with public input?
Because deflate decoding has dynamic "byte" length decided by huffman table, we need check the table bit by bit.
So, maybe we should assign the data(both origin and encoded) one bit per row?

@Brechtpd
Copy link
Author

Brechtpd commented Dec 14, 2022

Can we prove that the encoded txlist has the same result(data,huffman table) with public input?

Not sure I understand what you mean with this? You mean decoding the data in the smart contract?

Because deflate decoding has dynamic "byte" length decided by huffman table, we need check the table bit by bit.

A well known trick to decode huffman codes is to just use a fixed input table. For example, let's say there are code words 01 -> value_a and 0001 -> value_b. To decode these words you would create a lookup table like this:

code length value
0100 2 value_a
0101 2 value_a
0110 2 value_a
0111 2 value_a
0001 4 value_b

Then while decoding the data in the circuit, you always use 4 bit inputs to find the decoded value. However, when value_a is decoded you only jump ahead 2 bits in the input data stream (vs 4 bits when decoding value_b). So that's where the length field in the table also comes in so for each decoded value you also know how many bits to go forward.

So, maybe we should assign the data(both origin and encoded) one bit per row?

I think just decoding 1 codeword/row makes sense, no matter how many bits are decoded.

(Each row will accumulate the encoded bits and decoded bits in two accumulators so we can check that the input data stream processed by the circuit is indeed the correct one, and the output data stream can be exported to other circuits. But you don't have to worry about this that for now.)

@johntaiko
Copy link

If we use fixed code word, the compress ratio maybe has poor quality

@Brechtpd
Copy link
Author

We should be able to load any huffman table (even one provided in the input per data stream) in a lookup table like this, but of course that's a bit more complex than just using a fixed huffman table for now. Let's go step by step without making things too complicated for now so we have the core mechanisms implemented, we may also want to add dictionary support... :)

@johntaiko
Copy link

johntaiko commented Dec 14, 2022

We should be able to load any huffman table (even one provided in the input per data stream) in a lookup table like this, but of course that's a bit more complex than just using a fixed huffman table for now. Let's go step by step without making things too complicated for now so we have the core mechanisms implemented, we may also want to add dictionary support... :)

I got it

@johntaiko
Copy link

johntaiko commented Jan 10, 2023

Hi @Brechtpd, I discussed with @smtmfft, and got a new approach to use compress circuit.
The previous approach:

  • We get the compressed_raw_public_inputs from witness(comes from taiko-node)
  • The compress-circuit decompresses the compressed_raw_public_inputs to raw_public_inputs as circuit output, constraits rlc(compressed_raw_public_inputs) == rlc(public_inputs:layer1) at the same time.
  • The pi-circuit use the decompressed data raw_public_inputs as witness

The new approach:
未命名绘图

  • We get the raw_public_inputs from witness
  • The compress-circuit compresses the raw_public_inputs to compressed_raw_public_inputs, and constraits rlc(compressed_raw_public_inputs) == rlc(public_inputs:layer1)
  • The pi-circuit no need to change anything

So, the main difference between the two approaches is that the new approach will compress the pi and reduce one compression in taiko-node

cc @xiaodino

@Brechtpd
Copy link
Author

Not completely sure I understand what the difference is. You mean the witness data sent to the circuit from the node software is the uncompressed data instead of the compressed data? If so, yeah that makes sense. Thought the node will still have to be able to support compression (for block submission) and decompression (for reading the transaction from the chain). But for the data sent to the prover compressed or uncompressed seems to be equivalent so whichever is easier to support (which I guess is the uncompressed data because it doesn't need any changes in the node witness generation).

As for the circuit it doesn't seem to matter if you see it as compressing or decompressing the data because that's the exact same circuit for both, but you probably just stress the compress/decompress difference because of the how you look at it?

In the diagram, there's an arrow from zk-evm -> layer1 with (witness:raw_public_inputs). raw_public_inputs is not the compressed but we're sending that to layer 1? Should be the compressed_public_inputs right?

Also the pi-circuit doesn't need any changes no matter how you look at it no? There's just some extra constraints in the circuit that somehow makes sure that rlc(raw_public_inputs) == rlc(public_inputs:layer1), so some extra (de)compression constraints are added but that's it? Depending on what the circuit gets sent from the node it just compresses or decompresses that data so both can be put in the circuit at witness generation time, but the actual circuit constraints remain unchanged.

@johntaiko
Copy link

Not completely sure I understand what the difference is. You mean the witness data sent to the circuit from the node software is the uncompressed data instead of the compressed data?

Yes.

(which I guess is the uncompressed data because it doesn't need any changes in the node witness generation).

Yeah, I think this is the main benefit when using uncompressed data.

As for the circuit it doesn't seem to matter if you see it as compressing or decompressing the data because that's the exact same circuit for both, but you probably just stress the compress/decompress difference because of the how you look at it?

Totally agrees.

In the diagram, there's an arrow from zk-evm -> layer1 with (witness:raw_public_inputs). raw_public_inputs is not the compressed but we're sending that to layer 1? Should be the compressed_public_inputs right?

Oh sorry, the diagram has some mistakes. zkevm gets withness from layer2 node, layer2 node proposes the compressed_public_inputs to layer1 for verify

未命名绘图 (1)

Also the pi-circuit doesn't need any changes no matter how you look at it no?

I think so.

There's just some extra constraints in the circuit that somehow makes sure that rlc(raw_public_inputs) == rlc(public_inputs:layer1)

Do you mean we should constraint the rlc(compressed_public_inputs) == rlc(public_inputs:layer1)
Because the pi data from layer1 is compressed

@Brechtpd
Copy link
Author

Do you mean we should constraint the rlc(compressed_public_inputs) == rlc(public_inputs:layer1)
Because the pi data from layer1 is compressed

Woops sorry my mistake, meant to say something like rlc(raw_public_inputs) == rlc(decompress(public_inputs:layer1)). Just that the decompressed bitstream matches the current raw_public_inputs in the public input, nothing else interesting.

@johntaiko
Copy link

Woops sorry my mistake, meant to say something like rlc(raw_public_inputs) == rlc(decompress(public_inputs:layer1)). Just that the decompressed bitstream matches the current raw_public_inputs in the public input, nothing else interesting.

I don't know if I'm missing some context, I think it's not a good approach with constrainting rlc(raw_public_inputs) == rlc(decompress(public_inputs:layer1)) because the pi from layer1 is rlc(compressed_public_inputs:layer1)

So if we constraint the equivalent between two rlc-decompressed-data, than the rlc-compressed-data from layer1 is redundant

@Brechtpd
Copy link
Author

My way of writing that I guess was still unfortunate :). Just trying to say that the circuit constraints that the data in the raw_public_inputs column matches the decompressed data from the compressed_public_inputs column. But no matter how that is done, we still need to make sure that the input data (which will be the compressed data) matches the data expected by the smart contract used as the public input.

@johntaiko
Copy link

johntaiko commented Jan 12, 2023

I have no permission to associate the pr #55, so I just use the reply. ( @Brechtpd can add me into the zkevm group

There are some TODOs still unresolved marked in pr, and I think they will be done soon.

But I have no idea about how to split compressed pi into 32bytes stream for rlc encode in circuit, I need help. @Brechtpd

@johntaiko johntaiko linked a pull request Jan 12, 2023 that will close this issue
3 tasks
@Brechtpd Brechtpd linked a pull request Jan 12, 2023 that will close this issue
3 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

5 participants