We have a partnership with zkPass to create an interactive proof for IRS-reported taxable income from the IRS website (https://www.irs.gov), which is then used to establish the accredited investorship, through the most commonly used financial criteria for individuals:
Income over $200,000 (individually) or $300,000 (with spouse or partner) in each of the prior two years, and reasonably expects the same for the current year
This can be done with privacy and integrity by having the users interact with the data requestor as follows:
- prove—using zkPass 3P-TLS protocol—that he/she receives the account transcripts for the prior two years, which would be two PDF documents, from the IRS Transcript Delivery System (TDS). A redacted and desensitized sample of the 2022 IRS account transcript of the author can be found here. Since account transcripts contain sensitive personal information, such as the last four digits of the social security numbers (SSN) and abbreviated home address, the transcripts would not be revealed to the data requestor.
- prove—using zkPass data processing protocol, here RISC Zero backend—that the two account transcripts are:
- matching the user's profile
- new documents that are just issued by the IRS
- matching the requested years (such as 2022)
- have a taxable income larger than US$200,000
This can be illustrated with the following diagram.
flowchart LR
A[Login to IRS] --> B{zkPass<br/>3P-TLS<br/>protocol}
B --> C[Retrieve source-authenticated<br/>account transcripts]
C --> D{zkPass<br/>data processing<br/>protocol}
D --> E[Proof of the income data in<br/>the account transcripts]
zkPass is a full-stack solution for data ownership. It consists of various tools and applications that enable verifiable data sharing, with privacy and integrity guarantees.
- Data-feed:
- HTTPS web connections (with 3P-TLS)
- Electronic passport
- DKIM emails
- Digital signed PDFs
- Data-processing:
- Data-consuming:
- On-chain identities
- Off-chain verification
Currently, their testnet version already supports a long list of data-feeds, including internet companies, traditional industry, governments.
- banks and governments: Nagarik App, ANZ Bank, Australian myGov
- education: Coursera, Hubspot Academy
- video games: GOG.com
- real-world identities and assets: Ferrari, Uber
- cryptocurrency exchanges: OKX, Binance
- social platforms: Instagram, Twitter, Quora, Tiktok, Medium, Reddit, Discord
This repository means to add IRS to the list, but the techniques present here (PDF proofs) can be generalized to a lot of settings.
- An example is CeDiploma, an electronic diploma provider with clients including Stanford and UC Berkeley, which embeds the digital signature in the PDF.
- Another example is Docusign. The signed document together with the summary document, which are two PDF documents, can be used to prove that signers with those email addresses made the signature. See here for an example SAFT agreement (from https://saft-project.org/) and its eSignature summary.
Further development of our PDF proofs would enable such a big class of applications.
For IRS, we rely on the zkPass 3P-TLS protocol to prove the internet connections with the IRS website, which works as follows.
- The user makes the usual HTTPS connection with the IRS website, with the encrypted network traffic rerouted through the validator. Validator here acts like a network proxy or, in laypersons' terms, a VPN.
- After the HTTPS connection concludes, the validator asks the user to generate a zero-knowledge proof about the encrypted traffic data. The validator independently verifies the integrity of the TLS connection, through PKI certificates.
- As we discussed above, zkPass supports multiproof. Several backends, including RISC Zero that we use, can be used to generate this zero-knowledge proof. This is often a selection that optimizes performance. RISC Zero is suitable for proofs that involve RAM-model computation rather than circuit-model computation.
This protocol has been studied for many years, all the way starting from TLSNotary more than a decade ago (now, an Ethereum Foundation-funded PSE project). Academic work including BlindCA (IEEE S&P 2019), DECO (ACM CCS 2020), Oblivious TLS (CT-RSA 2021), MPCAuth (IEEE S&P 2023), and DiStefano from Brave Browser has moved this forward.
Note: zkPass also has a version of 3P-TLS protocol, implemented in their TransGate extension, that additionally secret-shares the TLS keys among the user and the validator. We found it not necessary in most of today's network environment (IP spoofing in public network is near impossible with additional security mechanisms such as cloudflare IP hiding and modern port randomization, now in Linux), and its overhead does not work well with users with slow network connections, such as users in certain firewalled regions.
As part of our partnership, we are working with zkPass specifically on the RISC Zero backend. Here, we briefly compare it with the existing IZK backend and the Groth16 backend.
IZK and RISC Zero are both more generalized and performant than Groth16. Especially, Groth16 is not suitable for computation that does not have a fixed pattern (such as parsing PDF) and cannot be easily parsed into a circuit. IZK and RISC Zero do not suffer from this limitation.
IZK and RISC Zero, however, are close competitors.
- IZK for arithmetic circuits, such as for ZKML, are fairly efficient. As long as the circuit is not enormously large, IZK is a clear winner to RISC Zero.
- IZK for RISC-V is still in its infancy. An academic prototype of RISC-V IZK is presented in ACM CCS 2021, offering a clock rate of 6.6 kHz with 100 Mbps network. RISC Zero, with hardware acceleration, can do 91.2 kHz, but without any hardware acceleration it has only 12.8 kHz. The comparison is not yet apples-to-apples, as IZK is likely communication-bounded while RISC Zero is computation-bounded.
Nevertheless, there are two fundamental differences between IZK and RISC Zero.
- Developer ecosystem as of today. Today programming in IZK does not have the same developer ecosystem as programming in RISC Zero—developers can write code in Rust or, more importantly, port existing Rust code into RISC Zero, which enables fast prototyping. As complicated applications—such as the one we will present soon—will emerge in zkPass's ecosystem, the RISC Zero backend is a nice addition to zkPass's tech stack.
- Only IZK has the non-repudiation property. A lesser-known property that is unique to IZK, but crucial especially for web2 applications, is that the original proofs are non-transferable. That is, the proof can be specified to a specific receiver who can verify the correctness of the proof, while everyone else cannot distinguish between a real proof or a forged proof. This helps with data privacy as a user may not want the recipient of the proof to be able to use the proof, which would be related to the user's data, to someone else. This also helps avoid liabilities as the proofs are inadmissible evidences in a court of law.
We have been looking at the IZK area for a while—see here for our presentation at the Decompute conference, which was during Token2049 Singapore 2023.
The PDF proof protocol in this repository requires very little domain expertise in zero knowledge. In fact, we wrote the entire thing in Rust, using existing Rust crates—md5, rc4, libflate—out of the box without any RISC-Zero-specific optimization, and then we copy-pasted the same Rust code into RISC Zero and it works. One can cross-check irs/src/test.rs and irs0/methods/guest/src/main.rs for more detail.
The current implementation of the PDF proofs consists of the following steps.
flowchart LR
A[parsing] --> B[decryption]
B --> C[decompression]
- parsing
- starting from the end of the file and reading the trailer of the PDF file
- A fun fact is that a PDF reader is supposed to read the PDF backwards. The code in simple-pdf-parser/src/parser/trailer.rs does so, and it obtains the document ID, the encryption descriptor object's ID, and the offset to the cross-reference table.
- reading the cross-reference table
- To find the offset to the encryption descriptor, it goes through the cross-reference table of the PDF and finds the corresponding entry. This is done in simple-pdf-parser/src/parser/xref_tables.rs.
- reading the encryption descriptor object
- Now, it goes to the encryption descriptor and reads information needed for decryption (note that the IRS account transcripts are not actually encrypted, but PDF standard requires it to encrypt the data under a dummy password). This is done in simple-pdf-parser/src/parser/find_encrypt_obj.rs.
- reading the object that stores the interested data
- We know that the IRS account transcript likely would store the data in a specific object ID, so we can directly jump to it using the cross-reference table. This is done in simple-pdf-parser/src/parser/stream.rs
- starting from the end of the file and reading the trailer of the PDF file
- decryption
- computing the PDF workspace key (targeting PDF 1.4)
- There is a workspace key for the entire PDF, which can be used to derive the key for each object in the document. For the IRS account transcripts that we are caring, it uses MD5 for key derivation. The code in simple-pdf-decrypt/src/compute_key/compute_workspace_key.rs does so.
- computing the object encryption key
- Now that we have the workspace key, we can derive the object encryption key accordingly. This is again using MD5. The code in simple-pdf-decrypt/src/compute_key/compute_object_key.rs does so.
- decrypting the object
- Decryption is done by running the RC4 stream cipher. See simple-pdf-decrypt/src/decrypt/mod.rs.
- computing the PDF workspace key (targeting PDF 1.4)
- decompression
- deflating the PDF compressed object (aka, FlatDecode)
- The decrypted data is still unintelligible because it is compressed using the DEFLATE algorithm. We use an existing Rust implementation of the decompression algorithm. See irs/src/lib.rs for how we use the libflate crate.
- deflating the PDF compressed object (aka, FlatDecode)
And the rest of the code is about walking through the body object of the IRS account transcripts. This can be illustrated with the following figure. Our example focuses on two fields. One can look up more fields if needed, which would not contribute to much overhead, as most of the RISC-V cycles are spent on key derivation and decompression.
Proof generation on my Mac Studio (with M2 Ultra chip) takes about 13s.
To run the code in irs/
or irs0/
folder, an IRS account transcript is needed. The one we used for internal testing is the author's real 2022 IRS tax transcript, and he is reluctant to include it in the public GitHub
repository. US residents should find little difficulty in obtaining an IRS account transcript from the online account. If you sincerely need one for testing but could not
get a version, please reach out to the author through [email protected].
Future optimization over PDF proofs is very plausible. In fact, part of the proof generation can be delegated, by having the user shares a redacted version of the PDF, and the user only handles a fraction of the proof generation that is related to the sensitive data in the unredacted version of the PDF (which is like a finishing touch). We are keen to formalize this as "patchwork proofs".
This work is a partnership between L2 Iterative and zkPass, with a focus to integrate the RISC Zero backend into zkPass.
The code in this repository, at the moment, is specific to the demo of proofs of IRS account transcripts. We have not used much third-party code, so we would like to license it under MIT or Apache 2.0. Future development, though, with the introduction of new code, may suggest a different license, and it would be updated in future versions of this repository.