Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds gen_substrs support #2

Closed
wants to merge 4 commits into from

Conversation

ewynx
Copy link

@ewynx ewynx commented Sep 19, 2024

Description

Note: this is added upon the original Noir support @olehmisar built.

Extraction of 1 or more substrings can be requested in both settings (raw, decomposed):

  • raw: pass on a json file with -s <SUBSTRS_JSON_PATH> that contains the state transitions that should be revealed
  • decomposed: mark the part as public

Note that the number of substrings that have to be extracted is not necessarily known beforehand. For example regex (substr=(a|b)+ )+ where the substring to be extracted is (a|b), should extract 2 substrings for substr=a substr=b and 1 substring for substr=b.

The information of where substrings should be extracted within the regex is added to substrings in the RegexAndDFA. In the case of raw only substring_ranges is filled so this is what this implementation uses.

How it works

The adjustments when gen_substrs = true are:

  • return type of regex_match becomes a Vector of BoundedVec. We don't know the substring length beforehand, but it is bounded by N. And we don't know how many substrings are returned
  • fill the substrings (BoundedVec) with the bytes when in the corresponding state. It checks when it starts and stops filling the substring

If the bool is false, nothing changes.

Note: For the raw setting the boolean was by default set to true, I changed it to false, just like in the decomposed setting. Asked in the TG group if this is ok (open question).

Testing

Decomposed

Create a file called substring_test.json containing:

{
  "parts":[
      {
          "is_public": false,
          "regex_def": "email was meant for @"
      },
      {
          "is_public": true,
          "regex_def": "[a-z]+"
      },
      {
          "is_public": false,
          "regex_def": "\\. And the next substring:"
      },
      {
          "is_public": true,
          "regex_def": "[a-z]+"
      }
  ]
}

Run

cargo run --bin zk-regex decomposed -d substring_test.json --noir-file-path testfile.nr -g true

Raw

This is the example from the README.md; create ./simple_regex_substrs.json containing:

    {
        "transitions": [
            [
                [
                    2,
                    3
                ]
            ],
            [
                [
                    6,
                    7
                ],
                [
                    7,
                    7
                ]
            ],
            [
                [
                    8,
                    9
                ]
            ]
        ]
    }

Run

cargo run --bin zk-regex raw -r "1=(a|b) (2=(b|c)+ )+d" -s ./simple_regex_substrs.json --noir-file-path simple_regex_substrs.nr -g true`

For example, the corresponding Noir code should return for input 1=b 2=bbcb 2=c 2=bb d the following 4 substrings: b, bbcb, c, bb, and d.

Note that the second substring is marked by 2 transitions [6,7] and [7,7], which is needed for the circom impl. However, in Noir we only need to know the substring is in state 7, so this is the information that is taken from substring_ranges.

Problem*

Resolves

Summary*

Additional Context

PR Checklist*

  • I have tested the changes locally.
  • I have formatted the changes with Prettier and/or cargo fmt on default settings.

olehmisar and others added 4 commits September 19, 2024 14:13
…aw setting.

The substrings are returned as BoundedVec since we don't know their exact length upfront, but we know they're not longer than N.
To support both settings (decomposed and raw) we have to use `substring_ranges` instead of `substring_boundaries`.
…gex and input. This fix makes sure this is supported.

Changes:
- regex_match returns a Vec of substrings instead of an array with known length
- per state where substrings have to be extracted; add the byte either to a new substring or an already started one

Note that substr_count is used to extract the correct "current" substring from the Vec. This is a workaround - first implementation was using `pop` but this gave an error.
@ewynx
Copy link
Author

ewynx commented Oct 2, 2024

#8 contains this feature, closing this PR.

@ewynx ewynx closed this Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants