Skip to content

Support CelOptions.maxRegexProgramSize(int) to limit RE2 program size #545

Closed
@sergiitk

Description

@sergiitk

Feature request checklist

  • There are no issues that match the desired change
  • The change is large enough it can't be addressed with a simple Pull Request
  • If this is a bug, please file a Bug Report.

Change

To make CEL environment setup consistent across CEL implementations, I propose to add CelOptions.maxRegexProgramSize(int) to CEL-Java. This option should work similar to InterpreterOptions.regex_max_program_size in CEL-Cpp (see nuances below).

RE2 program size should be verified when the CEL program is created from the AST (AFAIK this is how cel-cpp works).

Nuances

The program size represents a very approximate measure of a regexp's "cost". There are no guarantees on the implementation details or claims about the properties of the program size (except "larger numbers are more expensive than smaller numbers").

Currently the program size is the same as the number of instructions of the regex program. However, the number of instructions in a regex depends on the concrete RE2 implementation. The implication of using the number of instructions as the program size is that:

Important

There's no guarantee that RE2 program size has the exact same value in CPP, Go and Java.
We should communicate this in the docs.

For example:

["", "a", "^", "^$", "a+b", "a+b?", "(a+b)", "a+b.*", "(a+b?)"]  # pattern
[4,   5,   2,   2,     7,     8,       9,      15,       10]     # program size: cpp re2-2024-07-02
[3,   3,   3,   4,     5,     6,       7,       7,        8]     # program size: java re2j v1.8 
[3,   3,   3,   4,     5,     6,       7,       7,        8]     # program size: go v1.24.1
# (go v1.24.1 is identical to re2j v1.8)

Context

Unlike the canonical C++ RE2 implementation, re2j (Java RE2 port) didn't expose the program size in public APIs. To address this inconsistency I made google/re2j#180, and it was recently merged.

When the next re2j version is released, we'll be able to determine the program size using Pattern.programSize().

Example

private static final CelRuntime CEL_RUNTIME = CelRuntimeFactory
    .standardCelRuntimeBuilder()
    .setOptions(CelOptions.current().maxRegexProgramSize(5).build())
    .build();
@Test
public void regex_maxProgramSize() throws Exception {
  CelCompiler CEL_COMPILER = CelCompilerFactory.standardCelCompilerBuilder()
      .setResultType(SimpleType.BOOL)
      .build();

  String expr = "matches('foobar', 'a+b?')";
  CelRuntime.Program program = CEL_RUNTIME.createProgram(CEL_COMPILER.compile(expr).getAst());
  CelEvaluationException celErr = assertThrows(CelEvaluationException.class, program::eval);
  assertThat(celErr.getErrorCode()).isEqualTo(CelErrorCode.INVALID_ARGUMENT);
}

Some other known regex program sizes for java can be found in re2j's PatternTest.java.

Related

CC @l46kok, @TristonianJones

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions