Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor!: Rules are loaded into a radix trie instead of a list #1358

Merged
merged 77 commits into from
Apr 30, 2024

Conversation

dadrus
Copy link
Owner

@dadrus dadrus commented Apr 12, 2024

Related issue(s)

closes #652
closes #661
closes #1037
closes #1038

Checklist

  • I agree to follow this project's Code of Conduct.
  • I have read, and I am following this repository's Contributing Guidelines.
  • I have read the Security Policy.
  • I have referenced an issue describing the bug/feature request.
  • I have added tests that prove the correctness of my implementation.
  • I have updated the documentation.

Background

The current implementation for rule management and matching is pretty simple. It is based on a list of precompiled glob or regex expressions. There are however multiple drawbacks/challenges:

  • While the deletion, update or insertion of rules is pretty simple and fast, the lookup has a time complexity between O(n) and O(n²) in worst case (O(n) to match the expression + O(n) to iterate over the list of rules) which is actually terrible.
  • Matching happens either via regular expressions or glob expressions. Even glob expressions are much faster compared to regular expressions, they don't allow for groups or named matches. For that reason, heimdall does not implement an API which could be used in rule pipelines to get access to the matched parts of a path or similar. There are however workarounds (see Named capturing matchers #1038).
  • The <*> glob expression matches all characters except / and ., which are separators in paths and hosts respectively. Used in paths, that expression is quite useless, as it captures path segments containing dots in an unexpected way (see Make single star globs more context-aware and useful #1037)
  • The only way to ensure the more specific rules are matched before the more general ones, is placing the rules in one rule set and order the rules with more specific placed before the more general ones, which is ok, but requires explicit knowledge about the implementation (see Add support for closest match while matching the urls for rule selection #661).
  • It does not allow having read and write related functionality for the same url handled by different rules.

Description

This PR cleans up with the above said challenges and replaces the list based rule management implementation as well as the glob/regex based lookup implementation with a new implementation based on radix tree. That

  • has O(log(n)) time complexity for request to rule matching,
  • allows context based matching using (named) wildcards, with captured values being accessible in the pipeline
  • does not require a specific ordering of rules,
  • introduces backtracking while matching rules, which was even not possible before,
  • allows definition of multiple rules for same path expressions, e.g. to have separate rules for read and write requests
  • allows for further match conditions in the future, like e.g. taking specific headers, or query parameters into account.
  • enables usage of context aware glob expressions, which use . for host related expressions and / for path related ones as separators

The following sections describe the corresponding user facing changes in detail.

Rule Matching Configuration

The old matching related configuration was

match:
  url: <some glob or regex expression>
  strategy: <glob or regex>

The new matching related configuration, introduced by this PR looks as follows:

match:
  # mandatory
  path: <path expression>

  # optional, inherited from the default rule
  backtracking_enabled: false

  # optional
  with:
    # optional, if not set matches all schemes
    scheme: <either http or https>
    #optional, if not set matches all methods
    methods:
      - <list of HTTP methods to match>
    # optional, if not set matches all hosts
    host_glob: <glob expression>
    # optional, if not set matches all hosts
    host_regex: <regex expression>
    # optional, if not set matches all paths
    path_glob: <glob expression>
    # optional, if not set matches all paths
    path_regex: <regex expression>

With

  • path defining the primary expression to match the rule
  • with defining additional conditions, required to be met to have the rule indeed matched. These are:
    • scheme - allowing definition of the HTTP scheme to match. If not set, both "http" and "https" schemes are accepted
    • methods - defining the list of allowed, respectively to be matched HTTP methods.
    • host_glob and host_regex allowing additional expressions to validate the host value against. The two are mutually exclusive. If not set, any host is accepted.
    • path_glob and path_regex are mutually exclusive as well and allow to further nail down the URL path of the given request after it is matched.
  • backtracking_enabled configures the backtracking behavior if the additional conditions fail. If enabled, the lookup in the radix tree will traverse back to a less specific path expression and potentially match a less specific rule.

There may be multiple rules with the same path expression, but different additional conditions, like e.g. required methods.

Here an example:

match:
  path: /abs/foo/:something
  with:
    scheme: https
    methods:
      - GET
      - POST
    host_glob: "*.example.com"
    path_regex: "^/abs/foo/(bar|baz)"

The configuration of the default_rule has been extended

# in heimdall config file
default_rule:
  # optional, defaults to false
  backtracking_enabled: false

  # other properties
  ...

For security reasons backtracking is disabled by default. It can be enabled globally on the default rule level, and also enabled or disabled on the level of a particular upstream rule.

Expressions

The previous section talks about glob and regex expressions, which can be used to further nail down the <host and port>, as well as the path expression. Indeed all of these expression type were already shown in the example above. Latter, the path expression, allows usage of wildcards while specifying the path segments.

There are two types of wildcards available:

  • free wildcard, which can be defined using * and
  • single wildcard, which can be defined using :

Both can be named and unnamed. Named wildcards allow accessing of the matched segments in the pipeline of the rule using the defined name as a key. Unnamed free wildcard is defined as ** and unnamed single wildcard is defined as :*. A named wildcard uses some identifier instead of the *, so like *name for free wildcard and :name for single wildcard.

The value of the path segment, respectively path segments available via the wildcard name is decoded. E.g. if you define the to be matched path in a rule as /file/:name, and the actual path of the request is /file/%5Bid%5D, you'll get [id] when accessing the captured path segment via the name key.

There are some simple rules, which must be followed while using wildcards:

  • One can use as many single wildcards, as needed in any segment
  • A segment must start with : or * to define a wildcard
  • No segments are allowed after a free (named) wildcard
  • If a regular segment must start with : or *, but should not be considered as a wildcard, it must be escaped with \.

Here some path examples:

  • /apples/and/bananas - Matches exactly the given path
  • /apples/and/:something - Matches /apples/and/bananas, /apples/and/oranges and alike, but not /apples/and/bananas/andmore or /apples/or/bananas. Since a named single wildcard is used, the actual value of the path segment matched by :something can be accessed in the rule pipeline using something as a key.
  • /apples/and/some:thing - Matches exactly /apples/and/some:thing
  • /apples/and/some** - Matches exactly /apples/and/some**
  • /apples/:junction/:something - Similar to above. But will also match /apples/or/bananas in addition to /apples/and/bananas and /apples/and/oranges.
  • /apples/** - Matches any path starting with /apples/
  • /apples/*remainingpath - Same as above, but uses a named free wildcard
  • /apples/**/bananas - Is invalid, as there is a path segment after a free wildcard
  • /apples/\*remainingpath - Matches exactly /apples/*remainingpath

How to migrate match expressions of old rules to new ones

  • If you had something like url: http://127.0.0.1:9090/foo/<**>, you can replace it with
    match:
      path: /foo/**
      with:
        scheme: http
        host_glob: 127.0.0.1:9090
      
  • If you had something like url: http://127.0.0.1:9090/<{,**.css,**.js,**.ico}>, you need two rules now, one to match / and one to match the resources:
    # in one rule
    match:
      path: /
      with:
        scheme: http
        host_glob: 127.0.0.1:9090
      
    
    # in the other rule
    match:
      path: /**
      with:
        path_glob: >
          {/**.css,
           /**.js,
           /**.ico}
        scheme: http
        host_glob: 127.0.0.1:9090
  • If you had something like url: http://<**>/profile/api, you can replace it with
    match:
      path: /profile/api
      with:
        scheme: http

Rule Matching Specificity & Backtracking

As written above, before this PR, the only way to ensure the more specific rules are matched before the more general ones, is placing the rules in one rule set and order the rules with more specific being placed before the more general ones.

This PR makes that requirement obsolete. The implementation ensures, that more specific path expressions are matched first regardless of the placement of rules in a rule set. Indeed the more specific rules are matched first even the corresponding rules are defined in different rule sets. This PR does also introduce optional backtracking for rule matching, which extends the existing capabilities related to defaults.

The following example demonstrates the aspects described above.

Imagine, there are the following three rules

  • rule 1

    id: rule1
    match:
      path: /files/**
    execute:
      - <pipeline definition>
  • rule 2

    id: rule2
    match:
      path: /files/:team/:name
      with:
        path_regex: ^/files/(team1|team2)/.*
      backtracking_enabled: true
    execute:
      - <pipeline definition>
  • rule 3

    id: rule3
    match:
      path: /files/team3/:name
    execute:
      - <pipeline definition>

The request to /files/team1/document.pdf will be matched by the rule with id rule2 as it is more specific to rule 1. So the pipeline of rule 2 will be executed.

The request to /files/team3/document.pdf will be matched by the rule 3 as it is more specific than rule 1 and 2. Again the corresponding pipeline will be executed.

However, even the request to /files/team4/document.pdf will be matched by rule 2, the regular expression ^/files/(team1|team2)/.* will fail. Since backtracking is enabled, backtracking will start and the request will be matched by the rule 1.

This allows not only providing additional fall backs, respectively defaults, but also further reduction, as well as simplification of rules. Here an additional example:

Imagine, you have a pretty complex rule, which covers read and write access to the resource

version: "1alpha3"
name: articles rules
rules:
  - id: poc:articles:articles_access
    match:
      url: http://127.0.0.1:9090/articles/<**>
    methods:
      - GET
      - POST
      - PUT
    execute:
      - authenticator: kratos_session
      - authenticator: anonymous
      - contextualizer: subscription_plan
        if: Subject.ID != "anonymous"
      - contextualizer: user_article_stats
        if: Subject.ID != "anonymous"
      - contextualizer: opa
        if: Request.Method == "GET"
        config: {values: {policy: can_read}}
      - authorizer: deny_all
        if: Subject.ID == "anonymous" && (Request.Method == "POST" || Request.Method == "PUT")
      - authorizer: opa
        if: Request.Method == "POST" || Request.Method == "PUT"
        config: { values: {policy: can_write}}

You can now split it into two rules:

version: "1alpha4"
name: articles rules
rules:
  - id: poc:articles:articles_read_access
    match:
      path: /articles/:*
      with:
        methods: [ GET ]
    execute:
      - authenticator: kratos_session
      - authenticator: anonymous
      - contextualizer: subscription_plan
        if: Subject.ID != "anonymous"
      - contextualizer: user_article_stats
        if: Subject.ID != "anonymous"
      - contextualizer: opa
        config: {values: {policy: can_read}}

  - id: poc:articles:articles_write_access
    match:
      path: /articles/:*
      with:
        methods: [ PUT, POST ]
    execute:
      - authenticator: kratos_session
      - contextualizer: subscription_plan
      - contextualizer: user_article_stats
      - authorizer: opa
        config: { values: {policy: can_write}}

which is much easier to digest about and also test. Rules for the same path expressions must come from the same rule set though. That way, the rule settings cannot be overwritten maliciously by another rule.

Since multiple rules with the same path expression might be present in a rule set, multiple rules could be matched based on their additional conditions definitions. Here an example:

- id: rule1
  match:
      path: /articles/:id
      with:
        methods: [ POST ]
  execute:
    - <pipeline definition>

- id: rule2
  match:
      path: /articles/:id
      with:
        methods: [ POST ]
  execute:
    - <pipeline definition>

Such conflicting configurations cannot be avoided while loading a rule set and there might be valid reasons to have different rules with more specific additional conditions for the same path expression as well. For that reason, heimdall will use the first matching rule when the incoming request is matched by multiple rules.

Path Segments Encoding

Unlike previously, the rules are now matched by traversing a radix tree. That means, rule specific settings cannot be taken into account as long as a rule is not found in the tree.

That also means,

  • if you define the path to be matched for a rule as /foo/bar, it will never match /foo%2Fbar (%2F is an encoded slash) and vice versa, and
  • if you define the path to be matched for a rule as /foo/[id], it will never match, as path segments are typically encoded and the actual path looks like /foo/%5Bid%5D.

With other words, you must specify the expected path as it comes over the wire as long as you're not using wildcards.

Beyond that the semantic of allow_encoded_slashes (introduced in #1071) has not been changed.

Access to Matched Values

As written above, the usage of named wildcards enables access to matched values in rule pipelines. The corresponding key value pairs are available in the .Request.URL.Captures object. Here is an example (similar to the one described in #1038 as suggestion for the new API):

rules:
  - id: rule:1
    match:
      path: /files/:uuid/delete
      with:
        host_glob: hosty.mchostface
    execute:
      - authorizer: openfga_check
        config:
          payload: |
            {
              "user": "{{ .Subject.ID }}",
              "relation": "can_delete",
              "object": "file:{{ .Request.URL.Captures.uuid }}"
            }

Breaking Changes Introduced by this PR

  • the definition of the match property is completely different (See the section "Rule Matching Configuration" above).
  • the method property has been moved into the redesigned match object (there under the with property) and is optional. The configured HTTP verbs are therefore used to match the rule and not after the rule has been matched, allowing for definition of different rules for the same path.
  • the default rule does not have a method property any more. That means, heimdall will never respond with 405 Method Not Allowed any more.
  • Since 405 Method Not Allowed is not returned by heimdall any more, there is no way to overwrite the corresponding response code to something else. So, support for respond.with.method_error in the configuration of the decision and proxy services has been dropped.
  • unlike previously, the rules are matched by traversing a radix tree. That means, you must specify the path expression to correspond the value coming over the wire (See the section "Path Segments Encoding" above).
  • Support for rule_path_match_prefix on endpoint configurations for http_endpoint and cloud_blob providers has been dropped. Same functionality is now given by allowing rules for the same path expressions to come from the same rule set only.
  • Default rule rejects requests with encoded slashes in the path of the URL with 400 Bad Request.

BEGIN_COMMIT_OVERRIDE
perf: O(log(n)) time complexity for lookup of rules (#1358)
feat: Support for free and single (named) wildcards for request path matching and access of the captured values from the pipeline (#1358)
feat: Support for backtracking while matching rules (#1358)
feat: Multiple rules can be defined for the same path, e.g. to have separate rules for read and write requests (#1358)
feat: Glob expressions are context aware and use . for host related expressions and / for path related ones as separators (#1358)
refactor!: Rule matching configuration API redesigned (#1358)
refactor!: Default rule rejects requests with encoded slashes in the path of the URL with 400 Bad Request (#1358)
refactor!: Support for rule_path_match_prefix on endpoint configurations for http_endpoint and cloud_blob providers has been dropped (#1358)
END_COMMIT_OVERRIDE

Copy link

codecov bot commented Apr 12, 2024

Codecov Report

Attention: Patch coverage is 97.69585% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 89.69%. Comparing base (2766094) to head (6147096).
Report is 1 commits behind head on main.

Files Patch % Lines
internal/rules/repository_impl.go 90.72% 6 Missing and 3 partials ⚠️
cmd/validate/ruleset.go 40.00% 3 Missing ⚠️
internal/rules/config/rule.go 85.71% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1358      +/-   ##
==========================================
+ Coverage   89.28%   89.69%   +0.40%     
==========================================
  Files         270      270              
  Lines        8870     9051     +181     
==========================================
+ Hits         7920     8118     +198     
+ Misses        703      691      -12     
+ Partials      247      242       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

internal/rules/repository_impl.go Outdated Show resolved Hide resolved
internal/rules/repository_impl.go Outdated Show resolved Hide resolved
internal/rules/repository_impl.go Fixed Show resolved Hide resolved
@dadrus dadrus changed the title wip: Radix trie for rule management and fast lookup feat: Radix trie for rule management and fast lookup Apr 29, 2024
@dadrus dadrus changed the title feat: Radix trie for rule management and fast lookup refactor!: Rules are loaded into a radix trie instead of a list Apr 29, 2024
@dadrus dadrus self-assigned this Apr 29, 2024
@dadrus dadrus merged commit f2f6867 into main Apr 30, 2024
27 checks passed
@dadrus dadrus deleted the feat/radix_trie branch April 30, 2024 17:52
@dadrus dadrus mentioned this pull request Sep 10, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant