refactor!: Rules are loaded into a radix trie instead of a list #1358

dadrus · 2024-04-12T19:56:48Z

Related issue(s)

closes #652
closes #661
closes #1037
closes #1038

Checklist

I agree to follow this project's Code of Conduct.
I have read, and I am following this repository's Contributing Guidelines.
I have read the Security Policy.
I have referenced an issue describing the bug/feature request.
I have added tests that prove the correctness of my implementation.
I have updated the documentation.

Background

The current implementation for rule management and matching is pretty simple. It is based on a list of precompiled glob or regex expressions. There are however multiple drawbacks/challenges:

While the deletion, update or insertion of rules is pretty simple and fast, the lookup has a time complexity between O(n) and O(n²) in worst case (O(n) to match the expression + O(n) to iterate over the list of rules) which is actually terrible.
Matching happens either via regular expressions or glob expressions. Even glob expressions are much faster compared to regular expressions, they don't allow for groups or named matches. For that reason, heimdall does not implement an API which could be used in rule pipelines to get access to the matched parts of a path or similar. There are however workarounds (see Named capturing matchers #1038).
The <*> glob expression matches all characters except / and ., which are separators in paths and hosts respectively. Used in paths, that expression is quite useless, as it captures path segments containing dots in an unexpected way (see Make single star globs more context-aware and useful #1037)
The only way to ensure the more specific rules are matched before the more general ones, is placing the rules in one rule set and order the rules with more specific placed before the more general ones, which is ok, but requires explicit knowledge about the implementation (see Add support for closest match while matching the urls for rule selection #661).
It does not allow having read and write related functionality for the same url handled by different rules.

Description

This PR cleans up with the above said challenges and replaces the list based rule management implementation as well as the glob/regex based lookup implementation with a new implementation based on radix tree. That

has O(log(n)) time complexity for request to rule matching,
allows context based matching using (named) wildcards, with captured values being accessible in the pipeline
does not require a specific ordering of rules,
introduces backtracking while matching rules, which was even not possible before,
allows definition of multiple rules for same path expressions, e.g. to have separate rules for read and write requests
allows for further match conditions in the future, like e.g. taking specific headers, or query parameters into account.
enables usage of context aware glob expressions, which use . for host related expressions and / for path related ones as separators

The following sections describe the corresponding user facing changes in detail.

Rule Matching Configuration

The old matching related configuration was

match:
  url: <some glob or regex expression>
  strategy: <glob or regex>

The new matching related configuration, introduced by this PR looks as follows:

match:
  # mandatory
  path: <path expression>

  # optional, inherited from the default rule
  backtracking_enabled: false

  # optional
  with:
    # optional, if not set matches all schemes
    scheme: <either http or https>
    #optional, if not set matches all methods
    methods:
      - <list of HTTP methods to match>
    # optional, if not set matches all hosts
    host_glob: <glob expression>
    # optional, if not set matches all hosts
    host_regex: <regex expression>
    # optional, if not set matches all paths
    path_glob: <glob expression>
    # optional, if not set matches all paths
    path_regex: <regex expression>

With

path defining the primary expression to match the rule
with defining additional conditions, required to be met to have the rule indeed matched. These are:
- scheme - allowing definition of the HTTP scheme to match. If not set, both "http" and "https" schemes are accepted
- methods - defining the list of allowed, respectively to be matched HTTP methods.
- host_glob and host_regex allowing additional expressions to validate the host value against. The two are mutually exclusive. If not set, any host is accepted.
- path_glob and path_regex are mutually exclusive as well and allow to further nail down the URL path of the given request after it is matched.
backtracking_enabled configures the backtracking behavior if the additional conditions fail. If enabled, the lookup in the radix tree will traverse back to a less specific path expression and potentially match a less specific rule.

There may be multiple rules with the same path expression, but different additional conditions, like e.g. required methods.

Here an example:

match:
  path: /abs/foo/:something
  with:
    scheme: https
    methods:
      - GET
      - POST
    host_glob: "*.example.com"
    path_regex: "^/abs/foo/(bar|baz)"

The configuration of the default_rule has been extended

# in heimdall config file
default_rule:
  # optional, defaults to false
  backtracking_enabled: false

  # other properties
  ...

For security reasons backtracking is disabled by default. It can be enabled globally on the default rule level, and also enabled or disabled on the level of a particular upstream rule.

Expressions

The previous section talks about glob and regex expressions, which can be used to further nail down the <host and port>, as well as the path expression. Indeed all of these expression type were already shown in the example above. Latter, the path expression, allows usage of wildcards while specifying the path segments.

There are two types of wildcards available:

free wildcard, which can be defined using * and
single wildcard, which can be defined using :

Both can be named and unnamed. Named wildcards allow accessing of the matched segments in the pipeline of the rule using the defined name as a key. Unnamed free wildcard is defined as ** and unnamed single wildcard is defined as :*. A named wildcard uses some identifier instead of the *, so like *name for free wildcard and :name for single wildcard.

The value of the path segment, respectively path segments available via the wildcard name is decoded. E.g. if you define the to be matched path in a rule as /file/:name, and the actual path of the request is /file/%5Bid%5D, you'll get [id] when accessing the captured path segment via the name key.

There are some simple rules, which must be followed while using wildcards:

One can use as many single wildcards, as needed in any segment
A segment must start with : or * to define a wildcard
No segments are allowed after a free (named) wildcard
If a regular segment must start with : or *, but should not be considered as a wildcard, it must be escaped with \.

Here some path examples:

/apples/and/bananas - Matches exactly the given path
/apples/and/:something - Matches /apples/and/bananas, /apples/and/oranges and alike, but not /apples/and/bananas/andmore or /apples/or/bananas. Since a named single wildcard is used, the actual value of the path segment matched by :something can be accessed in the rule pipeline using something as a key.
/apples/and/some:thing - Matches exactly /apples/and/some:thing
/apples/and/some** - Matches exactly /apples/and/some**
/apples/:junction/:something - Similar to above. But will also match /apples/or/bananas in addition to /apples/and/bananas and /apples/and/oranges.
/apples/** - Matches any path starting with /apples/
/apples/*remainingpath - Same as above, but uses a named free wildcard
/apples/**/bananas - Is invalid, as there is a path segment after a free wildcard
/apples/\*remainingpath - Matches exactly /apples/*remainingpath

How to migrate match expressions of old rules to new ones

If you had something like url: http://127.0.0.1:9090/foo/<**>, you can replace it with

match:
  path: /foo/**
  with:
    scheme: http
    host_glob: 127.0.0.1:9090

If you had something like url: http://127.0.0.1:9090/<{,**.css,**.js,**.ico}>, you need two rules now, one to match / and one to match the resources:

# in one rule
match:
  path: /
  with:
    scheme: http
    host_glob: 127.0.0.1:9090
  

# in the other rule
match:
  path: /**
  with:
    path_glob: >
      {/**.css,
       /**.js,
       /**.ico}
    scheme: http
    host_glob: 127.0.0.1:9090

If you had something like url: http://<**>/profile/api, you can replace it with
```
match:
  path: /profile/api
  with:
    scheme: http
```

Rule Matching Specificity & Backtracking

As written above, before this PR, the only way to ensure the more specific rules are matched before the more general ones, is placing the rules in one rule set and order the rules with more specific being placed before the more general ones.

This PR makes that requirement obsolete. The implementation ensures, that more specific path expressions are matched first regardless of the placement of rules in a rule set. Indeed the more specific rules are matched first even the corresponding rules are defined in different rule sets. This PR does also introduce optional backtracking for rule matching, which extends the existing capabilities related to defaults.

The following example demonstrates the aspects described above.

Imagine, there are the following three rules

rule 1

id: rule1
match:
  path: /files/**
execute:
  - <pipeline definition>

rule 2

id: rule2
match:
  path: /files/:team/:name
  with:
    path_regex: ^/files/(team1|team2)/.*
  backtracking_enabled: true
execute:
  - <pipeline definition>

rule 3

id: rule3
match:
  path: /files/team3/:name
execute:
  - <pipeline definition>

The request to /files/team1/document.pdf will be matched by the rule with id rule2 as it is more specific to rule 1. So the pipeline of rule 2 will be executed.

The request to /files/team3/document.pdf will be matched by the rule 3 as it is more specific than rule 1 and 2. Again the corresponding pipeline will be executed.

However, even the request to /files/team4/document.pdf will be matched by rule 2, the regular expression ^/files/(team1|team2)/.* will fail. Since backtracking is enabled, backtracking will start and the request will be matched by the rule 1.

This allows not only providing additional fall backs, respectively defaults, but also further reduction, as well as simplification of rules. Here an additional example:

Imagine, you have a pretty complex rule, which covers read and write access to the resource

version: "1alpha3"
name: articles rules
rules:
  - id: poc:articles:articles_access
    match:
      url: http://127.0.0.1:9090/articles/<**>
    methods:
      - GET
      - POST
      - PUT
    execute:
      - authenticator: kratos_session
      - authenticator: anonymous
      - contextualizer: subscription_plan
        if: Subject.ID != "anonymous"
      - contextualizer: user_article_stats
        if: Subject.ID != "anonymous"
      - contextualizer: opa
        if: Request.Method == "GET"
        config: {values: {policy: can_read}}
      - authorizer: deny_all
        if: Subject.ID == "anonymous" && (Request.Method == "POST" || Request.Method == "PUT")
      - authorizer: opa
        if: Request.Method == "POST" || Request.Method == "PUT"
        config: { values: {policy: can_write}}

You can now split it into two rules:

version: "1alpha4"
name: articles rules
rules:
  - id: poc:articles:articles_read_access
    match:
      path: /articles/:*
      with:
        methods: [ GET ]
    execute:
      - authenticator: kratos_session
      - authenticator: anonymous
      - contextualizer: subscription_plan
        if: Subject.ID != "anonymous"
      - contextualizer: user_article_stats
        if: Subject.ID != "anonymous"
      - contextualizer: opa
        config: {values: {policy: can_read}}

  - id: poc:articles:articles_write_access
    match:
      path: /articles/:*
      with:
        methods: [ PUT, POST ]
    execute:
      - authenticator: kratos_session
      - contextualizer: subscription_plan
      - contextualizer: user_article_stats
      - authorizer: opa
        config: { values: {policy: can_write}}

which is much easier to digest about and also test. Rules for the same path expressions must come from the same rule set though. That way, the rule settings cannot be overwritten maliciously by another rule.

Since multiple rules with the same path expression might be present in a rule set, multiple rules could be matched based on their additional conditions definitions. Here an example:

- id: rule1
  match:
      path: /articles/:id
      with:
        methods: [ POST ]
  execute:
    - <pipeline definition>

- id: rule2
  match:
      path: /articles/:id
      with:
        methods: [ POST ]
  execute:
    - <pipeline definition>

Such conflicting configurations cannot be avoided while loading a rule set and there might be valid reasons to have different rules with more specific additional conditions for the same path expression as well. For that reason, heimdall will use the first matching rule when the incoming request is matched by multiple rules.

Path Segments Encoding

Unlike previously, the rules are now matched by traversing a radix tree. That means, rule specific settings cannot be taken into account as long as a rule is not found in the tree.

That also means,

if you define the path to be matched for a rule as /foo/bar, it will never match /foo%2Fbar (%2F is an encoded slash) and vice versa, and
if you define the path to be matched for a rule as /foo/[id], it will never match, as path segments are typically encoded and the actual path looks like /foo/%5Bid%5D.

With other words, you must specify the expected path as it comes over the wire as long as you're not using wildcards.

Beyond that the semantic of allow_encoded_slashes (introduced in #1071) has not been changed.

Access to Matched Values

As written above, the usage of named wildcards enables access to matched values in rule pipelines. The corresponding key value pairs are available in the .Request.URL.Captures object. Here is an example (similar to the one described in #1038 as suggestion for the new API):

rules:
  - id: rule:1
    match:
      path: /files/:uuid/delete
      with:
        host_glob: hosty.mchostface
    execute:
      - authorizer: openfga_check
        config:
          payload: |
            {
              "user": "{{ .Subject.ID }}",
              "relation": "can_delete",
              "object": "file:{{ .Request.URL.Captures.uuid }}"
            }

Breaking Changes Introduced by this PR

the definition of the match property is completely different (See the section "Rule Matching Configuration" above).
the method property has been moved into the redesigned match object (there under the with property) and is optional. The configured HTTP verbs are therefore used to match the rule and not after the rule has been matched, allowing for definition of different rules for the same path.
the default rule does not have a method property any more. That means, heimdall will never respond with 405 Method Not Allowed any more.
Since 405 Method Not Allowed is not returned by heimdall any more, there is no way to overwrite the corresponding response code to something else. So, support for respond.with.method_error in the configuration of the decision and proxy services has been dropped.
unlike previously, the rules are matched by traversing a radix tree. That means, you must specify the path expression to correspond the value coming over the wire (See the section "Path Segments Encoding" above).
Support for rule_path_match_prefix on endpoint configurations for http_endpoint and cloud_blob providers has been dropped. Same functionality is now given by allowing rules for the same path expressions to come from the same rule set only.
Default rule rejects requests with encoded slashes in the path of the URL with 400 Bad Request.

BEGIN_COMMIT_OVERRIDE
perf: O(log(n)) time complexity for lookup of rules (#1358)
feat: Support for free and single (named) wildcards for request path matching and access of the captured values from the pipeline (#1358)
feat: Support for backtracking while matching rules (#1358)
feat: Multiple rules can be defined for the same path, e.g. to have separate rules for read and write requests (#1358)
feat: Glob expressions are context aware and use . for host related expressions and / for path related ones as separators (#1358)
refactor!: Rule matching configuration API redesigned (#1358)
refactor!: Default rule rejects requests with encoded slashes in the path of the URL with 400 Bad Request (#1358)
refactor!: Support for rule_path_match_prefix on endpoint configurations for http_endpoint and cloud_blob providers has been dropped (#1358)
END_COMMIT_OVERRIDE

codecov · 2024-04-12T20:03:03Z

Codecov Report

Attention: Patch coverage is 97.69585% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 89.69%. Comparing base (2766094) to head (6147096).
Report is 1 commits behind head on main.

Files	Patch %	Lines
internal/rules/repository_impl.go	90.72%	6 Missing and 3 partials ⚠️
cmd/validate/ruleset.go	40.00%	3 Missing ⚠️
internal/rules/config/rule.go	85.71%	1 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1358      +/-   ##
==========================================
+ Coverage   89.28%   89.69%   +0.40%     
==========================================
  Files         270      270              
  Lines        8870     9051     +181     
==========================================
+ Hits         7920     8118     +198     
+ Misses        703      691      -12     
+ Partials      247      242       -5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…netes provider

internal/rules/repository_impl.go

… simplified

internal/rules/repository_impl_test.go

…of affected providers

…king

…ation

…dated

…tions only

dadrus added 2 commits April 12, 2024 21:52

new string utils function to find a common prefix length of two strings

8fcb57c

initial implementation based on work from @davidspek

7700693

dadrus added 10 commits April 13, 2024 15:55

find method ensures unnamed wildcards are not included in the result

23deeaa

linter warnings resolved

ecd178b

first working version

55de764

ruleset version updated to 1alpha4 respectively to v1alpha4 for kuber…

e47a724

…netes provider

ruleset version in helm chart updated

c467a90

validation command test data updated

7a3787e

first updates to examples

be24aa1

first changes to the docs

109c95d

some linter warnings resolved

1256454

Merge branch 'main' into feat/radix_trie

d6d6c8f

github-actions bot reviewed Apr 17, 2024

View reviewed changes

internal/rules/repository_impl.go Outdated Show resolved Hide resolved

internal/rules/repository_impl.go Outdated Show resolved Hide resolved

dadrus added 2 commits April 20, 2024 20:04

some errors fixed, enhanced internal apis and linter warnings resolved

3c6590f

some code reformatting

77312a0

github-advanced-security bot found potential problems Apr 20, 2024

View reviewed changes

internal/rules/repository_impl.go Fixed Show resolved Hide resolved

dadrus added 10 commits April 21, 2024 15:05

rule repository updated to cover additional new cases, implementation…

655c590

… simplified

more code comments

8ee0a96

indextree simplifications and refactorings

fc537cc

repository impl updated to comply with the updated indextree api

c6f5790

package indextree renamed to radixtree

672c4db

radixtree package move to x

4dad2ce

further renamings

4f89a3e

only rules from the same rule set can added to the same node

0abc474

better error handling and less dependencies in radix tree impl

effbc98

lookup of requests with escaped parts in URL fixed

55d7c23

github-actions bot reviewed Apr 22, 2024

View reviewed changes

internal/rules/repository_impl_test.go Outdated Show resolved Hide resolved

dadrus added 2 commits April 22, 2024 10:27

useless test removed

a94d5a2

code simplifications

6cd57a3

dadrus added 13 commits April 29, 2024 10:23

more docs updates

e70c988

json schema updated - suport for rule_path_match_prefix dropped

dc3f1bf

description of rule_path_match_prefix removed from the documentation …

18ade03

…of affected providers

config reference updated to reflect the changes introduced in this PR

da6787e

more descriptions about rule matching and backtracking

212865e

json schema updated - backtracking_enabled added to the default_rule

5c248ef

radixtree implementation updated to allow switching on or off backtrc…

127d4df

…king

CRD updated to allow usage of the new backtracking_enabled property

eab6827

default rule impl updated to support the new property

115b96e

upstream rule config impl updated to allow usage of the new property

07ec621

making use of the new backtracking_enabled property in rule implement…

b371752

…ation

configs used for validation tests updated

da82458

example config updated

a8a6cba

dadrus changed the title ~~wip: Radix trie for rule management and fast lookup~~ feat: Radix trie for rule management and fast lookup Apr 29, 2024

dadrus changed the title ~~feat: Radix trie for rule management and fast lookup~~ refactor!: Rules are loaded into a radix trie instead of a list Apr 29, 2024

dadrus added 3 commits April 29, 2024 17:14

docker compose quickstarts heimdall config updated

e0e21c9

made configs better readable by adding empty lines; metallb config up…

767f8ef

…dated

notes in examples readme added

b0d0d3d

dadrus self-assigned this Apr 29, 2024

dadrus added 6 commits April 30, 2024 10:22

support for and documentation of the respond.with.method_error removed

7bd03a3

more docu

8622ea2

made backtracking_enabled configurable together with additional condi…

8b33ec7

…tions only

documentation updated

2fd38b0

example rule updated

92b368b

better config validation converade in tests

6147096

dadrus merged commit f2f6867 into main Apr 30, 2024
27 checks passed

dadrus deleted the feat/radix_trie branch April 30, 2024 17:52

github-actions bot mentioned this pull request May 6, 2024

chore(main): release 0.15.0 #1405

Merged

dadrus mentioned this pull request Aug 17, 2024

Support for Matching Multiple Paths Using Lists #1718

Closed

3 tasks

dadrus mentioned this pull request Sep 10, 2024

feat: Route based matching of rules #1766

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor!: Rules are loaded into a radix trie instead of a list #1358

refactor!: Rules are loaded into a radix trie instead of a list #1358

dadrus commented Apr 12, 2024 •

edited

Loading

codecov bot commented Apr 12, 2024 •

edited

Loading

refactor!: Rules are loaded into a radix trie instead of a list #1358

refactor!: Rules are loaded into a radix trie instead of a list #1358

Conversation

dadrus commented Apr 12, 2024 • edited Loading

Related issue(s)

Checklist

Background

Description

Rule Matching Configuration

Expressions

How to migrate match expressions of old rules to new ones

Rule Matching Specificity & Backtracking

Path Segments Encoding

Access to Matched Values

Breaking Changes Introduced by this PR

codecov bot commented Apr 12, 2024 • edited Loading

Codecov Report

dadrus commented Apr 12, 2024 •

edited

Loading

codecov bot commented Apr 12, 2024 •

edited

Loading