Oathkeeper bombards Ory Network with requests after upgrade to 40.x #1161

Withel · 2024-04-25T08:23:40Z

Preflight checklist

I could not find a solution in the existing issues, docs, nor discussions.
I agree to follow this project's Code of Conduct.
I have read and am following this repository's Contribution Guidelines.
I have joined the Ory Community Slack.
I am signed up to the Ory Security Patch Newsletter.

Ory Network Project

https://vibrant-dubinsky-6d1qtx5k0i.projects.oryapis.com

Describe the bug

After upgrading from 39.4 to 40.7, @tricky42 notified us, about huge amount of requests made from oathkeeper to ory network's /introspect endpoint, as can be seen on the first screenshot (Upgrade was performed on 23.04 at ~15:58). From our perspective we don't see any issues in the logs or metrics. On the second screenshot you can see oathkeeper traffic (spike is from the moment of upgrade). We didn't change any configuration between the versions (except the naming of log level), and we suspect that the cache might not be working properly.

Screenshot 1:

Screenshot 2:

Reproducing the bug

Upgrade oathkeeper to 40.7 from 39.x

Relevant log output

No response

Relevant configuration

authenticators:
    oauth2_introspection:
      config:
        cache:
          enabled: true
          max_cost: 100
          ttl: 1h
        introspection_request_headers:
          Authorization: [REDACTED]
        introspection_url: https://[REDACTED]/admin/oauth2/introspect
        retry:
          give_up_after: 3s
          max_delay: 500ms
        token_from:
          header: Authorization
      enabled: true

Version

40.7

On which operating system are you observing this issue?

None

In which environment are you deploying?

Kubernetes with Helm

Additional Context

No response

aeneasr · 2024-04-25T08:24:53Z

Thank you for the report! Can you pin-point which version introduced this regression? It would make the search for the regression much easier!

Withel · 2024-04-25T08:28:01Z

I'm not sure if I understand you correctly, as I wrote in the description, we have upgraded from 39.4 to 40.7 or do you mean something else?

marcinfigiel · 2024-04-25T08:34:33Z

It's worth mentioning that this is our second attempt with the upgrade to 0.40.x. First time we tried with 0.40.6 and had the same effect.

aeneasr · 2024-04-25T10:00:19Z

Since there are a couple of versions between 39.4 and 40.6 I wanted to know if you specifically can pin point which version exactly introduced the issue, making it easier to find the root cause

marcinfigiel · 2024-04-25T10:02:40Z

Since there are a couple of versions between 39.4 and 40.6 I wanted to know if you specifically can pin point which version exactly introduced the issue, making it easier to find the root cause

Unfortunately not, we've only tried these two versions :(

Withel · 2024-05-08T13:08:27Z

@aeneasr we've managed to pin-point exact version which introduces this issue. This happens between v0.39.4 and v0.40.0. Hope this helps, please let me know if you need anything else.

aeneasr · 2024-05-09T10:35:27Z

I think your max_cost is just way too low. I see that we also use a misleading configuration in ristretto. The default max_cost is set to 100000000 which I think is way too high.

Try playing around with that value to see if it has an impact.

aeneasr · 2024-05-09T10:43:43Z

Fix for some of the config values: 2373057

aeneasr · 2024-05-09T11:02:53Z

Basically before we were using the internal cost which is I think the key length + cost function. Since your maxcost is like 100 the cache probably ran out of space after one or two keys so it's constantly evicting your values.

The fixes ignore the internal cost so you actually get 1 cost = 1 token instead of 1 token = 1 cost + cost of key

Withel · 2024-05-09T14:15:42Z

Soooo, with the help of @Demonsthere, we checked following values of the max_cost value:

~15:05 CET: 100M (default) -> 300-400 req/m with initial spike 
~15:35 CET: 50M -> 300-400 req/m
~15:55 CET: 5M -> for a second dropped to ~270 req/m, then increased to ~300-400 req/m

And here's the graph:

I hope this helps, for the time being we will go back to the 39.4 and we'll wait for the further updates. Please let us know if you need anything else.

aeneasr · 2024-05-09T15:15:39Z

So what you're saying is that it doesn't have an effect?

Withel · 2024-05-10T08:29:43Z

So, I must admitt that I made a mistake that I realised just after posting last comment. Unfortunately, I didnt keep an eye on the pods after deployment and turned out that they were not restarting after changing max_count. From what I know oathkeeper does not have configuration hot reload, thus I will have to redo those experiments. Nevertheless I’m not sure when I’ll be able to perform those experiments since right now Im out of the office, but Ill do this as soon as possible. Apologies for that.

aeneasr · 2024-05-10T08:35:23Z

Hot reloading only works for things that can be changed during runtime. Caches unfortunately are large memory objects that are allocated at process start and can not be changed at runtime.

Withel · 2024-05-14T14:19:36Z

So, I have redo the tests (now making sure that after each update, pods are restarted correctly) and here are the results. Values same as before:

Before - 39.4 with max_count set to 100
9:30 CET - 40.7 with max_count set to 100 
~9:55 CET - 40.7 with max_count removed (set to 100M, default)
~10:20 CET - 40.7 with max_count set to 50M
~10:50 CET - 40.7 with max_count set to 5M

Results (thanks to @tricky42 curtesy):

Also, we were suspecting that maybe oathkeeper containers are running out of memory, but we confirmed that it's not the case:

aeneasr · 2024-05-24T10:17:59Z

OK so increasing the cache size fixes the problem?

Withel added the bug Something is not working. label Apr 25, 2024

Demonsthere mentioned this issue May 16, 2024

Upgrade Oathkeeper helm chart 0.41 causes 503 #1165

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oathkeeper bombards Ory Network with requests after upgrade to 40.x #1161

Oathkeeper bombards Ory Network with requests after upgrade to 40.x #1161

Withel commented Apr 25, 2024

aeneasr commented Apr 25, 2024

Withel commented Apr 25, 2024

marcinfigiel commented Apr 25, 2024

aeneasr commented Apr 25, 2024

marcinfigiel commented Apr 25, 2024

Withel commented May 8, 2024

aeneasr commented May 9, 2024

aeneasr commented May 9, 2024

aeneasr commented May 9, 2024

Withel commented May 9, 2024

aeneasr commented May 9, 2024

Withel commented May 10, 2024

aeneasr commented May 10, 2024

Withel commented May 14, 2024

aeneasr commented May 24, 2024

Oathkeeper bombards Ory Network with requests after upgrade to 40.x #1161

Oathkeeper bombards Ory Network with requests after upgrade to 40.x #1161

Comments

Withel commented Apr 25, 2024

Preflight checklist

Ory Network Project

Describe the bug

Reproducing the bug

Relevant log output

Relevant configuration

Version

On which operating system are you observing this issue?

In which environment are you deploying?

Additional Context

aeneasr commented Apr 25, 2024

Withel commented Apr 25, 2024

marcinfigiel commented Apr 25, 2024

aeneasr commented Apr 25, 2024

marcinfigiel commented Apr 25, 2024

Withel commented May 8, 2024

aeneasr commented May 9, 2024

aeneasr commented May 9, 2024

aeneasr commented May 9, 2024

Withel commented May 9, 2024

aeneasr commented May 9, 2024

Withel commented May 10, 2024

aeneasr commented May 10, 2024

Withel commented May 14, 2024

aeneasr commented May 24, 2024