Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The dashboard DOS attacks the service during initial load #372

Closed
SGudbrandsson opened this issue Jun 21, 2023 · 10 comments
Closed

The dashboard DOS attacks the service during initial load #372

SGudbrandsson opened this issue Jun 21, 2023 · 10 comments

Comments

@SGudbrandsson
Copy link

Description

helm-dashboard seems to want to get all the information for all the services from the initial call.
This causes a DOS on the service itself.
The CPU spikes and I never get results for the resources view.

This seems to be initiated from datadog-rum-v4.js, initialized by the analytics part.
We're using komodorio/helm-dashboard:1.3.2 docker image

This causes the whole system to stop responding until kubernetes has restarted the dashboard.

I'm trying to modify the deployment and disabling tracking, but this is very annoying.. if you want tracking, you should probably modify this to not DOS the service..

Screenshots

image
image
image

Additional information

No response

@SGudbrandsson
Copy link
Author

Disabling analytics didn't do anything.

Looks like it's https://github.com/komodorio/helm-dashboard/blob/main/pkg/dashboard/static/list-view.js#L14 looping through all deployments, calling https://github.com/komodorio/helm-dashboard/blob/main/pkg/dashboard/static/list-view.js#L111 in a recursion.

@undera
Copy link
Collaborator

undera commented Jun 22, 2023

Hi
It has nothing to do with analytics. It's just how the app works by querying health status for each of the releases. There's no recursion, just loop once over a list. It may be longer if your list is long.

I guess you have many releases installed, are you? How many?

It this moment I'm not sure what would be the best solution to address it. Probably we should only query the health status for those releases that are visible on the list. It's something that could be part of V2 effort (#233).

One of the workaround is to narrow down the list of namespaces in scope via --namespace parameter.

Another immediate option to overcome this is Komodor's platform, where you have the same functionality without scalability issues.

@SGudbrandsson
Copy link
Author

:/

We deploy multiple times a day, so the list is long.

Looking at the code, it's doing a lot of things that can be optimized.

To get the health of a single service, the app is fetching all helm apps and all helm releases and parsing them before processing and checking the health status of that single app.

There are some architectural decisions you can make to improve this such as:

  1. Cache helm releases for 20 seconds (helps with bursts).
  2. Do a request collapsing - if there are multiple requests to the k8s api for the same data, fold them into a request queue and return all the calling functions the same data.
  3. For a health check, only check health. Don't download the whole catalog for every single application.

This is a cool concept that can be improved a bit by doing tiny performance optimizations.

I don't think we'll be able to use this or Komodor's platform (I guess it's running the same thing behind the scenes).
We'll find another solution 😢

@undera
Copy link
Collaborator

undera commented Jun 23, 2023

@SGudbrandsson About Komodor platform - no, it does not use the same code and same approaches, thus it does not suffer from this problem.

Thanks for sharing your observations, it will help to improve the product.

@undera
Copy link
Collaborator

undera commented Jun 23, 2023

You were absolutely right about fetching all releases to find single, I'm already fixing that in #373

@SGudbrandsson
Copy link
Author

Wow, you're amazing!
Thanks for checking this out.

I'm also in contact with sales at Komodor.io regarding the platform to see if it fits our use case (we use GKE autopilot, so node based pricing doesn't work)

@seboudry
Copy link

Hi!

Hope to see a release with these improvements 😉

Running on an on-prem cluster, this DOS the API server et control planes nodes.

With around 200 releases on our cluster there was 2k packets per seconds sent and 5 MB/s of received bandwith, uge!

Can't try this tool for now 😢

@seboudry
Copy link

hi @undera !

any chance to get a new release? 😉

@undera
Copy link
Collaborator

undera commented Nov 13, 2024

hi @undera !

any chance to get a new release? 😉

Hell yeah!! I've opened this Pandora box

@undera
Copy link
Collaborator

undera commented Nov 13, 2024

There you go, after 2 retries it has completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants