Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DONOT_MERGE] node-density-cni on 500 nodes #641

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

venkataanil
Copy link
Collaborator

Below changes are required to run node-density-cni on 500+ nodes 1) requestTimeout is set to 60 sec to avoid Client.Timeout errors while checking for created objects
2) Better to use metrics-aggregated.yml instead of metrics.yml which will have reduced/aggregated metrics. kube-burner looks for metrics.yml file in current directory (or generates if not found) and uses that for metrics. This patch adds metrics.yml in the e2e workload directory which will have only required metrics. 3) At large scale (500 node+), prometheous is failing to scrape containerCPU-AggregatedWorkers, containerMemory-AggregatedWorkers, nodeCPU-AggregatedWorkers, nodeMemoryUtilization-AggregatedWorkers, podStatusCount and podDistribution. So removed them from the metrics.yml 4) we manually need to reduce node count after test is finished and run 'kube-burner index' to scrape containerCPU-AggregatedWorkers and containerMemory-AggregatedWorkers metrics if needed.

Type of change

  • Refactor
  • New feature
  • Bug fix
  • Optimization
  • Documentation Update

Description

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please describe the System Under Test.
  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Below changes are required to run node-density-cni on 500+ nodes
1) requestTimeout is set to 60 sec to avoid Client.Timeout errors
while checking for created objects
2) Better to use metrics-aggregated.yml instead of metrics.yml
which will have reduced/aggregated metrics. kube-burner looks
for metrics.yml file in current directory (or generates if not
found) and uses that for metrics. This patch adds metrics.yml
in the e2e workload directory which will have only required metrics.
3) At large scale (500 node+), prometheous is failing to scrape
containerCPU-AggregatedWorkers, containerMemory-AggregatedWorkers,
nodeCPU-AggregatedWorkers, nodeMemoryUtilization-AggregatedWorkers,
podStatusCount and podDistribution. So removed them from the metrics.yml
4) we manually need to reduce node count after test is finished and
run 'kube-burner index' to scrape containerCPU-AggregatedWorkers
and containerMemory-AggregatedWorkers metrics if needed.
Label workers with ovnic and then scrape metrics from only these
workers.

node-desnity-cni on 500 nodes runs for 2 hours 15 minutes.
Scraping metrics from 500 nodes for the duration of 2 hours
15 minutes is overkill.
So we scrape from only 10 worker nodes if the worker node count is
more than 120.
Scale the machineset to the desired count before running kube-burner.

run.sh accepts the number of workers to scale for each avaialability
zone in region as env variable. Currently this patch is hard coded to
support us-west-2 region with 4 avaialaility zones.
For example, if user wants to scale US_WEST_2A to 143 nodes and
US_WEST_2B to 188 nodes, pass them to run.sh like below

US_WEST_2A=143 US_WEST_2B=188 WORKLOAD=node-desnity-cni ./run.sh

note: it scales nodes 50 at a time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant