Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load tests with provider-azure #404

Closed
ulucinar opened this issue Mar 3, 2023 · 14 comments
Closed

Load tests with provider-azure #404

ulucinar opened this issue Mar 3, 2023 · 14 comments
Assignees

Comments

@ulucinar
Copy link
Collaborator

ulucinar commented Mar 3, 2023

We would like to perform some load tests to better understand the scaling characteristics of provider-azure. The most recent experiments related to provider performance are here but they were for parameter optimization and not load test experiments. These tests can also help us to give the community sizing & scaling guidance.

We may do a set of experiments (with the latest available version of provider-azure) in which we gradually increase the # of MRs provisioned until we saturate the computing resources of upbound/provider-azure. I suggest we use an EKS cluster with a worker instance type of m5.2xlarge (32 GB Memory - 8 vCPUs) initially with the vanilla provider and with the default parameters (especially with the default max-reconcile-rate of 10) as suggested here) so that we can better relate our results with the results of those previous ones and also because the current default provider parameters are chosen using the results of those experiments.

We can also make use of the existing tooling from here & here to conduct these tests. We should collect & report at least the following for each experiment:

  • The types and number of MRs provisioned during the test
  • Success rate for Ready=True, Synced=True state in 10 min: During an interval of 10 min, how many of the MRs could acquire these conditions and how many failed to do so?
  • Using the available Prometheus metrics from the provider, what was the peak & avg. memory/CPU utilization? You can install the Prometheus and Grafana stack using something like: helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack -n prometheus --set namespaceOverride=prometheus --set grafana.namespaceOverride=prometheus --set kube-state-metrics.namespaceOverride=prometheus --set prometheus-node-exporter.namespaceOverride=prometheus --create-namespace from the prometheus-community Helm repository. We may include the Grafana dashboard screenshots like here.
  • kubectl get managed -o yaml output at the end of the experiment.
  • Time-to-readiness metrics as defined here. Histograms like we have there would be great but we can also derive them later.
  • go run github.com/upbound/uptest/cmd/ttr@fix-69 output (related with the above item)
  • ps -o pid,ppid,etime,comm,args output from the provider container. We can do this at the end of each experiment run or better, we can have reporting during the course of the experiment with something like: while true; do date; k exec -it <provider pod> -- ps -o pid,ppid,etime,comm,args; done and log the output to a file. You can refer to our conversion with @mmclane here for more context on why we do this.

As long as we have not saturated the compute resources of the provider, we can iterate with a new experiment with more MRs in increments of 5 or 10. I think initially we can start with 30 (let's start with something with 100% success rate, i.e., all MRs provisioned can become ready in the allocated time, i.e., in 10 min).

@Piotr1215
Copy link

Similar issue for testing AWS Provider: crossplane-contrib/provider-upjet-aws#576

@Piotr1215
Copy link

For testing, pick a resource with TTR of 15 sec. Command go run github.com/upbound/uptest/cmd/ttr@fix-69 without parameters.

@Piotr1215
Copy link

For testing we are going to use EKS single node with the spec of m5.2xlarge - 32 GB Memory - 8 vCPUs

@Piotr1215
Copy link

Baseline test with provider Azure v0.28 and EKS with 1 node m5.2xlarge - 32 GB Memory - 8 vCPUs. Kubernetes version 1.25.
Resource used for testing resource group. Tests results compared to the AWS counterparts show less memory and CPU utilization, but significantly longer TTRs.

Image

@jeanduplessis
Copy link
Collaborator

Curious if we have a hypothesis around this test showing less memory and CPU utilization compared to AWS. Could it be the type of resource? Something in the upstream terraform provider? Something else?

I assume TTR could be influenced by the actual cloud platform response time.

@Piotr1215
Copy link

My initial thought is the type of resource. I'm planning to run container registry (same resource type as with the AWS tests) and see if this affects the results.

We've discussed the TTR and the initial hypothesis is that it might be a combination of lower level Go SDK and connectivity to Azure resource manager.

@Piotr1215
Copy link

Piotr1215 commented Mar 13, 2023

I have run additional tests with a Registry resource with the following results to see if there are significant differences compared to. CPU and Memory utilization is higher, but what is interesting is that TTR for 100 resources is even lower.

Test with provider Azure v0.28 and EKS with 1 node m5.2xlarge - 32 GB Memory - 8 vCPUs. Kubernetes version 1.25.

image

@Piotr1215
Copy link

Piotr1215 commented Mar 14, 2023

Ran one more time 100 Registries with the following results. Including Prometheus graphs this time.

Test with provider Azure v0.28 and EKS with 1 node m5.2xlarge - 32 GB Memory - 8 vCPUs. Kubernetes version 1.25.

Image

Image

@negz
Copy link
Member

negz commented Mar 15, 2023

Apologies, I'm having trouble following. Are these latest round of numbers the baseline (i.e. still provider-azure v0.28), or are they measuring an improved build of the provider (e.g. one sharing Terraform provider processes)?

@Piotr1215
Copy link

Sorry for not including this info in the comment (comment updated). The tests are for now against the same provider version, node and k8s version.

@negz
Copy link
Member

negz commented Mar 16, 2023

@Piotr1215 One thing I'm having trouble following is whether any of your numbers are derived from an experimental build of the provider (i.e. one with long lived Terraform provider processes), or whether these are all just numbers for the "normal" provider-azure build.

Or, put otherwise, should I interpret that at the moment our most efficient provider-azure builds use about 4 vCPUs and 1GB of memory to reconcile 100 managed resources?

@Piotr1215
Copy link

@negz apologies for the confusion, all the tests so far are against normal build of the Azure provider (v 0.28). We are gathering a baseline to compare subsequent tests with experimental images against it. Hopefully tomorrow I can start running the tests against the experimental build and document the results.

Or, put otherwise, should I interpret that at the moment our most efficient provider-azure builds use about 4 vCPUs and 1GB of memory to reconcile 100 managed resources?

The node could probably handle a few more resources, I could try to increase the number of MRs to failure to see what is the upper load spike limit for this version of the provider.

@Piotr1215
Copy link

Comparing test results for provider Azure with different versions of image.
Test setup: Test with provider Azure and EKS with 1 node m5.2xlarge - 32 GB Memory - 8 vCPUs. Kubernetes version 1.25.
comparing regular image v0.28 and new improved image ulucinar/provider-azure-amd64:d0932e28 deploying 1,10,50 and 100 MRs of kind: Registry.

Note that the average time to readiness might for 100 MRs with the new image might not be acurate due to performance testing tool bug.

image

@Piotr1215
Copy link

A new sizing guide has been published based on the findings from the performance tests: https://github.com/upbound/upjet/blob/main/docs/sizing-guide.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants