Load tests with provider-azure #404

ulucinar · 2023-03-03T12:49:25Z

We would like to perform some load tests to better understand the scaling characteristics of provider-azure. The most recent experiments related to provider performance are here but they were for parameter optimization and not load test experiments. These tests can also help us to give the community sizing & scaling guidance.

We may do a set of experiments (with the latest available version of provider-azure) in which we gradually increase the # of MRs provisioned until we saturate the computing resources of upbound/provider-azure. I suggest we use an EKS cluster with a worker instance type of m5.2xlarge (32 GB Memory - 8 vCPUs) initially with the vanilla provider and with the default parameters (especially with the default max-reconcile-rate of 10) as suggested here) so that we can better relate our results with the results of those previous ones and also because the current default provider parameters are chosen using the results of those experiments.

We can also make use of the existing tooling from here & here to conduct these tests. We should collect & report at least the following for each experiment:

The types and number of MRs provisioned during the test
Success rate for Ready=True, Synced=True state in 10 min: During an interval of 10 min, how many of the MRs could acquire these conditions and how many failed to do so?
Using the available Prometheus metrics from the provider, what was the peak & avg. memory/CPU utilization? You can install the Prometheus and Grafana stack using something like: helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack -n prometheus --set namespaceOverride=prometheus --set grafana.namespaceOverride=prometheus --set kube-state-metrics.namespaceOverride=prometheus --set prometheus-node-exporter.namespaceOverride=prometheus --create-namespace from the prometheus-community Helm repository. We may include the Grafana dashboard screenshots like here.
kubectl get managed -o yaml output at the end of the experiment.
Time-to-readiness metrics as defined here. Histograms like we have there would be great but we can also derive them later.
go run github.com/upbound/uptest/cmd/ttr@fix-69 output (related with the above item)
ps -o pid,ppid,etime,comm,args output from the provider container. We can do this at the end of each experiment run or better, we can have reporting during the course of the experiment with something like: while true; do date; k exec -it <provider pod> -- ps -o pid,ppid,etime,comm,args; done and log the output to a file. You can refer to our conversion with @mmclane here for more context on why we do this.

As long as we have not saturated the compute resources of the provider, we can iterate with a new experiment with more MRs in increments of 5 or 10. I think initially we can start with 30 (let's start with something with 100% success rate, i.e., all MRs provisioned can become ready in the allocated time, i.e., in 10 min).

The text was updated successfully, but these errors were encountered:

Piotr1215 · 2023-03-03T13:38:23Z

Similar issue for testing AWS Provider: crossplane-contrib/provider-upjet-aws#576

Piotr1215 · 2023-03-06T15:04:31Z

For testing, pick a resource with TTR of 15 sec. Command go run github.com/upbound/uptest/cmd/ttr@fix-69 without parameters.

Piotr1215 · 2023-03-06T15:18:07Z

For testing we are going to use EKS single node with the spec of m5.2xlarge - 32 GB Memory - 8 vCPUs

Piotr1215 · 2023-03-10T14:18:51Z

Baseline test with provider Azure v0.28 and EKS with 1 node m5.2xlarge - 32 GB Memory - 8 vCPUs. Kubernetes version 1.25.
Resource used for testing resource group. Tests results compared to the AWS counterparts show less memory and CPU utilization, but significantly longer TTRs.

jeanduplessis · 2023-03-10T14:32:15Z

Curious if we have a hypothesis around this test showing less memory and CPU utilization compared to AWS. Could it be the type of resource? Something in the upstream terraform provider? Something else?

I assume TTR could be influenced by the actual cloud platform response time.

Piotr1215 · 2023-03-10T14:53:12Z

My initial thought is the type of resource. I'm planning to run container registry (same resource type as with the AWS tests) and see if this affects the results.

We've discussed the TTR and the initial hypothesis is that it might be a combination of lower level Go SDK and connectivity to Azure resource manager.

Piotr1215 · 2023-03-13T14:26:42Z

I have run additional tests with a Registry resource with the following results to see if there are significant differences compared to. CPU and Memory utilization is higher, but what is interesting is that TTR for 100 resources is even lower.

Test with provider Azure v0.28 and EKS with 1 node m5.2xlarge - 32 GB Memory - 8 vCPUs. Kubernetes version 1.25.

Piotr1215 · 2023-03-14T14:04:32Z

Ran one more time 100 Registries with the following results. Including Prometheus graphs this time.

Test with provider Azure v0.28 and EKS with 1 node m5.2xlarge - 32 GB Memory - 8 vCPUs. Kubernetes version 1.25.

negz · 2023-03-15T02:08:10Z

Apologies, I'm having trouble following. Are these latest round of numbers the baseline (i.e. still provider-azure v0.28), or are they measuring an improved build of the provider (e.g. one sharing Terraform provider processes)?

Piotr1215 · 2023-03-15T08:23:07Z

Sorry for not including this info in the comment (comment updated). The tests are for now against the same provider version, node and k8s version.

negz · 2023-03-16T18:23:15Z

@Piotr1215 One thing I'm having trouble following is whether any of your numbers are derived from an experimental build of the provider (i.e. one with long lived Terraform provider processes), or whether these are all just numbers for the "normal" provider-azure build.

Or, put otherwise, should I interpret that at the moment our most efficient provider-azure builds use about 4 vCPUs and 1GB of memory to reconcile 100 managed resources?

Piotr1215 · 2023-03-16T21:46:49Z

@negz apologies for the confusion, all the tests so far are against normal build of the Azure provider (v 0.28). We are gathering a baseline to compare subsequent tests with experimental images against it. Hopefully tomorrow I can start running the tests against the experimental build and document the results.

Or, put otherwise, should I interpret that at the moment our most efficient provider-azure builds use about 4 vCPUs and 1GB of memory to reconcile 100 managed resources?

The node could probably handle a few more resources, I could try to increase the number of MRs to failure to see what is the upper load spike limit for this version of the provider.

Piotr1215 · 2023-03-20T16:10:01Z

Comparing test results for provider Azure with different versions of image.
Test setup: Test with provider Azure and EKS with 1 node m5.2xlarge - 32 GB Memory - 8 vCPUs. Kubernetes version 1.25.
comparing regular image v0.28 and new improved image ulucinar/provider-azure-amd64:d0932e28 deploying 1,10,50 and 100 MRs of kind: Registry.

Note that the average time to readiness might for 100 MRs with the new image might not be acurate due to performance testing tool bug.

Piotr1215 · 2023-04-05T14:07:54Z

A new sizing guide has been published based on the findings from the performance tests: https://github.com/upbound/upjet/blob/main/docs/sizing-guide.md

ulucinar assigned Piotr1215 Mar 3, 2023

jeanduplessis added needs:triage and removed needs:triage labels Mar 16, 2023

Piotr1215 mentioned this issue Mar 16, 2023

Load tests with provider-gcp crossplane-contrib/provider-upjet-gcp#255

Closed

This was referenced Mar 22, 2023

Add terraform.ProviderScheduler crossplane/upjet#178

Merged

Consume upjet ProviderScheduler #417

Merged

Piotr1215 closed this as completed Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load tests with provider-azure #404

Load tests with provider-azure #404

ulucinar commented Mar 3, 2023 •

edited

Loading

Piotr1215 commented Mar 3, 2023

Piotr1215 commented Mar 6, 2023

Piotr1215 commented Mar 6, 2023

Piotr1215 commented Mar 10, 2023

jeanduplessis commented Mar 10, 2023

Piotr1215 commented Mar 10, 2023

Piotr1215 commented Mar 13, 2023 •

edited

Loading

Piotr1215 commented Mar 14, 2023 •

edited

Loading

negz commented Mar 15, 2023

Piotr1215 commented Mar 15, 2023

negz commented Mar 16, 2023

Piotr1215 commented Mar 16, 2023

Piotr1215 commented Mar 20, 2023

Piotr1215 commented Apr 5, 2023

Load tests with provider-azure #404

Load tests with provider-azure #404

Comments

ulucinar commented Mar 3, 2023 • edited Loading

Piotr1215 commented Mar 3, 2023

Piotr1215 commented Mar 6, 2023

Piotr1215 commented Mar 6, 2023

Piotr1215 commented Mar 10, 2023

jeanduplessis commented Mar 10, 2023

Piotr1215 commented Mar 10, 2023

Piotr1215 commented Mar 13, 2023 • edited Loading

Piotr1215 commented Mar 14, 2023 • edited Loading

negz commented Mar 15, 2023

Piotr1215 commented Mar 15, 2023

negz commented Mar 16, 2023

Piotr1215 commented Mar 16, 2023

Piotr1215 commented Mar 20, 2023

Piotr1215 commented Apr 5, 2023

ulucinar commented Mar 3, 2023 •

edited

Loading

Piotr1215 commented Mar 13, 2023 •

edited

Loading

Piotr1215 commented Mar 14, 2023 •

edited

Loading