-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load tests with provider-azure #404
Comments
Similar issue for testing AWS Provider: crossplane-contrib/provider-upjet-aws#576 |
For testing, pick a resource with TTR of 15 sec. Command |
For testing we are going to use EKS single node with the spec of |
Baseline test with provider Azure |
Curious if we have a hypothesis around this test showing less memory and CPU utilization compared to AWS. Could it be the type of resource? Something in the upstream terraform provider? Something else? I assume TTR could be influenced by the actual cloud platform response time. |
My initial thought is the type of resource. I'm planning to run container registry (same resource type as with the AWS tests) and see if this affects the results. We've discussed the TTR and the initial hypothesis is that it might be a combination of lower level Go SDK and connectivity to Azure resource manager. |
I have run additional tests with a Test with provider Azure |
Apologies, I'm having trouble following. Are these latest round of numbers the baseline (i.e. still provider-azure v0.28), or are they measuring an improved build of the provider (e.g. one sharing Terraform provider processes)? |
Sorry for not including this info in the comment (comment updated). The tests are for now against the same provider version, node and k8s version. |
@Piotr1215 One thing I'm having trouble following is whether any of your numbers are derived from an experimental build of the provider (i.e. one with long lived Terraform provider processes), or whether these are all just numbers for the "normal" provider-azure build. Or, put otherwise, should I interpret that at the moment our most efficient provider-azure builds use about 4 vCPUs and 1GB of memory to reconcile 100 managed resources? |
@negz apologies for the confusion, all the tests so far are against normal build of the Azure provider (v 0.28). We are gathering a baseline to compare subsequent tests with experimental images against it. Hopefully tomorrow I can start running the tests against the experimental build and document the results.
The node could probably handle a few more resources, I could try to increase the number of MRs to failure to see what is the upper load spike limit for this version of the provider. |
Comparing test results for provider Azure with different versions of image.
|
A new sizing guide has been published based on the findings from the performance tests: https://github.com/upbound/upjet/blob/main/docs/sizing-guide.md |
We would like to perform some load tests to better understand the scaling characteristics of provider-azure. The most recent experiments related to provider performance are here but they were for parameter optimization and not load test experiments. These tests can also help us to give the community sizing & scaling guidance.
We may do a set of experiments (with the latest available version of provider-azure) in which we gradually increase the # of MRs provisioned until we saturate the computing resources of upbound/provider-azure. I suggest we use an EKS cluster with a worker instance type of m5.2xlarge (32 GB Memory - 8 vCPUs) initially with the vanilla provider and with the default parameters (especially with the default
max-reconcile-rate
of 10) as suggested here) so that we can better relate our results with the results of those previous ones and also because the current default provider parameters are chosen using the results of those experiments.We can also make use of the existing tooling from here & here to conduct these tests. We should collect & report at least the following for each experiment:
Ready=True, Synced=True
state in 10 min: During an interval of 10 min, how many of the MRs could acquire these conditions and how many failed to do so?helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack -n prometheus --set namespaceOverride=prometheus --set grafana.namespaceOverride=prometheus --set kube-state-metrics.namespaceOverride=prometheus --set prometheus-node-exporter.namespaceOverride=prometheus --create-namespace
from theprometheus-community
Helm repository. We may include the Grafana dashboard screenshots like here.kubectl get managed -o yaml
output at the end of the experiment.go run github.com/upbound/uptest/cmd/ttr@fix-69
output (related with the above item)ps -o pid,ppid,etime,comm,args
output from the provider container. We can do this at the end of each experiment run or better, we can have reporting during the course of the experiment with something like:while true; do date; k exec -it <provider pod> -- ps -o pid,ppid,etime,comm,args; done
and log the output to a file. You can refer to our conversion with @mmclane here for more context on why we do this.As long as we have not saturated the compute resources of the provider, we can iterate with a new experiment with more MRs in increments of 5 or 10. I think initially we can start with 30 (let's start with something with 100% success rate, i.e., all MRs provisioned can become ready in the allocated time, i.e., in 10 min).
The text was updated successfully, but these errors were encountered: