-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: create a dedicated pool for prometheus and beef up the instance #1238
Conversation
📝 WalkthroughWalkthroughThis pull request introduces a dedicated Prometheus node pool in an Azure Kubernetes cluster and updates its configuration. The changes involve creating a new node pool with specific characteristics, including spot instance configuration, node labels, and taints. Additionally, the Prometheus stack configuration is modified to adjust resource allocation, storage, and pod scheduling tolerations to accommodate the new node pool. Changes
Sequence DiagramsequenceDiagram
participant K8s as Kubernetes Cluster
participant NodePool as Prometheus Node Pool
participant PrometheusService as Prometheus Service
NodePool->>K8s: Create Spot Node Pool
K8s->>NodePool: Configure Node Labels & Taints
PrometheusService->>K8s: Apply Tolerations
K8s->>NodePool: Schedule Prometheus Pods
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
Terraform environment testFormat and Style 🖌
|
Context | Values |
---|---|
Pusher | @monteiro-renato |
Action | push |
Working Directory | ./infrastructure/adminservices-test/altinn-monitor-test-rg |
State File | github.com/altinn/altinn-platform/environments/test/altinn-monitor-test-rg.tfstate |
Plan File | github.com_altinn_altinn-platform_environments_test_altinn-monitor-test-rg.tfstate.tfplan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus_stack_values.tftpl (1)
34-36
: Consider adding CPU requests.While the memory request is appropriate, consider adding CPU requests to prevent resource contention and ensure predictable performance.
resources: requests: memory: 8Gi + cpu: 2000m
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1)
58-75
: Address disk encryption security concern.The static analysis tool identified a security concern regarding disk encryption. Consider enabling encryption for the node pool to protect sensitive monitoring data.
Add the following configuration to enable encryption:
resource "azurerm_kubernetes_cluster_node_pool" "prometheus" { # ... existing configuration ... + enable_host_encryption = true }
🧰 Tools
🪛 checkov (3.2.334)
[HIGH] 58-75: Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources
(CKV_AZURE_227)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf
(1 hunks)infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus_stack_values.tftpl
(1 hunks)
🧰 Additional context used
🪛 checkov (3.2.334)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf
[HIGH] 58-75: Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources
(CKV_AZURE_227)
🔇 Additional comments (3)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus_stack_values.tftpl (2)
25-33
: LGTM! Tolerations match the node pool configuration.The tolerations are correctly configured to allow scheduling on the dedicated Prometheus node pool and spot instances.
44-44
: LGTM! Storage increase aligns with requirements.The storage increase from 5Gi to 15Gi is appropriate given the PR objective to handle more resource-intensive tests.
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1)
64-66
: Verify spot instance stability for Prometheus workload.While using spot instances for testing is cost-effective, ensure you have monitoring in place for:
- Spot instance termination events
- Impact on Prometheus data collection during evictions
✅ Verification successful
Spot instance configuration is appropriate for this test environment
The setup is resilient to spot instance terminations because:
- Prometheus metrics are continuously streamed to Azure Monitor, ensuring data durability
- No local persistence eliminates the risk of data loss during spot instance evictions
- The environment is specifically designed for testing purposes where temporary disruptions are acceptable
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check for spot instance terminations in the last 24 hours kubectl get events --field-selector reason=SpotTermination -A # Check for Prometheus pod restarts kubectl get pods -l app=prometheus -o wide --all-namespaces | grep -v "Running\|NAME"Length of output: 347
Script:
#!/bin/bash # Check for monitoring and persistence configurations fd . --type f -e tf -e yaml -e yml | xargs rg -l "prometheus|monitoring|persistence|backup|alert" # Check for specific spot instance configurations rg -l "spot_instance|spot_price|priority.*Spot" --type tf # Look for related kubernetes configurations fd . --type f -e yaml -e yml | xargs rg -l "prometheus|tolerations|nodeSelector"Length of output: 1043
Script:
#!/bin/bash # Check Prometheus configuration cat infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus.tf # Check for persistence and storage configurations rg -A 5 "persistence:|storage:|volume:" infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus.tf # Check tolerations and node affinity cat infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k6_operator_values.yamlLength of output: 2863
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf
Outdated
Show resolved
Hide resolved
937b91d
to
dd388aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1)
64-66
: Consider monitoring spot instance evictions.Since this is a single-node pool using spot instances, consider:
- Setting up alerts for spot instance termination notifications
- Monitoring Prometheus availability and recovery after evictions
Would you like me to help create monitoring and alerting configurations for spot instance events?
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf
(1 hunks)infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus_stack_values.tftpl
(1 hunks)
🧰 Additional context used
🪛 checkov (3.2.334)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf
[HIGH] 58-75: Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources
(CKV_AZURE_227)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (go)
🔇 Additional comments (3)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus_stack_values.tftpl (2)
25-36
: LGTM! Well-structured configuration for the dedicated Prometheus node pool.The tolerations and resource configurations are properly aligned with the node pool setup:
- Correct tolerations for both spot instances and prometheus workload
- Appropriate memory request of 8Gi for resource-intensive tests
44-44
: Verify if retention period needs adjustment with increased storage.While increasing storage to 15Gi is good for handling more data, consider if the current 1d retention period (line 39) is still optimal for your use case with the increased storage capacity.
Would you like me to help calculate an optimal retention period based on your data ingestion rate and the new storage capacity?
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1)
58-75
: LGTM! Well-designed dedicated node pool for Prometheus.The configuration is well-structured with appropriate VM size (Standard_D3_v2) to handle the 8Gi memory request.
🧰 Tools
🪛 checkov (3.2.334)
[HIGH] 58-75: Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources
(CKV_AZURE_227)
Dagfinn is running more resource intensive tests and Prometheus is starting to struggle.
This PR adds a dedicated node(pool) for Prometheus, sets memory requests and increases the storage requests.
Summary by CodeRabbit