feat: create a dedicated pool for prometheus and beef up the instance #1238

monteiro-renato · 2025-01-17T12:53:59Z

Dagfinn is running more resource intensive tests and Prometheus is starting to struggle.

This PR adds a dedicated node(pool) for Prometheus, sets memory requests and increases the storage requests.

Summary by CodeRabbit

Infrastructure
- Added a dedicated Prometheus node pool to the Kubernetes cluster.
- Configured node pool with spot pricing and specific workload labels.
- Updated Prometheus service configuration with enhanced resource allocation and scheduling tolerations.
- Increased Prometheus storage claim from 5Gi to 15Gi.

coderabbitai · 2025-01-17T12:54:06Z

📝 Walkthrough

Walkthrough

This pull request introduces a dedicated Prometheus node pool in an Azure Kubernetes cluster and updates its configuration. The changes involve creating a new node pool with specific characteristics, including spot instance configuration, node labels, and taints. Additionally, the Prometheus stack configuration is modified to adjust resource allocation, storage, and pod scheduling tolerations to accommodate the new node pool.

Changes

File	Change Summary
`infrastructure/adminservices-test/.../k6_tests_rg_k8s.tf`	Added new `azurerm_kubernetes_cluster_node_pool` resource named "prometheus" with spot instance configuration, specific VM size, and custom node labels/taints
`infrastructure/adminservices-test/.../k6_tests_rg_kube_prometheus_stack_values.tftpl`	Updated Prometheus configuration with: - Added tolerations for spot and prometheus-labeled nodes - Added 8Gi memory resource request - Increased storage claim from 5Gi to 15Gi

Sequence Diagram

sequenceDiagram
    participant K8s as Kubernetes Cluster
    participant NodePool as Prometheus Node Pool
    participant PrometheusService as Prometheus Service

    NodePool->>K8s: Create Spot Node Pool
    K8s->>NodePool: Configure Node Labels & Taints
    PrometheusService->>K8s: Apply Tolerations
    K8s->>NodePool: Schedule Prometheus Pods

Possibly related PRs

fix: Deploy more powerful VMs to fix OOM issues in the cluster #1233: The changes in this PR involve updating the VM size for the default node pool from "Standard_D2_v2" to "Standard_D3_v2," which is directly related to the VM size specified in the main PR for the new "prometheus" node pool.

Suggested reviewers

khanrn
tjololo

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

github-actions · 2025-01-17T12:56:03Z

Terraform environment test

Format and Style 🖌`success`

Initialization ⚙️`success`

Validation 🤖`success`

Validation Output


Success! The configuration is valid.

Plan 📖`success`

Show Plan


[Lines containing Refreshing state removed]
[Truncated to 120000 bytes! See logoutput for complete plan]
Acquiring state lock. This may take a few moments...
data.azurerm_client_config.current: Reading...
data.azurerm_monitor_data_collection_rule.prometheus: Reading...
data.azurerm_client_config.current: Read complete after 0s [id=Y2xpZW50Q29uZmlncy9jbGllbnRJZD1mMDNmNzU3OS01ZjM5LTRkN2YtYmNlYS04ZTVjYjZkN2FjYTQ7b2JqZWN0SWQ9MDJiMTdkMjgtN2VjOC00MDg0LTkyY2UtZDI2MDUxZmE0OTk4O3N1YnNjcmlwdGlvbklkPTFjZThlOWFmLWMyZDYtNDRlNy05YzVlLTA5OWEzMDgwNTZmZTt0ZW5hbnRJZD1jZDAwMjZkOC0yODNiLTRhNTUtOWJmYS1kMGVmNGE4YmEyMWM=]
data.azurerm_resource_group.rg_dp_stag: Reading...
data.azurerm_resource_group.rg_dp_prod: Reading...
data.azurerm_resource_group.rg_dp_prod: Read complete after 1s [id=/subscriptions/c595f787-450d-4c57-84fa-abc5f95d5459/resourceGroups/dp-be-prod-rg]
data.azurerm_log_analytics_workspace.dp_law_prod: Reading...
data.azurerm_monitor_data_collection_rule.prometheus: Read complete after 1s [id=/subscriptions/1ce8e9af-c2d6-44e7-9c5e-099a308056fe/resourceGroups/MA_k6tests-amw_norwayeast_managed/providers/Microsoft.Insights/dataCollectionRules/k6tests-amw]
data.azurerm_resource_group.rg_dp_stag: Read complete after 1s [id=/subscriptions/e4926efc-0577-47b3-9c3d-757925630eca/resourceGroups/dp-be-staging-rg]
data.azurerm_log_analytics_workspace.dp_law_prod: Read complete after 1s [id=/subscriptions/c595f787-450d-4c57-84fa-abc5f95d5459/resourceGroups/dp-be-prod-rg/providers/Microsoft.OperationalInsights/workspaces/dp-be-prod-insightsWorkspace]
data.azurerm_log_analytics_workspace.dp_law_stag: Reading...
data.azurerm_resource_group.rd_dp_test["dp-be-yt01-rg"]: Reading...
data.azurerm_log_analytics_workspace.dp_law_test["dp-be-test-insightsWorkspace"]: Reading...
data.azurerm_log_analytics_workspace.dp_law_test["dp-be-yt01-insightsWorkspace"]: Reading...
data.azurerm_resource_group.rd_dp_test["dp-be-test-rg"]: Reading...
data.azurerm_resource_group.rd_dp_test["dp-be-test-rg"]: Read complete after 1s [id=/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/resourceGroups/dp-be-test-rg]
data.azurerm_log_analytics_workspace.dp_law_test["dp-be-test-insightsWorkspace"]: Read complete after 1s [id=/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/resourceGroups/dp-be-test-rg/providers/Microsoft.OperationalInsights/workspaces/dp-be-test-insightsWorkspace]
data.azurerm_log_analytics_workspace.dp_law_stag: Read complete after 1s [id=/subscriptions/e4926efc-0577-47b3-9c3d-757925630eca/resourceGroups/dp-be-staging-rg/providers/Microsoft.OperationalInsights/workspaces/dp-be-staging-insightsWorkspace]
data.azurerm_log_analytics_workspace.dp_law_test["dp-be-yt01-insightsWorkspace"]: Read complete after 1s [id=/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/resourceGroups/dp-be-yt01-rg/providers/Microsoft.OperationalInsights/workspaces/dp-be-yt01-insightsWorkspace]
data.azurerm_resource_group.rd_dp_test["dp-be-yt01-rg"]: Read complete after 2s [id=/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/resourceGroups/dp-be-yt01-rg]

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  ~ update in-place
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # azurerm_role_assignment.monitoring_reader_rg_dp_prod must be replaced
-/+ resource "azurerm_role_assignment" "monitoring_reader_rg_dp_prod" {
      + condition_version                      = (known after apply)
      ~ id                                     = "/subscriptions/c595f787-450d-4c57-84fa-abc5f95d5459/resourceGroups/dp-be-prod-rg/providers/Microsoft.Authorization/roleAssignments/cb246258-ec98-85cd-e3c8-05a7556e3133" -> (known after apply)
      ~ name                                   = "cb246258-ec98-85cd-e3c8-05a7556e3133" -> (known after apply)
      ~ principal_type                         = "ServicePrincipal" -> (known after apply)
      ~ role_definition_id                     = "/subscriptions/c595f787-450d-4c57-84fa-abc5f95d5459/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" -> "/subscriptions/1ce8e9af-c2d6-44e7-9c5e-099a308056fe/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" # forces replacement
      ~ role_definition_name                   = "Monitoring Reader" -> (known after apply)
        # (6 unchanged attributes hidden)
    }

  # azurerm_role_assignment.monitoring_reader_rg_dp_stag must be replaced
-/+ resource "azurerm_role_assignment" "monitoring_reader_rg_dp_stag" {
      + condition_version                      = (known after apply)
      ~ id                                     = "/subscriptions/e4926efc-0577-47b3-9c3d-757925630eca/resourceGroups/dp-be-staging-rg/providers/Microsoft.Authorization/roleAssignments/8304cd3a-36bb-1d80-be0c-606c07d74021" -> (known after apply)
      ~ name                                   = "8304cd3a-36bb-1d80-be0c-606c07d74021" -> (known after apply)
      ~ principal_type                         = "ServicePrincipal" -> (known after apply)
      ~ role_definition_id                     = "/subscriptions/e4926efc-0577-47b3-9c3d-757925630eca/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" -> "/subscriptions/1ce8e9af-c2d6-44e7-9c5e-099a308056fe/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" # forces replacement
      ~ role_definition_name                   = "Monitoring Reader" -> (known after apply)
        # (6 unchanged attributes hidden)
    }

  # azurerm_role_assignment.monitoring_reader_rg_dp_test["dp-be-test-rg"] must be replaced
-/+ resource "azurerm_role_assignment" "monitoring_reader_rg_dp_test" {
      + condition_version                      = (known after apply)
      ~ id                                     = "/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/resourceGroups/dp-be-test-rg/providers/Microsoft.Authorization/roleAssignments/b8a3381c-1f71-e5ed-3ccb-fb2c8b6989e1" -> (known after apply)
      ~ name                                   = "b8a3381c-1f71-e5ed-3ccb-fb2c8b6989e1" -> (known after apply)
      ~ principal_type                         = "ServicePrincipal" -> (known after apply)
      ~ role_definition_id                     = "/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" -> "/subscriptions/1ce8e9af-c2d6-44e7-9c5e-099a308056fe/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" # forces replacement
      ~ role_definition_name                   = "Monitoring Reader" -> (known after apply)
        # (6 unchanged attributes hidden)
    }

  # azurerm_role_assignment.monitoring_reader_rg_dp_test["dp-be-yt01-rg"] must be replaced
-/+ resource "azurerm_role_assignment" "monitoring_reader_rg_dp_test" {
      + condition_version                      = (known after apply)
      ~ id                                     = "/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/resourceGroups/dp-be-yt01-rg/providers/Microsoft.Authorization/roleAssignments/3592ba27-02a2-6dc7-da3d-9a00c93e5f49" -> (known after apply)
      ~ name                                   = "3592ba27-02a2-6dc7-da3d-9a00c93e5f49" -> (known after apply)
      ~ principal_type                         = "ServicePrincipal" -> (known after apply)
      ~ role_definition_id                     = "/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" -> "/subscriptions/1ce8e9af-c2d6-44e7-9c5e-099a308056fe/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" # forces replacement
      ~ role_definition_name                   = "Monitoring Reader" -> (known after apply)
        # (6 unchanged attributes hidden)
    }

  # helm_release.kube_prometheus_stack will be updated in-place
  ~ resource "helm_release" "kube_prometheus_stack" {
        id                         = "kube-prometheus-stack"
      ~ metadata                   = [
          - {
              - app_version    = "v0.79.2"
              - chart          = "kube-prometheus-stack"
              - first_deployed = 1733406099
              - last_deployed  = 1737123696
              - name           = "kube-prometheus-stack"
              - namespace      = "monitoring"
              - notes          = <<-EOT
                    kube-prometheus-stack has been installed. Check its status by running:
                      kubectl --namespace monitoring get pods -l "release=kube-prometheus-stack"
                    
                    Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
                    
                    kube-state-metrics is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects.
                    The exposed metrics can be found here:
                    https://github.com/kubernetes/kube-state-metrics/blob/master/docs/README.md#exposed-metrics
                    
                    The metrics are exported on the HTTP endpoint /metrics on the listening port.
                    In your case, kube-prometheus-stack-kube-state-metrics.monitoring.svc.cluster.local:8080/metrics
                    
                    They are served either as plaintext or protobuf depending on the Accept header.
                    They are designed to be consumed either by Prometheus itself or by a scraper that is compatible with scraping a Prometheus client endpoint.
                    
                    1. Get the application URL by running these commands:
                      export POD_NAME=$(kubectl get pods --namespace monitoring -l "app.kubernetes.io/name=prometheus-node-exporter,app.kubernetes.io/instance=kube-prometheus-stack" -o jsonpath="{.items[0].metadata.name}")
                      echo "Visit http://127.0.0.1:9100 to use your application"
                      kubectl port-forward --namespace monitoring $POD_NAME 9100
                EOT
              - revision       = 7
              - values         = jsonencode(
                    {
                      - alertmanager = {
                          - enabled = true
                        }
                      - crds         = {
                          - enabled = false
                        }
                      - grafana      = {
                          - enabled = false
                        }
                      - prometheus   = {
                          - enabled        = true
                          - prometheusSpec = {
                              - enableRemoteWriteReceiver = true
                              - externalLabels            = {
                                  - cluster = "k6tests-cluster"
                                }
                              - nodeSelector              = {
                                  - prometheus = true
                                }
                              - podMetadata               = {
                                  - labels = {
                                      - "azure.workload.identity/use" = "true"
                                    }
                                }
                              - priorityClassName         = "system-cluster-critical"
                              - remoteWrite               = [
                                  - {
                                      - azureAd = {
                                          - cloud = "AzurePublic"
                                          - sdk   = {
                                              - tenantId = "cd0026d8-283b-4a55-9bfa-d0ef4a8ba21c"
                                            }
                                        }
                                      - url     = "https://k6tests-amw-0vej.norwayeast-1.metrics.ingest.monitor.azure.com/dataCollectionRules/dcr-81e9cf1b38fb4648b047399c5593ebda/streams/Microsoft-PrometheusMetrics/api/v1/write?api-version=2023-04-24"
                                    },
                                ]
                              - resources                 = {
                                  - requests = {
                                      - memory = "8Gi"
                                    }
                                }
                              - retention                 = "1d"
                              - storageSpec               = {
                                  - volumeClaimTemplate = {
                                      - spec = {
                                          - resources = {
                                              - requests = {
                                                  - storage = "15Gi"
                                                }
                                            }
                                        }
                                    }
                                }
                              - tolerations               = [
                                  - {
                                      - effect   = "NoSchedule"
                                      - key      = "kubernetes.azure.com/scalesetpriority"
                                      - operator = "Equal"
                                      - value    = "spot"
                                    },
                                  - {
                                      - effect   = "NoSchedule"
                                      - key      = "workload"
                                      - operator = "Equal"
                                      - value    = "prometheus"
                                    },
                                ]
                            }
                          - serviceAccount = {
                              - annotations = {
                                  - "azure.workload.identity/client-id" = "c4eeeead-8978-41a4-a989-e6e4aa9df603"
                                }
                            }
                        }
                    }
                )
              - version        = "67.9.0"
            },
        ] -> (known after apply)
        name                       = "kube-prometheus-stack"
      ~ status                     = "failed" -> "deployed"
      ~ values                     = [
          ~ <<-EOT
                crds:
                  enabled: false
                alertmanager:
                  enabled: true
                grafana:
                  enabled: false
                prometheus:
                  enabled: true
                  serviceAccount:
                    annotations:
                      azure.workload.identity/client-id: "c4eeeead-8978-41a4-a989-e6e4aa9df603"
                  prometheusSpec:
                    podMetadata:
                      labels:
                        azure.workload.identity/use: "true"
                    externalLabels:
                      cluster: "k6tests-cluster"
                    enableRemoteWriteReceiver: true
                    remoteWrite:
                      - url: "https://k6tests-amw-0vej.norwayeast-1.metrics.ingest.monitor.azure.com/dataCollectionRules/dcr-81e9cf1b38fb4648b047399c5593ebda/streams/Microsoft-PrometheusMetrics/api/v1/write?api-version=2023-04-24"
                        azureAd:
                          cloud: "AzurePublic"
                          sdk:
                            tenantId: "cd0026d8-283b-4a55-9bfa-d0ef4a8ba21c"
              +     tolerations:
              +       - key: "kubernetes.azure.com/scalesetpriority"
              +         operator: "Equal"
              +         value: "spot"
              +         effect: "NoSchedule"
              +       - key: "workload"
              +         operator: "Equal"
              +         value: "prometheus"
              +         effect: "NoSchedule"
              +     resources:
              +       requests:
              +         memory: 8Gi
                    priorityClassName: "system-cluster-critical"
                    retention: 1d
                    storageSpec:
                      volumeClaimTemplate:
                        spec:
                          resources:
                            requests:
              -               storage: 5Gi
              +               storage: 15Gi
            EOT,
        ]
        # (25 unchanged attributes hidden)
    }

Plan: 4 to add, 1 to change, 4 to destroy.

─────────────────────────────────────────────────────────────────────────────

Saved the plan to: tfplan.out

To perform exactly these actions, run the following command to apply:
    terraform apply "tfplan.out"
Releasing state lock. This may take a few moments...

Context	Values
Pusher	@monteiro-renato
Action	`push`
Working Directory	`./infrastructure/adminservices-test/altinn-monitor-test-rg`
State File	`github.com/altinn/altinn-platform/environments/test/altinn-monitor-test-rg.tfstate`
Plan File	`github.com_altinn_altinn-platform_environments_test_altinn-monitor-test-rg.tfstate.tfplan`

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus_stack_values.tftpl (1)
34-36: Consider adding CPU requests.

While the memory request is appropriate, consider adding CPU requests to prevent resource contention and ensure predictable performance.
     resources:
       requests:
         memory: 8Gi
+        cpu: 2000m
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1)
58-75: Address disk encryption security concern.

The static analysis tool identified a security concern regarding disk encryption. Consider enabling encryption for the node pool to protect sensitive monitoring data.

Add the following configuration to enable encryption:
 resource "azurerm_kubernetes_cluster_node_pool" "prometheus" {
   # ... existing configuration ...
+  enable_host_encryption = true
 }
🧰 Tools

🪛 checkov (3.2.334)

[HIGH] 58-75: Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources

(CKV_AZURE_227)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aca73d4 and 937b91d.

📒 Files selected for processing (2)

infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1 hunks)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus_stack_values.tftpl (1 hunks)

🧰 Additional context used

🪛 checkov (3.2.334)

infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf

[HIGH] 58-75: Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources

(CKV_AZURE_227)

🔇 Additional comments (3)

infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus_stack_values.tftpl (2)

25-33: LGTM! Tolerations match the node pool configuration.

The tolerations are correctly configured to allow scheduling on the dedicated Prometheus node pool and spot instances.

44-44: LGTM! Storage increase aligns with requirements.

The storage increase from 5Gi to 15Gi is appropriate given the PR objective to handle more resource-intensive tests.
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1)

64-66: Verify spot instance stability for Prometheus workload.

While using spot instances for testing is cost-effective, ensure you have monitoring in place for:

Spot instance termination events

Impact on Prometheus data collection during evictions

✅ Verification successful

Spot instance configuration is appropriate for this test environment

The setup is resilient to spot instance terminations because:

Prometheus metrics are continuously streamed to Azure Monitor, ensuring data durability

No local persistence eliminates the risk of data loss during spot instance evictions

The environment is specifically designed for testing purposes where temporary disruptions are acceptable
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Check for spot instance terminations in the last 24 hours
kubectl get events --field-selector reason=SpotTermination -A

# Check for Prometheus pod restarts
kubectl get pods -l app=prometheus -o wide --all-namespaces | grep -v "Running\|NAME"
Length of output: 347

Script:
#!/bin/bash
# Check for monitoring and persistence configurations
fd . --type f -e tf -e yaml -e yml | xargs rg -l "prometheus|monitoring|persistence|backup|alert" 

# Check for specific spot instance configurations
rg -l "spot_instance|spot_price|priority.*Spot" --type tf

# Look for related kubernetes configurations
fd . --type f -e yaml -e yml | xargs rg -l "prometheus|tolerations|nodeSelector"
Length of output: 1043

Script:
#!/bin/bash
# Check Prometheus configuration
cat infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus.tf

# Check for persistence and storage configurations
rg -A 5 "persistence:|storage:|volume:" infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus.tf

# Check tolerations and node affinity
cat infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k6_operator_values.yaml
Length of output: 2863

infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1)

64-66: Consider monitoring spot instance evictions.

Since this is a single-node pool using spot instances, consider:

Setting up alerts for spot instance termination notifications

Monitoring Prometheus availability and recovery after evictions

Would you like me to help create monitoring and alerting configurations for spot instance events?

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 937b91d and dd388aa.

📒 Files selected for processing (2)

infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1 hunks)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus_stack_values.tftpl (1 hunks)

🧰 Additional context used

🪛 checkov (3.2.334)

infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf

[HIGH] 58-75: Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources

(CKV_AZURE_227)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Analyze (go)

🔇 Additional comments (3)

infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus_stack_values.tftpl (2)

25-36: LGTM! Well-structured configuration for the dedicated Prometheus node pool.

The tolerations and resource configurations are properly aligned with the node pool setup:

Correct tolerations for both spot instances and prometheus workload

Appropriate memory request of 8Gi for resource-intensive tests

44-44: Verify if retention period needs adjustment with increased storage.

While increasing storage to 15Gi is good for handling more data, consider if the current 1d retention period (line 39) is still optimal for your use case with the increased storage capacity.

Would you like me to help calculate an optimal retention period based on your data ingestion rate and the new storage capacity?

infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1)

58-75: LGTM! Well-designed dedicated node pool for Prometheus.

The configuration is well-structured with appropriate VM size (Standard_D3_v2) to handle the 8Gi memory request.

🧰 Tools

🪛 checkov (3.2.334)

[HIGH] 58-75: Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources

(CKV_AZURE_227)

infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf

monteiro-renato temporarily deployed to reader January 17, 2025 12:54 — with GitHub Actions Inactive

monteiro-renato marked this pull request as ready for review January 17, 2025 12:57

monteiro-renato requested a review from a team as a code owner January 17, 2025 12:57

coderabbitai bot reviewed Jan 17, 2025

View reviewed changes

infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf Outdated Show resolved Hide resolved

feat: create a dedicated pool for prometheus and beef up the instance

dd388aa

monteiro-renato force-pushed the feat-dedicated-node-for-prometheus branch from 937b91d to dd388aa Compare January 17, 2025 13:01

monteiro-renato temporarily deployed to reader January 17, 2025 13:01 — with GitHub Actions Inactive

coderabbitai bot reviewed Jan 17, 2025

View reviewed changes

infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf Show resolved Hide resolved

tjololo approved these changes Jan 17, 2025

View reviewed changes

monteiro-renato merged commit 58d6566 into main Jan 17, 2025
8 checks passed

monteiro-renato deleted the feat-dedicated-node-for-prometheus branch January 17, 2025 13:36

This was referenced Jan 17, 2025

fix: add missing nodeSelector config #1240

Merged

fix: should be a string #1243

Merged

fix: reduce retention period and increase storage capacity #1263

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: create a dedicated pool for prometheus and beef up the instance #1238

feat: create a dedicated pool for prometheus and beef up the instance #1238

monteiro-renato commented Jan 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 17, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

github-actions bot commented Jan 17, 2025 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

feat: create a dedicated pool for prometheus and beef up the instance #1238

feat: create a dedicated pool for prometheus and beef up the instance #1238

Conversation

monteiro-renato commented Jan 17, 2025 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Jan 17, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

github-actions bot commented Jan 17, 2025 • edited Loading

Terraform environment test

Format and Style 🖌success

Initialization ⚙️success

Validation 🤖success

Plan 📖success

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

monteiro-renato commented Jan 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 17, 2025 •

edited

Loading

github-actions bot commented Jan 17, 2025 •

edited

Loading

Format and Style 🖌`success`

Initialization ⚙️`success`

Validation 🤖`success`

Plan 📖`success`