Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: create a dedicated pool for prometheus and beef up the instance #1238

Merged
merged 1 commit into from
Jan 17, 2025

Conversation

monteiro-renato
Copy link
Contributor

@monteiro-renato monteiro-renato commented Jan 17, 2025

Dagfinn is running more resource intensive tests and Prometheus is starting to struggle.
image

This PR adds a dedicated node(pool) for Prometheus, sets memory requests and increases the storage requests.

Summary by CodeRabbit

  • Infrastructure
    • Added a dedicated Prometheus node pool to the Kubernetes cluster.
    • Configured node pool with spot pricing and specific workload labels.
    • Updated Prometheus service configuration with enhanced resource allocation and scheduling tolerations.
    • Increased Prometheus storage claim from 5Gi to 15Gi.

Copy link
Contributor

coderabbitai bot commented Jan 17, 2025

📝 Walkthrough

Walkthrough

This pull request introduces a dedicated Prometheus node pool in an Azure Kubernetes cluster and updates its configuration. The changes involve creating a new node pool with specific characteristics, including spot instance configuration, node labels, and taints. Additionally, the Prometheus stack configuration is modified to adjust resource allocation, storage, and pod scheduling tolerations to accommodate the new node pool.

Changes

File Change Summary
infrastructure/adminservices-test/.../k6_tests_rg_k8s.tf Added new azurerm_kubernetes_cluster_node_pool resource named "prometheus" with spot instance configuration, specific VM size, and custom node labels/taints
infrastructure/adminservices-test/.../k6_tests_rg_kube_prometheus_stack_values.tftpl Updated Prometheus configuration with:
- Added tolerations for spot and prometheus-labeled nodes
- Added 8Gi memory resource request
- Increased storage claim from 5Gi to 15Gi

Sequence Diagram

sequenceDiagram
    participant K8s as Kubernetes Cluster
    participant NodePool as Prometheus Node Pool
    participant PrometheusService as Prometheus Service

    NodePool->>K8s: Create Spot Node Pool
    K8s->>NodePool: Configure Node Labels & Taints
    PrometheusService->>K8s: Apply Tolerations
    K8s->>NodePool: Schedule Prometheus Pods
Loading

Possibly related PRs

Suggested reviewers

  • khanrn
  • tjololo

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

github-actions bot commented Jan 17, 2025

Terraform environment test

Format and Style 🖌success

Initialization ⚙️success

Validation 🤖success

Validation Output

Success! The configuration is valid.


Plan 📖success

Show Plan

[Lines containing Refreshing state removed]
[Truncated to 120000 bytes! See logoutput for complete plan]
Acquiring state lock. This may take a few moments...
data.azurerm_client_config.current: Reading...
data.azurerm_monitor_data_collection_rule.prometheus: Reading...
data.azurerm_client_config.current: Read complete after 0s [id=Y2xpZW50Q29uZmlncy9jbGllbnRJZD1mMDNmNzU3OS01ZjM5LTRkN2YtYmNlYS04ZTVjYjZkN2FjYTQ7b2JqZWN0SWQ9MDJiMTdkMjgtN2VjOC00MDg0LTkyY2UtZDI2MDUxZmE0OTk4O3N1YnNjcmlwdGlvbklkPTFjZThlOWFmLWMyZDYtNDRlNy05YzVlLTA5OWEzMDgwNTZmZTt0ZW5hbnRJZD1jZDAwMjZkOC0yODNiLTRhNTUtOWJmYS1kMGVmNGE4YmEyMWM=]
data.azurerm_resource_group.rg_dp_stag: Reading...
data.azurerm_resource_group.rg_dp_prod: Reading...
data.azurerm_resource_group.rg_dp_prod: Read complete after 1s [id=/subscriptions/c595f787-450d-4c57-84fa-abc5f95d5459/resourceGroups/dp-be-prod-rg]
data.azurerm_log_analytics_workspace.dp_law_prod: Reading...
data.azurerm_monitor_data_collection_rule.prometheus: Read complete after 1s [id=/subscriptions/1ce8e9af-c2d6-44e7-9c5e-099a308056fe/resourceGroups/MA_k6tests-amw_norwayeast_managed/providers/Microsoft.Insights/dataCollectionRules/k6tests-amw]
data.azurerm_resource_group.rg_dp_stag: Read complete after 1s [id=/subscriptions/e4926efc-0577-47b3-9c3d-757925630eca/resourceGroups/dp-be-staging-rg]
data.azurerm_log_analytics_workspace.dp_law_prod: Read complete after 1s [id=/subscriptions/c595f787-450d-4c57-84fa-abc5f95d5459/resourceGroups/dp-be-prod-rg/providers/Microsoft.OperationalInsights/workspaces/dp-be-prod-insightsWorkspace]
data.azurerm_log_analytics_workspace.dp_law_stag: Reading...
data.azurerm_resource_group.rd_dp_test["dp-be-yt01-rg"]: Reading...
data.azurerm_log_analytics_workspace.dp_law_test["dp-be-test-insightsWorkspace"]: Reading...
data.azurerm_log_analytics_workspace.dp_law_test["dp-be-yt01-insightsWorkspace"]: Reading...
data.azurerm_resource_group.rd_dp_test["dp-be-test-rg"]: Reading...
data.azurerm_resource_group.rd_dp_test["dp-be-test-rg"]: Read complete after 1s [id=/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/resourceGroups/dp-be-test-rg]
data.azurerm_log_analytics_workspace.dp_law_test["dp-be-test-insightsWorkspace"]: Read complete after 1s [id=/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/resourceGroups/dp-be-test-rg/providers/Microsoft.OperationalInsights/workspaces/dp-be-test-insightsWorkspace]
data.azurerm_log_analytics_workspace.dp_law_stag: Read complete after 1s [id=/subscriptions/e4926efc-0577-47b3-9c3d-757925630eca/resourceGroups/dp-be-staging-rg/providers/Microsoft.OperationalInsights/workspaces/dp-be-staging-insightsWorkspace]
data.azurerm_log_analytics_workspace.dp_law_test["dp-be-yt01-insightsWorkspace"]: Read complete after 1s [id=/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/resourceGroups/dp-be-yt01-rg/providers/Microsoft.OperationalInsights/workspaces/dp-be-yt01-insightsWorkspace]
data.azurerm_resource_group.rd_dp_test["dp-be-yt01-rg"]: Read complete after 2s [id=/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/resourceGroups/dp-be-yt01-rg]

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  ~ update in-place
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # azurerm_role_assignment.monitoring_reader_rg_dp_prod must be replaced
-/+ resource "azurerm_role_assignment" "monitoring_reader_rg_dp_prod" {
      + condition_version                      = (known after apply)
      ~ id                                     = "/subscriptions/c595f787-450d-4c57-84fa-abc5f95d5459/resourceGroups/dp-be-prod-rg/providers/Microsoft.Authorization/roleAssignments/cb246258-ec98-85cd-e3c8-05a7556e3133" -> (known after apply)
      ~ name                                   = "cb246258-ec98-85cd-e3c8-05a7556e3133" -> (known after apply)
      ~ principal_type                         = "ServicePrincipal" -> (known after apply)
      ~ role_definition_id                     = "/subscriptions/c595f787-450d-4c57-84fa-abc5f95d5459/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" -> "/subscriptions/1ce8e9af-c2d6-44e7-9c5e-099a308056fe/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" # forces replacement
      ~ role_definition_name                   = "Monitoring Reader" -> (known after apply)
        # (6 unchanged attributes hidden)
    }

  # azurerm_role_assignment.monitoring_reader_rg_dp_stag must be replaced
-/+ resource "azurerm_role_assignment" "monitoring_reader_rg_dp_stag" {
      + condition_version                      = (known after apply)
      ~ id                                     = "/subscriptions/e4926efc-0577-47b3-9c3d-757925630eca/resourceGroups/dp-be-staging-rg/providers/Microsoft.Authorization/roleAssignments/8304cd3a-36bb-1d80-be0c-606c07d74021" -> (known after apply)
      ~ name                                   = "8304cd3a-36bb-1d80-be0c-606c07d74021" -> (known after apply)
      ~ principal_type                         = "ServicePrincipal" -> (known after apply)
      ~ role_definition_id                     = "/subscriptions/e4926efc-0577-47b3-9c3d-757925630eca/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" -> "/subscriptions/1ce8e9af-c2d6-44e7-9c5e-099a308056fe/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" # forces replacement
      ~ role_definition_name                   = "Monitoring Reader" -> (known after apply)
        # (6 unchanged attributes hidden)
    }

  # azurerm_role_assignment.monitoring_reader_rg_dp_test["dp-be-test-rg"] must be replaced
-/+ resource "azurerm_role_assignment" "monitoring_reader_rg_dp_test" {
      + condition_version                      = (known after apply)
      ~ id                                     = "/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/resourceGroups/dp-be-test-rg/providers/Microsoft.Authorization/roleAssignments/b8a3381c-1f71-e5ed-3ccb-fb2c8b6989e1" -> (known after apply)
      ~ name                                   = "b8a3381c-1f71-e5ed-3ccb-fb2c8b6989e1" -> (known after apply)
      ~ principal_type                         = "ServicePrincipal" -> (known after apply)
      ~ role_definition_id                     = "/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" -> "/subscriptions/1ce8e9af-c2d6-44e7-9c5e-099a308056fe/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" # forces replacement
      ~ role_definition_name                   = "Monitoring Reader" -> (known after apply)
        # (6 unchanged attributes hidden)
    }

  # azurerm_role_assignment.monitoring_reader_rg_dp_test["dp-be-yt01-rg"] must be replaced
-/+ resource "azurerm_role_assignment" "monitoring_reader_rg_dp_test" {
      + condition_version                      = (known after apply)
      ~ id                                     = "/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/resourceGroups/dp-be-yt01-rg/providers/Microsoft.Authorization/roleAssignments/3592ba27-02a2-6dc7-da3d-9a00c93e5f49" -> (known after apply)
      ~ name                                   = "3592ba27-02a2-6dc7-da3d-9a00c93e5f49" -> (known after apply)
      ~ principal_type                         = "ServicePrincipal" -> (known after apply)
      ~ role_definition_id                     = "/subscriptions/8a353de8-d81d-468d-a40d-f3574b6bb3f4/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" -> "/subscriptions/1ce8e9af-c2d6-44e7-9c5e-099a308056fe/providers/Microsoft.Authorization/roleDefinitions/43d0d8ad-25c7-4714-9337-8ba259a9fe05" # forces replacement
      ~ role_definition_name                   = "Monitoring Reader" -> (known after apply)
        # (6 unchanged attributes hidden)
    }

  # helm_release.kube_prometheus_stack will be updated in-place
  ~ resource "helm_release" "kube_prometheus_stack" {
        id                         = "kube-prometheus-stack"
      ~ metadata                   = [
          - {
              - app_version    = "v0.79.2"
              - chart          = "kube-prometheus-stack"
              - first_deployed = 1733406099
              - last_deployed  = 1737123696
              - name           = "kube-prometheus-stack"
              - namespace      = "monitoring"
              - notes          = <<-EOT
                    kube-prometheus-stack has been installed. Check its status by running:
                      kubectl --namespace monitoring get pods -l "release=kube-prometheus-stack"
                    
                    Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
                    
                    kube-state-metrics is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects.
                    The exposed metrics can be found here:
                    https://github.com/kubernetes/kube-state-metrics/blob/master/docs/README.md#exposed-metrics
                    
                    The metrics are exported on the HTTP endpoint /metrics on the listening port.
                    In your case, kube-prometheus-stack-kube-state-metrics.monitoring.svc.cluster.local:8080/metrics
                    
                    They are served either as plaintext or protobuf depending on the Accept header.
                    They are designed to be consumed either by Prometheus itself or by a scraper that is compatible with scraping a Prometheus client endpoint.
                    
                    1. Get the application URL by running these commands:
                      export POD_NAME=$(kubectl get pods --namespace monitoring -l "app.kubernetes.io/name=prometheus-node-exporter,app.kubernetes.io/instance=kube-prometheus-stack" -o jsonpath="{.items[0].metadata.name}")
                      echo "Visit http://127.0.0.1:9100 to use your application"
                      kubectl port-forward --namespace monitoring $POD_NAME 9100
                EOT
              - revision       = 7
              - values         = jsonencode(
                    {
                      - alertmanager = {
                          - enabled = true
                        }
                      - crds         = {
                          - enabled = false
                        }
                      - grafana      = {
                          - enabled = false
                        }
                      - prometheus   = {
                          - enabled        = true
                          - prometheusSpec = {
                              - enableRemoteWriteReceiver = true
                              - externalLabels            = {
                                  - cluster = "k6tests-cluster"
                                }
                              - nodeSelector              = {
                                  - prometheus = true
                                }
                              - podMetadata               = {
                                  - labels = {
                                      - "azure.workload.identity/use" = "true"
                                    }
                                }
                              - priorityClassName         = "system-cluster-critical"
                              - remoteWrite               = [
                                  - {
                                      - azureAd = {
                                          - cloud = "AzurePublic"
                                          - sdk   = {
                                              - tenantId = "cd0026d8-283b-4a55-9bfa-d0ef4a8ba21c"
                                            }
                                        }
                                      - url     = "https://k6tests-amw-0vej.norwayeast-1.metrics.ingest.monitor.azure.com/dataCollectionRules/dcr-81e9cf1b38fb4648b047399c5593ebda/streams/Microsoft-PrometheusMetrics/api/v1/write?api-version=2023-04-24"
                                    },
                                ]
                              - resources                 = {
                                  - requests = {
                                      - memory = "8Gi"
                                    }
                                }
                              - retention                 = "1d"
                              - storageSpec               = {
                                  - volumeClaimTemplate = {
                                      - spec = {
                                          - resources = {
                                              - requests = {
                                                  - storage = "15Gi"
                                                }
                                            }
                                        }
                                    }
                                }
                              - tolerations               = [
                                  - {
                                      - effect   = "NoSchedule"
                                      - key      = "kubernetes.azure.com/scalesetpriority"
                                      - operator = "Equal"
                                      - value    = "spot"
                                    },
                                  - {
                                      - effect   = "NoSchedule"
                                      - key      = "workload"
                                      - operator = "Equal"
                                      - value    = "prometheus"
                                    },
                                ]
                            }
                          - serviceAccount = {
                              - annotations = {
                                  - "azure.workload.identity/client-id" = "c4eeeead-8978-41a4-a989-e6e4aa9df603"
                                }
                            }
                        }
                    }
                )
              - version        = "67.9.0"
            },
        ] -> (known after apply)
        name                       = "kube-prometheus-stack"
      ~ status                     = "failed" -> "deployed"
      ~ values                     = [
          ~ <<-EOT
                crds:
                  enabled: false
                alertmanager:
                  enabled: true
                grafana:
                  enabled: false
                prometheus:
                  enabled: true
                  serviceAccount:
                    annotations:
                      azure.workload.identity/client-id: "c4eeeead-8978-41a4-a989-e6e4aa9df603"
                  prometheusSpec:
                    podMetadata:
                      labels:
                        azure.workload.identity/use: "true"
                    externalLabels:
                      cluster: "k6tests-cluster"
                    enableRemoteWriteReceiver: true
                    remoteWrite:
                      - url: "https://k6tests-amw-0vej.norwayeast-1.metrics.ingest.monitor.azure.com/dataCollectionRules/dcr-81e9cf1b38fb4648b047399c5593ebda/streams/Microsoft-PrometheusMetrics/api/v1/write?api-version=2023-04-24"
                        azureAd:
                          cloud: "AzurePublic"
                          sdk:
                            tenantId: "cd0026d8-283b-4a55-9bfa-d0ef4a8ba21c"
              +     tolerations:
              +       - key: "kubernetes.azure.com/scalesetpriority"
              +         operator: "Equal"
              +         value: "spot"
              +         effect: "NoSchedule"
              +       - key: "workload"
              +         operator: "Equal"
              +         value: "prometheus"
              +         effect: "NoSchedule"
              +     resources:
              +       requests:
              +         memory: 8Gi
                    priorityClassName: "system-cluster-critical"
                    retention: 1d
                    storageSpec:
                      volumeClaimTemplate:
                        spec:
                          resources:
                            requests:
              -               storage: 5Gi
              +               storage: 15Gi
            EOT,
        ]
        # (25 unchanged attributes hidden)
    }

Plan: 4 to add, 1 to change, 4 to destroy.

─────────────────────────────────────────────────────────────────────────────

Saved the plan to: tfplan.out

To perform exactly these actions, run the following command to apply:
    terraform apply "tfplan.out"
Releasing state lock. This may take a few moments...

Context Values
Pusher @monteiro-renato
Action push
Working Directory ./infrastructure/adminservices-test/altinn-monitor-test-rg
State File github.com/altinn/altinn-platform/environments/test/altinn-monitor-test-rg.tfstate
Plan File github.com_altinn_altinn-platform_environments_test_altinn-monitor-test-rg.tfstate.tfplan

@monteiro-renato monteiro-renato marked this pull request as ready for review January 17, 2025 12:57
@monteiro-renato monteiro-renato requested a review from a team as a code owner January 17, 2025 12:57
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus_stack_values.tftpl (1)

34-36: Consider adding CPU requests.

While the memory request is appropriate, consider adding CPU requests to prevent resource contention and ensure predictable performance.

     resources:
       requests:
         memory: 8Gi
+        cpu: 2000m
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1)

58-75: Address disk encryption security concern.

The static analysis tool identified a security concern regarding disk encryption. Consider enabling encryption for the node pool to protect sensitive monitoring data.

Add the following configuration to enable encryption:

 resource "azurerm_kubernetes_cluster_node_pool" "prometheus" {
   # ... existing configuration ...
+  enable_host_encryption = true
 }
🧰 Tools
🪛 checkov (3.2.334)

[HIGH] 58-75: Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources

(CKV_AZURE_227)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aca73d4 and 937b91d.

📒 Files selected for processing (2)
  • infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1 hunks)
  • infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus_stack_values.tftpl (1 hunks)
🧰 Additional context used
🪛 checkov (3.2.334)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf

[HIGH] 58-75: Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources

(CKV_AZURE_227)

🔇 Additional comments (3)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus_stack_values.tftpl (2)

25-33: LGTM! Tolerations match the node pool configuration.

The tolerations are correctly configured to allow scheduling on the dedicated Prometheus node pool and spot instances.


44-44: LGTM! Storage increase aligns with requirements.

The storage increase from 5Gi to 15Gi is appropriate given the PR objective to handle more resource-intensive tests.

infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1)

64-66: Verify spot instance stability for Prometheus workload.

While using spot instances for testing is cost-effective, ensure you have monitoring in place for:

  1. Spot instance termination events
  2. Impact on Prometheus data collection during evictions
✅ Verification successful

Spot instance configuration is appropriate for this test environment

The setup is resilient to spot instance terminations because:

  • Prometheus metrics are continuously streamed to Azure Monitor, ensuring data durability
  • No local persistence eliminates the risk of data loss during spot instance evictions
  • The environment is specifically designed for testing purposes where temporary disruptions are acceptable
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for spot instance terminations in the last 24 hours
kubectl get events --field-selector reason=SpotTermination -A

# Check for Prometheus pod restarts
kubectl get pods -l app=prometheus -o wide --all-namespaces | grep -v "Running\|NAME"

Length of output: 347


Script:

#!/bin/bash
# Check for monitoring and persistence configurations
fd . --type f -e tf -e yaml -e yml | xargs rg -l "prometheus|monitoring|persistence|backup|alert" 

# Check for specific spot instance configurations
rg -l "spot_instance|spot_price|priority.*Spot" --type tf

# Look for related kubernetes configurations
fd . --type f -e yaml -e yml | xargs rg -l "prometheus|tolerations|nodeSelector"

Length of output: 1043


Script:

#!/bin/bash
# Check Prometheus configuration
cat infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus.tf

# Check for persistence and storage configurations
rg -A 5 "persistence:|storage:|volume:" infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus.tf

# Check tolerations and node affinity
cat infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k6_operator_values.yaml

Length of output: 2863

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1)

64-66: Consider monitoring spot instance evictions.

Since this is a single-node pool using spot instances, consider:

  1. Setting up alerts for spot instance termination notifications
  2. Monitoring Prometheus availability and recovery after evictions

Would you like me to help create monitoring and alerting configurations for spot instance events?

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 937b91d and dd388aa.

📒 Files selected for processing (2)
  • infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1 hunks)
  • infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus_stack_values.tftpl (1 hunks)
🧰 Additional context used
🪛 checkov (3.2.334)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf

[HIGH] 58-75: Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources

(CKV_AZURE_227)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (go)
🔇 Additional comments (3)
infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_kube_prometheus_stack_values.tftpl (2)

25-36: LGTM! Well-structured configuration for the dedicated Prometheus node pool.

The tolerations and resource configurations are properly aligned with the node pool setup:

  • Correct tolerations for both spot instances and prometheus workload
  • Appropriate memory request of 8Gi for resource-intensive tests

44-44: Verify if retention period needs adjustment with increased storage.

While increasing storage to 15Gi is good for handling more data, consider if the current 1d retention period (line 39) is still optimal for your use case with the increased storage capacity.

Would you like me to help calculate an optimal retention period based on your data ingestion rate and the new storage capacity?

infrastructure/adminservices-test/altinn-monitor-test-rg/k6_tests_rg_k8s.tf (1)

58-75: LGTM! Well-designed dedicated node pool for Prometheus.

The configuration is well-structured with appropriate VM size (Standard_D3_v2) to handle the 8Gi memory request.

🧰 Tools
🪛 checkov (3.2.334)

[HIGH] 58-75: Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources

(CKV_AZURE_227)

@monteiro-renato monteiro-renato merged commit 58d6566 into main Jan 17, 2025
8 checks passed
@monteiro-renato monteiro-renato deleted the feat-dedicated-node-for-prometheus branch January 17, 2025 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants