Skip to content

Monitoring

aaboyle878 edited this page Jan 8, 2024 · 17 revisions

Metrics

As part of our monitoring effort there are certain key metrics which can serve as an early system for potential issues and also shed some light into the day to day performance of the cluster as a whole some of these metrics have been tracked using Azure Insights and pinned to the our dashboard in the Azure portal and are listed below:

  • Average Node CPU Usage

Tracks the CPU usage of the cluster's nodes. Monitoring CPU usage helps ensure efficient resource allocation and detect potential performance issues not only this but it can also serve as a vital point of reference when potential security threats arise. NodeCPU As can be seen in the graph above the only metric displayed is the Node CPU usage and can be represented by a solid line when there is data being received on the metric or a dashed line which indicates there is either missing data for the time period (e.g. a deployment on-going) or that there was issues retrieving the data for the specified time in the event there are large periods where data is missing this could be a potential indicator of issues within the cluster

  • Average Pod Count

This chart displays the average number of pods running within the cluster. It can be useful when trying to measure the cluster's capacity and workload distribution. PodCount This chart similarly to Node CPU tracks a single metric and does such to make it clean and readable, it will display the average number of healthy running pods within the cluster and will showcase the most activity during scaling and deployment events where the average may increase or decrease in line with the activity taking place. During deployments this chart can also be one of the first in which we may see a drop if there have been issues during the deployment e.g. pods with error status

  • Used Disk Percentage

Monitoring disk usage is critical to prevent storage-related issues and so this chart allows us to monitor the active disk usage and view key point of increased activity and so from this data we can plan to scale accordingly, additionally this can also be a key metric when viewing cluster security as random spikes can indicate malicious activity UsedDisk We track two metrics within this chart one for the max disk usage for which we have an alert set up to notify when the disk usage has breached a specified threshold and additionally the average disk usage comparing these metrics against each other can prove to be useful in events where the alert has been triggered and can prove useful when making plans for the future off the cluster

  • Bytes Read and Written per Second

Monitoring data I/O can prove vital when identifying potential performance bottlenecks. The purpose of this chart is to visual data transfer rates. I/O As stated this chart showcases the average transfer rates in and out of the cluster as as can be seen in the image above shows spikes on the incoming traffic when a connection is established and an action applied either to the cluster or within the applications, these spikes can be key performance indicator for the cluster when it comes to lagging processing times as it allows us to view when the bottleneck might have begun and form there we can use this data along side some of our other configured charts for root cause analysis not only this but when there are large amounts of unexpected incoming activity this could be a an indication of security issues.

Logs

Some of the logs analysed were:

  • Average Node CPU Usage Percentage per Minute

Similarly to the chart on Node CPU using this log will return similar results but in a written format that will allow for a more granular and text based approach with the option for customisation of the query to fine turn the output as required

  • Average Node Memory Usage Percentage per Minute

In the same way we use CPU allocation to determine performance within the cluster the logs for memory use allow us to do the same and they detail which resources are making the most frequent calls and use the most memory and from this we can plan further cluster provisioning e.g volumes or investigate why these resources are so memory hungry

  • Pods Counts with Phase

These logs detail the life events of each pod within the cluster an as such allow us to ensure even workload distribution and further inspect pods with undesirable states it also allows us to monitor the time it takes for new pods to become live which can be a useful performance indicator for cluster health

  • Warning in Container Logs

This log is a key term search and for the purposes of this project we have are checking for warnings within the containers in an ideal world this log would always be empty however when things do go wrong it can be our most valuable resource. As can be seen in the image below at present there are no warnings however due to the nature of this query it can be used to check on a board range of terms WarningsQueryResults

  • Monitoring Kubernetes Events

This log set contains information about Kubernetes events e.g. pod scheduling, scaling activities, and errors, and can serve as a granular insight into cluster performance and health

Alarms

Alerts The above are some of the alerts currently set up within the cluster to trigger should the specified resource breach a threshold amount of 80%/90% should this threshold be breached it will trigger the sending of an alert (the alert to be sent in our case will be an email however you can configure the sending of sms also) to a specified action group which will contain a list of defined user for whom the alert will be sent to. When the alert has been received it is then at the discretion of the responding team to investigate why the alert was trigger and take the appropriate measure to rectify everything.

Response Strategies

In some situations the response to alerts can be very clear, however when it comes to our cluster it is always good to have a defined approach to dealing with situations should they arise. Should any of our alerts be triggered the main steps I would consider taking are:

  1. Checking the alert that was triggered
  2. Checking metric charts to see if the alert was triggered by a spike in activity or a gradual incline in activity
  3. Check against historical data for the metric to see if it has happened before/ is a common occurrence
  4. Retrieve and view log data for the metric triggered using log analytics queries to help better diagnose the issue and perform root cause analysis
  5. Take remediation steps if urgent fix is required (if time allows this will be the planning of remediation steps)
  6. (Apply plan from previous step)
  7. Create record of issue and steps taken to resolve so that should a common problem arise it can serve as a point of reference

e.g. In the case the CPU alert is triggered due to high demand for the application we might allocate additional resources by scaling the cluster node pool should our contingent auto-scaling not be sufficient this can be done using the az aks node pool scale or we can scale the deployment using kubectl scale deployment <deployment-name> --replicas=<desired-replica-count>