Skip to content

Observability Requirements

Basil Vandegriend edited this page Aug 26, 2022 · 7 revisions

Overview

These requirements are largely written from the point of view of the FOM production support/sustainment role, focusing on availability and performance (security-related monitoring was considered out of scope). These specific requirements all derive from the following observability/alerting epic: "As FOM production support/sustainment, I want information on how FOM is running in production to address issues proactively or reactively so that users (public, industry, and ministry) have a great user experience."

Observability Requirements

Observability Requirement Corresponding Metrics/Alerts
Is FOM currently running overall? % failed external health checks. Alert on failing external health checks. (Internal health checks could be used, but are typically used as container health checks to restart instances automatically so are not useful to alert on, plus they can provide false positives and false negatives.)
Are user interactions generally successful? % of web responses with 5xx response codes (4xx response codes may be relevant as well).
Is end user performance reasonable? Web response times, typically measured at the 50th, 90th, 95th, and 99th percentiles.
Are there particular operations that are too slow? Response times (50th/90th/95th percentile) by API endpoint exceeding a threshold (e.g. 1500 ms)
Are scheduled (batch) processes running? Alert on failed executions of scheduled processes.
Is FOM at risk for exhausting storage? Monitor trend in % allocated storage free (unused), alert if less than 10%.
Is FOM at risk for exhausting infrastructure resource capacity (compute, memory) Monitor CPU and memory utilization rates for backend and database, ability to cross-corelate with response times. (Sharply rising higher percentile response times is often a sign of approaching a resource bottleneck.)
Troubleshoot availability or performance issues relating to peak traffic Volume of activity - # of web requests per unit time, # of database connections / transactions per unit time.
Troubleshoot root cause of availability or performance issues Execution time of web requests for each infrastructure component and each layer in the backend, access to application logs with search capabilities.
Understanding normal vs abnormal system behavior to help respond proactively and help troubleshoot Access to historical metrics (at least for a couple weeks prior if not more). Ability to compare past and current metrics. Ability to view different metrics for the same time period to determine correlations.

Implementation Considerations

  • Using alerts needs to decide who to send the alerts to, what the communications channel will be (e.g. email DL, teams channel, SMS), what the expected actions are, and what the response time objective is (SLO/SLA). Third party tools (e.g. PagerDuty) can be used by larger teams to coordinate dispatch, response and resolution of alerts.
  • In terms of prioritizing implementation of alerts, functions that users will notice is not working, and can easily communicate with the team about are less critical to have alerts on. For example in FOM the public site needs an outage alert much more than the admin site.
  • Platforms used (e.g. OpenShift vs AWS) and technology stacks (e.g. Spring Boot) provide a defined set of metrics as a starting point. Third party tools (e.g. Sentry, New Relic, Azure Application Insights) can be integrated with a code base to provide additional metrics beyond those available in the platform / technology stack.
  • Key metrics are typically displayed in a dashboard with the ability to change time periods and compare different metrics for the same time period. An example from AWS is provided below. AWS Sample Dashboard