From fe7581de86b3bea10411b85dac32960d36a6ebc9 Mon Sep 17 00:00:00 2001 From: worktheclock <85885287+worktheclock@users.noreply.github.com> Date: Tue, 30 Jul 2024 16:18:38 +0200 Subject: [PATCH] Monitoring overview language edit --- .../operate/secure-and-monitor/monitor.mdx | 120 +++++++++--------- 1 file changed, 62 insertions(+), 58 deletions(-) diff --git a/astro/src/content/docs/operate/secure-and-monitor/monitor.mdx b/astro/src/content/docs/operate/secure-and-monitor/monitor.mdx index f3ffaa5bd0..36d78bca19 100644 --- a/astro/src/content/docs/operate/secure-and-monitor/monitor.mdx +++ b/astro/src/content/docs/operate/secure-and-monitor/monitor.mdx @@ -11,44 +11,46 @@ import LoadTestingTips from 'src/content/docs/get-started/run-in-the-cloud/_load ## Overview -Once you've installed FusionAuth and your applications use it for authentication, you have to ensure FusionAuth remains online and usable. There is an ecosystem of tools, protocols, and paid services to monitor applications and provide telemetry (remote measurement) that you need to understand. +Once you've installed FusionAuth and your applications use it for authentication, you'll want to ensure FusionAuth remains online and operational. An ecosystem of tools, protocols, and paid services are available to monitor applications and provide telemetry (remote measurement) to help you maintain FusionAuth's availability and performance. -This overview article will introduce you to various metrics (measurements) you can extract from FusionAuth, common monitoring tools to process, store, and visualize them, and explain which tools and services to choose for your requirements. Your needs will depend on how much information you need about the state of FusionAuth, how much time you are willing to spend setting up and maintaining your monitoring system, and how much you are willing to pay for cloud services. +This overview article will introduce you to various metrics (measurements) you can extract from FusionAuth and common monitoring tools to process, store, and visualize them. + +Your choice of tools and services will depend on the specific information you need about the state of FusionAuth, the time you are willing to spend setting up and maintaining your monitoring system, and your budget for cloud services. ### What Is Monitoring? -The aim of monitoring FusionAuth is to be able to see at any time if the service is alive (Docker container is running) and functioning correctly (e.g., users can sign in and logs have no errors), and to be alerted (by email or chat app) if either of these conditions is not met. +The aim of monitoring FusionAuth is to be able to see at any time if the service is alive (Docker container is running) and functioning correctly (users can sign in and logs have no errors), and to be alerted (by email or chat app) if either of these conditions is not met. A complete monitoring system involves five activities: + - Measuring — Getting operational data about FusionAuth and Docker. -- Processing — Filtering, indexing, and aggregating data into counts and booleans that represent if the system is working correctly. +- Processing — Filtering, indexing, and aggregating data into counts and booleans that represent whether the system is working correctly. - Storing — Storing metrics, logs, and aggregates. - Displaying — Showing the information in a dashboard. (A query language might do the aggregation here instead of the processing step.) - Alerting — Notifying administrators when services degrade or fail. -While you can use stock Docker monitoring apps to monitor Docker containers and collect logs, you always need to write custom code or [webhooks](https://fusionauth.io/docs/apis/webhooks) to get FusionAuth specific information, like numbers of successful logins. +While you can use stock Docker-monitoring apps to monitor Docker containers and collect logs, you will need to write custom code or [webhooks](https://fusionauth.io/docs/apis/webhooks) to get FusionAuth-specific information, like the number of successful logins. ## What FusionAuth Metrics Can Be Monitored? -The table below lists metadata FusionAuth offers for monitoring purposes. +The table below lists the metadata available from FusionAuth for monitoring purposes. | Name | Contains | Ingestion Options | More Information | |---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------| -| Metrics | System metrics, such as JVM buffers, memory, and threads. | Can be consumed via API and available in Prometheus compatible format. | [More info](/docs/operate/secure-and-monitor/monitor#metrics) | -| System Logs | Exceptions, stack traces, database connection issues, Elasticsearch connection issues. Minimal tracing capability unless debug logging is enabled. Logs are also written to the Docker standard output (terminal). | Can be exported as a zip file via API. | [More info](/docs/operate/troubleshooting/troubleshooting#system-log-ui) | -| Audit Logs | Created by the admin user interface actions. This is just an API, so a customer can also call this API. This is a free form API, and the audit log contents are whatever you put in. | This can be consumed via web hook or API. | [More info](/docs/apis/audit-logs) | -| Event Logs | Debug information for external integrations with IdPs, SMTP etc. In general, runtime errors that are typically not caused by FusionAuth and cannot be communicated well at runtime. Examples: a template rendering error in a custom theme, an exception connecting to SMTP due to bad credentials, a failure in a SAML exchange, a connection to a webhook is failing. | This can be consumed via web hook or API. | [More info](/docs/operate/troubleshooting/troubleshooting#event-log) | -| Login records | Each successful login is recorded here. Includes IP information, application, user, and timestamp. | Can be consumed via API. This record itself is not sent through a webhook, but a login success or login failure can be consumed via a web hook. | [More info](/docs/apis/login#export-login-records) | +| Metrics | System metrics, such as JVM buffers, memory, and threads. | Can be consumed via API, available in Prometheus-compatible format. | [More info](/docs/operate/secure-and-monitor/monitor#metrics) | +| System Logs | Exceptions, stack traces, database connection issues, Elasticsearch connection issues. Minimal tracing capability unless debug logging is enabled. Logs are also written to the Docker standard output (terminal). | Can be exported as a ZIP file via API. | [More info](/docs/operate/troubleshooting/troubleshooting#system-log-ui) | +| Audit Logs | Created by admin user interface actions. This is a basic API, so customers can also call it. A free-form API, the audit log can contain whatever data you specify. | Can be consumed via webhook or API. | [More info](/docs/apis/audit-logs) | +| Event Logs | Debug information for external integrations with IdPs, SMTP, and other external services. In general, runtime errors that are typically not caused by FusionAuth and cannot be communicated well at runtime. For example, a template rendering error in a custom theme, an exception connecting to SMTP due to bad credentials, a failure in a SAML exchange, or a connection to a webhook is failing. | Can be consumed via webhook or API. | [More info](/docs/operate/troubleshooting/troubleshooting#event-log) | +| Login records | Each successful login is recorded here. Includes IP information, application, user, and timestamp. | Can be consumed via API. This record itself is not sent through a webhook, but a login success or login failure can be consumed via a webhook. | [More info](/docs/apis/login#export-login-records) | | Webhooks | Triggered by events as documented, sending is configurable. Contains IP and location information when available. | Can be sent to an HTTP endpoint or a Kafka topic. | [More info](/docs/extend/events-and-webhooks/) | -Most of the items above are given as a ZIP file when requested by an API call. Webhooks are different — they are pushed from FusionAuth to a URL you specify, not called in an API by a monitoring tool. +Except for webhooks, items are provided as a ZIP file when requested by an API call. Webhooks are pushed from FusionAuth to a URL you specify, not called in an API by a monitoring tool. -The next few subsections give more detail about each item above. +The next few subsections detail each of these items. ### Log Files -The system log files will be placed in the `logs` directory under the FusionAuth installation unless you are running FusionAuth in a container. In the latter case, the log output will be sent to STDOUT by default. -You may also set up a [Docker logging driver](https://docs.docker.com/config/containers/logging/configure/) to direct log files elsewhere. +The system log files will be placed in the `logs` directory under the FusionAuth installation unless you are running FusionAuth in a container. In that case, the log output will be sent to STDOUT by default. You may also set up a [Docker logging driver](https://docs.docker.com/config/containers/logging/configure/) to direct log files elsewhere. System logs running in non-containerized instances can also be exported via the [Export System Logs API](/docs/apis/system#export-system-logs). @@ -56,13 +58,13 @@ System logs running in non-containerized instances can also be exported via the ### Application Logging -There are a few different APIs exposing FusionAuth application specific information you may want to ingest into your monitoring system: +A few different APIs expose FusionAuth application-specific information you may want to ingest into your monitoring system: * [FusionAuth administrative user interface audit logs](/docs/apis/audit-logs) * [Logs and errors from asynchronous code execution](/docs/apis/event-logs) * [Login records](/docs/apis/login#export-login-records) -In general these are APIs you will have to poll to ingest. +In general, these are APIs you will have to poll to ingest. ### Log File Formats @@ -72,31 +74,31 @@ In general these are APIs you will have to poll to ingest. | Audit Logs | Zipped CSV | Specified by the `zoneId` parameter | Specified by the `dateTimeSecondsFormat` parameter, defaults to `M/d/yyyy hh:mm:ss a z` | [API](/docs/apis/audit-logs) | | Event Logs | JSON | UTC | [Instant](/docs/reference/data-types#instants) | [API](/docs/apis/event-logs) | | Login Records | Zipped CSV | Specified by the `zoneId` parameter | Specified by the `dateTimeSecondsFormat` parameter, defaults to `M/d/yyyy hh:mm:ss a z` | [API](/docs/apis/login#export-login-records) | -| System Logs | Zipped directory of files. Log entries separated by newlines, but may be unstructured (stack traces, etc). | For log entries, the timezone of the server. The `zoneId` parameter, if provided, is used to build the filename. | `yyyy-MM-dd h:mm:ss.SSS a` | [API](/docs/apis/system#export-system-logs) | +| System Logs | Zipped directory of files. Log entries separated by newlines, but may be unstructured (for example, stack traces). | For log entries, the timezone of the server. The `zoneId` parameter, if provided, is used to build the filename. | `yyyy-MM-dd h:mm:ss.SSS a` | [API](/docs/apis/system#export-system-logs) | ### Application Events -There are a number of events, such as a failed authentication or an account deletion, that you may want to ingest into your monitoring system. These are available as webhooks. +You may want to ingest application events such as failed authentication or account deletion into your monitoring system. These are available as webhooks. Here's [the list of all available events](/docs/extend/events-and-webhooks/events/). ### Metrics -You can pull system metrics from the [System API](/docs/apis/system#retrieve-system-status). The format of these metrics are evolving and thus are not documented. +You can pull system metrics from the [System API](/docs/apis/system#retrieve-system-status). The format of these metrics is evolving and thus not documented. -You can also enable JMX metrics as outlined in the [Troubleshooting documentation](/docs/operate/troubleshooting/troubleshooting#enabling-jmx). +You can also enable JMX metrics as outlined in the [troubleshooting documentation](/docs/operate/troubleshooting/troubleshooting#enabling-jmx). #### Prometheus Endpoint -The default system metrics are also available in a Prometheus compatible form. [This tutorial](/docs/operate/secure-and-monitor/prometheus) explains how to set up Prometheus to monitor FusionAuth metrics. +Default system metrics are also available in a Prometheus-compatible form. [This tutorial](/docs/operate/secure-and-monitor/prometheus) explains how to set up Prometheus to monitor FusionAuth metrics. ## Overview Of Popular Monitoring Tools -This section introduces the most common monitoring tools. At the end of the section is a table showing where each tool fits in the five activities (measuring, processing, storing, displaying, alerting). +This section introduces the most common monitoring tools. The table at the end of the section illustrates each tool's suitability for the five monitoring activities: measuring, processing, storing, displaying, and alerting. ### What Is OpenTelemetry? @@ -104,70 +106,71 @@ This section introduces the most common monitoring tools. At the end of the sect You can either use an OpenTelemetry library as a module in your code, which you call manually to send it application-specific metrics (instrumentation), or you can use a stock OpenTelemetry instrumentation agent, like the Java Virtual Machine agent, that collects general information about your application as it runs. -You could also collect metrics about your application through another tool and send them to an OpenTelemetry Collector service through OTLP (OpenTelemetry Protocol), a standardized way of exchanging telemetry data between tools. The Collector acts a central point to receive all metrics, potentially do some simple filtering, and forward the metrics to other services. +You could also collect metrics about your application through another tool and send them to an OpenTelemetry Collector service through OTLP (OpenTelemetry Protocol), a standardized way of exchanging telemetry data between tools. The collector acts as a central point to receive all metrics, potentially do some simple filtering, and forward the metrics to other services. OpenTelemetry does not handle storage, data aggregation (counting things), or visualization (dashboards). -For example, you could run the OpenTelemetry Java agent before running your app. The agent would monitor RAM use of your app as a metric and send it to an OpenTelemetry collector, which would forward the metric to another service for saving, counting, and displaying in a dashboard. +For example, you could run the OpenTelemetry Java agent before running your app. The agent would monitor the RAM use of your app as a metric and send it to an OpenTelemetry Collector, which would forward the metric to another service for saving, counting, and displaying in a dashboard. -There is a guide to monitoring FusionAuth with OpenTelemetry [here](/docs/operate/secure-and-monitor/opentelemetry). +Read the [guide to monitoring FusionAuth with OpenTelemetry](/docs/operate/secure-and-monitor/opentelemetry). ### What Is Prometheus? -[Prometheus](https://prometheus.io/docs/introduction/overview) is a database specifically designed to store metrics efficiently, as well as provide data through a custom query language. Prometheus also provides alerts through its [AlertManager](https://github.com/prometheus/alertmanager) and simplistic charts, but for proper dashboards you need Grafana. Prometheus also provides some instrumentation tools to obtain metrics for networks, machines, and some [programming languages](https://prometheus.io/docs/instrumenting/clientlibs/). +[Prometheus](https://prometheus.io/docs/introduction/overview) is a monitoring and alerting toolkit with a built-in time series database designed to efficiently store metrics and provide data through a custom query language. Prometheus offers instrumentation tools to collect metrics from networks, machines, and some [programming languages](https://prometheus.io/docs/instrumenting/clientlibs/). While Prometheus includes basic charting capabilities, more advanced dashboards require integration with Grafana. You can use the Prometheus [Alertmanager](https://github.com/prometheus/alertmanager) to handle alerts. Prometheus can be used for all activities in the monitoring chain, though some at a very basic level, but should [not be used to store logs](https://prometheus.io/docs/introduction/faq/#how-to-feed-logs-into-prometheus). -For example, you could run the Prometheus network monitoring agent, which would monitor the amount of data travelling your network every second as a metric, and save it to the Prometheus database. You could then run an ad hoc query to visualize the network usage in the last hour as a chart with the Prometheus web interface. +For example, you could run the Prometheus network monitoring agent to monitor the amount of data traveling your network every second as a metric and save the data to the Prometheus database. You could then run an ad hoc query to visualize the network usage in the last hour as a chart with the Prometheus web interface. ### What Is Grafana? -[Grafana](https://grafana.com/grafana) is an open-source dashboard app, that can be self-hosted, or cloud-hosted with a fee. Grafana does not aggregate or process data itself, but can call data providers to get your metrics with aggregation, if the provider supports a query language that does aggregation. Grafana can also trigger [alerts](https://grafana.com/docs/grafana/latest/alerting/fundamentals/). +[Grafana](https://grafana.com/grafana) is an open-source dashboard app that can be self-hosted or cloud-hosted with a fee. Grafana does not aggregate or process data but can retrieve aggregated metrics from data providers that support a query language capable of aggregation. Grafana can also trigger [alerts](https://grafana.com/docs/grafana/latest/alerting/fundamentals/). For example, you could create a dashboard in Grafana that queries Prometheus every few seconds to get the network usage over the last hour and display it as a chart. ### What Is Grafana Loki? -Although Grafana does not store data, Grafana [Loki](https://grafana.com/oss/loki/), a separate tool that is part of the Grafana brand, does store data. Loki receives and stores logs. It has a similar purpose to Elasticsearch and Logstash (explained in the next section). +Although Grafana does not store data, [Grafana Loki](https://grafana.com/oss/loki/), a separate tool that is part of the Grafana brand, does store data. Loki receives and stores logs. It has a similar purpose to Elasticsearch and Logstash (explained in the next section). -Unlike Logstash, Loki does not do significant processing to logs it receives. So Loki can run faster, but provides less data for indexed searching. +Unlike Logstash, Loki does not do significant processing to the logs it receives. So Loki can run faster, but provides less data for indexed searching. For example, you could redirect the output of your app's logs from the Docker container standard output to Loki for storing. -### What Are Elasticsearch, Logstash, and Kibana (ELK)? +### What Are Elasticsearch, Logstash, And Kibana (ELK)? -Elasticsearch is a database to store indexed non-relational data, which provides a query tool to search the data. You can either save logs directly to Elasticsearch or first process and index them by sending the logs to Logstash, which forwards the processed logs to Elasticsearch for saving. +Elasticsearch is a database that stores and searches indexed non-relational data. You can either save logs directly to Elasticsearch, or first send logs to Logstash for processing and indexing, which forwards the processed logs to Elasticsearch for saving. -Both tools are managed by the Elastic company, and are designed to integrate with the Elastic dashboard tool, Kibana. +Both tools are managed by the Elastic company and are designed to integrate with the Elastic dashboard tool, Kibana. You could use Elasticsearch similarly to both Prometheus and Loki, in the same way as their examples. ### What Are OpenSearch And OpenSearch Dashboard? -[OpenSearch and OpenSearch Dashboard](https://opensearch.org/faq) are the open source versions of the ElasticSearch and Kibana tools. OpenSearch was forked from Elastic at version 7.10 and is maintained by AWS. +[OpenSearch and OpenSearch Dashboard](https://opensearch.org/faq) are the open-source versions of the ElasticSearch and Kibana tools. OpenSearch was forked from Elastic at version 7.10 and is maintained by AWS. -Note that while not strictly open source for all uses, Elastic products are both free for almost all use and the code is available (gratis and libre). Only reselling Elastic products as a hosted service in competition with Elastic Cloud is disallowed. So until the features of Elastic and OpenSearch diverge significantly, you can treat them as equivalents for your monitoring purposes. +Note that while not strictly open source for all uses, both Elastic products are free for most uses and the code is available (gratis and libre). Only reselling Elastic products as a hosted service in competition with Elastic Cloud is disallowed. So until the features of Elastic and OpenSearch diverge significantly, you can treat them as equivalents for your monitoring purposes. ### What Are Jaeger, Zipkin, and Grafana Tempo? -[Jaeger](https://www.jaegertracing.io/docs/1.59/) is an open source monitoring tool made by [Uber](https://www.uber.com/en-IE/blog/distributed-tracing) to understand user requests across distributed microservices. You don't need to use Jaeger with FusionAuth since FusionAuth is a single service. +[Jaeger](https://www.jaegertracing.io/docs/1.59/) is an open-source monitoring tool made by [Uber](https://www.uber.com/en-IE/blog/distributed-tracing) to understand user requests across distributed microservices. You don't need to use Jaeger with FusionAuth since FusionAuth is a single service. -[Zipkin](https://zipkin.io/) is similar, except made by Twitter, and it's older than Jaeger. [Grafana Tempo](https://grafana.com/oss/tempo/) is also similar, but it's the most modern of the three distributed tracing options. +[Zipkin](https://zipkin.io/), made by Twitter, is similar to Jaeger but was developed earlier. [Grafana Tempo](https://grafana.com/oss/tempo/) is another similar tool, and is the most modern of the three distributed tracing options. ### What Tools Are Available To Send Alerts To Administrators? -Some of the monitoring components discussed in the previous sections are able to send system administrators alerts when their apps fail. There are three general places an alert can be sent to: -- Email: for which you will probably need you to pay a subscription to a mail sender service, unless your Internet provider allows you access to the SMTP protocol. -- Computer chat apps: of which the most popular are Slack and Discord. They can be used to receive alert messages in a specific chat channel for free. Slack and Discord clients can be installed on your phone to be sure you never miss an alert. -- Phone chat apps: of which the most popular are WhatsApp and Threema. While WhatsApp requires a phone number to sign up, Threema needs no phone number, but requires you to buy the mobile app for a small one-time fee. Both services charge a fee for businesses to send messages to users. +Some of the monitoring components discussed in the previous sections can alert system administrators when applications fail. Alerts can be sent to: -Grafana also has a tool called OnCall, which consolidates and redirects multiple alerts from multiple systems. If you have support staff that monitor an alert system in shifts, rather than wait to receive an email, this might be useful to you. +- Email: You will probably need to subscribe to a mail sender service to receive email alerts, unless your internet provider allows you access to the SMTP protocol. +- Computer chat apps: Computer chat apps like Slack and Discord can receive alert messages in a specific channel for free. Slack and Discord clients can be installed on your phone to be sure you never miss an alert. +- Phone chat apps: For example, WhatsApp and Threema. WhatsApp requires a phone number to sign up. Threema does not need a phone number but requires you to buy the mobile app for a small one-time fee. Both services charge a fee for businesses to send messages to users. + +Grafana's OnCall tool consolidates and redirects multiple alerts from various systems. This might be a good option for you if you have support staff that monitor an alert system in shifts rather than wait to receive an email. ### What Can You Do With A Custom Script? As a simple and free alternative to all the tools above, you could write a small web service in JavaScript, Go, or Python to continuously call the FusionAuth website or API, and if it fails, send an alert to the administrator. -If you want to monitor any FusionAuth metrics directly, instead of monitoring only the FusionAuth container, you will need to write a custom script to call the FusionAuth API anyway. +If you want to monitor any FusionAuth metrics directly instead of only the FusionAuth container, you will need to write a custom script to call the FusionAuth API anyway. ### Monitoring Tools Compared @@ -182,12 +185,12 @@ Remember that Prometheus, Loki, and Grafana all work together, in the same way a | OpenTelemetry | ✅ | ✅ | ⛔ | ⛔ | ⛔ | ⛔ | | Elastic | ✅ Elastic Agent or Beats | ✅ Logstash | ✅ Elasticsearch | ✅ Kibana | ✅ Kibana | ⛔ | | OpenSearch | ✅ | ✅ | ✅ | ✅ OpenSearch Dashboard | ✅ OpenSearch Dashboard | ⛔ | -| Prometheus | ✅ | ✅ | ✅ | ✔️ | ✅ AlertManager | ⛔ | +| Prometheus | ✅ | ✅ | ✅ | ✅ | ✅ AlertManager | ⛔ | | Grafana | ⛔ | ⛔ | ⛔ | ✅ | ✅ OnCall | ⛔ | | Loki (logs only) | ⛔ | ✅ | ✅ | ⛔ | ⛔ | ⛔ | | Email, Slack, Discord, WhatsApp, Threema | ⛔ | ⛔ | ⛔ | ⛔ | ⛔ | ✅ | -There are other paid hosted services not discussed in this article, like Splunk (integration guide with FusionAuth [here](/docs/operate/secure-and-monitor/splunk)) and DataDog, that have their own tools (that usually implement the OpenTelemetry Protocol). +There are other paid and hosted monitoring services that aren't discussed in this article, like Splunk (integration guide with FusionAuth [here](/docs/operate/secure-and-monitor/splunk)) and DataDog, which have their own tools and typically implement the OpenTelemetry Protocol. ## Which Tools Should You Choose? @@ -196,7 +199,7 @@ Which combination of monitoring tools you choose is based on: - How much time you want to spend building and maintaining a monitoring system. Less time means you'll have less information about your system, and possibly spend more on specialized monitoring cloud services that handle maintenance for you. - How much money you want to spend. If you want to pay as little as possible, there are free tools available for all aspects of monitoring, but you will have to install them on your own server and monitor them too. -If you already use a paid cloud monitoring service, they probably have all the components you need to monitor FusionAuth. Continue using that. +If you already use a paid cloud monitoring service, it probably has all the components you need to monitor FusionAuth. Continue using that. ### A Sample Of Monitoring Tool Choices @@ -204,20 +207,21 @@ Below are some examples of how you could monitor FusionAuth. The table gives an | System | Coverage | Cost | Notes | |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------|------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| A custom web service that either checks the FusionAuth container is running, or calls the FusionAuth API to check there are no errors. If either condition is not met, the service alerts the administrator by posting a message to Slack or Discord. | You can make the service check as many or few metrics as you want. | Free | You need to write custom code. You won't have a dashboard or a store of historical performance data and logs. | -| Install the ElasticSearch/OpenSearch stack in Docker on your server. Monitor your Docker container cluster with ElasticAgent. | Low. You know only if your container is running or not. | Free | Setting up the Elastic stack yourself is initially complex and time-consuming. You will automatically have dashboards available in Kibana. | -| As above, the Elastic stack, but also save logs to Elasticsearch, and write a custom web service to upload FusionAuth metrics via the Elastic API | Very high. You will know every detail about your instance and can search stored data. | Free | A large amount of work needed. | +| A custom web service that either checks the FusionAuth container is running or calls the FusionAuth API to check there are no errors. If either condition is not met, the service alerts the administrator by posting a message to Slack or Discord. | You can make the service check as many or as few metrics as you want. | Free | You need to write custom code. You won't have a dashboard or a store of historical performance data and logs. | +| Install the ElasticSearch or OpenSearch stack in Docker on your server. Monitor your Docker container cluster with ElasticAgent. | Low. You know only if your container is running or not. | Free | Setting up the Elastic stack yourself is initially complex and time-consuming. You will automatically have dashboards available in Kibana. | +| As above, the Elastic stack, but also save logs to Elasticsearch, and write a custom web service to upload FusionAuth metrics via the Elastic API | Very high. You will know every detail about your instance and can search stored data. | Free | Substantial work needed. | | As above, the Elastic stack with custom code, but use Elastic Cloud as your paid monitoring host. | Very high | High | Much less work needed than self-hosting, and no need to worry about a failing monitoring service you need to maintain. | -| Run Prometheus in a container and point it to the stock FusionAuth endpoints for Prometheus. | Medium | Free | Simple to set up. Will provide you with a simple dashboard, metric storage, and alerts, all provide by Prometheus alone. | -| As above, Prometheus with FusionAuth endpoints, but save logs with Grafana Loki, and create comprehensive permanent dashboards with Grafana. | High | Free | More complex to set up and maintain, but will provide very detailed information with high quality usability for administrators. | -| A custom web service to extract FusionAuth metrics, plus the tools of a paid monitoring service like DataDog, Splunk, etc... | High | High | You will need to write a custom web service here, because no monitoring service integrates naturally with FusionAuth. The service's existing monitoring components should cover everything else you need: container monitoring, log indexing. If you are a large firm that already uses a paid service for your other software, this is the easiest choice. | +| Run Prometheus in a container and point it to the stock FusionAuth endpoints for Prometheus. | Medium | Free | Simple to set up. Will provide you with a simple dashboard, metric storage, and alerts, all provided by Prometheus alone. | +| As above, Prometheus with FusionAuth endpoints, but save logs with Grafana Loki, and create comprehensive permanent dashboards with Grafana. | High | Free | More complex to set up and maintain, but will provide very detailed information with high-quality usability for administrators. | +| A custom web service to extract FusionAuth metrics, plus the tools of a paid monitoring service like DataDog or Splunk. | High | High | You will need to write a custom web service as no monitoring service integrates naturally with FusionAuth. The service's existing monitoring components should cover everything else you need, such as container monitoring and log indexing. If you are a large firm that already uses a paid service for your other software, this is the easiest choice. | + +In any of these systems, you can configure a trigger to send alerts to your preferred chat app using its API, for example, to a Slack channel using the Slack API. Most paid services, like Elastic Cloud, will have stock Slack integrations that won't need you to code an API call yourself. -Alerting can be done by configuring a trigger in any of the systems above, that sends a message to your chat app of preference via its API (e.g., Slack API). Most paid services, like Elastic Cloud, will have stock Slack integrations that don't require you to code an API call yourself. +### Monitoring Monitors -### Monitoring monitors +Remember that if you're using a self-hosted monitoring system, you won't be alerted if the system dies — nothing monitors the monitor. To overcome this you will need to: -Remember for all the self-hosted monitoring systems, if the system dies you won't be alerted — nothing monitors the monitor. To overcome this you will need to: -- Pay an external monitoring service instead of self-hosting. +- Pay for an external monitoring service instead of self-hosting. - Use a simple heartbeat monitoring service that checks only that your hosted monitoring service is contactable. - Write a service to monitor the monitor on a separate self-hosted server.