Add section on how node operators can monitor node

renaynay · Aug 15, 2022 · c5fe0e6 · c5fe0e6
1 parent 42cd171
commit c5fe0e6
Show file tree

Hide file tree

Showing 4 changed files with 140 additions and 89 deletions.
diff --git a/docs/adr/adr-010-incentivized-testnet-monitoring.md b/docs/adr/adr-010-incentivized-testnet-monitoring.md
@@ -0,0 +1,139 @@
+# ADR #010: Incentivized Testnet Monitoring
+
+## Changelog
+
+- 2022-7-19: Started
+- 2022-7-22: Add section on "How can node operators monitor their node?"
+
+## Context
+
+We're adding telemetry to celestia-node by instrumenting our codebase with metrics (see [ADR-009-telemetry](./adr-009-telemetry.md)). If the option to report metrics is enabled on celestia-node, then celestia-node will push metrics via [OLTP Explorter](https://opentelemetry.io/docs/reference/specification/protocol/exporter/) to an [OTEL Collector](https://opentelemetry.io/docs/collector/) instance. The Celestia team will manage an OLTP Collector instance during the incentivized testnet (and likely beyond) to collect metrics from nodes on the network. Celestia-node operators have the option of running their own OTEL Collector instance in order to collect metrics from their nodes.
+
+We would like to make the metrics collected in the Celestia team managed OLTP Collector actionable by making them queryable in internal Grafana dashboards. We additionally want a subset of metrics to be queryable by a public incentivized testnet leaderboard frontend.
+
+We would like to make it possible for node operators to monitor their own nodes by deploying their own OLTP collector instance.
+
+This document proposes a strategy for making data in the Celestia team managed OTEL Collector available for use in internal Grafana dashboards and a public leaderboard. Additionally it describes how a node operator can monitor their own node by deploying their own OLTP collector instance.
+
+![Incentivized Testnet Monitoring Diagram](./img/incentivized-testnet-monitoring-diagram.svg)
+
+// Comment on diagram <https://lucid.app/lucidchart/d957570f-9c06-4a82-9843-00d8232f734a/edit?invitationId=inv_247247d6-3a67-40b1-8513-1818caadd627>#
+
+### Where to export data to?
+
+Grafana can query data from [multiple data sources](https://grafana.com/docs/grafana/latest/datasources/#supported-data-sources). This document explores two of these data sources:
+
+1. [Prometheus](https://github.com/prometheus/prometheus) is an open-source time series database written in Go. Prometheus uses the [PromQL](https://prometheus.io/docs/prometheus/latest/querying/basics/) query language. We can deploy Prometheus ourselves or use a hosted Prometheus provider (ex. [Google](https://cloud.google.com/stackdriver/docs/managed-prometheus), [AWS](https://aws.amazon.com/prometheus/), [Grafana](https://grafana.com/go/hosted-prometheus-monitoring/), etc.). Prometheus is pull-based which means services that would like to expose Prometheus metrics must provide an HTTP endpoint (ex. `/metrics`) that a Prometheus instance can poll (see [instrumenting a Go application for Prometheus](https://prometheus.io/docs/guides/go-application/)). Prometheus is used by [Cosmos SDK telemetry](https://docs.cosmos.network/main/core/telemetry.html) and [Tendermint telemetry](https://docs.tendermint.com/v0.35/nodes/metrics.html) so one major benefit to using Prometheus is that metrics emitted by celestia-core, celestia-app, and celestia-node can share the same database.
+1. [InfluxDB](https://github.com/influxdata/influxdb) is another open-source time series database written in Go. It is free to deploy the InfluxDB but there is a commercial offering from [influxdata](https://www.influxdata.com/get-influxdb/) that provides clustering and on-prem deployments. InfluxDB uses the [InfluxQL](https://docs.influxdata.com/influxdb/v1.8/query_language/) query language which appears less capable at advanced queries [ref](https://www.robustperception.io/translating-between-monitoring-languages/). InfluxDB is push-based which means services can push metrics directly to an InfluxDB instance ([ref](https://logz.io/blog/prometheus-influxdb/#:~:text=InfluxDB%20is%20a%20push%2Dbased,and%20Prometheus%20fetches%20them%20periodically.)). See [Prometheus vs. InfluxDB](https://prometheus.io/docs/introduction/comparison/#prometheus-vs-influxdb) for a more detailed comparison.
+
+If alternative data sources should be evaluated, please share them with us.
+
+### How to export data out of OLTP Collector?
+
+[Exporters](https://opentelemetry.io/docs/collector/configuration/#exporters) provide a way to export data from an OLTP Collector to a supported destination.
+
+We configure OLTP collector to export data to Prometheus like this:
+
+```yaml
+exporters:
+  # Data sources: metrics
+  prometheus:
+    endpoint: "prometheus:8889"
+    namespace: "default"
+```
+
+We must additionally enable this exporter via configuration like this:
+
+```yaml
+service:
+  pipelines:
+    metrics:
+      exporters: [prometheus]
+```
+
+OLTP collector support for exporting to InfluxDB is still in [beta](https://github.com/open-telemetry/opentelemetry-collector#beta=). See [InfluxDB Exporter](https://pkg.go.dev/github.com/open-telemetry/opentelemetry-collector-contrib/exporter/influxdbexporter#section-readme).
+
+### How to query data in Prometheus from Grafana?
+
+In order to query Prometheus data from Grafana, we must add a Prometheus datasource. Steps outlined [here](https://prometheus.io/docs/visualization/grafana/#creating-a-prometheus-data-source).
+
+### How to query data in Prometheus from incentivized testnet leaderboard?
+
+Prometheus server exposes an HTTP API for querying metrics (see [docs](https://prometheus.io/docs/prometheus/latest/querying/api/#querying-exemplars)).
+
+Implementation details for the incentivized testnet leaderboard are not yet known (likely built by an external vendor). Two possible implementations are:
+
+1. If the incentivized testnet has a dedicated backend, it can query the HTTP API above
+1. If the incentivized testnet has **no** dedicated backend and the frontend queries Prometheus directly, then there exists a TypeScript library: [prometheus-query-js](https://github.com/samber/prometheus-query-js) which may be helpful.
+
+### How can a node operator monitor their own node?
+
+Node operators have the option of running their own instance of OTEL Collector to collect metrics from their nodes. Rough steps:
+
+1. [Install celestia-node](https://docs.celestia.org/developers/celestia-node)
+1. Start a Grafana instance. If you'd like to use a cloud-hosted Grafana, sign up for an account on <https://grafana.com/>
+1. [Install OTEL Collector](https://opentelemetry.io/docs/collector/getting-started/). If on a Linux machine follow [these steps](https://opentelemetry.io/docs/collector/getting-started/#linux-packaging=). OTEL Collector should start automatically immediately after installation.
+1. Configure OTEL Collector to receive metrics from celestia-node by confirming your `/etc/otelcol/config.yaml` has the default config:
+
+    ```yaml
+    receivers:
+      otlp:
+        protocols:
+          grpc:
+          http:
+    ```
+
+    This starts the [OTLP receiver](https://github.com/open-telemetry/opentelemetry-collector/blob/main/receiver/otlpreceiver/README.md) on port 4317 for gRPC and 4318 for HTTP. Celestia-node will by default emit HTTP metrics to `localhost:4318` so if you deployed OTEL Collector on the same machine as celestia-node, you can preserve the default config.
+1. Configure OTEL Collector to send metrics to Prometheus. If you are using cloud-hosted Grafana, add something like the following to your `/etc/otelcol/config.yaml`:
+
+    ```yaml
+    exporters:
+      prometheusremotewrite:
+        endpoint: https://361398:eyJrIjoiYTNlZTFiOTc2NjA2ODJlOGY1ZGRlNGJkNWMwODRkMDY2M2U2MTE3NiIsIm4iOiJtZXRyaWNzLWtleSIsImlkIjo2MTU4ODJ9@prometheus-prod-01-eu-west-0.grafana.net/api/prom/push
+    ```
+
+1. Configure OTEL Collector to enable the `otlp` receiver and the `prometheusremotewrite` exporter. In `/etc/otelcol/config.yaml`:
+
+    ```yaml
+    service:
+      pipelines:
+        metrics:
+          receivers: [otlp]
+          exporters: [prometheusremotewrite]
+    ```
+
+    See [this article](https://grafana.com/blog/2022/05/10/how-to-collect-prometheus-metrics-with-the-opentelemetry-collector-and-grafana/) for more details. You may need to specify port 443 in the endpoint like this: `endpoint: "https://USER:[email protected]:443/api/prom/push"`
+
+1. Restart OTEL Collector with `sudo systemctl restart otelcol`
+1. Monitor that OTEL Collector started correctly with `systemctl status otelcol.service` and confirming no errors in `journalctl | grep otelcol | grep Error`
+1. Start celestia-node
+1. Verify that metrics are being displayed in Grafana
+1. [Optional] Import a [OpenTelemetry Collector Dashboard](https://grafana.com/grafana/dashboards/12553-opentelemetry-collector/) into Grafana to monitor your OTEL Collector.
+
+### Should we host a Prometheus instance ourselves or use a hosted provider?
+
+We already host a Prometheus instance on DigitalOcean (host mamaki-prometheus). This seems like a good option during development. Grafana offers a hosted service, see [Grafana pricing](https://grafana.com/pricing/) for details.
+
+### Should we host a Grafana instance ourselves or use a hosted provider?
+
+We already host a Grafana instance on DigitalOcean (host mamaki-prometheus). This seems like a good option during development. Grafana offers a hosted service, see [Grafana pricing](https://grafana.com/pricing/) for details.
+
+### Should we host separate Prometheus instances per use case? I.e. one for internal dashboards and one for public leaderboard?
+
+The Prometheus docs state the following with regard to [Denial of Service](https://prometheus.io/docs/operating/security/#denial-of-service):
+
+> There are some mitigations in place for excess load or expensive queries. However, if too many or too expensive queries/metrics are provided components will fall over. It is more likely that a component will be accidentally taken out by a trusted user than by malicious action.
+
+So if we are concerned about the public leaderboard crashing the Prometheus instance that we use for internal dashboards, we may want to host two separate instances. This seems feasible by configuring OTEL Collector to export to two different Prometheus instances. This is a one way door, I suggest sticking with one instance for now and if we observe scenarios where the Prometheus instance falls over, we can explore a hosted option or running separate instances per use case.
+
+## Status
+
+Proposed
+
+## References
+
+- <https://github.com/celestiaorg/celestia-node/pull/901>
+- <https://github.com/celestiaorg/celestia-node/pull/907>
+- <https://celestia-team.slack.com/archives/C03QAJVLHK3/p1658169362548589>
+- <https://www.notion.so/celestiaorg/Telemetry-Dashboard-d85550a3caee4004b00a2e3bf82619b1>
+- <https://opentelemetry.io/docs/collector/>
diff --git a/docs/adr/adr-010-monitoring.md b/docs/adr/adr-010-monitoring.md
diff --git a/docs/adr/img/incentivized-testnet-monitoring-diagram.svg b/docs/adr/img/incentivized-testnet-monitoring-diagram.svg
diff --git a/docs/adr/img/monitoring-diagram.svg b/docs/adr/img/monitoring-diagram.svg