docs(adr-009): cover metrics and more info on tracing

celestiaorg · Aug 10, 2022 · de5f85d · de5f85d
1 parent 8c335dd
commit de5f85d
Showing 1 changed file with 190 additions and 15 deletions.
diff --git a/docs/adr/adr-009-telemetry.md b/docs/adr/adr-009-telemetry.md
@@ -9,6 +9,7 @@
 * 2022-07-15: Stylistic improvements from @rootulp and @bidon15
 * 2022-07-29: Formatting fixes
 * 2022-07-29: Clarify and add more info regarding Uptrace
+* 2022-08-09: Cover metrics and add more info about trace
 
 ## Authors
 
@@ -54,15 +55,16 @@ This ADR is intended to outline the decisions on how to proceed with:
 * Integration plan according to the priorities and the requirements
 * What observability tools/dependencies to integrate
 * Integration design into Celestia-Node for each observability option
-* A reference document explaining "whats" and "hows" during integration in some part of the codebase, e.g. new dev
+* A reference document explaining "whats" and "hows" during integration in some part of the codebase
+* A primer for any developer in celestia-node to quickly onboard into Telemetry
 
 ## Decisions
 
 ### Plan
 
 #### First Priority
 
-The first priority lies on "ShrEx" stack analysis results  for Celestia project. The outcome will tell us whether
+The first priority lies on "ShrEx" stack analysis results for Celestia project. The outcome will tell us whether
 our current [Full Node reconstruction](https://github.com/celestiaorg/celestia-node/issues/602) qualities conforms to
 the main network requirements, subsequently affecting the development roadmap of the celestia-node before the main
 network launch. Basing on the former, the plan is focused on unblocking the reconstruction
@@ -74,43 +76,66 @@ so the decision for the celestia-node team is to cover with traces only the _nec
 code as the initial response to the ADR, leaving the rest to be integrated in the background for the devs in the team
 once they are free as well as for the efficient bootstrapping into the code for the new devs.
 
+___Update:___ The `ShrEx` analysis is not the blocker nor the highest priority at the moment of writing.
+
 #### Second Priority
 
-The next biggest priority - incentivized Testnet can be largely covered with traces as well. All participant will submit
+The next biggest priority - incentivized testnet can be largely covered with traces as well. All participant will submit
 traces from their nodes to any provided backend endpoint by us during the whole network lifespan. Later on, we will be
 able to verify the data of each participant by querying historical traces. This is the feature that some backend solutions
 provide, which we can use as well to extract valuable insight on how the network performs in macro view.
 
+Even though incentivised testnet goal can be largely covered by traces in terms of observability, the metrics for this
+priority are desirable, as metrics provide:
+
+* Easily queryable time-series data
+* Extensive tooling to build visualization for that data
+
+Both, can facilitate implementation of global network observability dashboard, participant validation for the goal.
+
 #### Third Priority
 
 Enabling total observability of the node through metrics and traces.
 
 ### Tooling/Dependencies
 
-#### Golang API/Shim
+#### Telemetry Golang API/Shim
 
 The decision is to use [opentelemetry-go](https://github.com/open-telemetry/opentelemetry-go) for both Metrics and Tracing:
 
 * Minimal and golang savvy API/shim which gathers years of experience from OpenCensus/OpenMetrics and [CNCF](https://www.cncf.io/)
 * Backends/exporters for all the existing timeseries monitoring DBs, e.g. Prometheus, InfluxDB. As well as tracing backends
   Jaeger, Uptrace, etc.
 * <https://github.com/uptrace/opentelemetry-go-extra/tree/main/otelzap> with logging engine we use - Zap
+* Provides first-class support/implementation for/of generic [OTLP](https://opentelemetry.io/docs/reference/specification/protocol/)(OpenTelemetry Protocol)
+  * Generic format for any telemetry data.
+  * Allows integrating otel-go once and use it with any known backend, either
+    * Supporting OTLP natively
+    * Or through [OTel Collector]((https://opentelemetry.io/docs/collector/))
+    * Allows exporting telemetry to one endpoint only([opentelemetry-go#3055](https://github.com/open-telemetry/opentelemetry-go/issues/3055))
 
 The discussion over this decision can be found in [celestia-node#663](https://github.com/celestiaorg/celestia-node/issues/663)
 and props to @liamsi for initial kickoff and a deep dive into OpenTelemetry.
 
 #### Tracing Backends
 
-For tracing, there are 3 modern OSS tools that are recommended. All of them have bidirectional support with OpenTelemetry:
+For tracing, there are 4 modern OSS tools that are recommended. All of them have bidirectional support with OpenTelemetry:
 
+* [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/)
+  * Tracing data proxy from a OTLP client to __any__ backend
+  * Supports a [long list of backends](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter)
 * [Uptrace](https://get.uptrace.dev/guide/#what-is-uptrace)
   * The most recent (~1 year)
   * Tight to Clickhouse DB
   * The most lightweight
+  * Supports OTLP
+  * OSS and can be deployed locally
+  * Provides hosted solution
 * [Jaeger](https://www.jaegertracing.io/)
   * The most mature
   * Started by Uber, now supported by CNCF
   * Supports multiple storages(ScyllaDB, InfluxDB, Amazon DynamoDB)
+  * Supports OTLP
 * [Graphana Tempo](https://grafana.com/oss/tempo/)
   * Deep integration with Graphana/Prometheus
   * Relatively new (~2 years)
@@ -125,14 +150,43 @@ Each of these backends can be used independently and depending on the use case.
 
 > I am personally planning to set up the lightweight Uptrace for the local light node. Just to play around and observe
 > things
-> UPDATE: It turns out it is not that straightforward and adds additional overhead. See #Other-Findings
+>
+> __UPDATE__: It turns out it is not that straightforward and adds additional overhead. See #Other-Findings
 
 There is no strict decision on which of these backends and where to use. People taking ownership of any listed vectors
-are free to use any recommended solution or any unlisted.
+are free to use any recommended solution or any unlisted. The only backend requirement is the support of OTLP, natively
+or through OTel Collector. The latter though, introduces additional infrastructure piece which adds unnecessary complexity
+for node runners, and thus not recommended.
 
 #### Metrics Backend
 
-// WIP
+We only consider OSS backends.
+
+* [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/)
+  * Push based
+  * Metrics data proxy from a OTLP client to __any__ backend
+  * Supports a [long list of backends](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter)
+* [Netdata](https://github.com/netdata/netdata)
+  * Push based
+  * Widely supported option in the Linux community
+  * Written in C
+  * Decade of experience optimized to bare metal
+  * Perfect for local monitoring setups
+  * Unfortunately, does not support OTLP
+* [Uptrace](https://get.uptrace.dev/guide/#what-is-uptrace)
+  * The most recent (~1 year)
+  * Tight to Clickhouse DB
+  * The most lightweight
+  * Supports OTLP
+  * OSS and can be deployed locally
+  * Provides hosted solution
+* Prometheus+Graphana
+  * Pull based
+  * No native OTLP support
+    * Thought there [spec](https://github.com/open-telemetry/wg-prometheus/blob/main/specification.md) to fix this
+  * Still, can be used with Otel Collector
+
+Similarly, no strictness around backend solution with only OTLP support requirement. Natively or through OTLP exporter.
 
 ## Design
 
@@ -218,14 +272,14 @@ Here is the result of the above code sending traces visualized on Jaeger UI
 
 #### Backends connection
 
-Jaeger example
+Example for Jaeger
 
 ```go
-    // Create the Jaeger exporter
-    exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(url)))
-    if err != nil {
-        return nil, err
-    }
+  // Create the Jaeger exporter
+  exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(url)))
+  if err != nil {
+      return nil, err
+  }
  // then the tracer provider
     tp := tracesdk.NewTracerProvider(
         // Always be sure to batch in production.
@@ -245,19 +299,140 @@ Jaeger example
     tp.Shutdown(ctx)
 ```
 
+We decided to use OTLP backend, and it is almost similar in terms of setup.
+
 ### Metrics Design
 
-// WIP
+Metrics allows collecting time-series data from different measurable points in the application. Every measurable can
+be covered via 6 instruments OpenTelemetry provides:
+
+* ___Counter___ - synchronous instrument that measures additive non-decreasing values.
+* ___UpDownCounter___ - synchronous instrument which measures additive values that increase or decrease with time.
+* ___Histogram___ - synchronous instrument that produces a histogram from recorded values.
+* ___CounterObserver___ - asynchronous instrument that measures additive non-decreasing values.
+* ___UpDownCounterObserver___ - asynchronous instrument that measures additive values that can increase or decrease with time.
+* ___GaugeObserver___ - asynchronous instrument that measures non-additive values for which sum does not produce a meaningful correct result.
+
+#### Integration Example
+
+Consider we want to know report current network height as a metric.
+
+First of all, the global pkg meter has to be defined in the code related to the desired metric. In our case it is `header` pkg.
+
+```go
+var meter = global.MeterProvider().Meter("header")
+```
+
+Next, we should understand what instrument to use. On the first glance, for chain height a ___Counter___ instrument should
+fit, as it is a non-decreasing value, and then we should think whether we need a sync or async version of it. For our case,
+both would work and its more the question of precision we want. Sync metering would report every height change, while
+the async would poke `header` pkg API periodically to get the metered data. For our example, we will go with the latter.
+
+```go
+// MonitorHead enables Otel metrics to monitor head.
+func MonitorHead(store Store) {
+ headC, _ := meter.AsyncInt64().Counter(
+  "head",
+  instrument.WithUnit(unit.Dimensionless),
+  instrument.WithDescription("Subjective head of the node"),
+ )
+
+ err := meter.RegisterCallback(
+  []instrument.Asynchronous{
+   headC,
+  },
+  func(ctx context.Context) {
+   head, err := store.Head(ctx)
+   if err != nil {
+    headC.Observe(ctx, 0, attribute.String("err", err.Error()))
+    return
+   }
+
+   headC.Observe(
+    ctx,
+    head.Height,
+    attribute.Int("square_size", len(head.DAH.RowsRoots)),
+   )
+  },
+ )
+ if err != nil {
+  panic(err)
+ }
+}
+```
+
+The example follows a solely API-based approach without the need to integrate the metric deeper into the implementation
+insides, which is nice and keeps metering decoupled from business logic. The `MonitorHead` func simply accepts the `Store`
+interface and reads the information about the latest subjective header via `Head` on the node.
+
+The API-based approach should be followed for any info level metric. Even if there is no API to get the required metric,
+such API should be introduced. However, this approach is not always possible and sometimes deeper integration with code
+logic is necessary to analyze performance or there are security and/or encapsulation considerations.
+
+On the example, we can also see how any additional data can be added to the instruments via attributes or labels. It is
+important to add only absolutely necessary data(more on that in Others section below) to the metrics or data which is
+common over multiple time-series. In this case, we attach `square_size` of the height to know the block size of the
+height. This allows us to query reported heights with some square size using backend UI. Note that there are only
+powers of two(with 256 being a current limit) as unique values possible for the metric, so it won't put pressure on the
+metrics backend.
+
+For in Go code examples on other metric instruments consult [Uptrace Otel docs](https://uptrace.dev/opentelemetry/go-metrics.html#getting-started).
+
+#### Backends Connection
+
+Example for OTLP extracted from our code
+
+```go
+  opts := []otlpmetrichttp.Option{
+      otlpmetrichttp.WithCompression(otlpmetrichttp.GzipCompression),
+      otlpmetrichttp.WithEndpoint(cmd.Flag(metricsEndpointFlag).Value.String()),
+  }
+  if ok, err := cmd.Flags().GetBool(metricsTlS); err != nil {
+      panic(err)
+  } else if !ok {
+      opts = append(opts, otlpmetrichttp.WithInsecure())
+  }
+
+  exp, err := otlpmetrichttp.New(cmd.Context(), opts...)
+  if err != nil {
+      return err
+  }
+
+  pusher := controller.New(
+      processor.NewFactory(
+          selector.NewWithHistogramDistribution(),
+          exp,
+      ),
+      controller.WithExporter(exp),
+      controller.WithCollectPeriod(2*time.Second),
+      controller.WithResource(resource.NewWithAttributes(
+          semconv.SchemaURL,
+          semconv.ServiceNameKey.String(fmt.Sprintf("Celestia-%s", env.NodeType.String())), 
+    // Here we can add more attributes with Node information
+      )),
+  )
+```
 
 ## Considerations
 
 * Tracing performance
   * _Every_ method calling two more functions making network request can affect overall performance
+* Metrics backend performance
+  * Mainly, we should avoid sending too much data to the metrics backend through labels. Metrics is only for metrics and
+not for indexing.
+e.g. hash, uuid, etc not to overload the metrics backend
 * Security and exported data protection
   * OTLP provides TLS support
 
 ## Other Findings
 
+### Labels and High Cardinality
+
+High cardinality(many different label values) issue should be always kept in mind while introducing new metrics
+and labels for them. Each metric should be attached with only absolutely necessary labels and stay away from metrics
+sending label __unique__ values each time, e.g. hash, uuid, etc. Doing the opposite can dramatically increase the
+amount of data stored. See <https://prometheus.io/docs/practices/naming/#labels>.
+
 ### Tracing and Logging
 
 As you will see in the examples below, tracing looks similar to logging and have almost the same semantics. In fact,