diff --git a/docs/adr/adr-009-telemetry.md b/docs/adr/adr-009-telemetry.md new file mode 100644 index 0000000000..8fcfec2dd7 --- /dev/null +++ b/docs/adr/adr-009-telemetry.md @@ -0,0 +1,466 @@ +# ADR #009: Telemetry + +## Changelog + +* 2022-07-04: Started +* 2022-07-10: Initial Draft finished +* 2022-07-11: Stylistic improvements from @renaynay +* 2022-07-14: Stylistic improvements from @liamsi +* 2022-07-15: Stylistic improvements from @rootulp and @bidon15 +* 2022-07-29: Formatting fixes +* 2022-07-29: Clarify and add more info regarding Uptrace +* 2022-08-09: Cover metrics and add more info about trace + +## Authors + +@Wondertan @liamsi + +## Glossary + +* `ShrEx` - P2P Share Exchange Stack + +> It's all ogre now + +## Context + +Celestia Node needs deeper observability of each module and their components. The only integrated observability solution +we have is logging and there are two more options we need to explore from the observability triangle(tracing, metrics and logs). + +There are several priorities and "why"s we need deeper observability: + +* Establishing metrics/data driven engineering culture for celestia-node devs + * Metrics and tracing allows extracting dry facts out of any software on its performance, liveness, bottlenecks, + regressions, etc., on whole system scale, so devs can reliably respond + * Basing on these, all the improvements can be proven with data _before_ and _after_ a change +* Roadmap adjustment after analysis of the current `ShrEx` based on real world data from: + * Full Node reconstruction qualities + * Data availability sampling +* Incentivized Testnet + * Tracking participants + * Validating done tasks with transparent evidence + * Harvesting valuable data/insight/traces that we can analyze and improve on +* Monitoring dashboards + * For Celestia's own DA network infrastructure, e.g. DA Network Bootstrappers + * For the node operators +* Extend debugging arsenal for the networking heavy DA layer + * Local development + * Issues found with [Testground testing](https://github.com/celestiaorg/test-infra) + * Production + +This ADR is intended to outline the decisions on how to proceed with: + +* Integration plan according to the priorities and the requirements +* What observability tools/dependencies to integrate +* Integration design into Celestia-Node for each observability option +* A reference document explaining "whats" and "hows" during integration in some part of the codebase +* A primer for any developer in celestia-node to quickly onboard into Telemetry + +## Decisions + +### Plan + +#### First Priority + +The first priority lies on "ShrEx" stack analysis results for Celestia project. The outcome will tell us whether +our current [Full Node reconstruction](https://github.com/celestiaorg/celestia-node/issues/602) qualities conforms to +the main network requirements, subsequently affecting the development roadmap of the celestia-node before the main +network launch. Basing on the former, the plan is focused on unblocking the reconstruction +analysis first and then proceed with steady covering of our codebase with traces for the complex codepaths as well as +metrics and dashboards for "measurables". + +Fortunately, the `ShrEx` analysis can be performed with _tracing_ only(more on that in Tracing Design section below), +so the decision for the celestia-node team is to cover with traces only the _necessary_ for the current "ShrEx" stack +code as the initial response to the ADR, leaving the rest to be integrated in the background for the devs in the team +once they are free as well as for the efficient bootstrapping into the code for the new devs. + +___Update:___ The `ShrEx` analysis is not the blocker nor the highest priority at the moment of writing. + +#### Second Priority + +The next biggest priority - incentivized testnet can be largely covered with traces as well. All participant will submit +traces from their nodes to any provided backend endpoint by us during the whole network lifespan. Later on, we will be +able to verify the data of each participant by querying historical traces. This is the feature that some backend solutions +provide, which we can use as well to extract valuable insight on how the network performs in macro view. + +Even though incentivised testnet goal can be largely covered by traces in terms of observability, the metrics for this +priority are desirable, as metrics provide: + +* Easily queryable time-series data +* Extensive tooling to build visualization for that data + +Both, can facilitate implementation of global network observability dashboard, participant validation for the goal. + +#### Third Priority + +Enabling total observability of the node through metrics and traces. + +### Tooling/Dependencies + +#### Telemetry Golang API/Shim + +The decision is to use [opentelemetry-go](https://github.com/open-telemetry/opentelemetry-go) for both Metrics and Tracing: + +* Minimal and golang savvy API/shim which gathers years of experience from OpenCensus/OpenMetrics and [CNCF](https://www.cncf.io/) +* Backends/exporters for all the existing timeseries monitoring DBs, e.g. Prometheus, InfluxDB. As well as tracing backends + Jaeger, Uptrace, etc. +* with logging engine we use - Zap +* Provides first-class support/implementation for/of generic [OTLP](https://opentelemetry.io/docs/reference/specification/protocol/)(OpenTelemetry Protocol) + * Generic format for any telemetry data. + * Allows integrating otel-go once and use it with any known backend, either + * Supporting OTLP natively + * Or through [OTel Collector]((https://opentelemetry.io/docs/collector/)) + * Allows exporting telemetry to one endpoint only([opentelemetry-go#3055](https://github.com/open-telemetry/opentelemetry-go/issues/3055)) + +The discussion over this decision can be found in [celestia-node#663](https://github.com/celestiaorg/celestia-node/issues/663) +and props to @liamsi for initial kickoff and a deep dive into OpenTelemetry. + +#### Tracing Backends + +For tracing, there are 4 modern OSS tools that are recommended. All of them have bidirectional support with OpenTelemetry: + +* [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) + * Tracing data proxy from a OTLP client to __any__ backend + * Supports a [long list of backends](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter) +* [Uptrace](https://get.uptrace.dev/guide/#what-is-uptrace) + * The most recent (~1 year) + * Tight to Clickhouse DB + * The most lightweight + * Supports OTLP + * OSS and can be deployed locally + * Provides hosted solution +* [Jaeger](https://www.jaegertracing.io/) + * The most mature + * Started by Uber, now supported by CNCF + * Supports multiple storages(ScyllaDB, InfluxDB, Amazon DynamoDB) + * Supports OTLP +* [Graphana Tempo](https://grafana.com/oss/tempo/) + * Deep integration with Graphana/Prometheus + * Relatively new (~2 years) + * Uses Azure, GCS, S3 or local disk for storage + +Each of these backends can be used independently and depending on the use case. For us, these are main use cases: + +* Local development/debugging for the private network or even public network setup +* Data collection from the Testground test runs +* Bootstrappers monitoring infrastructure +* Data collection from the incentivized testnet participants + +> I am personally planning to set up the lightweight Uptrace for the local light node. Just to play around and observe +> things +> +> __UPDATE__: It turns out it is not that straightforward and adds additional overhead. See #Other-Findings + +There is no strict decision on which of these backends and where to use. People taking ownership of any listed vectors +are free to use any recommended solution or any unlisted. The only backend requirement is the support of OTLP, natively +or through OTel Collector. The latter though, introduces additional infrastructure piece which adds unnecessary complexity +for node runners, and thus not recommended. + +#### Metrics Backend + +We only consider OSS backends. + +* [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) + * Push based + * Metrics data proxy from a OTLP client to __any__ backend + * Supports a [long list of backends](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter) +* [Netdata](https://github.com/netdata/netdata) + * Push based + * Widely supported option in the Linux community + * Written in C + * Decade of experience optimized to bare metal + * Perfect for local monitoring setups + * Unfortunately, does not support OTLP +* [Uptrace](https://get.uptrace.dev/guide/#what-is-uptrace) + * The most recent (~1 year) + * Tight to Clickhouse DB + * The most lightweight + * Supports OTLP + * OSS and can be deployed locally + * Provides hosted solution +* Prometheus+Graphana + * Pull based + * No native OTLP support + * Thought there [spec](https://github.com/open-telemetry/wg-prometheus/blob/main/specification.md) to fix this + * Still, can be used with Otel Collector + +Similarly, no strictness around backend solution with only OTLP support requirement. Natively or through OTLP exporter. + +## Design + +### Tracing Design + +Tracing allows to see _how_ any process progresses through different modules, APIs and networks, as well as timings of +each operation and any events or errors as they occur. + +A visual example of a generic tracing dashboard provided via [Uptrace](https://uptrace.dev/) backend +![tracing](img/tracing-dashboard.png) + +Mainly, for `ShrEx` and reconstruction analysis we need to know if the reconstruction succeeded and the time it took for +the big block sizes(EDS >= 128). The tracing in this case would provide all the data for the whole reconstruction +operation and for each sub operation within reconstruction, e.g time spent specifically on erasure coding +> NOTE: The exact compute time is not available unless [rsmt2d#107](https://github.com/celestiaorg/rsmt2d/issues/107) +> is fixed. + +#### Spans + +Span represents an operation (unit of work) in a trace. They keep the time when operation _started_ and _ended_. Any +additional user defined _attributes_, operation status(success or error with an error itself) and events/logs that +may happen during the operation. + +Spans also form a parent tree, meaning that each span associated to a process can have multiple sub processes or child +spans and vice-versa. Altogether, this feature allows to see the whole trace of execution of any part of the system, no +matter how complex it is. This is exactly what we need to analyze our reconstruction performance. + +#### Tracing Integration Example + +First, we define global pkg level tracer to create spans from within `ipld` pkg. Basically, it groups spans under +common logical namespace and extends the full name of each span. + +```go +var tracer = otel.Tracer("ipld") +``` + +Then, we define a root span in `ipld.Retriever`: + +```go +import "go.opentelemetry.io/otel" + +func (r *Retriever) Retrieve(ctx context.Context, dah *da.DataAvailabilityHeader) (*rsmt2d.ExtendedDataSquare, error) { + ctx, span := tracer.Start(ctx, "retrieve-square") + defer span.End() + + span.SetAttributes( + attribute.Int("size", len(dah.RowsRoots)), + attribute.String("data_hash", hex.EncodeToString(dah.Hash())), + ) + ... +} +``` + +Next, the child span in `ipld.Retriever.Reconstruct`: + +```go + ctx, span := tracer.Start(ctx, "reconstruct-square") + defer span.End() + + // and try to repair with what we have + err := rs.squareImported.Repair(rs.dah.RowsRoots, rs.dah.ColumnRoots, rs.codec, rs.treeFn) + if err != nil { + span.RecordError(err) + return nil, err + } +``` + +And lastly, the quadrant request event: + +```go + span.AddEvent("requesting quadrant", trace.WithAttributes( + attribute.Int("axis", q.source), + attribute.Int("x", q.x), + attribute.Int("y", q.y), + attribute.Int("size", len(q.roots)), + )) +``` + +> The above is only examples related to our code and is a subject to change. + +Here is the result of the above code sending traces visualized on Jaeger UI +![tracing](img/trace-jaeger.png) + +#### Backends connection + +Example for Jaeger + +```go + // Create the Jaeger exporter + exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(url))) + if err != nil { + return nil, err + } + // then the tracer provider + tp := tracesdk.NewTracerProvider( + // Always be sure to batch in production. + tracesdk.WithBatcher(exp), + // Record information about this application in a Resource. + tracesdk.WithResource(resource.NewWithAttributes( + semconv.SchemaURL, + semconv.ServiceNameKey.String(service), + attribute.String("environment", environment), + attribute.Int64("ID", id), + )), + ) + // and set it globally to be used across packages + otel.SetTracerProvider(tp) + + // then close it elsewhere + tp.Shutdown(ctx) +``` + +We decided to use OTLP backend, and it is almost similar in terms of setup. + +### Metrics Design + +Metrics allows collecting time-series data from different measurable points in the application. Every measurable can +be covered via 6 instruments OpenTelemetry provides: + +* ___Counter___ - synchronous instrument that measures additive non-decreasing values. +* ___UpDownCounter___ - synchronous instrument which measures additive values that increase or decrease with time. +* ___Histogram___ - synchronous instrument that produces a histogram from recorded values. +* ___CounterObserver___ - asynchronous instrument that measures additive non-decreasing values. +* ___UpDownCounterObserver___ - asynchronous instrument that measures additive values that can increase or decrease with time. +* ___GaugeObserver___ - asynchronous instrument that measures non-additive values for which sum does not produce a meaningful correct result. + +#### Metrics Integration Example + +Consider we want to report the current network height as a metric. + +First of all, the global pkg meter has to be defined in the code related to the desired metric. In our case it is `header` pkg. + +```go +var meter = global.MeterProvider().Meter("header") +``` + +Next, we should understand what instrument to use. On the first glance, for chain height a ___Counter___ instrument should +fit, as it is a non-decreasing value, and then we should think whether we need a sync or async version of it. For our case, +both would work and its more the question of precision we want. Sync metering would report every height change, while +the async would poke `header` pkg API periodically to get the metered data. For our example, we will go with the latter. + +```go +// MonitorHead enables Otel metrics to monitor head. +func MonitorHead(store Store) { + headC, _ := meter.AsyncInt64().Counter( + "head", + instrument.WithUnit(unit.Dimensionless), + instrument.WithDescription("Subjective head of the node"), + ) + + err := meter.RegisterCallback( + []instrument.Asynchronous{ + headC, + }, + func(ctx context.Context) { + head, err := store.Head(ctx) + if err != nil { + headC.Observe(ctx, 0, attribute.String("err", err.Error())) + return + } + + headC.Observe( + ctx, + head.Height, + attribute.Int("square_size", len(head.DAH.RowsRoots)), + ) + }, + ) + if err != nil { + panic(err) + } +} +``` + +The example follows a solely API-based approach without the need to integrate the metric deeper into the implementation +insides, which is nice and keeps metering decoupled from business logic. The `MonitorHead` func simply accepts the `Store` +interface and reads the information about the latest subjective header via `Head` on the node. + +The API-based approach should be followed for any info level metric. Even if there is no API to get the required metric, +such API should be introduced. However, this approach is not always possible and sometimes deeper integration with code +logic is necessary to analyze performance or there are security and/or encapsulation considerations. + +On the example, we can also see how any additional data can be added to the instruments via attributes or labels. It is +important to add only absolutely necessary data(more on that in Others section below) to the metrics or data which is +common over multiple time-series. In this case, we attach `square_size` of the height to know the block size of the +height. This allows us to query reported heights with some square size using backend UI. Note that there are only +powers of two(with 256 being a current limit) as unique values possible for the metric, so it won't put pressure on the +metrics backend. + +For Go code examples on other metric instruments consult [Uptrace Otel docs](https://uptrace.dev/opentelemetry/go-metrics.html#getting-started). + +#### Backends Connection + +Example for OTLP extracted from our code + +```go + opts := []otlpmetrichttp.Option{ + otlpmetrichttp.WithCompression(otlpmetrichttp.GzipCompression), + otlpmetrichttp.WithEndpoint(cmd.Flag(metricsEndpointFlag).Value.String()), + } + if ok, err := cmd.Flags().GetBool(metricsTlS); err != nil { + panic(err) + } else if !ok { + opts = append(opts, otlpmetrichttp.WithInsecure()) + } + + exp, err := otlpmetrichttp.New(cmd.Context(), opts...) + if err != nil { + return err + } + + pusher := controller.New( + processor.NewFactory( + selector.NewWithHistogramDistribution(), + exp, + ), + controller.WithExporter(exp), + controller.WithCollectPeriod(2*time.Second), + controller.WithResource(resource.NewWithAttributes( + semconv.SchemaURL, + semconv.ServiceNameKey.String(fmt.Sprintf("Celestia-%s", env.NodeType.String())), + // Here we can add more attributes with Node information + )), + ) +``` + +## Considerations + +* Tracing performance + * _Every_ method calling two more functions making network request can affect overall performance +* Metrics backend performance + * Mainly, we should avoid sending too much data to the metrics backend through labels. Metrics is only for metrics and +not for indexing. +e.g. hash, uuid, etc not to overload the metrics backend +* Security and exported data protection + * OTLP provides TLS support + +## Other Findings + +### Labels and High Cardinality + +High cardinality(many different label values) issue should be always kept in mind while introducing new metrics +and labels for them. Each metric should be attached with only absolutely necessary labels and stay away from metrics +sending label __unique__ values each time, e.g. hash, uuid, etc. Doing the opposite can dramatically increase the +amount of data stored. See . + +### Tracing and Logging + +As you will see in the examples below, tracing looks similar to logging and have almost the same semantics. In fact, +tracing is debug logging on steroids, and we can potentially consider dropping conventional _debug_ logging once we +fully cover our codebases with the tracing. Same as logging, traces can be piped out into the stdout as prettyprinted +event log. + +### Uptrace + +It turns out that running only Uptrace locally Collector is PITA. It requires either: + +* Using their [uptrace-go](https://github.com/uptrace/uptrace-go/blob/master/example/metrics/main.go) custom OTel wrapper + * For some undocumented reason they decided to go with a custom wrapper while it's possible to use OTel with Uptrace +directly +* The direct usage though also requires additional frictions and does not work with defaults. Requires: + * Token auth to send data + * Custom URL and path + * Maintaining config for itself and clickhouse + +Overall, it is not user-friendly alternative to known projects, even though it still does not require running Otel +Collector and absorbs both tracing and metrics. + +## Further Readings + +* [Uptrace tracing tools comparison](https://get.uptrace.dev/compare/distributed-tracing-tools.html) +* [Uptrace guide](https://get.uptrace.dev/guide/) +* [Uptrace OpenTelemetry Docs](https://opentelemetry.uptrace.dev/) + * Provides simple Go API guide for metrics and traces +* [OpenTelemetry Docs](https://opentelemetry.io/docs/) +* [Prometheus Docs](https://prometheus.io/docs/introduction/overview) + +## Status + +Proposed diff --git a/docs/adr/img/trace-jaeger.png b/docs/adr/img/trace-jaeger.png new file mode 100644 index 0000000000..dc96f9651f Binary files /dev/null and b/docs/adr/img/trace-jaeger.png differ diff --git a/docs/adr/img/tracing-dashboard.png b/docs/adr/img/tracing-dashboard.png new file mode 100644 index 0000000000..bccb9af346 Binary files /dev/null and b/docs/adr/img/tracing-dashboard.png differ