Skip to content

Commit

Permalink
docs(adr): initial draft fraft for telemtry adr
Browse files Browse the repository at this point in the history
  • Loading branch information
Wondertan committed Jul 11, 2022
1 parent 3f2d743 commit f1866fd
Show file tree
Hide file tree
Showing 2 changed files with 220 additions and 0 deletions.
220 changes: 220 additions & 0 deletions docs/adr/adr-009-telemetry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
# ADR #009: Telemetry

## Changelog

* 2022-07-4: Started
* 2022-07-10: Initial Draft finished

## Authors

@Wondertan @liamsi

## Context
> Now I know why I don't like writing ADRs - because I cannot run/test them and see if they work or not.
> Hoping that quality team feedback will solve this problem!
Celestia Node needs deeper observability of each module and their components. The only integrated observability solution
we have is logging and there are two more options we need to explore from the observability triangle(tracing, metrics and logs).

There are several priorities and "why"s we need deeper observability:
* Analysis of the current p2p share exchange or "ShrEx" stack
* So we can evaluate real world Full Node reconstruction qualities, along with data availability sampling
* And unblock celestia-node development planning
* Incentivized Testnet
* Tracking participants and validation that do task correctly
* So all participants provide to us valuable data/insight/traces that we can analyze and improve on
* Monitoring of the Celestia's own DA network infrastructure, e.g. DA Network Bootstrappers
* Monitoring dashboard for the node operators
* Extend debugging arsenal for the networking heavy DA layer
* Local development
* Issues found with Testground
* Production
* Establishing metrics/data driven engineering culture for celestia-node devs
* Metrics and tracing allows extracting dry facts out of any software on its performance, liveness, bottlenecks,
regressions, etc., on whole system scale, so devs can reliably respond
* Basing on these, all the improvements can be proven with data _before_ and _after_ a change

This ADR is intended to outline the decisions on how to proceed with:
* Integration plan according to the priorities and the requirements
* What observability tools/dependencies to integrate
* Integration design into Celestia-Node for each observability option
* A reference document explaining "whats" and "hows" during integration in some part of the codebase, e.g. new dev

## Decisions

### Plan

The first "ShrEx" stack analysis priority is critical for Celestia project. The analysis results will tell us whether
our current Full Node reconstruction qualities conforms to the main network requirements, subsequently affecting
the development roadmap of the celestia-node before the main network launch, therefore is a potential blocker to the
launch, which needs to be resolved ASAP. Basing on the former, the plan is focused on unblocking the reconstruction
story first and then proceed with steady covering of our codebase with traces for the complex codepaths as well as
metrics and dashboards for "measurables".

Fortunately, the "ShrEx" analysis can be performed with _tracing_ only(more on that in [Tracing](./#Tracing)), so the
decision for the celestia-node team is to cover with traces only the _necessary_ for the current "ShrEx" stack code as
the initial response to the ADR, leaving the rest to be integrated in the background for the devs in the team once they
are free as well as for the efficient bootstrapping into the code for the new devs.

The next biggest priority - incentivized Testnet can be largely covered with traces as well. All participant will submit
traces from their nodes to any provided backend endpoint by us during the whole network lifespan. Later on, we will be
able to verify the data of each participant by querying historical traces. This is the feature that some backend solutions
provide, which we can use as well to extract valuable on how the network performs in macro view.

### Tooling/Dependencies

#### Golang API/Shim

The decision is to use [opentelemetry-go](https://github.com/open-telemetry/opentelemetry-go) for both Metrics and Tracing:
* Minimal and golang savvy API/shim which gathers years of experience from OpenCensus/OpenMetrics and [CNCF](https://www.cncf.io/)
* Backends/exporters for all the existing timeseries monitoring DBs, e.g. Prometheus, InfluxDB. As well as tracing backends
Jaeger, Uptrace, etc.
* https://github.com/uptrace/opentelemetry-go-extra/tree/main/otelzap with logging engine with use - Zap

The discussion over this decision can be found in [celestia-node#663](https://github.com/celestiaorg/celestia-node/issues/663)
and props to @liamsi for initial kickoff and a deep dive into OpenTelemetry.

#### Tracing Backends

For the tracing there are 3 modern OSS tools are recommended. All of them have bidirectional support with OpenTelemetry:
* [Uptrace](https://get.uptrace.dev/guide/#what-is-uptrace)
* The most recent (~1 year)
* The richest UI
* Tight to Clickhouse DB
* Made by OpenTelemetry
* The most lightweight
* [Jaeger](https://www.jaegertracing.io/)
* The most mature
* Started by Uber, now supported by CNCF
* Supports multiple storages(ScyllaDB, InfluxDB, Amazon DynamoDB)
* [Graphana Tempo](https://grafana.com/oss/tempo/)
* Deep integration with Graphana/Prometheus
* Relatively new (~2 years)
* Uses Azure, GCS, S3 or local disk for storage

Each of these backends can be used independently and depending on the use case. For us, these are main use cases:
* Local development/debugging for the private network or even public network setup
> I am personally planning to set up the lightweight Uptrace for the local light node. Just to play around and observe
> things
* Data collection from the Testground test runs
* Bootstrappers monitoring infrastructure
* Data collection from the incentivized testnet participants

There is no strict decision on which of these backends and where to use. People taking ownership of any listed vectors
are free to use any recommended solution or any unlisted.

#### Metrics Backend

// WIP

## Design

### Tracing

Tracing allows to see _how_ any process progresses through different modules, APIs and networks, as well as timings of
each operation and any events or errors as they occur.

A visual example of a generic tracing dashboard provided via [Uptrace](https://uptrace.dev/) backend
![tracing](img/tracing-dashboard.png)

Mainly, for "ShrEx" and reconstruction analysis we need to know if the reconstruction succeeded and the time it took.
The tracing in this case would provide both for the whole reconstruction operation and for each sub operation of each
process and the node involved.

#### Spans
Span represents an operation (unit of work) in a trace. They keep the time when operation _started_ and _ended_. Any
additional user defined _attributes_, operation status(success or error with an error itself) and events/logs that
may happen during the operation.

Spans also form a parent tree, meaning that each span associated to a process can have multiple sub processes or child
spans and vise-versa. Altogether, this feature allows to see the whole trace of execution of any part of the system, no
matter how complex it is. This is exactly what we need to analyze our reconstruction performance.

An example of a root span in `ipld.Retriever`:
```go
import "go.opentelemetry.io/otel"

var tracer = otel.Tracer("share")

func (r *Retriever) Retrieve(ctx context.Context, dah *da.DataAvailabilityHeader) (*rsmt2d.ExtendedDataSquare, error) {
ctx, span := tracer.Start(ctx, "retrieve-square")
defer span.End()

span.SetAttributes(
attribute.Int("size", len(dah.RowsRoots)),
attribute.String("data_hash", hex.EncodeToString(dah.Hash())),
)
...
}
```

The child span in `ipld.Retriever.Reconstruct`:
```go
ctx, span := tracer.Start(ctx, "reconstruct-square")
defer span.End()

// and try to repair with what we have
err := rs.squareImported.Repair(rs.dah.RowsRoots, rs.dah.ColumnRoots, rs.codec, rs.treeFn)
if err != nil {
span.RecordError(err)
return nil, err
}
```

And the quadrant request event:
```go
span.AddEvent("requesting quadrant", trace.WithAttributes(
attribute.Int("axis", q.source),
attribute.Int("x", q.x),
attribute.Int("y", q.y),
attribute.Int("size", len(q.roots)),
))
```
> The above is only examples related to our code and is a subject to change.
#### Backends connection

Jaeger example
```go
// Create the Jaeger exporter
exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(url)))
if err != nil {
return nil, err
}
// then the tracer provider
tp := tracesdk.NewTracerProvider(
// Always be sure to batch in production.
tracesdk.WithBatcher(exp),
// Record information about this application in a Resource.
tracesdk.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String(service),
attribute.String("environment", environment),
attribute.Int64("ID", id),
)),
)
// and set it globally to be used across packages
otel.SetTracerProvider(tp)

// then close it elsewhere
tp.Shutdown(ctx)
```

### Metrics

// WIP

## Considerations

* Tracing performance
* _Every_ method calling two more functions making network request can affect overall performance

## Other

As you will see in the examples below, tracing looks similar to logging and have almost the same semantics. In fact,
tracing is debug logging on steroids, and we can potentially consider dropping conventional _debug_ logging once we
fully cover our codebases with the tracing. Same as logging, traces can be pipe out into the stdout as prettyprinted
event log.

## Status
Proposed
Binary file added docs/adr/img/tracing-dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit f1866fd

Please sign in to comment.