Skip to content

Commit

Permalink
docs(adr): adr-009: fix formatting
Browse files Browse the repository at this point in the history
  • Loading branch information
Wondertan committed Jul 29, 2022
1 parent 74df42f commit d71dd47
Showing 1 changed file with 59 additions and 41 deletions.
100 changes: 59 additions & 41 deletions docs/adr/adr-009-telemetry.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,23 +7,28 @@
* 2022-07-11: Stylistic improvements from @renaynay
* 2022-07-14: Stylistic improvements from @liamsi
* 2022-07-15: Stylistic improvements from @rootulp and @bidon15
* 2022-07-29: Formatting fixes

## Authors
## Authors

@Wondertan @liamsi

## Glossary
- `ShrEx` - P2P Share Exchange Stack

* `ShrEx` - P2P Share Exchange Stack

> It's all ogre now
## Context

> Now I know why I don't like writing ADRs - because I cannot run/test them and see if they work or not.
> Hoping that quality team feedback will solve this problem!
Celestia Node needs deeper observability of each module and their components. The only integrated observability solution
Celestia Node needs deeper observability of each module and their components. The only integrated observability solution
we have is logging and there are two more options we need to explore from the observability triangle(tracing, metrics and logs).

There are several priorities and "why"s we need deeper observability:

* Establishing metrics/data driven engineering culture for celestia-node devs
* Metrics and tracing allows extracting dry facts out of any software on its performance, liveness, bottlenecks,
regressions, etc., on whole system scale, so devs can reliably respond
Expand All @@ -44,6 +49,7 @@ There are several priorities and "why"s we need deeper observability:
* Production

This ADR is intended to outline the decisions on how to proceed with:

* Integration plan according to the priorities and the requirements
* What observability tools/dependencies to integrate
* Integration design into Celestia-Node for each observability option
Expand All @@ -56,40 +62,42 @@ This ADR is intended to outline the decisions on how to proceed with:
#### First Priority

The first priority lies on "ShrEx" stack analysis results for Celestia project. The outcome will tell us whether
our current [Full Node reconstruction](https://github.com/celestiaorg/celestia-node/issues/602) qualities conforms to
the main network requirements, subsequently affecting the development roadmap of the celestia-node before the main
network launch. Basing on the former, the plan is focused on unblocking the reconstruction
analysis first and then proceed with steady covering of our codebase with traces for the complex codepaths as well as
our current [Full Node reconstruction](https://github.com/celestiaorg/celestia-node/issues/602) qualities conforms to
the main network requirements, subsequently affecting the development roadmap of the celestia-node before the main
network launch. Basing on the former, the plan is focused on unblocking the reconstruction
analysis first and then proceed with steady covering of our codebase with traces for the complex codepaths as well as
metrics and dashboards for "measurables".

Fortunately, the `ShrEx` analysis can be performed with _tracing_ only(more on that in Tracing Design section below),
so the decision for the celestia-node team is to cover with traces only the _necessary_ for the current "ShrEx" stack
code as the initial response to the ADR, leaving the rest to be integrated in the background for the devs in the team
Fortunately, the `ShrEx` analysis can be performed with _tracing_ only(more on that in Tracing Design section below),
so the decision for the celestia-node team is to cover with traces only the _necessary_ for the current "ShrEx" stack
code as the initial response to the ADR, leaving the rest to be integrated in the background for the devs in the team
once they are free as well as for the efficient bootstrapping into the code for the new devs.

#### Second Priority

The next biggest priority - incentivized Testnet can be largely covered with traces as well. All participant will submit
traces from their nodes to any provided backend endpoint by us during the whole network lifespan. Later on, we will be
able to verify the data of each participant by querying historical traces. This is the feature that some backend solutions
traces from their nodes to any provided backend endpoint by us during the whole network lifespan. Later on, we will be
able to verify the data of each participant by querying historical traces. This is the feature that some backend solutions
provide, which we can use as well to extract valuable insight on how the network performs in macro view.

### Tooling/Dependencies

#### Golang API/Shim

The decision is to use [opentelemetry-go](https://github.com/open-telemetry/opentelemetry-go) for both Metrics and Tracing:

* Minimal and golang savvy API/shim which gathers years of experience from OpenCensus/OpenMetrics and [CNCF](https://www.cncf.io/)
* Backends/exporters for all the existing timeseries monitoring DBs, e.g. Prometheus, InfluxDB. As well as tracing backends
Jaeger, Uptrace, etc.
* https://github.com/uptrace/opentelemetry-go-extra/tree/main/otelzap with logging engine with use - Zap
* <https://github.com/uptrace/opentelemetry-go-extra/tree/main/otelzap> with logging engine we use - Zap

The discussion over this decision can be found in [celestia-node#663](https://github.com/celestiaorg/celestia-node/issues/663)
and props to @liamsi for initial kickoff and a deep dive into OpenTelemetry.

#### Tracing Backends

For tracing, there are 3 modern OSS tools that are recommended. All of them have bidirectional support with OpenTelemetry:

* [Uptrace](https://get.uptrace.dev/guide/#what-is-uptrace)
* The most recent (~1 year)
* The richest UI
Expand All @@ -105,11 +113,13 @@ For tracing, there are 3 modern OSS tools that are recommended. All of them have
* Relatively new (~2 years)
* Uses Azure, GCS, S3 or local disk for storage

Each of these backends can be used independently and depending on the use case. For us, these are main use cases:
Each of these backends can be used independently and depending on the use case. For us, these are main use cases:

* Local development/debugging for the private network or even public network setup
* Data collection from the Testground test runs
* Bootstrappers monitoring infrastructure
* Data collection from the incentivized testnet participants

> I am personally planning to set up the lightweight Uptrace for the local light node. Just to play around and observe
> things
Expand All @@ -124,37 +134,39 @@ are free to use any recommended solution or any unlisted.

### Tracing Design

Tracing allows to see _how_ any process progresses through different modules, APIs and networks, as well as timings of
Tracing allows to see _how_ any process progresses through different modules, APIs and networks, as well as timings of
each operation and any events or errors as they occur.

A visual example of a generic tracing dashboard provided via [Uptrace](https://uptrace.dev/) backend
![tracing](img/tracing-dashboard.png)

Mainly, for `ShrEx` and reconstruction analysis we need to know if the reconstruction succeeded and the time it took for
the big block sizes(EDS >= 128). The tracing in this case would provide all the data for the whole reconstruction
the big block sizes(EDS >= 128). The tracing in this case would provide all the data for the whole reconstruction
operation and for each sub operation within reconstruction, e.g time spend specifically on erasure coding
> NOTE: The exact compute time is not available unless [rsmt2d#107](https://github.com/celestiaorg/rsmt2d/issues/107)
> is fixed.
#### Spans

Span represents an operation (unit of work) in a trace. They keep the time when operation _started_ and _ended_. Any
additional user defined _attributes_, operation status(success or error with an error itself) and events/logs that
Span represents an operation (unit of work) in a trace. They keep the time when operation _started_ and _ended_. Any
additional user defined _attributes_, operation status(success or error with an error itself) and events/logs that
may happen during the operation.

Spans also form a parent tree, meaning that each span associated to a process can have multiple sub processes or child
spans and vise-versa. Altogether, this feature allows to see the whole trace of execution of any part of the system, no
spans and vise-versa. Altogether, this feature allows to see the whole trace of execution of any part of the system, no
matter how complex it is. This is exactly what we need to analyze our reconstruction performance.

#### Integration Example

First, we define global pkg level tracer to create spans from within `ipld` pkg. Basically, it groups spans under
First, we define global pkg level tracer to create spans from within `ipld` pkg. Basically, it groups spans under
common logical namespace and extends the full name of each span.

```go
var tracer = otel.Tracer("ipld")
```

Then, we define a root span in `ipld.Retriever`:

```go
import "go.opentelemetry.io/otel"

Expand All @@ -171,19 +183,21 @@ func (r *Retriever) Retrieve(ctx context.Context, dah *da.DataAvailabilityHeader
```

Next, the child span in `ipld.Retriever.Reconstruct`:

```go
ctx, span := tracer.Start(ctx, "reconstruct-square")
defer span.End()

// and try to repair with what we have
err := rs.squareImported.Repair(rs.dah.RowsRoots, rs.dah.ColumnRoots, rs.codec, rs.treeFn)
if err != nil {
span.RecordError(err)
return nil, err
}
ctx, span := tracer.Start(ctx, "reconstruct-square")
defer span.End()

// and try to repair with what we have
err := rs.squareImported.Repair(rs.dah.RowsRoots, rs.dah.ColumnRoots, rs.codec, rs.treeFn)
if err != nil {
span.RecordError(err)
return nil, err
}
```

And lastly, the quadrant request event:

```go
span.AddEvent("requesting quadrant", trace.WithAttributes(
attribute.Int("axis", q.source),
Expand All @@ -192,21 +206,23 @@ And lastly, the quadrant request event:
attribute.Int("size", len(q.roots)),
))
```

> The above is only examples related to our code and is a subject to change.
Here is the result of the above code sending traces visualized on Jaeger UI
![tracing](img/trace-jaeger.png)

#### Backends connection
#### Backends connection

Jaeger example

```go
// Create the Jaeger exporter
exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(url)))
if err != nil {
return nil, err
}
// then the tracer provider
// then the tracer provider
tp := tracesdk.NewTracerProvider(
// Always be sure to batch in production.
tracesdk.WithBatcher(exp),
Expand All @@ -218,10 +234,10 @@ Jaeger example
attribute.Int64("ID", id),
)),
)
// and set it globally to be used across packages
// and set it globally to be used across packages
otel.SetTracerProvider(tp)
// then close it elsewhere
// then close it elsewhere
tp.Shutdown(ctx)
```

Expand All @@ -237,17 +253,19 @@ Jaeger example
## Other

As you will see in the examples below, tracing looks similar to logging and have almost the same semantics. In fact,
tracing is debug logging on steroids, and we can potentially consider dropping conventional _debug_ logging once we
tracing is debug logging on steroids, and we can potentially consider dropping conventional _debug_ logging once we
fully cover our codebases with the tracing. Same as logging, traces can be pipe out into the stdout as prettyprinted
event log.

## Further Readings
- [Uptrace tracing tools comparison](https://get.uptrace.dev/compare/distributed-tracing-tools.html)
- [Uptrace guide](https://get.uptrace.dev/guide/)
- [Uptrace OpenTelemetry Docs](https://opentelemetry.uptrace.dev/)
> Provides simple Go API guide for metrics and traces
- [OpenTelemetry Docs](https://opentelemetry.io/docs/)
- [Prometheus Docs](https://prometheus.io/docs/introduction/overview)

* [Uptrace tracing tools comparison](https://get.uptrace.dev/compare/distributed-tracing-tools.html)
* [Uptrace guide](https://get.uptrace.dev/guide/)
* [Uptrace OpenTelemetry Docs](https://opentelemetry.uptrace.dev/)
* Provides simple Go API guide for metrics and traces
* [OpenTelemetry Docs](https://opentelemetry.io/docs/)
* [Prometheus Docs](https://prometheus.io/docs/introduction/overview)

## Status

Proposed

0 comments on commit d71dd47

Please sign in to comment.