-
Notifications
You must be signed in to change notification settings - Fork 973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(adr): ADR #009: Telemetry #901
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## main #901 +/- ##
==========================================
- Coverage 58.53% 58.11% -0.43%
==========================================
Files 125 125
Lines 7409 7415 +6
==========================================
- Hits 4337 4309 -28
- Misses 2617 2647 +30
- Partials 455 459 +4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏🏻
docs/adr/adr-009-telemetry.md
Outdated
### Plan | ||
|
||
The first "ShrEx" stack analysis priority is critical for Celestia project. The analysis results will tell us whether | ||
our current Full Node reconstruction qualities conforms to the main network requirements, subsequently affecting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our current Full Node reconstruction qualities conforms to the main network requirements, subsequently affecting | |
our current Full Node reconstruction qualities conforms to mainnet requirements, subsequently affecting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I like verbosity in this case
docs/adr/adr-009-telemetry.md
Outdated
|
||
The first "ShrEx" stack analysis priority is critical for Celestia project. The analysis results will tell us whether | ||
our current Full Node reconstruction qualities conforms to the main network requirements, subsequently affecting | ||
the development roadmap of the celestia-node before the main network launch, therefore is a potential blocker to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the development roadmap of the celestia-node before the main network launch, therefore is a potential blocker to the | |
the development roadmap of the celestia-node before the main network launch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? It is a potential blocker.
## Considerations | ||
|
||
* Tracing performance | ||
* _Every_ method calling two more functions making network request can affect overall performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meaning this could degrade overall node performance?
Could we resolve this with having light || comprehensive
tracing modes eventually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. IO is expensive. Doing it everywhere degrades overall performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we resolve this with having light || comprehensive tracing modes eventually?
Maybe. It is not yet clear what the difference would be. For now, if you wanna have tracing, performance might be sacrificed. If you turn off it, then things should be almost the same.
dd36676
to
1e2e2d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good. But I will still insist that this ADR should not exclusively focus on tracing as metrics can be very insightful and are very necessary. I feel if I just approve it as is, it could be misunderstood as a go for only priotizing tracing in the foreseeable future. But we need both.
docs/adr/adr-009-telemetry.md
Outdated
* So we can evaluate real world Full Node reconstruction qualities, along with data availability sampling | ||
* And adjust our roadmap accordingly. | ||
* Incentivized Testnet | ||
* Tracking participants and validation that do task correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Tracking participants and validation that do task correctly | |
* Tracking participation and progress on incentivized testnet tasks. |
validators are tracked differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liamsi, This is not about validators, but about validation of participants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed with @YazzyYaz, there is going to be a different set of tasks for which participants will get additional points. They may and may not take this. We need to verify what they claimed to do.
docs/adr/adr-009-telemetry.md
Outdated
our current Full Node reconstruction qualities conforms to the main network requirements, subsequently affecting | ||
the development roadmap of the celestia-node before the main network launch, therefore is a potential blocker to the | ||
launch, which needs to be resolved ASAP. Basing on the former, the plan is focused on unblocking the reconstruction | ||
story first and then proceed with steady covering of our codebase with traces for the complex codepaths as well as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but you could also just link to the blcok reconstruction issue s.t. new hires can look into what is meant by "the reconstruction story" independently of tracing.
Thanks for the review, @liamsi. I wasn't planning to merge it before the Metrics sections are filled up. The goal was to unblock the reconstruction analysis ASAP and to leave metrics integration for later in the background |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing this ADR! Makes me want to learn more about https://opentelemetry.io/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome kick-start! So stoked to see this alive! 🌟
Left some suggestions for readability improvements as I needed to go up and down several times to get the idea of the ADR.
df7e853
to
62027a7
Compare
Based on celestiaorg#901 and likely shouldn't be merged before it
9d87da3
to
9581d16
Compare
…ootulp + add integrated trace visualization with Jaeger UI
9581d16
to
260f7be
Compare
260f7be
to
de5f85d
Compare
de5f85d
to
4163cb6
Compare
Ok, so the metrics are now fully covered in this ADR and this part is ready for review. The next part is to finish the list of places where metrics and traces should be integrated. |
@@ -0,0 +1,469 @@ | |||
# ADR #009: Telemetry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
# ADR #009: Telemetry | |
# ADR #009: Telemetry and General Observability via Tracing and Metrics |
Based on celestiaorg#901 and likely shouldn't be merged before it
Based on celestiaorg#901 and likely shouldn't be merged before it
Based on celestiaorg#901 and likely shouldn't be merged before it
Co-authored-by: Rootul P <[email protected]>
Co-authored-by: Ismail Khoffi <[email protected]>
Co-authored-by: Rootul P <[email protected]>
Co-authored-by: Rootul P <[email protected]>
Co-authored-by: Rootul P <[email protected]>
Co-authored-by: Rootul P <[email protected]>
Co-authored-by: Rootul P <[email protected]>
span.SetAttributes( | ||
attribute.Int("size", len(dah.RowsRoots)), | ||
attribute.String("data_hash", hex.EncodeToString(dah.Hash())), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cardinality issue right here using datahash as attribute. The topic of high cardinality worth its own paragraph as it seems it happens to be encountered by devs from time to time.
Next, we should understand what instrument to use. On the first glance, for chain height a ___Counter___ instrument should | ||
fit, as it is a non-decreasing value, and then we should think whether we need a sync or async version of it. For our case, | ||
both would work and its more the question of precision we want. Sync metering would report every height change, while | ||
the async would poke `header` pkg API periodically to get the metered data. For our example, we will go with the latter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is missing the point that Counter will be iterated from multiple sources, that might include same application being restarted. For specified case only async Gauge can be used to reflect current network height.
func MonitorHead(store Store) { | ||
headC, _ := meter.AsyncInt64().Counter( | ||
"head", | ||
instrument.WithUnit(unit.Dimensionless), | ||
instrument.WithDescription("Subjective head of the node"), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics API for Counter initialisation and async callbacks has changed, so examples are outdated
func(ctx context.Context) { | ||
head, err := store.Head(ctx) | ||
if err != nil { | ||
headC.Observe(ctx, 0, attribute.String("err", err.Error())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should never use errors as attributes. It causes cardinality issues and is not useful, as it is impossible to query such errors from metrics
#### Backends Connection | ||
|
||
Example for OTLP extracted from our code | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example needs to be updated. Primaraly it misses global attributes we set:
- network name
- node type
- peer ID
Those are crucial for metrics query and filtering and must be a part of adr
if err != nil { | ||
panic(err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets calm down here and panic 😆
Rendered
This is an initial draft for the Telemetry ADR focused on Tracing, as it solves most of our needs(reasoning is in the document). There are also a few other things that I am going to add there soon, but the document is already ready for review and ready to kick off discussions. The main goal, for now, was to unblock reconstruction work and partially incentivized testnet.
Main TODOs:
Closes #663