Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(adr): ADR #009: Telemetry #901

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

docs(adr): ADR #009: Telemetry #901

wants to merge 17 commits into from

Conversation

Wondertan
Copy link
Member

@Wondertan Wondertan commented Jul 11, 2022

Rendered

This is an initial draft for the Telemetry ADR focused on Tracing, as it solves most of our needs(reasoning is in the document). There are also a few other things that I am going to add there soon, but the document is already ready for review and ready to kick off discussions. The main goal, for now, was to unblock reconstruction work and partially incentivized testnet.

Main TODOs:

  • Tracing integration points in the code(almost there)
  • Design section for metrics(WIP)
    • List of measurables for celestia-node

Closes #663

@Wondertan Wondertan added docs:adr ADR kind:misc Attached to miscellaneous PRs labels Jul 11, 2022
@Wondertan Wondertan self-assigned this Jul 11, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jul 11, 2022

Codecov Report

Merging #901 (84b1b99) into main (3f2d743) will decrease coverage by 0.42%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #901      +/-   ##
==========================================
- Coverage   58.53%   58.11%   -0.43%     
==========================================
  Files         125      125              
  Lines        7409     7415       +6     
==========================================
- Hits         4337     4309      -28     
- Misses       2617     2647      +30     
- Partials      455      459       +4     
Impacted Files Coverage Δ
ipld/nmt_adder.go 68.00% <0.00%> (-24.00%) ⬇️
logs/logs.go 85.71% <0.00%> (-14.29%) ⬇️
node/p2p/routing.go 60.00% <0.00%> (-7.75%) ⬇️
fraud/pb/proof.pb.go 29.22% <0.00%> (-4.29%) ⬇️
fraud/bad_encoding.go 60.76% <0.00%> (-2.31%) ⬇️
ipld/retriever.go 90.38% <0.00%> (-1.93%) ⬇️
node/p2p/p2p.go 93.02% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f2d743...84b1b99. Read the comment docs.

@Wondertan
Copy link
Member Author

Wondertan commented Jul 11, 2022

Also, asked for a review from @sysrex as it touches on infrastructure/tools and @YazzyYaz as the document touches the incentivised testnet

Copy link
Member

@renaynay renaynay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏🏻

### Plan

The first "ShrEx" stack analysis priority is critical for Celestia project. The analysis results will tell us whether
our current Full Node reconstruction qualities conforms to the main network requirements, subsequently affecting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
our current Full Node reconstruction qualities conforms to the main network requirements, subsequently affecting
our current Full Node reconstruction qualities conforms to mainnet requirements, subsequently affecting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I like verbosity in this case


The first "ShrEx" stack analysis priority is critical for Celestia project. The analysis results will tell us whether
our current Full Node reconstruction qualities conforms to the main network requirements, subsequently affecting
the development roadmap of the celestia-node before the main network launch, therefore is a potential blocker to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the development roadmap of the celestia-node before the main network launch, therefore is a potential blocker to the
the development roadmap of the celestia-node before the main network launch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? It is a potential blocker.

## Considerations

* Tracing performance
* _Every_ method calling two more functions making network request can affect overall performance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meaning this could degrade overall node performance?

Could we resolve this with having light || comprehensive tracing modes eventually?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. IO is expensive. Doing it everywhere degrades overall performance.

Copy link
Member Author

@Wondertan Wondertan Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we resolve this with having light || comprehensive tracing modes eventually?

Maybe. It is not yet clear what the difference would be. For now, if you wanna have tracing, performance might be sacrificed. If you turn off it, then things should be almost the same.

Copy link
Member

@liamsi liamsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good. But I will still insist that this ADR should not exclusively focus on tracing as metrics can be very insightful and are very necessary. I feel if I just approve it as is, it could be misunderstood as a go for only priotizing tracing in the foreseeable future. But we need both.

* So we can evaluate real world Full Node reconstruction qualities, along with data availability sampling
* And adjust our roadmap accordingly.
* Incentivized Testnet
* Tracking participants and validation that do task correctly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Tracking participants and validation that do task correctly
* Tracking participation and progress on incentivized testnet tasks.

validators are tracked differently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liamsi, This is not about validators, but about validation of participants

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed with @YazzyYaz, there is going to be a different set of tasks for which participants will get additional points. They may and may not take this. We need to verify what they claimed to do.

our current Full Node reconstruction qualities conforms to the main network requirements, subsequently affecting
the development roadmap of the celestia-node before the main network launch, therefore is a potential blocker to the
launch, which needs to be resolved ASAP. Basing on the former, the plan is focused on unblocking the reconstruction
story first and then proceed with steady covering of our codebase with traces for the complex codepaths as well as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you could also just link to the blcok reconstruction issue s.t. new hires can look into what is meant by "the reconstruction story" independently of tracing.

@Wondertan
Copy link
Member Author

this looks good. But I will still insist that this ADR should not exclusively focus on tracing as metrics can be very insightful and are very necessary. I feel if I just approve it as is, it could be misunderstood as a go for only priotizing tracing in the foreseeable future. But we need both.

Thanks for the review, @liamsi. I wasn't planning to merge it before the Metrics sections are filled up. The goal was to unblock the reconstruction analysis ASAP and to leave metrics integration for later in the background

Copy link
Contributor

@rootulp rootulp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing this ADR! Makes me want to learn more about https://opentelemetry.io/

Copy link
Member

@Bidon15 Bidon15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome kick-start! So stoked to see this alive! 🌟
Left some suggestions for readability improvements as I needed to go up and down several times to get the idea of the ADR.

rootulp added a commit to rootulp/celestia-node that referenced this pull request Jul 20, 2022
Based on celestiaorg#901 and
likely shouldn't be merged before it
@tzdybal tzdybal self-requested a review July 21, 2022 11:42
@Wondertan Wondertan mentioned this pull request Jul 22, 2022
14 tasks
@Wondertan Wondertan force-pushed the hlib/adr-metrics branch 2 times, most recently from 9d87da3 to 9581d16 Compare July 29, 2022 16:47
@Wondertan
Copy link
Member Author

Ok, so the metrics are now fully covered in this ADR and this part is ready for review. The next part is to finish the list of places where metrics and traces should be integrated.

rootulp
rootulp previously approved these changes Aug 10, 2022
@@ -0,0 +1,469 @@
# ADR #009: Telemetry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

Suggested change
# ADR #009: Telemetry
# ADR #009: Telemetry and General Observability via Tracing and Metrics

renaynay pushed a commit to renaynay/celestia-node that referenced this pull request Aug 15, 2022
Based on celestiaorg#901 and
likely shouldn't be merged before it
distractedm1nd pushed a commit to renaynay/celestia-node that referenced this pull request Sep 19, 2022
Based on celestiaorg#901 and
likely shouldn't be merged before it
distractedm1nd pushed a commit to distractedm1nd/celestia-node that referenced this pull request Sep 21, 2022
Based on celestiaorg#901 and
likely shouldn't be merged before it
@Frierened Frierened mentioned this pull request Feb 14, 2024
@Wondertan Wondertan requested a review from adlerjohn as a code owner February 26, 2024 10:55
@ramin ramin requested a review from walldiss February 26, 2024 12:06
Comment on lines +232 to +235
span.SetAttributes(
attribute.Int("size", len(dah.RowsRoots)),
attribute.String("data_hash", hex.EncodeToString(dah.Hash())),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cardinality issue right here using datahash as attribute. The topic of high cardinality worth its own paragraph as it seems it happens to be encountered by devs from time to time.

Comment on lines +323 to +326
Next, we should understand what instrument to use. On the first glance, for chain height a ___Counter___ instrument should
fit, as it is a non-decreasing value, and then we should think whether we need a sync or async version of it. For our case,
both would work and its more the question of precision we want. Sync metering would report every height change, while
the async would poke `header` pkg API periodically to get the metered data. For our example, we will go with the latter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is missing the point that Counter will be iterated from multiple sources, that might include same application being restarted. For specified case only async Gauge can be used to reflect current network height.

Comment on lines +330 to +335
func MonitorHead(store Store) {
headC, _ := meter.AsyncInt64().Counter(
"head",
instrument.WithUnit(unit.Dimensionless),
instrument.WithDescription("Subjective head of the node"),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics API for Counter initialisation and async callbacks has changed, so examples are outdated

func(ctx context.Context) {
head, err := store.Head(ctx)
if err != nil {
headC.Observe(ctx, 0, attribute.String("err", err.Error()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never use errors as attributes. It causes cardinality issues and is not useful, as it is impossible to query such errors from metrics

Comment on lines +378 to +381
#### Backends Connection

Example for OTLP extracted from our code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example needs to be updated. Primaraly it misses global attributes we set:

  • network name
  • node type
  • peer ID
    Those are crucial for metrics query and filtering and must be a part of adr

Comment on lines +355 to +357
if err != nil {
panic(err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets calm down here and panic 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs:adr ADR kind:misc Attached to miscellaneous PRs
Projects
No open projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

Decide on a metrics solution (e.g. Prometheus) and create lightweight spec
9 participants