docs(adr): ADR #009: Telemetry #901

Wondertan · 2022-07-11T08:38:52Z

This is an initial draft for the Telemetry ADR focused on Tracing, as it solves most of our needs(reasoning is in the document). There are also a few other things that I am going to add there soon, but the document is already ready for review and ready to kick off discussions. The main goal, for now, was to unblock reconstruction work and partially incentivized testnet.

Main TODOs:

Tracing integration points in the code(almost there)
Design section for metrics(WIP)
- List of measurables for celestia-node

Closes #663

codecov-commenter · 2022-07-11T08:44:05Z

Codecov Report

Merging #901 (84b1b99) into main (3f2d743) will decrease coverage by 0.42%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #901      +/-   ##
==========================================
- Coverage   58.53%   58.11%   -0.43%     
==========================================
  Files         125      125              
  Lines        7409     7415       +6     
==========================================
- Hits         4337     4309      -28     
- Misses       2617     2647      +30     
- Partials      455      459       +4

Impacted Files	Coverage Δ
ipld/nmt_adder.go	`68.00% <0.00%> (-24.00%)`	⬇️
logs/logs.go	`85.71% <0.00%> (-14.29%)`	⬇️
node/p2p/routing.go	`60.00% <0.00%> (-7.75%)`	⬇️
fraud/pb/proof.pb.go	`29.22% <0.00%> (-4.29%)`	⬇️
fraud/bad_encoding.go	`60.76% <0.00%> (-2.31%)`	⬇️
ipld/retriever.go	`90.38% <0.00%> (-1.93%)`	⬇️
node/p2p/p2p.go	`93.02% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f2d743...84b1b99. Read the comment docs.

Wondertan · 2022-07-11T08:48:28Z

Also, asked for a review from @sysrex as it touches on infrastructure/tools and @YazzyYaz as the document touches the incentivised testnet

docs/adr/adr-009-telemetry.md

renaynay

👏🏻

docs/adr/adr-009-telemetry.md

renaynay · 2022-07-11T11:02:35Z

docs/adr/adr-009-telemetry.md

+### Plan
+
+The first "ShrEx" stack analysis priority is critical for Celestia project. The analysis results will tell us whether
+our current Full Node reconstruction qualities conforms to the main network requirements, subsequently affecting 


Suggested change

our current Full Node reconstruction qualities conforms to the main network requirements, subsequently affecting

our current Full Node reconstruction qualities conforms to mainnet requirements, subsequently affecting

Hmm, I like verbosity in this case

renaynay · 2022-07-11T11:03:26Z

docs/adr/adr-009-telemetry.md

+
+The first "ShrEx" stack analysis priority is critical for Celestia project. The analysis results will tell us whether
+our current Full Node reconstruction qualities conforms to the main network requirements, subsequently affecting 
+the development roadmap of the celestia-node before the main network launch, therefore is a potential blocker to the 


Suggested change

the development roadmap of the celestia-node before the main network launch, therefore is a potential blocker to the

the development roadmap of the celestia-node before the main network launch.

Why? It is a potential blocker.

docs/adr/adr-009-telemetry.md

renaynay · 2022-07-11T11:16:10Z

docs/adr/adr-009-telemetry.md

+## Considerations
+
+* Tracing performance
+  * _Every_ method calling two more functions making network request can affect overall performance


meaning this could degrade overall node performance?

Could we resolve this with having light || comprehensive tracing modes eventually?

Yes. IO is expensive. Doing it everywhere degrades overall performance.

Could we resolve this with having light || comprehensive tracing modes eventually?

Maybe. It is not yet clear what the difference would be. For now, if you wanna have tracing, performance might be sacrificed. If you turn off it, then things should be almost the same.

liamsi

this looks good. But I will still insist that this ADR should not exclusively focus on tracing as metrics can be very insightful and are very necessary. I feel if I just approve it as is, it could be misunderstood as a go for only priotizing tracing in the foreseeable future. But we need both.

docs/adr/adr-009-telemetry.md

liamsi · 2022-07-13T09:25:23Z

docs/adr/adr-009-telemetry.md

+  * So we can evaluate real world Full Node reconstruction qualities, along with data availability sampling
+  * And adjust our roadmap accordingly.
+* Incentivized Testnet
+  * Tracking participants and validation that do task correctly


Suggested change

* Tracking participants and validation that do task correctly

* Tracking participation and progress on incentivized testnet tasks.

validators are tracked differently.

@liamsi, This is not about validators, but about validation of participants

As we discussed with @YazzyYaz, there is going to be a different set of tasks for which participants will get additional points. They may and may not take this. We need to verify what they claimed to do.

docs/adr/adr-009-telemetry.md

liamsi · 2022-07-13T09:28:36Z

docs/adr/adr-009-telemetry.md

+our current Full Node reconstruction qualities conforms to the main network requirements, subsequently affecting 
+the development roadmap of the celestia-node before the main network launch, therefore is a potential blocker to the 
+launch, which needs to be resolved ASAP. Basing on the former, the plan is focused on unblocking the reconstruction 
+story first and then proceed with steady covering of our codebase with traces for the complex codepaths as well as 


but you could also just link to the blcok reconstruction issue s.t. new hires can look into what is meant by "the reconstruction story" independently of tracing.

docs/adr/adr-009-telemetry.md

Wondertan · 2022-07-13T11:16:17Z

this looks good. But I will still insist that this ADR should not exclusively focus on tracing as metrics can be very insightful and are very necessary. I feel if I just approve it as is, it could be misunderstood as a go for only priotizing tracing in the foreseeable future. But we need both.

Thanks for the review, @liamsi. I wasn't planning to merge it before the Metrics sections are filled up. The goal was to unblock the reconstruction analysis ASAP and to leave metrics integration for later in the background

rootulp

Thanks for sharing this ADR! Makes me want to learn more about https://opentelemetry.io/

docs/adr/adr-009-telemetry.md

Bidon15

Awesome kick-start! So stoked to see this alive! 🌟
Left some suggestions for readability improvements as I needed to go up and down several times to get the idea of the ADR.

docs/adr/adr-009-telemetry.md

Based on celestiaorg#901 and likely shouldn't be merged before it

docs/adr/adr-009-telemetry.md

…ootulp

…ootulp + add integrated trace visualization with Jaeger UI

Wondertan · 2022-08-10T17:07:13Z

Ok, so the metrics are now fully covered in this ADR and this part is ready for review. The next part is to finish the list of places where metrics and traces should be integrated.

docs/adr/adr-009-telemetry.md

liamsi · 2022-08-12T07:38:48Z

docs/adr/adr-009-telemetry.md

@@ -0,0 +1,469 @@
+# ADR #009: Telemetry


Suggestion:

Suggested change

# ADR #009: Telemetry

# ADR #009: Telemetry and General Observability via Tracing and Metrics

docs/adr/adr-009-telemetry.md

Based on celestiaorg#901 and likely shouldn't be merged before it

Co-authored-by: Rootul P <[email protected]>

Co-authored-by: Ismail Khoffi <[email protected]>

Co-authored-by: Rootul P <[email protected]>

walldiss · 2024-03-14T12:39:40Z

docs/adr/adr-009-telemetry.md

+  span.SetAttributes(
+    attribute.Int("size", len(dah.RowsRoots)),
+    attribute.String("data_hash", hex.EncodeToString(dah.Hash())),
+  )


Cardinality issue right here using datahash as attribute. The topic of high cardinality worth its own paragraph as it seems it happens to be encountered by devs from time to time.

walldiss · 2024-03-14T12:44:11Z

docs/adr/adr-009-telemetry.md

+Next, we should understand what instrument to use. On the first glance, for chain height a ___Counter___ instrument should
+fit, as it is a non-decreasing value, and then we should think whether we need a sync or async version of it. For our case,
+both would work and its more the question of precision we want. Sync metering would report every height change, while
+the async would poke `header` pkg API periodically to get the metered data. For our example, we will go with the latter.


This section is missing the point that Counter will be iterated from multiple sources, that might include same application being restarted. For specified case only async Gauge can be used to reflect current network height.

walldiss · 2024-03-14T12:46:14Z

docs/adr/adr-009-telemetry.md

+func MonitorHead(store Store) {
+ headC, _ := meter.AsyncInt64().Counter(
+  "head",
+  instrument.WithUnit(unit.Dimensionless),
+  instrument.WithDescription("Subjective head of the node"),
+ )


Metrics API for Counter initialisation and async callbacks has changed, so examples are outdated

walldiss · 2024-03-14T12:47:01Z

docs/adr/adr-009-telemetry.md

+  func(ctx context.Context) {
+   head, err := store.Head(ctx)
+   if err != nil {
+    headC.Observe(ctx, 0, attribute.String("err", err.Error()))


We should never use errors as attributes. It causes cardinality issues and is not useful, as it is impossible to query such errors from metrics

walldiss · 2024-03-14T12:48:26Z

docs/adr/adr-009-telemetry.md

+#### Backends Connection
+
+Example for OTLP extracted from our code
+


This example needs to be updated. Primaraly it misses global attributes we set:

network name

node type

peer ID
Those are crucial for metrics query and filtering and must be a part of adr

walldiss · 2024-03-14T12:52:22Z

docs/adr/adr-009-telemetry.md

+ if err != nil {
+  panic(err)
+ }


Lets calm down here and panic 😆

Wondertan added docs:adr ADR kind:misc Attached to miscellaneous PRs labels Jul 11, 2022

Wondertan requested review from sysrex, YazzyYaz and vgonkivs July 11, 2022 08:38

Wondertan requested review from liamsi and renaynay as code owners July 11, 2022 08:38

Wondertan self-assigned this Jul 11, 2022

liamsi reviewed Jul 11, 2022

View reviewed changes

docs/adr/adr-009-telemetry.md Outdated Show resolved Hide resolved

docs/adr/adr-009-telemetry.md Outdated Show resolved Hide resolved

docs/adr/adr-009-telemetry.md Show resolved Hide resolved

docs/adr/adr-009-telemetry.md Outdated Show resolved Hide resolved

renaynay reviewed Jul 11, 2022

View reviewed changes

Wondertan force-pushed the hlib/adr-metrics branch from dd36676 to 1e2e2d5 Compare July 11, 2022 16:49

liamsi reviewed Jul 13, 2022

View reviewed changes

Wondertan mentioned this pull request Jul 14, 2022

[EPIC] E2E large block reconstruction test tracking issue #602

Closed

3 tasks

rootulp reviewed Jul 14, 2022

View reviewed changes

docs/adr/adr-009-telemetry.md Outdated Show resolved Hide resolved

docs/adr/adr-009-telemetry.md Outdated Show resolved Hide resolved

Bidon15 reviewed Jul 15, 2022

View reviewed changes

Wondertan force-pushed the hlib/adr-metrics branch from df7e853 to 62027a7 Compare July 15, 2022 12:03

rootulp added a commit to rootulp/celestia-node that referenced this pull request Jul 20, 2022

docs(adr): ADR celestiaorg#10: Monitoring

c419917

Based on celestiaorg#901 and likely shouldn't be merged before it

rootulp mentioned this pull request Jul 20, 2022

docs(adr): ADR #010: Incentivized Testnet Monitoring #922

Merged

tzdybal self-requested a review July 21, 2022 11:42

tzdybal reviewed Jul 21, 2022

View reviewed changes

docs/adr/adr-009-telemetry.md Outdated Show resolved Hide resolved

Wondertan mentioned this pull request Jul 22, 2022

[EPIC] Telemetry tracking issue #260

Closed

14 tasks

Wondertan force-pushed the hlib/adr-metrics branch 2 times, most recently from 9d87da3 to 9581d16 Compare July 29, 2022 16:47

Wondertan added 4 commits August 10, 2022 19:00

docs(adr): initial draft fraft for telemtry adr

67a4743

docs(adr): stylistic improvements and comments from @renaynay

c7918ec

docs(adr): apply suggestions from @liamsi and add minor improvements

fbb379e

docs(adr): improve integration example to cover question raised by @r…

7ee4170

…ootulp + add integrated trace visualization with Jaeger UI

Wondertan force-pushed the hlib/adr-metrics branch from 9581d16 to 260f7be Compare August 10, 2022 17:01

Wondertan requested a review from distractedm1nd as a code owner August 10, 2022 17:01

Wondertan requested review from rootulp, Bidon15, liamsi, renaynay and tzdybal August 10, 2022 17:01

Wondertan force-pushed the hlib/adr-metrics branch from 260f7be to de5f85d Compare August 10, 2022 17:02

docs(adr-009): cover metrics and more info on tracing

4163cb6

Wondertan force-pushed the hlib/adr-metrics branch from de5f85d to 4163cb6 Compare August 10, 2022 17:05

rootulp previously approved these changes Aug 10, 2022

View reviewed changes

liamsi reviewed Aug 12, 2022

View reviewed changes

docs/adr/adr-009-telemetry.md Outdated Show resolved Hide resolved

renaynay pushed a commit to renaynay/celestia-node that referenced this pull request Aug 15, 2022

docs(adr): ADR #10: Monitoring

42cd171

Based on celestiaorg#901 and likely shouldn't be merged before it

distractedm1nd pushed a commit to renaynay/celestia-node that referenced this pull request Sep 19, 2022

docs(adr): ADR #10: Monitoring

efb97c5

Based on celestiaorg#901 and likely shouldn't be merged before it

distractedm1nd pushed a commit to distractedm1nd/celestia-node that referenced this pull request Sep 21, 2022

docs(adr): ADR celestiaorg#10: Monitoring

2fb5669

Based on celestiaorg#901 and likely shouldn't be merged before it

Frierened mentioned this pull request Feb 14, 2024

chore: updated path #3187

Closed

ramin dismissed rootulp’s stale review via a140a98 February 26, 2024 10:52

ramin and others added 8 commits February 26, 2024 10:52

Update docs/adr/adr-009-telemetry.md

a140a98

Co-authored-by: Rootul P <[email protected]>

Update docs/adr/adr-009-telemetry.md

7ecab43

Co-authored-by: Ismail Khoffi <[email protected]>

Update docs/adr/adr-009-telemetry.md

8a4ae84

Co-authored-by: Rootul P <[email protected]>

Update docs/adr/adr-009-telemetry.md

5ea8373

Co-authored-by: Rootul P <[email protected]>

Update docs/adr/adr-009-telemetry.md

36fa82a

Co-authored-by: Rootul P <[email protected]>

Update docs/adr/adr-009-telemetry.md

05d51de

Co-authored-by: Rootul P <[email protected]>

Update docs/adr/adr-009-telemetry.md

9afb2eb

Co-authored-by: Rootul P <[email protected]>

Merge branch 'main' into hlib/adr-metrics

3bde8cc

Wondertan requested a review from adlerjohn as a code owner February 26, 2024 10:55

ramin requested a review from walldiss February 26, 2024 12:06

walldiss reviewed Mar 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(adr): ADR #009: Telemetry #901

docs(adr): ADR #009: Telemetry #901

Wondertan commented Jul 11, 2022 •

edited

Loading

codecov-commenter commented Jul 11, 2022 •

edited

Loading

Wondertan commented Jul 11, 2022 •

edited

Loading

renaynay left a comment

renaynay Jul 11, 2022

Wondertan Jul 11, 2022

renaynay Jul 11, 2022

Wondertan Jul 11, 2022

renaynay Jul 11, 2022

Wondertan Jul 11, 2022

Wondertan Jul 11, 2022 •

edited

Loading

liamsi left a comment

liamsi Jul 13, 2022

Wondertan Jul 13, 2022

Wondertan Jul 13, 2022

liamsi Jul 13, 2022

Wondertan commented Jul 13, 2022

rootulp left a comment

Bidon15 left a comment

Wondertan commented Aug 10, 2022

liamsi Aug 12, 2022

walldiss Mar 14, 2024

walldiss Mar 14, 2024

walldiss Mar 14, 2024

walldiss Mar 14, 2024

walldiss Mar 14, 2024

walldiss Mar 14, 2024

	our current Full Node reconstruction qualities conforms to the main network requirements, subsequently affecting
	our current Full Node reconstruction qualities conforms to mainnet requirements, subsequently affecting

	the development roadmap of the celestia-node before the main network launch, therefore is a potential blocker to the
	the development roadmap of the celestia-node before the main network launch.

	* Tracking participants and validation that do task correctly
	* Tracking participation and progress on incentivized testnet tasks.

	# ADR #009: Telemetry
	# ADR #009: Telemetry and General Observability via Tracing and Metrics

		#### Backends Connection

		Example for OTLP extracted from our code

docs(adr): ADR #009: Telemetry #901

Are you sure you want to change the base?

docs(adr): ADR #009: Telemetry #901

Conversation

Wondertan commented Jul 11, 2022 • edited Loading

codecov-commenter commented Jul 11, 2022 • edited Loading

Codecov Report

Wondertan commented Jul 11, 2022 • edited Loading

renaynay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wondertan Jul 11, 2022 • edited Loading

Choose a reason for hiding this comment

liamsi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wondertan commented Jul 13, 2022

rootulp left a comment

Choose a reason for hiding this comment

Bidon15 left a comment

Choose a reason for hiding this comment

Wondertan commented Aug 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wondertan commented Jul 11, 2022 •

edited

Loading

codecov-commenter commented Jul 11, 2022 •

edited

Loading

Wondertan commented Jul 11, 2022 •

edited

Loading

Wondertan Jul 11, 2022 •

edited

Loading