Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FeatureRequest/Idea] Make polkadot-introspector the ultimate debugging tool #764

Open
1 of 9 tasks
alexggh opened this issue Aug 12, 2024 · 5 comments
Open
1 of 9 tasks

Comments

@alexggh
Copy link
Contributor

alexggh commented Aug 12, 2024

Polkadot introspector should be extended with logic that help us understand the state of the polkadot network at a given time, so that developer could quickly diagnose what component or entity is not properly functioning, for that I think we should extend it to included some predefined logic that can answer specific protocol questions using configurable data sources.

Data sources:

  1. On chain data.
  2. DHT data, it should include/use the data obtained with https://github.com/lexnv/subp2p-explorer.
  3. Snapshot of some validators logs, either from Loki or a file.
  4. Telemetry data.
  5. Grafana metrics.

Examples of queries/operations

This is a sample list of basic operations that I find useful during debugging, we should not limit ourselves to them, as a rule of thumb with this tool we should be able to check any protocol invariant that we have data for.

  • Convert between peer-id, authorithy-id, account_id and any other per-validator key. #771
  • List backing validators for a parachain at relay chain x.
  • Check a validator can be reached, what version is running, stats about its performance.
  • Which validator/collator should have a produced a missing block
  • Validate any invariants on the path Collator -> Backing validators -> Author
  • PVF execution/compile time for a given parachain block.
  • Availability bits statistics.
  • Find relay/parachain blocks that weren't produced in time.
  • Which core was the parachain scheduled on.
@sandreim
Copy link
Collaborator

Data sources:

  1. Snapshot of some validators logs, either from Loki or a file.
  2. Grafana metrics.

I think some automation here would be very helpful, however seemed to be PITA to integrate with Loki last time I tried.

Examples of queries/operations

This is a sample list of basic operations that I find useful during debugging, we should not limit ourselves to them, as a rule of thumb with this tool we should be able to check any protocol invariant that we have data for.

  • Convert between peer-id, authorithy-id, account_id and any other per-validator key.
  • Check a validator can be reached, what version is running, stats about its performance.

These should be part of the whois tool.

  • Which validator/collator should have a produced a missing block

I don't think this is feasible for collators, for validators it should be easy to check on-chain data.

  • Validate any invariants on the path Collator -> Backing validators -> Author

Do you have any idea on how to approach this. Long ago I was experimenting with exposing NodeEvents from validators to be debug the collator protocol on test nets.

  • PVF execution/compile time for a given parachain block.

https://github.com/ordian/kuddelmuddel already does execute the block. Do we want to expand on that ?

  • Availability bits statistics.

The data is already available in parachain tracer.

  • Find relay/parachain blocks that weren't produced in time.

The parachain tracer already shows the relay chain blocks where para slots were missed.

  • Which core was the parachain scheduled on.

This should be supported with the historical mode of parachain tracer.

@AndreiEres please share some thoughts

@alexggh
Copy link
Contributor Author

alexggh commented Aug 12, 2024

I think some automation here would be very helpful, however seemed to be PITA to integrate with Loki last time I tried.

I think @lexnv has something that he is using for this.

Which validator/collator should have a produced a missing block

I don't think this is feasible for collators, for validators it should be easy to check on-chain data.

Most of the parachains use aura, we should be able to query aura authorithies and reverse engineer from that whose turn was to generate a block.

Validate any invariants on the path Collator -> Backing validators -> Author
Do you have any idea on how to approach this. Long ago I was experimenting with exposing NodeEvents from validators to be debug the collator protocol on test nets.

For now I was thinking of only using the data that is there and deduce from that, I was thinking more about things like, did the collator produce other blocks that ended up on chain? Do backing validator backed other parachains blocks or from other collators? Did the author back things ? Was to core ready for backing ?

https://github.com/ordian/kuddelmuddel already does execute the block. Do we want to expand on that ?

I was thinking just to use it in here, so that we have everything in one place, same thing with https://github.com/lexnv/subp2p-explorer, the goal being to make the polkadot-introspector the ultimate debugging tool :D.

@lexnv
Copy link
Contributor

lexnv commented Aug 12, 2024

It should be relatively straight forward to integrate with grafana, I've did a similar thing to triage all warnings / errors from substrate: https://github.com/lexnv/sub-triage-logs/ automatically.
The goal of this tool is to ease the triaging process. Going manually through 4k warnings is not sustainable.

This tool fetches the polkadot-sdk repo and converts any log::warn!("..") and log::error!("...") into a regex expression. For example log::warn!("{peer_id} banned, disconnecting, reason: {reason}") turns into Regex(".\* banned, disconnecting, reason: .\*").

Then it groups warnings / errors from Grafana or a local file. Even more, you can add a closure to deduplicate further (like we do for peer banned reason)

A triage output looks like:

Count Level Triage report
3006 warn Notification block pinning limit reached. Unpinning block with hash = .*
516 warn 💔 Error importing block .: . ( block has an unknown parent )
42 warn 🥩 ran out of peers to request justif #.* from
28 warn .* banned, disconnecting, reason: .* ( Same block request multiple times )
13 warn .* banned, disconnecting, reason: .* ( Peer disconnected )
4 warn .* banned, disconnecting, reason: .* ( Open failure )
2 warn ❌ Error while dialing .: .
1 error 🥩 Error: .*. Restarting voter.

Check a validator can be reached

This can also be done with https://github.com/lexnv/subp2p-explorer after fetching a DHT record.

Telemetry data

We have a tracking issue paritytech/substrate-telemetry#588 to expose a friendly API to extract a bit more information from substrate-telemetry.

It would be beneficial to have this API although it will not be sufficient for debugging. Most of the time we try to debug an issue that happened in the past, and by that time the information exposed by telemetry might be lost. For example, who is the peer that was banned 3 days ago? We can either introduce a new service to keep a history record of N days, or we can extend the substrate-telemetry to keep the data around of us

https://github.com/ordian/kuddelmuddel already does execute the block. Do we want to expand on that ?

IIUC, we'll have something similar to:

  • Have a separate repo/workspace that exposes a nice CLI interface (ie introspect-cli)
  • Each subcommand forwards argument to the appropriate tool
introspect-cli chain-data  ... (calls into polkadot-introspector)

introspect-cli p2p ... (calls into subp2p explorer)

introspect-cli telemetry ... (calls into substrate-telemetry CLI *to be added in the future*)

I would opt for a separate repo entirely to keep things simpler, shouldn't matter that much here since it would be a forwarding cli.

@AndreiEres
Copy link
Collaborator

So, a small part of these queries already exist or are easy to implement.
@alexggh, could you prioritize the list by annoyance: what you always use and it requires a lot of manual work?

introspect-cli chain-data  ... (calls into polkadot-introspector)
introspect-cli p2p ... (calls into subp2p explorer)

Actually, the introspector is already not a single tool but a toolset. I don't like the idea of making a meta toolset of a toolset :-) But I need to mull over the requests.

@alexggh
Copy link
Contributor Author

alexggh commented Aug 21, 2024

could you prioritize the list by annoyance: what you always use and it requires a lot of manual work

The list of the top it is more or less already prioritised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

4 participants