[FeatureRequest/Idea] Make polkadot-introspector the ultimate debugging tool #764

alexggh · 2024-08-12T08:28:33Z

Polkadot introspector should be extended with logic that help us understand the state of the polkadot network at a given time, so that developer could quickly diagnose what component or entity is not properly functioning, for that I think we should extend it to included some predefined logic that can answer specific protocol questions using configurable data sources.

Data sources:

On chain data.
DHT data, it should include/use the data obtained with https://github.com/lexnv/subp2p-explorer.
Snapshot of some validators logs, either from Loki or a file.
Telemetry data.
Grafana metrics.

Examples of queries/operations

This is a sample list of basic operations that I find useful during debugging, we should not limit ourselves to them, as a rule of thumb with this tool we should be able to check any protocol invariant that we have data for.

Convert between peer-id, authorithy-id, account_id and any other per-validator key. #771
List backing validators for a parachain at relay chain x.
Check a validator can be reached, what version is running, stats about its performance.
Which validator/collator should have a produced a missing block
Validate any invariants on the path Collator -> Backing validators -> Author
PVF execution/compile time for a given parachain block.
Availability bits statistics.
Find relay/parachain blocks that weren't produced in time.
Which core was the parachain scheduled on.

sandreim · 2024-08-12T08:46:49Z

Data sources:

Snapshot of some validators logs, either from Loki or a file.

Grafana metrics.

I think some automation here would be very helpful, however seemed to be PITA to integrate with Loki last time I tried.

Examples of queries/operations

This is a sample list of basic operations that I find useful during debugging, we should not limit ourselves to them, as a rule of thumb with this tool we should be able to check any protocol invariant that we have data for.

Convert between peer-id, authorithy-id, account_id and any other per-validator key.

Check a validator can be reached, what version is running, stats about its performance.

These should be part of the whois tool.

Which validator/collator should have a produced a missing block

I don't think this is feasible for collators, for validators it should be easy to check on-chain data.

Validate any invariants on the path Collator -> Backing validators -> Author

Do you have any idea on how to approach this. Long ago I was experimenting with exposing NodeEvents from validators to be debug the collator protocol on test nets.

PVF execution/compile time for a given parachain block.

https://github.com/ordian/kuddelmuddel already does execute the block. Do we want to expand on that ?

Availability bits statistics.

The data is already available in parachain tracer.

Find relay/parachain blocks that weren't produced in time.

The parachain tracer already shows the relay chain blocks where para slots were missed.

Which core was the parachain scheduled on.

This should be supported with the historical mode of parachain tracer.

@AndreiEres please share some thoughts

alexggh · 2024-08-12T09:03:37Z

I think some automation here would be very helpful, however seemed to be PITA to integrate with Loki last time I tried.

I think @lexnv has something that he is using for this.

Which validator/collator should have a produced a missing block

I don't think this is feasible for collators, for validators it should be easy to check on-chain data.

Most of the parachains use aura, we should be able to query aura authorithies and reverse engineer from that whose turn was to generate a block.

Validate any invariants on the path Collator -> Backing validators -> Author
Do you have any idea on how to approach this. Long ago I was experimenting with exposing NodeEvents from validators to be debug the collator protocol on test nets.

For now I was thinking of only using the data that is there and deduce from that, I was thinking more about things like, did the collator produce other blocks that ended up on chain? Do backing validator backed other parachains blocks or from other collators? Did the author back things ? Was to core ready for backing ?

https://github.com/ordian/kuddelmuddel already does execute the block. Do we want to expand on that ?

I was thinking just to use it in here, so that we have everything in one place, same thing with https://github.com/lexnv/subp2p-explorer, the goal being to make the polkadot-introspector the ultimate debugging tool :D.

lexnv · 2024-08-12T09:41:33Z

It should be relatively straight forward to integrate with grafana, I've did a similar thing to triage all warnings / errors from substrate: https://github.com/lexnv/sub-triage-logs/ automatically.
The goal of this tool is to ease the triaging process. Going manually through 4k warnings is not sustainable.

This tool fetches the polkadot-sdk repo and converts any log::warn!("..") and log::error!("...") into a regex expression. For example log::warn!("{peer_id} banned, disconnecting, reason: {reason}") turns into Regex(".\* banned, disconnecting, reason: .\*").

Then it groups warnings / errors from Grafana or a local file. Even more, you can add a closure to deduplicate further (like we do for peer banned reason)

A triage output looks like:

Count	Level	Triage report
3006	warn	Notification block pinning limit reached. Unpinning block with hash = .*
516	warn	💔 Error importing block .: . ( block has an unknown parent )
42	warn	🥩 ran out of peers to request justif #.* from
28	warn	.* banned, disconnecting, reason: .* ( Same block request multiple times )
13	warn	.* banned, disconnecting, reason: .* ( Peer disconnected )
4	warn	.* banned, disconnecting, reason: .* ( Open failure )
2	warn	❌ Error while dialing .: .
1	error	🥩 Error: .*. Restarting voter.

Check a validator can be reached

This can also be done with https://github.com/lexnv/subp2p-explorer after fetching a DHT record.

Telemetry data

We have a tracking issue paritytech/substrate-telemetry#588 to expose a friendly API to extract a bit more information from substrate-telemetry.

It would be beneficial to have this API although it will not be sufficient for debugging. Most of the time we try to debug an issue that happened in the past, and by that time the information exposed by telemetry might be lost. For example, who is the peer that was banned 3 days ago? We can either introduce a new service to keep a history record of N days, or we can extend the substrate-telemetry to keep the data around of us

https://github.com/ordian/kuddelmuddel already does execute the block. Do we want to expand on that ?

IIUC, we'll have something similar to:

Have a separate repo/workspace that exposes a nice CLI interface (ie introspect-cli)
Each subcommand forwards argument to the appropriate tool

introspect-cli chain-data  ... (calls into polkadot-introspector)

introspect-cli p2p ... (calls into subp2p explorer)

introspect-cli telemetry ... (calls into substrate-telemetry CLI *to be added in the future*)

I would opt for a separate repo entirely to keep things simpler, shouldn't matter that much here since it would be a forwarding cli.

AndreiEres · 2024-08-21T12:18:08Z

So, a small part of these queries already exist or are easy to implement.
@alexggh, could you prioritize the list by annoyance: what you always use and it requires a lot of manual work?

introspect-cli chain-data  ... (calls into polkadot-introspector)
introspect-cli p2p ... (calls into subp2p explorer)

Actually, the introspector is already not a single tool but a toolset. I don't like the idea of making a meta toolset of a toolset :-) But I need to mull over the requests.

alexggh · 2024-08-21T13:14:53Z

could you prioritize the list by annoyance: what you always use and it requires a lot of manual work

The list of the top it is more or less already prioritised.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FeatureRequest/Idea] Make polkadot-introspector the ultimate debugging tool #764

[FeatureRequest/Idea] Make polkadot-introspector the ultimate debugging tool #764

alexggh commented Aug 12, 2024 •

edited by AndreiEres

Loading

sandreim commented Aug 12, 2024

Data sources:

Examples of queries/operations

alexggh commented Aug 12, 2024

lexnv commented Aug 12, 2024

AndreiEres commented Aug 21, 2024

alexggh commented Aug 21, 2024

[FeatureRequest/Idea] Make polkadot-introspector the ultimate debugging tool #764

[FeatureRequest/Idea] Make polkadot-introspector the ultimate debugging tool #764

Comments

alexggh commented Aug 12, 2024 • edited by AndreiEres Loading

Data sources:

Examples of queries/operations

sandreim commented Aug 12, 2024

Data sources:

Examples of queries/operations

alexggh commented Aug 12, 2024

lexnv commented Aug 12, 2024

AndreiEres commented Aug 21, 2024

alexggh commented Aug 21, 2024

alexggh commented Aug 12, 2024 •

edited by AndreiEres

Loading