kad: get_providers() sometimes resolves with FinishedWithNoAdditionalRecord and no closest peers #5418

nazar-pc · 2024-05-25T08:30:21Z

I have an app that does a lot of get_providers() calls via Kademlia.

Most of the time they finish successfully, but occasionally the first and only query result is GetProvidersOk::FinishedWithNoAdditionalRecord with empty list of closest peers, which seems wrong both because the key is actually present on the network and due to closest peers list being empty.

It would be reasonable if the key was not found and I got a list of closest peers, but the fact that list of closest peers is empty while Kademlia k-buckets are not empty tells me something is off here, but I'm having a hard time tracing what, where and why.

Sometimes I first get GetProvidersOk::FoundProviders with empty list followed by GetProvidersOk::FinishedWithNoAdditionalRecord with a single peer ID, again for a key that definitely exists on the network.

Any suggestions for debugging steps here?

libp2p-kad 0.45.3 (latest at the moment of writing), disjoint_query_paths: true

The text was updated successfully, but these errors were encountered:

guillaumemichel · 2024-05-28T11:53:40Z

I am not very familiar with the disjoint paths query. It seems to be because the ClosestPeersIters are empty, but I am not sure why?

Does the same issue occur if you use another query routing technique (closest or fixed)?

Is your node connected with the rest of the network when the query fails?
Is it possible that the querying node is storing the provider record itself, and that all entries are expired?
Do you enable Caching? If yes, was the key requested already by the node?

nazar-pc · 2024-05-28T13:51:22Z

Does the same issue occur if you use another query routing technique (closest or fixed)?

I'm just calling https://docs.rs/libp2p-kad/0.45.3/libp2p_kad/struct.Behaviour.html#method.get_providers in this case. I guess I can try to modify the code to try it out, but it is slower to test that way.

Is your node connected with the rest of the network when the query fails?

I'm quite sure that I am. There are many concurrent queries happening and only a few end up like this and not even every time, sometimes I have 1000 successful in a row.

Is it possible that the querying node is storing the provider record itself, and that all entries are expired?

Not possible, the node making requests in thsi particular situation is such that physically and by design has no records.

Do you enable Caching? If yes, was the key requested already by the node?

I did not and wasn't even aware of such cache 😅

guillaumemichel · 2024-05-29T07:41:07Z

Do you enable Caching? If yes, was the key requested already by the node?

I did not and wasn't even aware of such cache 😅

By default you should get 1 peer in the FinishedWithNoAdditionalRecord.

Does the same issue occur if you use another query routing technique (closest or fixed)?

I'm just calling https://docs.rs/libp2p-kad/0.45.3/libp2p_kad/struct.Behaviour.html#method.get_providers in this case. I guess I can try to modify the code to try it out, but it is slower to test that way.

You can simply try to set disjoint_query_paths: false, and see if the problem still occurs.

Is the code you are working on public?

nazar-pc · 2024-05-29T07:44:24Z

You can simply try to set disjoint_query_paths: false, and see if the problem still occurs.

I will.

Is the code you are working on public?

Yes, all of it, but it is a big project and might be a bit tricky as a demo of this issue. Let me do some experiments and post results here.

nazar-pc · 2024-05-30T14:44:07Z

It does happen with and without disjoint query paths.

A smaller reproduction using https://github.com/subspace/subspace/tree/debug-piece-not-found-libp2p-wip is this:

cargo run --example benchmark -- --bootstrap-nodes=/dns/bootstrap-0.gemini-3h.subspace.network/tcp/30533/p2p/12D3KooWK7NuL4S6aEdy5gELnvhCGo6EyrWVARnBy7W4AJTVkaF1 --bootstrap-nodes=/dns/bootstrap-1.gemini-3h.subspace.network/tcp/30533/p2p/12D3KooWQK33n2raSXzjH8JyqbFtczBmbwZiK9Tpicdw3rveJesj --protocol-version=0c121c75f4ef450f40619e1fca9d1e8e7fbabc42c895bc4790801e85d5a91c34 --out-peers=100 --pending-out-peers=100 parallel --parallelism-level=100 --max-pieces=256 --start-with=256 --retries=0

You'll see some Piece not found and for some of them (not all, keys are printed in logs) you'll see the only records matched by key are these:

2024-05-30T14:37:33.057585Z ERROR subspace_networking::node_runner: Get providers query yielded no results key="90b2ce0508c101000000000000" closest_peers=0 closest_peers=[]
2024-05-30T14:37:33.057618Z  WARN benchmark: Piece not found. piece_index=449 key="90b2ce0508c101000000000000" pure_duration=9.319213208s full_duration=180.36635061s

While expectation is to see something like this (another one that was not found):

2024-05-30T14:37:27.970135Z ERROR subspace_networking::node_runner: Get providers query yielded 0 results key="90b2ce05089201000000000000"
2024-05-30T14:37:30.060845Z ERROR subspace_networking::node_runner: Get providers query yielded no results key="90b2ce05089201000000000000" closest_peers=1 closest_peers=[PeerId("12D3KooWLjVBE9kzygAVTL1qHfEF8WpupjkUqgqnpUYZTRPvjL3y")]
2024-05-30T14:37:50.274517Z  WARN benchmark: Piece not found. piece_index=402 key="90b2ce05089201000000000000" pure_duration=30.749968564s full_duration=197.583262458s

Error logs are not errors, I just made them errors so I have different colors 🙃

subspace-networking is the crate that encapsulates our libp2p-based networking stack used for these examples.

guillaumemichel · 2024-05-31T15:09:34Z

What is the size of the network?

Given the high duration of the lookups (I am not sure which duration should be considered), it seems like the node was able to establish some connections and perform the lookup even though it didn't find the record. So it should have some peers to return in the GetProvidersOk::FinishedWithNoAdditionalRecord.

nazar-pc · 2024-05-31T16:05:10Z

What is the size of the network?

We have over 8000 consensus nodes (clients) and at least the same number (likely higher) farmers that are serving data to clients.

So it should have some peers to return in the GetProvidersOk::FinishedWithNoAdditionalRecord.

That is my understanding as well.

nazar-pc mentioned this issue May 30, 2024

Kademlia bootstrap gets stuck forever in some cases #5432

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kad: get_providers() sometimes resolves with FinishedWithNoAdditionalRecord and no closest peers #5418

kad: get_providers() sometimes resolves with FinishedWithNoAdditionalRecord and no closest peers #5418

nazar-pc commented May 25, 2024 •

edited

Loading

guillaumemichel commented May 28, 2024

nazar-pc commented May 28, 2024

guillaumemichel commented May 29, 2024

nazar-pc commented May 29, 2024

nazar-pc commented May 30, 2024

guillaumemichel commented May 31, 2024

nazar-pc commented May 31, 2024

kad: get_providers() sometimes resolves with FinishedWithNoAdditionalRecord and no closest peers #5418

kad: get_providers() sometimes resolves with FinishedWithNoAdditionalRecord and no closest peers #5418

Comments

nazar-pc commented May 25, 2024 • edited Loading

guillaumemichel commented May 28, 2024

nazar-pc commented May 28, 2024

guillaumemichel commented May 29, 2024

nazar-pc commented May 29, 2024

nazar-pc commented May 30, 2024

guillaumemichel commented May 31, 2024

nazar-pc commented May 31, 2024

nazar-pc commented May 25, 2024 •

edited

Loading