Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(identify): make timeout and concurrent streams configurable #5654

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

vbhattaccmu
Copy link

@vbhattaccmu vbhattaccmu commented Oct 29, 2024

Description

Reference Issue 5653

Notes & open questions

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

@vbhattaccmu vbhattaccmu changed the title feat: make stream_timeout and max concurrent streams configurable on identify feat: make timeout and concurrent streams configurable on identify Oct 29, 2024
@vbhattaccmu vbhattaccmu changed the title feat: make timeout and concurrent streams configurable on identify feat(identify): make timeout and concurrent streams configurable Oct 29, 2024
Copy link
Member

@dariusc93 dariusc93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you update the libp2p-identify version in the workspace Cargo.toml?

@vbhattaccmu
Copy link
Author

LGTM. Can you update the libp2p-identify version in the workspace Cargo.toml?

have updated identify version on workspace toml.

Comment on lines 124 to 125
stream_timeout: Duration,
max_concurrent_streams_per_connection: usize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is triggering a too many argument linting from clippy. Maybe we could have a struct that takes in the options and pass it to the handler. Thoughts?

CC @jxs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is triggering a too many argument linting from clippy. Maybe we could have a struct that takes in the options and pass it to the handler. Thoughts?

sure. will add it inside a struct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dariusc93 rather than creating a new struct can I directly pass behaviour config into the handler as most of the data is being used from there?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dariusc93 have passed config into hander and checks have passed.

Copy link
Member

@jxs jxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thanks for your interest! On a technical level this looks good to me, but reading your motivation I am curious:

Motivation
In a large peer-to-peer (p2p) network with a lot of peers attempting to a dial a peer via swarm, some subsequent IdentifyReceived events are never triggered due to the active streams timing out or not having enough concurrent streams configured per connection.

Currently there is no way to increase this timeout or allow more concurrent streams as they are hardcoded to 60s and 10 respectively on identify handlers.

I would love to have these two attributes be made configurable such that the user inheriting the library can choose to set a value from config and the default be set to the original 60s and 10 respectively.

Do you have any kind of data that proves that the defaults didn't suffice your case? Do you think the values you would like to provide for your case would be better for the default case? There was a rational for setting them, see here so if we are introducing more knobs we should have data that backs their need and properly document their trade offs.

Thanks!

@vbhattaccmu
Copy link
Author

vbhattaccmu commented Oct 30, 2024

Hi! Thanks for your interest! On a technical level this looks good to me, but reading your motivation I am curious:

Motivation
In a large peer-to-peer (p2p) network with a lot of peers attempting to a dial a peer via swarm, some subsequent IdentifyReceived events are never triggered due to the active streams timing out or not having enough concurrent streams configured per connection.
Currently there is no way to increase this timeout or allow more concurrent streams as they are hardcoded to 60s and 10 respectively on identify handlers.
I would love to have these two attributes be made configurable such that the user inheriting the library can choose to set a value from config and the default be set to the original 60s and 10 respectively.

Do you have any kind of data that proves that the defaults didn't suffice your case? Do you think the values you would like to provide for your case would be better for the default case? There was a rational for setting them, see here so if we are introducing more knobs we should have data that backs their need and properly document their trade offs.

Thanks!

Currently I don't have a reproducible example from a public testnet/mainnet. I can try to reproduce it from a public devnet /testnet where I can share the logs of IdentifyReceived never getting triggered after swarm dial step.

I am also trying to execute a parametric sweep to know the optimal bounds of these parameters in different conditions, which is where having these attributes configurable will allow us to deploy and test p2p clients faster. Hence raised a PR. Have kept the original values intact as default.

For a network having lesser number of peers (around 5k or so) the default set combination works without any issues.
For a large network (around 100k peers), I have seen if the timeouts are increased to 120s and buffer size is increased to 20, identify events are triggered more often without timeouts. But for a very large network (say over a million peers on your routing table), the scale of these attributes are unknown.

@jxs
Copy link
Member

jxs commented Nov 5, 2024

Hi! Thanks for your interest! On a technical level this looks good to me, but reading your motivation I am curious:

Motivation
In a large peer-to-peer (p2p) network with a lot of peers attempting to a dial a peer via swarm, some subsequent IdentifyReceived events are never triggered due to the active streams timing out or not having enough concurrent streams configured per connection.
Currently there is no way to increase this timeout or allow more concurrent streams as they are hardcoded to 60s and 10 respectively on identify handlers.
I would love to have these two attributes be made configurable such that the user inheriting the library can choose to set a value from config and the default be set to the original 60s and 10 respectively.

Do you have any kind of data that proves that the defaults didn't suffice your case? Do you think the values you would like to provide for your case would be better for the default case? There was a rational for setting them, see here so if we are introducing more knobs we should have data that backs their need and properly document their trade offs.
Thanks!

Currently I don't have a reproducible example from a public testnet/mainnet. I can try to reproduce it from a public devnet /testnet where I can share the logs of IdentifyReceived never getting triggered after swarm dial step.

I am also trying to execute a parametric sweep to know the optimal bounds of these parameters in different conditions, which is where having these attributes configurable will allow us to deploy and test p2p clients faster. Hence raised a PR. Have kept the original values intact as default.

For a network having lesser number of peers (around 5k or so) the default set combination works without any issues. For a large network (around 100k peers), I have seen if the timeouts are increased to 120s and buffer size is increased to 20, identify events are triggered more often without timeouts. But for a very large network (say over a million peers on your routing table), the scale of these attributes are unknown.

thanks for the response, If you agree then I'd suggest we wait for the data on the parameters

@vbhattaccmu
Copy link
Author

Hi! Thanks for your interest! On a technical level this looks good to me, but reading your motivation I am curious:

Motivation
In a large peer-to-peer (p2p) network with a lot of peers attempting to a dial a peer via swarm, some subsequent IdentifyReceived events are never triggered due to the active streams timing out or not having enough concurrent streams configured per connection.
Currently there is no way to increase this timeout or allow more concurrent streams as they are hardcoded to 60s and 10 respectively on identify handlers.
I would love to have these two attributes be made configurable such that the user inheriting the library can choose to set a value from config and the default be set to the original 60s and 10 respectively.

Do you have any kind of data that proves that the defaults didn't suffice your case? Do you think the values you would like to provide for your case would be better for the default case? There was a rational for setting them, see here so if we are introducing more knobs we should have data that backs their need and properly document their trade offs.
Thanks!

Currently I don't have a reproducible example from a public testnet/mainnet. I can try to reproduce it from a public devnet /testnet where I can share the logs of IdentifyReceived never getting triggered after swarm dial step.
I am also trying to execute a parametric sweep to know the optimal bounds of these parameters in different conditions, which is where having these attributes configurable will allow us to deploy and test p2p clients faster. Hence raised a PR. Have kept the original values intact as default.
For a network having lesser number of peers (around 5k or so) the default set combination works without any issues. For a large network (around 100k peers), I have seen if the timeouts are increased to 120s and buffer size is increased to 20, identify events are triggered more often without timeouts. But for a very large network (say over a million peers on your routing table), the scale of these attributes are unknown.

thanks for the response, If you agree then I'd suggest we wait for the data on the parameters

Yes have prepared a setup. Will reach back with some data.

Copy link
Contributor

mergify bot commented Nov 6, 2024

This pull request has merge conflicts. Could you please resolve them @vbhattaccmu? 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants