Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add max-radius param with 5% default #1647

Merged
merged 3 commits into from
Feb 4, 2025

Conversation

KolbyML
Copy link
Member

@KolbyML KolbyML commented Jan 22, 2025

What was wrong?

For a larger description of the problem can be viewed here #1574

The TL:DR is currently we start at 100 percent radius when the node will end up at 1-3%. This means we download and purge content our node won't end up storing which wastes a lot of CPU cycles, but also takes our uTP transfer limit which could be used for content within our final 1-3% final radius.

This is a big issue as on the state network, currently we are trying to bridge latest state, but our nodes uTP limit is always full so often trin-execution offer's of state diffs are declined as all State nodes are at transfer capacity.

image
^ here is a picture I took awhile ago, at the bottom you can see we successfully offered to 0 peers and the content was declined 8 times. This is due to the State nodes being at uTP transfer capacity (or in other words hitting our rate limit)

Recently this issue has gotten worse, but even if it didn't this PR is required if we want to bridge latest state. Bridging state content is very expensive, because there is just so much state to gossip. If we want Portal and the State network to be successful we need a 10x performance improvement at least in Trin as currently Trin can only handle 5 Mbps up and down which is way to slow, we won't be able to take a load on the network with this.

So this PR is the first step in a road of changes which will hopefully greatly increase the performance in Trin. I don't think we have a choice we need to be performent if we want users.

How was it fixed?

Add a cli called max-radius, by default the max radius is 5 percent, assuming we have a network of over 20 nodes we should have max coverage. Since the nodes radius's should end up below 3% especially as the network storage requirements will keep growing over time. So I don't think we should have a default bigger then 5%

For testing on hive we will use --max-radius 100

I think this solution is the best one for the foreseeable future, we could do a more complex scheme but that wouldn't be worth it right now, instead we should focus on finding other bottlenecks in Trin as this PR is good enough and gets us most of want we want for the foreseeable future. A more complex scheme would give diminishing returns, so I wouldn't expect us to look into it until Trin is very optimized so it is a very low priority to further optimize this with a complex solution.

@KolbyML KolbyML self-assigned this Jan 22, 2025
@KolbyML KolbyML force-pushed the default-max-radius-5-percent branch from d3a94af to 83ceaed Compare January 22, 2025 11:42
@@ -146,7 +148,7 @@ pub async fn run_trin(
&discovery,
utp_socket.clone(),
portalnet_config.clone(),
storage_config_factory.create(&Subnetwork::History)?,
storage_config_factory.create(&Subnetwork::History, trin_config.max_radius)?,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to leave this as Distance::MAX? Since it seems like the bottlenecking volume is only on the state network iiuc.

Copy link
Member Author

@KolbyML KolbyML Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bottlenecking is only more extreme on State. This issue occurs on History too, it is just the radius's on our fleet are settled now, if we are targeting low hardware requirements I think this setting should be used on the history network as well. I would argue there isn't a good reason to leave it at Distance::MAX, compared to 5%

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also since we run History and State at the same time, they affect each other in terms of performance, so we should do both as bad performance on the History network would degrade performance of the State network.

Copy link
Collaborator

@morph-dev morph-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'm in favor of this direction.

Few concerns:

  1. we currently have only ~25 state nodes. Making their radius to 5% would leave us with gaps.
    • we can/should get more state nodes
    • once live, big storage nodes should make this less of an issue (it would still be nice to have entire domain space covered without them)
  2. Once this is merged, with the code as is, nothing would change regarding already running nodes, which is what we want to accomplish in the first place. If you look at IdIndexedV1Store::init, I will fall under this condition, which means that it will initialize radius to distance of the farthest content in the db. See my comment as well.
  3. Even when 2 is fixed, we have a bit of not properly tested scenario where we prune a lot of data and trin startup time. This might not be a problem at all (it's one time thing), but I wanted to highlight it.

@@ -432,7 +433,7 @@ impl<TContentKey: OverlayContentKey> IdIndexedV1Store<TContentKey> {
self.radius = Distance::ZERO;
} else {
error!(Db = %self.config.content_type, "Farthest not found!");
self.radius = Distance::MAX;
self.radius = self.config.max_radius;
}
}
Some(farthest) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we self.radius to not be higher than max_radius in this case as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I will do min()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I added the min so it will not return higher then the max_radius now

@KolbyML
Copy link
Member Author

KolbyML commented Jan 25, 2025

In general, I'm in favor of this direction.

Few concerns:

  1. we currently have only ~25 state nodes. Making their radius to 5% would leave us with gaps.

    • we can/should get more state nodes
    • once live, big storage nodes should make this less of an issue (it would still be nice to have entire domain space covered without them)

All 25 state nodes are expected to settle well below 1% radius, we can add more state nodes there wouldn't be any push back against this.

Copy link
Collaborator

@morph-dev morph-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 25 state nodes are expected to settle well below 1% radius, we can add more state nodes there wouldn't be any push back against this.

When do you think we should add more nodes?


In general, I'm ok with this being merged. But would like to see 1-2 more approvals from others.

I'm still concerned that we might have holes in the domain space. But luckily we have big_storage nodes and there are other clients as well, so none of the content should be lost.

Once this is merged, and before it's deployed, we have to make sure that ansible configuration for big_storage nodes is updated with this flag.

@KolbyML
Copy link
Member Author

KolbyML commented Jan 28, 2025

All 25 state nodes are expected to settle well below 1% radius, we can add more state nodes there wouldn't be any push back against this.

When do you think we should add more nodes?

We can double our fleet now if you want, I am not necessarily in a rush, but it couldn't hurt

Copy link
Collaborator

@njgheorghita njgheorghita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few nitpicks. And I'll echo what was already said, but feels pretty important

Once this is merged, and before it's deployed, we have to make sure that ansible configuration for big_storage nodes is updated with this flag.

));
}

// If the max radius is 100%, return the maximum distance as without it we will be 35 less than
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clarify it a bit

@@ -72,3 +73,30 @@ pub fn subnetwork_parser(subnetwork_string: &str) -> Result<Arc<Vec<Subnetwork>>

Ok(Arc::new(subnetworks))
}

pub fn max_radius_parser(max_radius_str: &str) -> Result<Distance, String> {
let max_radius_percentage = match max_radius_str.parse::<U256>() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let Ok(max_radius_percentage) = max_radius_str.parse::<U256> else {
    return Err(...)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to print the error from parse

@KolbyML KolbyML force-pushed the default-max-radius-5-percent branch from cbc2cf2 to d64dda0 Compare February 2, 2025 16:44
@KolbyML
Copy link
Member Author

KolbyML commented Feb 2, 2025

https://github.com/ethereum/cluster/pull/1144 the ansible script is updated (or will be on monday when this PR is merged)

@KolbyML
Copy link
Member Author

KolbyML commented Feb 2, 2025

I am going to merge this PR on monday after the cluster PR is merged

@KolbyML KolbyML merged commit 0043f6b into ethereum:master Feb 4, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants