-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add max-radius param with 5% default #1647
Conversation
d3a94af
to
83ceaed
Compare
@@ -146,7 +148,7 @@ pub async fn run_trin( | |||
&discovery, | |||
utp_socket.clone(), | |||
portalnet_config.clone(), | |||
storage_config_factory.create(&Subnetwork::History)?, | |||
storage_config_factory.create(&Subnetwork::History, trin_config.max_radius)?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to leave this as Distance::MAX
? Since it seems like the bottlenecking volume is only on the state network iiuc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bottlenecking is only more extreme on State. This issue occurs on History too, it is just the radius's on our fleet are settled now, if we are targeting low hardware requirements I think this setting should be used on the history network as well. I would argue there isn't a good reason to leave it at Distance::MAX
, compared to 5%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also since we run History and State at the same time, they affect each other in terms of performance, so we should do both as bad performance on the History network would degrade performance of the State network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I'm in favor of this direction.
Few concerns:
- we currently have only ~25 state nodes. Making their radius to 5% would leave us with gaps.
- we can/should get more state nodes
- once live, big storage nodes should make this less of an issue (it would still be nice to have entire domain space covered without them)
- Once this is merged, with the code as is, nothing would change regarding already running nodes, which is what we want to accomplish in the first place. If you look at
IdIndexedV1Store::init
, I will fall under this condition, which means that it will initialize radius to distance of the farthest content in the db. See my comment as well. - Even when 2 is fixed, we have a bit of not properly tested scenario where we prune a lot of data and trin startup time. This might not be a problem at all (it's one time thing), but I wanted to highlight it.
@@ -432,7 +433,7 @@ impl<TContentKey: OverlayContentKey> IdIndexedV1Store<TContentKey> { | |||
self.radius = Distance::ZERO; | |||
} else { | |||
error!(Db = %self.config.content_type, "Farthest not found!"); | |||
self.radius = Distance::MAX; | |||
self.radius = self.config.max_radius; | |||
} | |||
} | |||
Some(farthest) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we self.radius
to not be higher than max_radius
in this case as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I will do min()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I added the min so it will not return higher then the max_radius
now
All 25 state nodes are expected to settle well below 1% radius, we can add more state nodes there wouldn't be any push back against this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All 25 state nodes are expected to settle well below 1% radius, we can add more state nodes there wouldn't be any push back against this.
When do you think we should add more nodes?
In general, I'm ok with this being merged. But would like to see 1-2 more approvals from others.
I'm still concerned that we might have holes in the domain space. But luckily we have big_storage nodes and there are other clients as well, so none of the content should be lost.
Once this is merged, and before it's deployed, we have to make sure that ansible configuration for big_storage nodes is updated with this flag.
We can double our fleet now if you want, I am not necessarily in a rush, but it couldn't hurt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a few nitpicks. And I'll echo what was already said, but feels pretty important
Once this is merged, and before it's deployed, we have to make sure that ansible configuration for big_storage nodes is updated with this flag.
crates/utils/src/cli.rs
Outdated
)); | ||
} | ||
|
||
// If the max radius is 100%, return the maximum distance as without it we will be 35 less than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to clarify it a bit
@@ -72,3 +73,30 @@ pub fn subnetwork_parser(subnetwork_string: &str) -> Result<Arc<Vec<Subnetwork>> | |||
|
|||
Ok(Arc::new(subnetworks)) | |||
} | |||
|
|||
pub fn max_radius_parser(max_radius_str: &str) -> Result<Distance, String> { | |||
let max_radius_percentage = match max_radius_str.parse::<U256>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let Ok(max_radius_percentage) = max_radius_str.parse::<U256> else {
return Err(...)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to print the error from parse
cbc2cf2
to
d64dda0
Compare
https://github.com/ethereum/cluster/pull/1144 the ansible script is updated (or will be on monday when this PR is merged) |
I am going to merge this PR on monday after the cluster PR is merged |
What was wrong?
For a larger description of the problem can be viewed here #1574
The TL:DR is currently we start at 100 percent radius when the node will end up at 1-3%. This means we download and purge content our node won't end up storing which wastes a lot of CPU cycles, but also takes our uTP transfer limit which could be used for content within our final 1-3% final radius.
This is a big issue as on the state network, currently we are trying to bridge latest state, but our nodes uTP limit is always full so often
trin-execution
offer's of state diffs are declined as all State nodes are at transfer capacity.^ here is a picture I took awhile ago, at the bottom you can see we successfully offered to 0 peers and the content was declined 8 times. This is due to the State nodes being at uTP transfer capacity (or in other words hitting our rate limit)
Recently this issue has gotten worse, but even if it didn't this PR is required if we want to bridge latest state. Bridging state content is very expensive, because there is just so much state to gossip. If we want Portal and the State network to be successful we need a 10x performance improvement at least in Trin as currently Trin can only handle 5 Mbps up and down which is way to slow, we won't be able to take a load on the network with this.
So this PR is the first step in a road of changes which will hopefully greatly increase the performance in Trin. I don't think we have a choice we need to be performent if we want users.
How was it fixed?
Add a cli called
max-radius
, by default the max radius is 5 percent, assuming we have a network of over 20 nodes we should have max coverage. Since the nodes radius's should end up below 3% especially as the network storage requirements will keep growing over time. So I don't think we should have a default bigger then 5%For testing on hive we will use
--max-radius 100
I think this solution is the best one for the foreseeable future, we could do a more complex scheme but that wouldn't be worth it right now, instead we should focus on finding other bottlenecks in Trin as this PR is good enough and gets us most of want we want for the foreseeable future. A more complex scheme would give diminishing returns, so I wouldn't expect us to look into it until Trin is very optimized so it is a very low priority to further optimize this with a complex solution.