Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have one thread per chain on benchmark #3374

Merged
merged 1 commit into from
Feb 21, 2025

Conversation

ndr-ds
Copy link
Contributor

@ndr-ds ndr-ds commented Feb 19, 2025

Motivation

Right now on linera benchmark we do everything from a single thread. That doesn't scale much.

Proposal

Spawn one task per chain, and split the desired BPS across the number of chains. For example, if the desired BPS is 10, and you have 10 chains, each task will need to reach 1 BPS.

Every task just tries to send as many block proposals as fast as possible. We don't control the BPS in the tasks, but we have a BPS control task for that. Once a task has sent the BPS amount of blocks successfully, it sends a message to the BPS control task. It uses a crossbeam channel for that, as it allows bounded channels with buffer size of 0.
If we have a buffer size of 0, that means that if the BPS control task is sleeping or processing another message, the sender task will be blocked. This is exactly what we want to properly control the BPS behavior. These are faster than std channels and comparable speed of tokio channels, but tokio channels don't allow a buffer size of 0, and also don't block the sender task as sending is async.
The BPS control task will have a timer going. Once it receives num_chains messages, that means that the total number of blocks to be sent was reached. If more than a second elapsed, we failed to achieve the desired BPS. Otherwise, we achieved the desired BPS.

This all runs sharing the same wallet and ClientContext. The only resource the different threads could race for is the wallet, but since we don't save the benchmark chains to the wallet anymore, that's no longer a concern. So using the same wallet for multiple threads is feasible, because the wallet is only used to create the chains at the start using its default chain, then the wallet isn't used by the threads anymore.

Shutdown signals also continue to work (all threads are gracefully shutdown), and we continue to close all the chains on exit.

Test Plan

Ran with a few different BPS/TPB variations, seems to work. Shutdown also seems to work correctly.

Release Plan

  • Nothing to do / These changes follow the usual release cycle.

Copy link
Contributor Author

ndr-ds commented Feb 19, 2025

@ndr-ds ndr-ds force-pushed the 02-19-have_one_thread_per_chain_on_benchmark branch from 88a80ad to b6acbe6 Compare February 19, 2025 23:55
async move {
context
.run_benchmark(
bps_share,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this in addition to the BPS control task? Can't each chain task go as fast as possible because it will be blocked by the channel anyway if the desired BPS is reached?

Copy link
Contributor Author

@ndr-ds ndr-ds Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I do it like that, it means that the number of messages the BPS control task has to process is directly proportional to the BPS value, which is not ideal, specially when we start reaching really high BPS values. If each task has a BPS share it is supposed to achieve, and they only send a message when that share is achieved, then we have at most num_chains messages per second, always, and that can't really become a bottleneck so easily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't imagine that a bunch of () messages could become a bottleneck… but I've been proven wrong before. 😅

@ndr-ds ndr-ds changed the base branch from 02-18-stop_updating_the_wallet_during_linera_benchmark to graphite-base/3374 February 20, 2025 13:14
@ndr-ds ndr-ds force-pushed the 02-19-have_one_thread_per_chain_on_benchmark branch from b6acbe6 to f15df95 Compare February 20, 2025 13:57
@ndr-ds ndr-ds force-pushed the graphite-base/3374 branch from 10e8da9 to 220ab76 Compare February 20, 2025 13:57
@ndr-ds ndr-ds force-pushed the 02-19-have_one_thread_per_chain_on_benchmark branch from f15df95 to d3127e0 Compare February 20, 2025 14:21
@ndr-ds ndr-ds force-pushed the graphite-base/3374 branch from 220ab76 to 56771fa Compare February 20, 2025 14:31
@ndr-ds ndr-ds force-pushed the 02-19-have_one_thread_per_chain_on_benchmark branch 2 times, most recently from 2bc26c1 to 1222830 Compare February 20, 2025 14:38
@ndr-ds ndr-ds force-pushed the graphite-base/3374 branch from 56771fa to 108dccb Compare February 20, 2025 15:16
@ndr-ds ndr-ds force-pushed the 02-19-have_one_thread_per_chain_on_benchmark branch from 1222830 to 09f92d3 Compare February 20, 2025 15:16
@ndr-ds ndr-ds changed the base branch from graphite-base/3374 to main February 20, 2025 15:16
@ndr-ds ndr-ds force-pushed the 02-19-have_one_thread_per_chain_on_benchmark branch from 09f92d3 to bf4a170 Compare February 20, 2025 15:16
@ndr-ds ndr-ds force-pushed the 02-19-have_one_thread_per_chain_on_benchmark branch from bf4a170 to 55f7881 Compare February 20, 2025 19:48
@ndr-ds ndr-ds changed the base branch from main to graphite-base/3374 February 20, 2025 22:47
@ndr-ds ndr-ds force-pushed the 02-19-have_one_thread_per_chain_on_benchmark branch from 55f7881 to 742de30 Compare February 20, 2025 22:47
@ndr-ds ndr-ds changed the base branch from graphite-base/3374 to 02-19-adjust_dashboards_to_show_1m_metrics February 20, 2025 22:47
@ndr-ds ndr-ds force-pushed the 02-19-have_one_thread_per_chain_on_benchmark branch from 742de30 to 19d9e36 Compare February 20, 2025 22:58
@ndr-ds ndr-ds changed the base branch from 02-19-adjust_dashboards_to_show_1m_metrics to graphite-base/3374 February 20, 2025 23:55
@ndr-ds ndr-ds force-pushed the 02-19-have_one_thread_per_chain_on_benchmark branch from 19d9e36 to 09d9fe2 Compare February 20, 2025 23:55
@ndr-ds ndr-ds force-pushed the graphite-base/3374 branch from 60b267d to fb96a12 Compare February 20, 2025 23:55
@ndr-ds ndr-ds changed the base branch from graphite-base/3374 to main February 20, 2025 23:56
@ndr-ds ndr-ds force-pushed the 02-19-have_one_thread_per_chain_on_benchmark branch from 09d9fe2 to 43ce4d1 Compare February 20, 2025 23:56
.await?;
let epoch = epoch.expect("default chain should have an epoch");
let committee = committees
.get(&epoch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove here you don't have to clone below.

async move {
context
.run_benchmark(
bps_share,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't imagine that a bunch of () messages could become a bottleneck… but I've been proven wrong before. 😅

@deuszx
Copy link
Contributor

deuszx commented Feb 21, 2025

as we don't do any mutable operations to either since we don't alter the wallet for benchmarks anymore.

If we don't do any mutable operations (don't update the wallet/state/db) is it an honest benchmark?

Comment on lines +592 to +596
let start = Instant::now();
// Below all block proposals are supposed to succeed without retries, we
// must make sure that all incoming payments have been accepted on-chain
// and that no validator is missing user certificates.
self.process_inboxes_and_force_validator_updates().await;
info!(
"Processed inboxes and forced validator updates in {} ms",
start.elapsed().as_millis()
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reader I'm slightly confused - the method is called prepare_for_benchmark but here we force the processing of inboxes and create blocks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what we need to do to prepare for running the benchmark :)

info!(
"Got {} chains in {} ms",
key_pairs.len(),
start.elapsed().as_millis()
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not trace!?

Comment on lines +597 to +596
info!(
"Processed inboxes and forced validator updates in {} ms",
start.elapsed().as_millis()
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a TRACE-level log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not production code, it's a benchmarking tool. I want to know how long things are running for without having to turn on trace or debug logs

fungible_application_id,
);

Ok((chain_clients, epoch, blocks_infos, committee.clone()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is doing at least four things:

  • clearing out inboxes, creating blocks
  • creating new chains for benchmark
  • supplying a test application with some tokens
  • calling make_benchmark_block_info on a default chain

@@ -1123,7 +1129,7 @@ where

/// Broadcasts certified blocks to validators.
#[instrument(level = "trace", skip(committee, delivery))]
async fn communicate_chain_updates(
pub async fn communicate_chain_updates(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add pub then they become part of the API contract and we no longer can change them.

// the desired BPS, the tasks would continue sending block proposals until the channel's
// buffer is filled, which would cause us to not properly control the BPS rate.
let (sender, receiver) = crossbeam_channel::bounded(0);
let bps_control_task = tokio::spawn(async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like bps_control_task is spawned on the same logical "level" as tasks that send the blocks. Personally I think it'd make more sense if the "control task" (supervisor) was spawning those tasks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main task deals with awaiting the different spawned tasks. I don't see what the issue is. The control task controls the BPS, it is not supposed to create other things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the BPS task that is controlling when the job "is done" though. But OK, it doesn't necessarily mean it has to spawn the workers.

let mut start = time::Instant::now();
while let Ok(()) = receiver.recv() {
recv_count += 1;
if recv_count == num_chains {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to see that the code here is called just before finishing the benchmark. If the BPS control task spawned the "worker tasks" but itself stayed in the "main thread" it'd be more linear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you're describing only happens when we don't specify --bps. All senders will be closed, and this will exit

let default_chain_client = context.make_chain_client(default_chain_id)?;
let (epoch, committees) = default_chain_client
.epoch_and_committees(default_chain_id)
let (chain_clients, epoch, blocks_infos, committee) = context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it weird that benchmark load (operations etc) are generated by a client context. Why is it not the benchmark process itself deciding what and how the benchmark should look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I can move the stuff that doesn't need the client context into Benchmark instead

Copy link
Contributor Author

ndr-ds commented Feb 21, 2025

@deuszx I think I didn't express myself well there. All I meant to say was that we used to alter the wallet, and now we don't. The only resource the different threads could race for is the wallet, but since we don't save the benchmark chains to the wallet anymore, that's no longer a concern. So using the same wallet for multiple threads is feasible, because the wallet is only used to create the chains at the start using its default chain, then the wallet isn't used by the threads anymore.

@deuszx
Copy link
Contributor

deuszx commented Feb 21, 2025

@deuszx I think I didn't express myself well there. All I meant to say was that we used to alter the wallet, and now we don't. The only resource the different threads could race for is the wallet, but since we don't save the benchmark chains to the wallet anymore, that's no longer a concern. So using the same wallet for multiple threads is feasible, because the wallet is only used to create the chains at the start using its default chain, then the wallet isn't used by the threads anymore.

I see, thanks. It was my understanding then – I totally forgot that by wallet we mean local wallet of the client who tracks chains.

@ndr-ds ndr-ds force-pushed the 02-19-have_one_thread_per_chain_on_benchmark branch from 43ce4d1 to f48ee51 Compare February 21, 2025 14:52
@ndr-ds ndr-ds force-pushed the 02-19-have_one_thread_per_chain_on_benchmark branch from f48ee51 to fd0602b Compare February 21, 2025 14:53
Copy link
Contributor Author

ndr-ds commented Feb 21, 2025

Merge activity

  • Feb 21, 10:27 AM EST: A user started a stack merge that includes this pull request via Graphite.
  • Feb 21, 10:28 AM EST: A user merged this pull request with Graphite.

@ndr-ds ndr-ds merged commit cfefa52 into main Feb 21, 2025
23 checks passed
@ndr-ds ndr-ds deleted the 02-19-have_one_thread_per_chain_on_benchmark branch February 21, 2025 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants