Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure builtin instruction performance #3364

Open
tao-stones opened this issue Oct 29, 2024 · 15 comments
Open

Measure builtin instruction performance #3364

tao-stones opened this issue Oct 29, 2024 · 15 comments
Assignees

Comments

@tao-stones
Copy link

tao-stones commented Oct 29, 2024

SIMD-198 calls for assigning static execution cost (in CU) for each builtin instruction, vs current per buitin programs, it'd help to improve CU allocation, therefore block packing quality.

Instruction issue/PR status
System #3567 merged
Stake #3396 merged
Loader #3659 merged
Vote #4342 open
ALT #4406 merged
Config bpf migration (SIMD-140), ready for mainnet-beta
ComputeBudget #3614 merged
@tao-stones tao-stones self-assigned this Oct 29, 2024
@tao-stones
Copy link
Author

tao-stones commented Oct 29, 2024

Looking for most efficient and reliable way, of gathering builtin instructions CU / Performance data, I'm aware of few options:

  • Bench for each instruction is reliable and repeatable, but writing them all up is not a small task
  • query stats/metric isn't reliable, not repeatable and quicker. However, I am not able to find proper data source from metrics nor 3rd party data I know of.
  • replay ledger, isn't reliable, not repeatable, and not fast. But can always be done

I'll start with bench first. I tried first to construct simple transaction that only contains necessary data for targeting instruction, for example StakeInstruction::Initialize, then push the transaction through a bank for full scale load_execute_and_commit. This could work, but have too much overhead in bench result. I have since settled down by using invoke_context::mock_process_instruction() in bencher.

@apfitzge
Copy link

@tao-stones could we write benches for each instruction, on some typical x86 machines, and count instructions via something like criterion-perf-events?

Then just use that number as approximate CUs? That seems like a very consistent way to do it instead of benchmarking time -> convert to CU via some previous constant of cu/us.

wdyt?

@tao-stones
Copy link
Author

tao-stones commented Oct 31, 2024

Benching is the way to go, already trying out on Stake. Counting number of instructions is great way to have a consistent measurements (assume an execution path is chosen), @ksolana brought up similar ideas as well. Think worth trying. Also TIL criterion-perf-events

However, doing so will put builtins costs in different realm as the rest still in us/CU conversion land. So instead of builtins costs are inaccurate but stable relative to other instructions, they will be more accurate but not in same scale as the rest. Would that cause concern? (I need to think about it)

@apfitzge
Copy link

However, doing so will put builtins costs in different realm as the rest still in us/CU conversion land.

CUs used to just be bpf instructions.
I think it's somewhat reasonable to translate from "typicaly x86 instruction" to 1 CU. It's an overestimate since obviously native vs emulated perf implications.

@ksolana
Copy link

ksolana commented Oct 31, 2024

@tao-stones could we write benches for each instruction, on some typical x86 machines, and count instructions via something like criterion-perf-events?

Then just use that number as approximate CUs? That seems like a very consistent way to do it instead of benchmarking time -> convert to CU via some previous constant of cu/us.

wdyt?

Yeah, this is exactly what i was referring to during yesterday's conversation. We can hook up valgrind to get a 'trace' of instruction counts. There might be other ways to get it, but since this is a one off thing, using valgrind isn't a bad idea IMO. A bit tricky but once implemented it is scalable to all the instructions.

@tao-stones
Copy link
Author

tried criterion-perf-events on StakeInstruction::Initialize, setting measure as Instructions, it prints:

Gnuplot not found, using plotters backend
stakes/initialize       time:   [125501.2873 cycles 125501.9226 cycles 125502.5530 cycles]
                        change: [-0.0002% +0.0011% +0.0024%] (p = 0.12 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

not sure how to read this, are the cycles are "count of instructions"?

@apfitzge
Copy link

apfitzge commented Nov 1, 2024

Hmm i think I linked the wrong library looking at that output.

this is what i was thinking of: https://github.com/bheisler/iai

https://bheisler.github.io/criterion.rs/book/iai/getting_started.html

@ksolana
Copy link

ksolana commented Nov 1, 2024

Ah nice! It already uses cachegrind(which is part of the valgrind). Simplifies our work 🔥

@tao-stones
Copy link
Author

iirc iai cannot exclude setup code, if true, then it is a deal breaker for this use case.

@ksolana
Copy link

ksolana commented Nov 1, 2024

Can we compute the setup cost and subtract from every calculation?

@tao-stones
Copy link
Author

That's doable, then the project is more like a one-time effort. Still worth trying. Maybe can have both criterion benchmarks and iai benches, then need something automatically subtract setup cost from iai results.

@tao-stones
Copy link
Author

Also, do you guys know if iai and valgrind are stable on MacOS?

@tao-stones
Copy link
Author

tao-stones commented Nov 1, 2024

Got iai work on linux, a result looks like this:

iai bench StakeInitialize
     Running benches/stake_iai.rs (target/release/deps/stake_iai-680ef0d7fd289c61)
bench_setup_initialize
  Instructions:              159403
  L1 Accesses:               202150
  L2 Accesses:                  255
  RAM Accesses:                 929
  Estimated Cycles:          235940

bench_initialize
  Instructions:              290080
  L1 Accesses:               369241
  L2 Accesses:                  630
  RAM Accesses:                1991
  Estimated Cycles:          442076

Note: simply subtracting these two Instructions doesn't give correct answer, cause there are more "inner" pre-execution steps (eg invoke context and accounts priming) and post-execution steps.

@ksolana
Copy link

ksolana commented Nov 1, 2024

Also, do you guys know if iai and valgrind are stable on MacOS?

it's a linux thing. MacOS stopped supporting valgrind in a reasonable way. And for these computations we want to measure X86 instruction count anyways.

@tao-stones
Copy link
Author

@KirillLykov @ksolana just fyi, separated re-pricing builtin instruction into simd-198, updated this issue's description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants