- Paolo Savini (Embecosm)
- Helene Chelin (Embecosm)
- Jeremy Bennett (Embecosm)
- Hugh O'Keeffe (Ashling)
- Nadim Shehayed (Ashling)
- Daniel Barboza (Ventana)
- WP1:
- Pull together all the memory/string operations benchmarks and obtain baseline QEMU instruction counts.
- We have scalar and vector versions of memset and memcpy ready for memory operation benchmarks.
- We are working on the framework to produce the baseline.
- Establish framework for single RVV instruction benchmarks and obtain example performance scores.
- We have the framework in place for a sample instruction and produce the score in MIPS.
- We now just need to find the first vector instructions to work on.
- Obtain reference QEMU SPEC CPU 2017 instruction count baselines with and without RVV.
- We got scalar and vector SPECCPU 2017 baseline results. See below.
- Decide on whether to enable any other ISA extensions (e.g. Zvfh/Zfh) and whether to use LTO.
- We propose to focus on the main V extension for the moment.
- We advice against using LTO: the focus is on the performance of QEMU, not the compiler.
- Tools to view the translation process of QEMU on RVV instructions are in place. See below.
- Pull together all the memory/string operations benchmarks and obtain baseline QEMU instruction counts.
- WP1:
- Get baseline scores for memory operation benchmarks.
- Identify the most promising load/store vector instruction to optimize.
- Prepare routine/nightly runs of benchmarks.
Our current set of agreed priorities are taken from the Statement of Work. This has the following priorities, which trade off functionality targeted versus architectures supported.
- vector load/store ops for x86_64 AVX
- vector load/store ops for AArch64/Neon
- vector integer ALU ops for x86_64 AVX
- vector load/store ops for Intel AVX10
For each of these there will be an analysis phase and an optimization phase, leading to the following set of work packages.
- WP0: Infrastructure
- WP1: Analysis of vector load/store ops on x86_64 AVX
- WP2: Optimization of vector load/store ops on x86_64 AVX
- WP3: Analysis of vector load/store ops on AArch64/Neon
- WP4: Optimization of vector load/store ops on AArch64/Neon
- WP5: Analysis of integer ALU ops on x86_64 AVX
- WP6: Optimization of integer ALU ops on x86_64 AVX
- WP7: Analysis of vector load/store ops on Intel AVX10
- WP8: Optimization of vector load/store ops on Intel AVX10
These priorities can be revised by agreement with RISE during the project.
We have run all 20 SPEC CPU 2017 speed tests, 10 integer, 10 floating point, under QEMU. We have run the tests both with, and without RISC-V Vector (RVV) enabled. See the full results by following the link in the statistics secion below.
8 tests reported failed checks at the end. We believe 3 of these may be a failure of the checking infrastructure, and the runs are OK, but 5 are genuinely failing to run correctly. This is under investigation.
The runs were carried out on a 20-core Intel laptop running Ubuntu 23.04. The elapsed and compute times for each run were as follows.
Run | Elapsed time | Compute Time |
---|---|---|
Scalar | 1d 8h | 24d 19h |
Vector | 3d 7h | 52d 17h |
These data are not perfect for the following reasons
- they pre-date the release of GCC 14.1, and each run uses slightly different releases
- the vector runs also enable zvfh and zfh extensions
- not all runs completed successfully.
We will rerun these data on our servers (which have far more cores) using GCC 14.1 release, however the data as they stand are sufficient to inform our plans for this project. The key data table, on QEMU execution times is given here. We tabulate both the wall clock elapsed time and the total compute time needed across all cores. All times are in seconds.
Benchmark | Type | OK | S Wall | S Compute | V Wall | V Compute | V/S Compute |
---|---|---|---|---|---|---|---|
600.perlbench_s |
int | 50,917 | 10,406 | 58,799 | 12,937 | 1.24 | |
602.gcc_s |
int | 42,414 | 8,265 | 52,146 | 11,231 | 1.36 | |
603.bwaves_s |
fp | X | 3,292 | 8,122 | 1,769 | 4,408 | 0.54 |
605.mcf_s |
int | 23,201 | 5,130 | 25,132 | 5,637 | 1.10 | |
607.cactuBSSN_s |
fp | 55,220 | 239,892 | 54,241 | 246,158 | 1.03 | |
619.lbm_s |
fp | 16,790 | 38,708 | 17,398 | 39,353 | 1.02 | |
620.omnetpp_s |
int | 25,771 | 5,805 | 29,313 | 7,006 | 1.21 | |
621.wrf_s |
fp | 119,434 | 678,309 | 144,140 | 758,194 | 1.12 | |
623.xalancbmk_s |
int | 17,537 | 3,538 | 30,544 | 7,493 | 2.12 | |
625.x264_s |
int | 28,478 | 6,010 | 71,432 | 20,411 | 3.40 | |
627.cam4_s |
fp | X | - | - | 325 | 312 | - |
628.pop2_s |
fp | 123,726 | 65,563 | 285,214 | 215,536 | 3.29 | |
631.deepsjeng_s |
int | 40,689 | 9,130 | 42,067 | 12,133 | 1.33 | |
638.imagick_s |
fp | 109,173 | 888,627 | 244,575 | 3,028,893 | 3.41 | |
641.leela_s |
int | 41,555 | 9,276 | 45,114 | 13,244 | 1.43 | |
644.nab_s |
fp | 32,859 | 142,530 | 32,298 | 147,449 | 1.03 | |
648.exchange2_s |
int | 26,871 | 6,154 | 55,297 | 17,635 | 2.87 | |
649.fotonik3d_s |
fp | X | 0 | 0 | 0 | 0 | 0.40 |
654.roms_s |
fp | X | 1 | 1 | 1 | 1 | 1.14 |
657.xz_s |
int | X | 13,027 | 17,826 | 4,305 | 5,258 | 0.29 |
996.specrand_fs |
fp | 4 | 1 | 8 | 2 | 1.32 | |
998.specrand_is |
int | 5 | 1 | 5 | 1 | 1.07 | |
Total | 770,962 | 2,143,293 | 1,194,118 | 4,553,288 | 2.12 |
We make the following observations:
- the floating point tests generally take much long than the integer tests, suggesting that FP emulation in QEMU may also have a problem;
- for many benchmarks, the vector run is less than 10% longer than the scalar run, suggesting these benchmarks may have minimal vectorization taking place; and
- 4/5 of the failed benchmarks are FP benchmarks.
We would like to use a subset of the quicker SPEC CPU tests to allow us to check QEMU performance nightly. For this we would like a mix of integer and FP tests, all of which have significantly longer execution times when vectorized. This is slightly problematic, since the only floating point tests where vectorized execution is significantly longer both take over 3 days to complete when vectorized (638.imagick_s and 628.pop2_s). We therefore propose using the following strategy.
- Initially use 619.lbm_s (fp), 620.omnetpp_s (int) and 623.xalancbmk_s (int) as our nightly tests, which we anticipate will complete in around 8-12 hours.
- Repeat our baseline run using the SPEC CPU test datasets, rather than the ref datasets. These are smaller and should lead to shorter runs. This should give us a richer set of alternatives for nightly testing.
We have tools in place to see how QEMU is generating TCG operations on x86:
Given an RVV instruction like vle32.v you can see here which TCG operations are generated and then which x86 instructions are generated to support those:
RVV:
vle32.v v0,(a1)
TCG ops:
brcond_i64 vstart,vl,geu,$L1
add_i64 loc3,env,$0x200
add_i64 loc4,env,$0x200
call vle32_v,$0x0,$0,loc3,loc4,x11/a1,env,$0x60601
x86_64:
movl -16(%rbp), %ebx
testl %ebx, %ebx
jl 512
movb $0, -12(%rbp)
movq 4632(%rbp), %rbx
movq 4624(%rbp), %r12
cmpq %r12, %rbx
jae 33
leaq 512(%rbp), %rdi
leaq 512(%rbp), %rsi
movl $394753, %r8d # imm = 0x60601
movq %rbp, %rcx
movq 88(%rbp), %rdx
callq *481(%rip)
This is just a small example focused on vle32.v but we can compare directly the input asm with the corresponding TCG basic blocks and the native stranslation of such basic blocks. This is going to be extremely useful in our optimization work.
No available data yet.
We have an infrastructure in place to run performance tests around a single V instruction. E.g.:
int
main (int argc,
char *argv[])
{
size_t num_millions = atoi (argv[1]);
size_t num_loops = num_millions * 1000000;
for (size_t i = 0; i < num_loops; i++)
__asm__ ("vadd.vx\t v0, v0, %0" : : "r"(i) : "v0");
return 0;
}
The aim is to run one of these tests twice with a considerable amount of iterations and use the difference of execution time in order to obtain a MIPS measurement of the QEMU emulation time of that instruction.
~$ qemu-riscv64 main.exe 2000
~$ qemu-riscv64 main.exe 1000
to measure the MIPS of 1 billion iterations:
~$ 70 MIPS
You can find the baseline execution time and instruction count of the SPEC CPU 2017 benchmarks here
2024-05-08
-
Jeremy to characterise QEMU floating point performance and file it as a performance regression issue in QEMU GitLab.
-
Jeremy to set up "smoke test" SPEC CPU 2017 run using the integer and floating point
specrand
programs (should run in minutes). -
Daniel to provide git sendmail macro for QEMU patches
- COMPLETE.
send-email --suppress-cc=sob --to [email protected] \
--cc [email protected] --cc <list of maintainers>
-
Daniel to provide Paolo with list of maintainer email addresses
- COMPLETE.
-
Daniel to advise on Paolo of patch checking script.
- COMPLETE. Use
./script/checkpatch.pl
- COMPLETE. Use
2024-05-01
- Paolo to review the generic issue from Palmer Dabbelt to identify ideas for optimization and benchmarks to reuse.
- WIP: we are working on running the reproducers to see the TCG ops generated by QEMU.
- Daniel to advise Paolo on best practice for preparing QEMU upstream submissions.
- The bionic benchmarks may be a useful source of small benchmarks.
The risk register is held in a shared spreadsheet
We will keep it updated continuously and report any changes each week.
There are no changes to the risk register this week.
Helene will be on vacation from the 1st to the 12th of May.