QEMU TCG RVV optimization: Weekly report 2024-05-08

Overview

Introducing the project team

Paolo Savini (Embecosm)
Helene Chelin (Embecosm)
Jeremy Bennett (Embecosm)
Hugh O'Keeffe (Ashling)
Nadim Shehayed (Ashling)
Daniel Barboza (Ventana)

Work completed since last report

WP1:
- Pull together all the memory/string operations benchmarks and obtain baseline QEMU instruction counts.
  - We have scalar and vector versions of memset and memcpy ready for memory operation benchmarks.
  - We are working on the framework to produce the baseline.
- Establish framework for single RVV instruction benchmarks and obtain example performance scores.
  - We have the framework in place for a sample instruction and produce the score in MIPS.
  - We now just need to find the first vector instructions to work on.
- Obtain reference QEMU SPEC CPU 2017 instruction count baselines with and without RVV.
  - We got scalar and vector SPECCPU 2017 baseline results. See below.
- Decide on whether to enable any other ISA extensions (e.g. Zvfh/Zfh) and whether to use LTO.
  - We propose to focus on the main V extension for the moment.
  - We advice against using LTO: the focus is on the performance of QEMU, not the compiler.
- Tools to view the translation process of QEMU on RVV instructions are in place. See below.

Work planned for the coming week

WP1:
- Get baseline scores for memory operation benchmarks.
- Identify the most promising load/store vector instruction to optimize.
- Prepare routine/nightly runs of benchmarks.

Current priorties

Our current set of agreed priorities are taken from the Statement of Work. This has the following priorities, which trade off functionality targeted versus architectures supported.

vector load/store ops for x86_64 AVX
vector load/store ops for AArch64/Neon
vector integer ALU ops for x86_64 AVX
vector load/store ops for Intel AVX10

For each of these there will be an analysis phase and an optimization phase, leading to the following set of work packages.

WP0: Infrastructure
WP1: Analysis of vector load/store ops on x86_64 AVX
WP2: Optimization of vector load/store ops on x86_64 AVX
WP3: Analysis of vector load/store ops on AArch64/Neon
WP4: Optimization of vector load/store ops on AArch64/Neon
WP5: Analysis of integer ALU ops on x86_64 AVX
WP6: Optimization of integer ALU ops on x86_64 AVX
WP7: Analysis of vector load/store ops on Intel AVX10
WP8: Optimization of vector load/store ops on Intel AVX10

These priorities can be revised by agreement with RISE during the project.

Detailed description of work

WP1

SPEC CPU 2017 base line results:

We have run all 20 SPEC CPU 2017 speed tests, 10 integer, 10 floating point, under QEMU. We have run the tests both with, and without RISC-V Vector (RVV) enabled. See the full results by following the link in the statistics secion below.

8 tests reported failed checks at the end. We believe 3 of these may be a failure of the checking infrastructure, and the runs are OK, but 5 are genuinely failing to run correctly. This is under investigation.

The runs were carried out on a 20-core Intel laptop running Ubuntu 23.04. The elapsed and compute times for each run were as follows.

Run	Elapsed time	Compute Time
Scalar	1d 8h	24d 19h
Vector	3d 7h	52d 17h

These data are not perfect for the following reasons

they pre-date the release of GCC 14.1, and each run uses slightly different releases
the vector runs also enable zvfh and zfh extensions
not all runs completed successfully.

We will rerun these data on our servers (which have far more cores) using GCC 14.1 release, however the data as they stand are sufficient to inform our plans for this project. The key data table, on QEMU execution times is given here. We tabulate both the wall clock elapsed time and the total compute time needed across all cores. All times are in seconds.

Benchmark	Type	OK	S Wall	S Compute	V Wall	V Compute	V/S Compute
`600.perlbench_s`	int		50,917	10,406	58,799	12,937	1.24
`602.gcc_s`	int		42,414	8,265	52,146	11,231	1.36
`603.bwaves_s`	fp	X	3,292	8,122	1,769	4,408	0.54
`605.mcf_s`	int		23,201	5,130	25,132	5,637	1.10
`607.cactuBSSN_s`	fp		55,220	239,892	54,241	246,158	1.03
`619.lbm_s`	fp		16,790	38,708	17,398	39,353	1.02
`620.omnetpp_s`	int		25,771	5,805	29,313	7,006	1.21
`621.wrf_s`	fp		119,434	678,309	144,140	758,194	1.12
`623.xalancbmk_s`	int		17,537	3,538	30,544	7,493	2.12
`625.x264_s`	int		28,478	6,010	71,432	20,411	3.40
`627.cam4_s`	fp	X	-	-	325	312	-
`628.pop2_s`	fp		123,726	65,563	285,214	215,536	3.29
`631.deepsjeng_s`	int		40,689	9,130	42,067	12,133	1.33
`638.imagick_s`	fp		109,173	888,627	244,575	3,028,893	3.41
`641.leela_s`	int		41,555	9,276	45,114	13,244	1.43
`644.nab_s`	fp		32,859	142,530	32,298	147,449	1.03
`648.exchange2_s`	int		26,871	6,154	55,297	17,635	2.87
`649.fotonik3d_s`	fp	X	0	0	0	0	0.40
`654.roms_s`	fp	X	1	1	1	1	1.14
`657.xz_s`	int	X	13,027	17,826	4,305	5,258	0.29
`996.specrand_fs`	fp		4	1	8	2	1.32
`998.specrand_is`	int		5	1	5	1	1.07

Total			770,962	2,143,293	1,194,118	4,553,288	2.12

We make the following observations:

the floating point tests generally take much long than the integer tests, suggesting that FP emulation in QEMU may also have a problem;
for many benchmarks, the vector run is less than 10% longer than the scalar run, suggesting these benchmarks may have minimal vectorization taking place; and
4/5 of the failed benchmarks are FP benchmarks.

We would like to use a subset of the quicker SPEC CPU tests to allow us to check QEMU performance nightly. For this we would like a mix of integer and FP tests, all of which have significantly longer execution times when vectorized. This is slightly problematic, since the only floating point tests where vectorized execution is significantly longer both take over 3 days to complete when vectorized (638.imagick_s and 628.pop2_s). We therefore propose using the following strategy.

Initially use 619.lbm_s (fp), 620.omnetpp_s (int) and 623.xalancbmk_s (int) as our nightly tests, which we anticipate will complete in around 8-12 hours.
Repeat our baseline run using the SPEC CPU test datasets, rather than the ref datasets. These are smaller and should lead to shorter runs. This should give us a richer set of alternatives for nightly testing.

Analysis of RVV to TCG to x86:

We have tools in place to see how QEMU is generating TCG operations on x86:

Given an RVV instruction like vle32.v you can see here which TCG operations are generated and then which x86 instructions are generated to support those:

RVV:

    vle32.v v0,(a1)

TCG ops:

    brcond_i64 vstart,vl,geu,$L1
    add_i64 loc3,env,$0x200
    add_i64 loc4,env,$0x200
    call vle32_v,$0x0,$0,loc3,loc4,x11/a1,env,$0x60601

x86_64:

    movl    -16(%rbp), %ebx
    testl   %ebx, %ebx
    jl      512
    movb    $0, -12(%rbp)
    movq    4632(%rbp), %rbx
    movq    4624(%rbp), %r12
    cmpq    %r12, %rbx
    jae     33
    leaq    512(%rbp), %rdi
    leaq    512(%rbp), %rsi
    movl    $394753, %r8d                   # imm = 0x60601
    movq    %rbp, %rcx
    movq    88(%rbp), %rdx
    callq   *481(%rip)

This is just a small example focused on vle32.v but we can compare directly the input asm with the corresponding TCG basic blocks and the native stranslation of such basic blocks. This is going to be extremely useful in our optimization work.

Statistics

Memory/string operations

No available data yet.

Individual RVV instruction performance

We have an infrastructure in place to run performance tests around a single V instruction. E.g.:

    int
    main (int   argc,
          char *argv[])
    {
      size_t num_millions = atoi (argv[1]);
      size_t num_loops = num_millions * 1000000;

      for (size_t i = 0; i < num_loops; i++)
        __asm__ ("vadd.vx\t v0, v0, %0" : : "r"(i) : "v0");

      return 0;
    }

The aim is to run one of these tests twice with a considerable amount of iterations and use the difference of execution time in order to obtain a MIPS measurement of the QEMU emulation time of that instruction.

    ~$ qemu-riscv64 main.exe 2000
    ~$ qemu-riscv64 main.exe 1000

to measure the MIPS of 1 billion iterations:

    ~$ 70 MIPS

SPEC CPU 2017 performance

You can find the baseline execution time and instruction count of the SPEC CPU 2017 benchmarks here

Actions

2024-05-08

Jeremy to characterise QEMU floating point performance and file it as a performance regression issue in QEMU GitLab.
Jeremy to set up "smoke test" SPEC CPU 2017 run using the integer and floating point specrand programs (should run in minutes).
Daniel to provide git sendmail macro for QEMU patches
- COMPLETE.

send-email --suppress-cc=sob --to qemu-devel@nongnu.org \
    --cc qemu-riscv@nongnu.org --cc <list of maintainers>

Daniel to provide Paolo with list of maintainer email addresses
- COMPLETE.
Daniel to advise on Paolo of patch checking script.
- COMPLETE. Use ./script/checkpatch.pl

2024-05-01

Paolo to review the generic issue from Palmer Dabbelt to identify ideas for optimization and benchmarks to reuse.
- WIP: we are working on running the reproducers to see the TCG ops generated by QEMU.
Daniel to advise Paolo on best practice for preparing QEMU upstream submissions.
The bionic benchmarks may be a useful source of small benchmarks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20240508.md

20240508.md

QEMU TCG RVV optimization: Weekly report 2024-05-08

Overview

Introducing the project team

Work completed since last report

Work planned for the coming week

Current priorties

Detailed description of work

WP1

SPEC CPU 2017 base line results:

Analysis of RVV to TCG to x86:

Statistics

Memory/string operations

Individual RVV instruction performance

SPEC CPU 2017 performance

Actions

Risk register

Planned absences

Files

20240508.md

Latest commit

History

20240508.md

File metadata and controls

QEMU TCG RVV optimization: Weekly report 2024-05-08

Overview

Introducing the project team

Work completed since last report

Work planned for the coming week

Current priorties

Detailed description of work

WP1

SPEC CPU 2017 base line results:

Analysis of RVV to TCG to x86:

Statistics

Memory/string operations

Individual RVV instruction performance

SPEC CPU 2017 performance

Actions

Risk register

Planned absences