Skip to content

Latest commit

 

History

History
241 lines (173 loc) · 11.2 KB

20240508.md

File metadata and controls

241 lines (173 loc) · 11.2 KB

QEMU TCG RVV optimization: Weekly report 2024-05-08

Overview

Introducing the project team

  • Paolo Savini (Embecosm)
  • Helene Chelin (Embecosm)
  • Jeremy Bennett (Embecosm)
  • Hugh O'Keeffe (Ashling)
  • Nadim Shehayed (Ashling)
  • Daniel Barboza (Ventana)

Work completed since last report

  • WP1:
    • Pull together all the memory/string operations benchmarks and obtain baseline QEMU instruction counts.
      • We have scalar and vector versions of memset and memcpy ready for memory operation benchmarks.
      • We are working on the framework to produce the baseline.
    • Establish framework for single RVV instruction benchmarks and obtain example performance scores.
      • We have the framework in place for a sample instruction and produce the score in MIPS.
      • We now just need to find the first vector instructions to work on.
    • Obtain reference QEMU SPEC CPU 2017 instruction count baselines with and without RVV.
      • We got scalar and vector SPECCPU 2017 baseline results. See below.
    • Decide on whether to enable any other ISA extensions (e.g. Zvfh/Zfh) and whether to use LTO.
      • We propose to focus on the main V extension for the moment.
      • We advice against using LTO: the focus is on the performance of QEMU, not the compiler.
    • Tools to view the translation process of QEMU on RVV instructions are in place. See below.

Work planned for the coming week

  • WP1:
    • Get baseline scores for memory operation benchmarks.
    • Identify the most promising load/store vector instruction to optimize.
    • Prepare routine/nightly runs of benchmarks.

Current priorties

Our current set of agreed priorities are taken from the Statement of Work. This has the following priorities, which trade off functionality targeted versus architectures supported.

  • vector load/store ops for x86_64 AVX
  • vector load/store ops for AArch64/Neon
  • vector integer ALU ops for x86_64 AVX
  • vector load/store ops for Intel AVX10

For each of these there will be an analysis phase and an optimization phase, leading to the following set of work packages.

  • WP0: Infrastructure
  • WP1: Analysis of vector load/store ops on x86_64 AVX
  • WP2: Optimization of vector load/store ops on x86_64 AVX
  • WP3: Analysis of vector load/store ops on AArch64/Neon
  • WP4: Optimization of vector load/store ops on AArch64/Neon
  • WP5: Analysis of integer ALU ops on x86_64 AVX
  • WP6: Optimization of integer ALU ops on x86_64 AVX
  • WP7: Analysis of vector load/store ops on Intel AVX10
  • WP8: Optimization of vector load/store ops on Intel AVX10

These priorities can be revised by agreement with RISE during the project.

Detailed description of work

WP1

SPEC CPU 2017 base line results:

We have run all 20 SPEC CPU 2017 speed tests, 10 integer, 10 floating point, under QEMU. We have run the tests both with, and without RISC-V Vector (RVV) enabled. See the full results by following the link in the statistics secion below.

8 tests reported failed checks at the end. We believe 3 of these may be a failure of the checking infrastructure, and the runs are OK, but 5 are genuinely failing to run correctly. This is under investigation.

The runs were carried out on a 20-core Intel laptop running Ubuntu 23.04. The elapsed and compute times for each run were as follows.

Run Elapsed time Compute Time
Scalar 1d 8h 24d 19h
Vector 3d 7h 52d 17h

These data are not perfect for the following reasons

  • they pre-date the release of GCC 14.1, and each run uses slightly different releases
  • the vector runs also enable zvfh and zfh extensions
  • not all runs completed successfully.

We will rerun these data on our servers (which have far more cores) using GCC 14.1 release, however the data as they stand are sufficient to inform our plans for this project. The key data table, on QEMU execution times is given here. We tabulate both the wall clock elapsed time and the total compute time needed across all cores. All times are in seconds.

Benchmark Type OK S Wall S Compute V Wall V Compute V/S Compute
600.perlbench_s int 50,917 10,406 58,799 12,937 1.24
602.gcc_s int 42,414 8,265 52,146 11,231 1.36
603.bwaves_s fp X 3,292 8,122 1,769 4,408 0.54
605.mcf_s int 23,201 5,130 25,132 5,637 1.10
607.cactuBSSN_s fp 55,220 239,892 54,241 246,158 1.03
619.lbm_s fp 16,790 38,708 17,398 39,353 1.02
620.omnetpp_s int 25,771 5,805 29,313 7,006 1.21
621.wrf_s fp 119,434 678,309 144,140 758,194 1.12
623.xalancbmk_s int 17,537 3,538 30,544 7,493 2.12
625.x264_s int 28,478 6,010 71,432 20,411 3.40
627.cam4_s fp X - - 325 312 -
628.pop2_s fp 123,726 65,563 285,214 215,536 3.29
631.deepsjeng_s int 40,689 9,130 42,067 12,133 1.33
638.imagick_s fp 109,173 888,627 244,575 3,028,893 3.41
641.leela_s int 41,555 9,276 45,114 13,244 1.43
644.nab_s fp 32,859 142,530 32,298 147,449 1.03
648.exchange2_s int 26,871 6,154 55,297 17,635 2.87
649.fotonik3d_s fp X 0 0 0 0 0.40
654.roms_s fp X 1 1 1 1 1.14
657.xz_s int X 13,027 17,826 4,305 5,258 0.29
996.specrand_fs fp 4 1 8 2 1.32
998.specrand_is int 5 1 5 1 1.07
Total 770,962 2,143,293 1,194,118 4,553,288 2.12

We make the following observations:

  • the floating point tests generally take much long than the integer tests, suggesting that FP emulation in QEMU may also have a problem;
  • for many benchmarks, the vector run is less than 10% longer than the scalar run, suggesting these benchmarks may have minimal vectorization taking place; and
  • 4/5 of the failed benchmarks are FP benchmarks.

We would like to use a subset of the quicker SPEC CPU tests to allow us to check QEMU performance nightly. For this we would like a mix of integer and FP tests, all of which have significantly longer execution times when vectorized. This is slightly problematic, since the only floating point tests where vectorized execution is significantly longer both take over 3 days to complete when vectorized (638.imagick_s and 628.pop2_s). We therefore propose using the following strategy.

  • Initially use 619.lbm_s (fp), 620.omnetpp_s (int) and 623.xalancbmk_s (int) as our nightly tests, which we anticipate will complete in around 8-12 hours.
  • Repeat our baseline run using the SPEC CPU test datasets, rather than the ref datasets. These are smaller and should lead to shorter runs. This should give us a richer set of alternatives for nightly testing.

Analysis of RVV to TCG to x86:

We have tools in place to see how QEMU is generating TCG operations on x86:

Given an RVV instruction like vle32.v you can see here which TCG operations are generated and then which x86 instructions are generated to support those:

RVV:

    vle32.v v0,(a1)

TCG ops:

    brcond_i64 vstart,vl,geu,$L1
    add_i64 loc3,env,$0x200
    add_i64 loc4,env,$0x200
    call vle32_v,$0x0,$0,loc3,loc4,x11/a1,env,$0x60601

x86_64:

    movl    -16(%rbp), %ebx
    testl   %ebx, %ebx
    jl      512
    movb    $0, -12(%rbp)
    movq    4632(%rbp), %rbx
    movq    4624(%rbp), %r12
    cmpq    %r12, %rbx
    jae     33
    leaq    512(%rbp), %rdi
    leaq    512(%rbp), %rsi
    movl    $394753, %r8d                   # imm = 0x60601
    movq    %rbp, %rcx
    movq    88(%rbp), %rdx
    callq   *481(%rip)

This is just a small example focused on vle32.v but we can compare directly the input asm with the corresponding TCG basic blocks and the native stranslation of such basic blocks. This is going to be extremely useful in our optimization work.

Statistics

Memory/string operations

No available data yet.

Individual RVV instruction performance

We have an infrastructure in place to run performance tests around a single V instruction. E.g.:

    int
    main (int   argc,
          char *argv[])
    {
      size_t num_millions = atoi (argv[1]);
      size_t num_loops = num_millions * 1000000;

      for (size_t i = 0; i < num_loops; i++)
        __asm__ ("vadd.vx\t v0, v0, %0" : : "r"(i) : "v0");

      return 0;
    }

The aim is to run one of these tests twice with a considerable amount of iterations and use the difference of execution time in order to obtain a MIPS measurement of the QEMU emulation time of that instruction.

    ~$ qemu-riscv64 main.exe 2000
    ~$ qemu-riscv64 main.exe 1000

to measure the MIPS of 1 billion iterations:

    ~$ 70 MIPS

SPEC CPU 2017 performance

You can find the baseline execution time and instruction count of the SPEC CPU 2017 benchmarks here

Actions

2024-05-08

  • Jeremy to characterise QEMU floating point performance and file it as a performance regression issue in QEMU GitLab.

  • Jeremy to set up "smoke test" SPEC CPU 2017 run using the integer and floating point specrand programs (should run in minutes).

  • Daniel to provide git sendmail macro for QEMU patches

    • COMPLETE.
send-email --suppress-cc=sob --to [email protected] \
    --cc [email protected] --cc <list of maintainers>
  • Daniel to provide Paolo with list of maintainer email addresses

    • COMPLETE.
  • Daniel to advise on Paolo of patch checking script.

    • COMPLETE. Use ./script/checkpatch.pl

2024-05-01

  • Paolo to review the generic issue from Palmer Dabbelt to identify ideas for optimization and benchmarks to reuse.
    • WIP: we are working on running the reproducers to see the TCG ops generated by QEMU.
  • Daniel to advise Paolo on best practice for preparing QEMU upstream submissions.
  • The bionic benchmarks may be a useful source of small benchmarks.

Risk register

The risk register is held in a shared spreadsheet

We will keep it updated continuously and report any changes each week.

There are no changes to the risk register this week.

Planned absences

Helene will be on vacation from the 1st to the 12th of May.