Skip to content

Latest commit

 

History

History
214 lines (161 loc) · 10.5 KB

20240529.md

File metadata and controls

214 lines (161 loc) · 10.5 KB

RISE RP005 QEMU weekly report 2024-05-29

Overview

Introducing the project team

  • Paolo Savini (Embecosm)
  • Helene Chelin (Embecosm)
  • Jeremy Bennett (Embecosm)
  • Hugh O'Keeffe (Ashling)
  • Nadim Shehayed (Ashling)
  • Daniel Barboza (Ventana)

Work completed since last report

  • WP1:

    • Prepare routine/nightly runs of benchmarks.
      • The infrastructure is set up. We need to populate it with the latest testing tools.
    • Run the single instruction benchmarks with vle64.v/vse64.v (see below).
  • WP2:

    • We are working on the optimization of the vext_ldst_us helper function for vle8.v.
      • the aim is to combine multiple byte loads and stores in one.

Work planned for the coming week

  • WP1:

    • Run all the latest tests.
  • WP2:

    • Implement a more efficient loop for vle8.v.
    • Explore optimization through the usage of builtins like __builtin_memcpy.

Current priorties

Our current set of agreed priorities are taken from the Statement of Work. This has the following priorities, which trade off functionality targeted versus architectures supported.

  • vector load/store ops for x86_64 AVX
  • vector load/store ops for AArch64/Neon
  • vector integer ALU ops for x86_64 AVX
  • vector load/store ops for Intel AVX10

For each of these there will be an analysis phase and an optimization phase, leading to the following set of work packages.

  • WP0: Infrastructure
  • WP1: Analysis of vector load/store ops on x86_64 AVX
  • WP2: Optimization of vector load/store ops on x86_64 AVX
  • WP3: Analysis of vector load/store ops on AArch64/Neon
  • WP4: Optimization of vector load/store ops on AArch64/Neon
  • WP5: Analysis of integer ALU ops on x86_64 AVX
  • WP6: Optimization of integer ALU ops on x86_64 AVX
  • WP7: Analysis of vector load/store ops on Intel AVX10
  • WP8: Optimization of vector load/store ops on Intel AVX10

These priorities can be revised by agreement with RISE during the project.

Detailed description of work

WP1

Performance of vle64.v and vse64.v

Following on from our previous report looking at vle8.v and vse8.v, we have carried out a detailed performance analysis of the vle64.v and vse64.v instructions. The only difference is that the element width (notional unit of copying) is 64-bits rather than 8-bits. Full details can be found in this Google Sheet.

As was the case with vle8.v and vse8.v the QEMU performance (measured as ns/instruction) is proportional to the number of bytes being copied, up to the point where the vector register (VLEN*LMUL) is full. For example, this is the data for reading up to 1,024 double words with LMUL=8

length VLEN=128 VLEN=256 VLEN=512 VLEN=1024
1 23.55 23.44 23.41 23.48
2 41.84 41.86 41.89 41.97
4 69.12 68.91 70.12 69.38
8 123.19 123.81 123.31 125.81
16 242.12 247.38 241.62 243.38
32 281.50 541.25 581.25 548.50
64 282.00 547.00 1104.00 1068.00
128 282.00 541.00 1084.00 2109.00
256 282.00 538.00 1078.00 2102.00
512 279.00 541.00 1076.00 2105.00
1,024 279.00 537.00 1063.00 2100.00

The interesting case is when we compare 8 and 64-bit versions loading or storing the same total number of bytes. The following table shows the time per instruction when loading various lengths of data (in bytes) using vle8.v and vle64.v with VLEN=1024 and LMUL=8.

length vle8.v vle64.v ratio
8 108.75 23.48 4.63
16 210.00 41.97 5.00
32 396.50 69.38 5.71
64 770.00 125.81 6.12
128 1520.00 243.38 6.25
256 3014.00 548.50 5.49
512 6034.00 1068.00 5.65
1,024 11973.00 2109.00 5.68
2,048 11964.00 2102.00 5.69
4,096 11970.00 2105.00 5.69
8,192 11967.00 2100.00 5.70

The results are summarized in this graph

:QEMU vse8.v performance for LMUL=8{width=160mm}

Even for smaller vectors and LMUL we see the effect. The following table is for VLEN=128 and LMUL=1

length vle8.v vle64.v ratio
8 108.88 23.74 4.59
16 209.88 41.45 5.06
32 210.50 55.25 3.81
64 209.00 48.75 4.29
128 207.00 50.25 4.12
256 212.00 57.50 3.69
512 210.00 55.00 3.82
1,024 213.00 56.00 3.80
2,048 207.00 52.00 3.98
4,096 210.00 51.00 4.12
8,192 208.00 51.00 4.08

The same effect is seen when comparing store instructions between vse8.v and vse64.v, although the ratios are marginally smaller.

The obvious quick-win strategy is to perform simple block loads and stores (which we surmise are the majority) using double words wherever possible. We are working on an implementation of this.

Statistics

Memory/string operations

No changes to report since last week.

length s time v1 time v8 time s Micount v1 Micount v8 Micount s ns/inst v1 ns/inst v8 ns/inst
1 0.16 0.14 0.12 73 19 19 2.19 7.37 6.32
2 0.22 0.12 0.14 89 19 19 2.47 6.32 7.37
4 0.24 0.20 0.17 121 19 19 1.98 10.53 8.95
8 0.23 0.29 0.27 95 19 19 2.42 15.26 14.21
16 0.23 0.44 0.44 111 19 19 2.07 23.16 23.16
32 0.28 0.86 0.81 143 26 19 1.96 33.08 42.63
64 0.43 1.74 1.45 207 40 19 2.08 43.50 76.32
128 0.57 3.34 2.85 293 68 19 1.95 49.12 150.00
256 0.89 6.62 5.67 451 124 26 1.97 53.39 218.08
512 1.52 13.28 11.27 767 236 40 1.98 56.27 281.75
1,024 2.83 26.28 22.45 1,448 460 68 1.95 57.13 330.15
2,048 5.49 52.41 44.85 2,810 908 124 1.95 57.72 361.69
4,096 10.72 105.43 89.67 5,534 1,804 236 1.94 58.44 379.96
8,192 21.21 210.61 179.43 10,933 3,596 460 1.94 58.57 390.07
16,384 42.54 421.70 359.58 21,731 7,180 908 1.96 58.73 396.01

Individual RVV instruction performance

No changes since last week.

These results are in ns/instruction.

length LMUL VLEN=128 VLEN=256 VLEN=512 VLEN=1024
1 1 21.13 21.15 21.13 21.14
2 1 33.44 33.48 33.50 33.52
4 1 62.16 62.16 62.09 62.12
8 1 108.88 108.88 108.94 109.06
16 1 209.88 210.12 209.75 209.75
32 1 210.50 396.50 396.75 397.25
64 1 209.00 396.50 772.00 771.50
128 1 207.00 397.00 771.00 1524.00
256 1 212.00 399.00 772.00 1523.00
512 1 210.00 398.00 771.00 1521.00
1,024 1 213.00 398.00 773.00 1521.00
2,048 1 207.00 395.00 773.00 1519.00
4,096 1 210.00 395.00 771.00 1522.00
8,192 1 208.00 401.00 768.00 1521.00

SPEC CPU 2017 performance

You can find the baseline execution time and instruction count of the SPEC CPU 2017 benchmarks here

Actions

2024-05-22

  • Jeremy to run baseiline results for the other flavours of the vle*.v/vse*.v instructions
    • COMPLETE (see details above)
  • Paolo to check the ARM SVE example mentioned int the gitlab issue (see 2024-05-01).
    • COMPLETE: good example as soon as we'll need to implement vle8.v with a more direct access to the host.

2024-05-15

  • Jeremy to look at impact of masked v unmasked and strided v unstrided on vector operations.
    • lower proirity.

2024-05-08

  • Jeremy to characterise QEMU floating point performance and file it as a performance regression issue in QEMU GitLab.
    • low priority, deferred to prioritize the smoke tests work.

2024-05-01

  • Paolo to review the generic issue from Palmer Dabbelt to identify ideas for optimization and benchmarks to reuse.
    • IN PROGRESS: Reproduction deferred to prioritize ARM analysis.
    • So far we didn't see the execution time difference reported in the issue. Need to check the context.
    • The bionic benchmarks may be a useful source of small benchmarks.
    • Taken the ARM example: it might be tricky to map each load/store with the right host operations but that's the kind of optimization we are aiming at.
  • Daniel to advise Paolo on best practice for preparing QEMU upstream submissions.

Risk register

The risk register is held in a shared spreadsheet

We will keep it updated continuously and report any changes each week.

There are no changes to the risk register this week.

Planned absences

Jeremy will be on vacation from the 7th to the 16th of June. Paolo will be on vacation from the 20th to the 24th of June.