- Paolo Savini (Embecosm)
- Helene Chelin (Embecosm)
- Jeremy Bennett (Embecosm)
- Hugh O'Keeffe (Ashling)
- Nadim Shehayed (Ashling)
- Daniel Barboza (Ventana)
-
WP1:
- Prepare routine/nightly runs of benchmarks.
- The infrastructure is set up. We need to populate it with the latest testing tools.
- Run the single instruction benchmarks with
vle64.v
/vse64.v
(see below).
- Prepare routine/nightly runs of benchmarks.
-
WP2:
- We are working on the optimization of the
vext_ldst_us
helper function for vle8.v.- the aim is to combine multiple byte loads and stores in one.
- We are working on the optimization of the
-
WP1:
- Run all the latest tests.
-
WP2:
- Implement a more efficient loop for vle8.v.
- Explore optimization through the usage of builtins like
__builtin_memcpy
.
Our current set of agreed priorities are taken from the Statement of Work. This has the following priorities, which trade off functionality targeted versus architectures supported.
- vector load/store ops for x86_64 AVX
- vector load/store ops for AArch64/Neon
- vector integer ALU ops for x86_64 AVX
- vector load/store ops for Intel AVX10
For each of these there will be an analysis phase and an optimization phase, leading to the following set of work packages.
- WP0: Infrastructure
- WP1: Analysis of vector load/store ops on x86_64 AVX
- WP2: Optimization of vector load/store ops on x86_64 AVX
- WP3: Analysis of vector load/store ops on AArch64/Neon
- WP4: Optimization of vector load/store ops on AArch64/Neon
- WP5: Analysis of integer ALU ops on x86_64 AVX
- WP6: Optimization of integer ALU ops on x86_64 AVX
- WP7: Analysis of vector load/store ops on Intel AVX10
- WP8: Optimization of vector load/store ops on Intel AVX10
These priorities can be revised by agreement with RISE during the project.
Following on from our previous report looking at vle8.v
and vse8.v
, we have carried out a detailed performance analysis of the vle64.v
and vse64.v
instructions. The only difference is that the element width (notional unit of copying) is 64-bits rather than 8-bits. Full details can be found in this Google Sheet.
As was the case with vle8.v
and vse8.v
the QEMU performance (measured as ns/instruction) is proportional to the number of bytes being copied, up to the point where the vector register (VLEN*LMUL
) is full. For example, this is the data for reading up to 1,024 double words with LMUL=8
length | VLEN=128 | VLEN=256 | VLEN=512 | VLEN=1024 |
---|---|---|---|---|
1 | 23.55 | 23.44 | 23.41 | 23.48 |
2 | 41.84 | 41.86 | 41.89 | 41.97 |
4 | 69.12 | 68.91 | 70.12 | 69.38 |
8 | 123.19 | 123.81 | 123.31 | 125.81 |
16 | 242.12 | 247.38 | 241.62 | 243.38 |
32 | 281.50 | 541.25 | 581.25 | 548.50 |
64 | 282.00 | 547.00 | 1104.00 | 1068.00 |
128 | 282.00 | 541.00 | 1084.00 | 2109.00 |
256 | 282.00 | 538.00 | 1078.00 | 2102.00 |
512 | 279.00 | 541.00 | 1076.00 | 2105.00 |
1,024 | 279.00 | 537.00 | 1063.00 | 2100.00 |
The interesting case is when we compare 8 and 64-bit versions loading or storing the same total number of bytes. The following table shows the time per instruction when loading various lengths of data (in bytes) using vle8.v
and vle64.v
with VLEN=1024
and LMUL=8
.
length | vle8.v | vle64.v | ratio |
---|---|---|---|
8 | 108.75 | 23.48 | 4.63 |
16 | 210.00 | 41.97 | 5.00 |
32 | 396.50 | 69.38 | 5.71 |
64 | 770.00 | 125.81 | 6.12 |
128 | 1520.00 | 243.38 | 6.25 |
256 | 3014.00 | 548.50 | 5.49 |
512 | 6034.00 | 1068.00 | 5.65 |
1,024 | 11973.00 | 2109.00 | 5.68 |
2,048 | 11964.00 | 2102.00 | 5.69 |
4,096 | 11970.00 | 2105.00 | 5.69 |
8,192 | 11967.00 | 2100.00 | 5.70 |
The results are summarized in this graph
Even for smaller vectors and LMUL
we see the effect. The following table is for VLEN=128
and LMUL=1
length | vle8.v | vle64.v | ratio |
---|---|---|---|
8 | 108.88 | 23.74 | 4.59 |
16 | 209.88 | 41.45 | 5.06 |
32 | 210.50 | 55.25 | 3.81 |
64 | 209.00 | 48.75 | 4.29 |
128 | 207.00 | 50.25 | 4.12 |
256 | 212.00 | 57.50 | 3.69 |
512 | 210.00 | 55.00 | 3.82 |
1,024 | 213.00 | 56.00 | 3.80 |
2,048 | 207.00 | 52.00 | 3.98 |
4,096 | 210.00 | 51.00 | 4.12 |
8,192 | 208.00 | 51.00 | 4.08 |
The same effect is seen when comparing store instructions between vse8.v
and vse64.v
, although the ratios are marginally smaller.
The obvious quick-win strategy is to perform simple block loads and stores (which we surmise are the majority) using double words wherever possible. We are working on an implementation of this.
No changes to report since last week.
length | s time | v1 time | v8 time | s Micount | v1 Micount | v8 Micount | s ns/inst | v1 ns/inst | v8 ns/inst |
---|---|---|---|---|---|---|---|---|---|
1 | 0.16 | 0.14 | 0.12 | 73 | 19 | 19 | 2.19 | 7.37 | 6.32 |
2 | 0.22 | 0.12 | 0.14 | 89 | 19 | 19 | 2.47 | 6.32 | 7.37 |
4 | 0.24 | 0.20 | 0.17 | 121 | 19 | 19 | 1.98 | 10.53 | 8.95 |
8 | 0.23 | 0.29 | 0.27 | 95 | 19 | 19 | 2.42 | 15.26 | 14.21 |
16 | 0.23 | 0.44 | 0.44 | 111 | 19 | 19 | 2.07 | 23.16 | 23.16 |
32 | 0.28 | 0.86 | 0.81 | 143 | 26 | 19 | 1.96 | 33.08 | 42.63 |
64 | 0.43 | 1.74 | 1.45 | 207 | 40 | 19 | 2.08 | 43.50 | 76.32 |
128 | 0.57 | 3.34 | 2.85 | 293 | 68 | 19 | 1.95 | 49.12 | 150.00 |
256 | 0.89 | 6.62 | 5.67 | 451 | 124 | 26 | 1.97 | 53.39 | 218.08 |
512 | 1.52 | 13.28 | 11.27 | 767 | 236 | 40 | 1.98 | 56.27 | 281.75 |
1,024 | 2.83 | 26.28 | 22.45 | 1,448 | 460 | 68 | 1.95 | 57.13 | 330.15 |
2,048 | 5.49 | 52.41 | 44.85 | 2,810 | 908 | 124 | 1.95 | 57.72 | 361.69 |
4,096 | 10.72 | 105.43 | 89.67 | 5,534 | 1,804 | 236 | 1.94 | 58.44 | 379.96 |
8,192 | 21.21 | 210.61 | 179.43 | 10,933 | 3,596 | 460 | 1.94 | 58.57 | 390.07 |
16,384 | 42.54 | 421.70 | 359.58 | 21,731 | 7,180 | 908 | 1.96 | 58.73 | 396.01 |
No changes since last week.
These results are in ns/instruction.
length | LMUL | VLEN=128 | VLEN=256 | VLEN=512 | VLEN=1024 |
---|---|---|---|---|---|
1 | 1 | 21.13 | 21.15 | 21.13 | 21.14 |
2 | 1 | 33.44 | 33.48 | 33.50 | 33.52 |
4 | 1 | 62.16 | 62.16 | 62.09 | 62.12 |
8 | 1 | 108.88 | 108.88 | 108.94 | 109.06 |
16 | 1 | 209.88 | 210.12 | 209.75 | 209.75 |
32 | 1 | 210.50 | 396.50 | 396.75 | 397.25 |
64 | 1 | 209.00 | 396.50 | 772.00 | 771.50 |
128 | 1 | 207.00 | 397.00 | 771.00 | 1524.00 |
256 | 1 | 212.00 | 399.00 | 772.00 | 1523.00 |
512 | 1 | 210.00 | 398.00 | 771.00 | 1521.00 |
1,024 | 1 | 213.00 | 398.00 | 773.00 | 1521.00 |
2,048 | 1 | 207.00 | 395.00 | 773.00 | 1519.00 |
4,096 | 1 | 210.00 | 395.00 | 771.00 | 1522.00 |
8,192 | 1 | 208.00 | 401.00 | 768.00 | 1521.00 |
You can find the baseline execution time and instruction count of the SPEC CPU 2017 benchmarks here
2024-05-22
- Jeremy to run baseiline results for the other flavours of the vle*.v/vse*.v instructions
- COMPLETE (see details above)
- Paolo to check the ARM SVE example mentioned int the gitlab issue (see 2024-05-01).
- COMPLETE: good example as soon as we'll need to implement vle8.v with a more direct access to the host.
2024-05-15
- Jeremy to look at impact of masked v unmasked and strided v unstrided on vector operations.
- lower proirity.
2024-05-08
- Jeremy to characterise QEMU floating point performance and file it as a performance regression issue in QEMU GitLab.
- low priority, deferred to prioritize the smoke tests work.
2024-05-01
- Paolo to review the generic issue from Palmer Dabbelt to identify ideas for optimization and benchmarks to reuse.
- IN PROGRESS: Reproduction deferred to prioritize ARM analysis.
- So far we didn't see the execution time difference reported in the issue. Need to check the context.
- The bionic benchmarks may be a useful source of small benchmarks.
- Taken the ARM example: it might be tricky to map each load/store with the right host operations but that's the kind of optimization we are aiming at.
- Daniel to advise Paolo on best practice for preparing QEMU upstream submissions.
The risk register is held in a shared spreadsheet
We will keep it updated continuously and report any changes each week.
There are no changes to the risk register this week.
Jeremy will be on vacation from the 7th to the 16th of June. Paolo will be on vacation from the 20th to the 24th of June.