RISE RP005 QEMU weekly report 2024-05-29

Overview

Introducing the project team

Paolo Savini (Embecosm)
Helene Chelin (Embecosm)
Jeremy Bennett (Embecosm)
Hugh O'Keeffe (Ashling)
Nadim Shehayed (Ashling)
Daniel Barboza (Ventana)

Work completed since last report

WP1:
- Prepare routine/nightly runs of benchmarks.
  - The infrastructure is set up. We need to populate it with the latest testing tools.
- Run the single instruction benchmarks with vle64.v/vse64.v (see below).
WP2:
- We are working on the optimization of the vext_ldst_us helper function for vle8.v.
  - the aim is to combine multiple byte loads and stores in one.

Work planned for the coming week

WP1:
- Run all the latest tests.
WP2:
- Implement a more efficient loop for vle8.v.
- Explore optimization through the usage of builtins like __builtin_memcpy.

Current priorties

Our current set of agreed priorities are taken from the Statement of Work. This has the following priorities, which trade off functionality targeted versus architectures supported.

vector load/store ops for x86_64 AVX
vector load/store ops for AArch64/Neon
vector integer ALU ops for x86_64 AVX
vector load/store ops for Intel AVX10

For each of these there will be an analysis phase and an optimization phase, leading to the following set of work packages.

WP0: Infrastructure
WP1: Analysis of vector load/store ops on x86_64 AVX
WP2: Optimization of vector load/store ops on x86_64 AVX
WP3: Analysis of vector load/store ops on AArch64/Neon
WP4: Optimization of vector load/store ops on AArch64/Neon
WP5: Analysis of integer ALU ops on x86_64 AVX
WP6: Optimization of integer ALU ops on x86_64 AVX
WP7: Analysis of vector load/store ops on Intel AVX10
WP8: Optimization of vector load/store ops on Intel AVX10

These priorities can be revised by agreement with RISE during the project.

Detailed description of work

WP1

Performance of `vle64.v` and `vse64.v`

Following on from our previous report looking at vle8.v and vse8.v, we have carried out a detailed performance analysis of the vle64.v and vse64.v instructions. The only difference is that the element width (notional unit of copying) is 64-bits rather than 8-bits. Full details can be found in this Google Sheet.

As was the case with vle8.v and vse8.v the QEMU performance (measured as ns/instruction) is proportional to the number of bytes being copied, up to the point where the vector register (VLEN*LMUL) is full. For example, this is the data for reading up to 1,024 double words with LMUL=8

length	VLEN=128	VLEN=256	VLEN=512	VLEN=1024
1	23.55	23.44	23.41	23.48
2	41.84	41.86	41.89	41.97
4	69.12	68.91	70.12	69.38
8	123.19	123.81	123.31	125.81
16	242.12	247.38	241.62	243.38
32	281.50	541.25	581.25	548.50
64	282.00	547.00	1104.00	1068.00
128	282.00	541.00	1084.00	2109.00
256	282.00	538.00	1078.00	2102.00
512	279.00	541.00	1076.00	2105.00
1,024	279.00	537.00	1063.00	2100.00

The interesting case is when we compare 8 and 64-bit versions loading or storing the same total number of bytes. The following table shows the time per instruction when loading various lengths of data (in bytes) using vle8.v and vle64.v with VLEN=1024 and LMUL=8.

length	vle8.v	vle64.v	ratio
8	108.75	23.48	4.63
16	210.00	41.97	5.00
32	396.50	69.38	5.71
64	770.00	125.81	6.12
128	1520.00	243.38	6.25
256	3014.00	548.50	5.49
512	6034.00	1068.00	5.65
1,024	11973.00	2109.00	5.68
2,048	11964.00	2102.00	5.69
4,096	11970.00	2105.00	5.69
8,192	11967.00	2100.00	5.70

The results are summarized in this graph

:{width=160mm}

Even for smaller vectors and LMUL we see the effect. The following table is for VLEN=128 and LMUL=1

length	vle8.v	vle64.v	ratio

8	108.88	23.74	4.59
16	209.88	41.45	5.06
32	210.50	55.25	3.81
64	209.00	48.75	4.29
128	207.00	50.25	4.12
256	212.00	57.50	3.69
512	210.00	55.00	3.82
1,024	213.00	56.00	3.80
2,048	207.00	52.00	3.98
4,096	210.00	51.00	4.12
8,192	208.00	51.00	4.08

The same effect is seen when comparing store instructions between vse8.v and vse64.v, although the ratios are marginally smaller.

The obvious quick-win strategy is to perform simple block loads and stores (which we surmise are the majority) using double words wherever possible. We are working on an implementation of this.

Statistics

Memory/string operations

No changes to report since last week.

length	s time	v1 time	v8 time	s Micount	v1 Micount	v8 Micount	s ns/inst	v1 ns/inst	v8 ns/inst

1	0.16	0.14	0.12	73	19	19	2.19	7.37	6.32
2	0.22	0.12	0.14	89	19	19	2.47	6.32	7.37
4	0.24	0.20	0.17	121	19	19	1.98	10.53	8.95
8	0.23	0.29	0.27	95	19	19	2.42	15.26	14.21
16	0.23	0.44	0.44	111	19	19	2.07	23.16	23.16
32	0.28	0.86	0.81	143	26	19	1.96	33.08	42.63
64	0.43	1.74	1.45	207	40	19	2.08	43.50	76.32
128	0.57	3.34	2.85	293	68	19	1.95	49.12	150.00
256	0.89	6.62	5.67	451	124	26	1.97	53.39	218.08
512	1.52	13.28	11.27	767	236	40	1.98	56.27	281.75
1,024	2.83	26.28	22.45	1,448	460	68	1.95	57.13	330.15
2,048	5.49	52.41	44.85	2,810	908	124	1.95	57.72	361.69
4,096	10.72	105.43	89.67	5,534	1,804	236	1.94	58.44	379.96
8,192	21.21	210.61	179.43	10,933	3,596	460	1.94	58.57	390.07
16,384	42.54	421.70	359.58	21,731	7,180	908	1.96	58.73	396.01

Individual RVV instruction performance

No changes since last week.

These results are in ns/instruction.

length	LMUL	VLEN=128	VLEN=256	VLEN=512	VLEN=1024

1	1	21.13	21.15	21.13	21.14
2	1	33.44	33.48	33.50	33.52
4	1	62.16	62.16	62.09	62.12
8	1	108.88	108.88	108.94	109.06
16	1	209.88	210.12	209.75	209.75
32	1	210.50	396.50	396.75	397.25
64	1	209.00	396.50	772.00	771.50
128	1	207.00	397.00	771.00	1524.00
256	1	212.00	399.00	772.00	1523.00
512	1	210.00	398.00	771.00	1521.00
1,024	1	213.00	398.00	773.00	1521.00
2,048	1	207.00	395.00	773.00	1519.00
4,096	1	210.00	395.00	771.00	1522.00
8,192	1	208.00	401.00	768.00	1521.00

SPEC CPU 2017 performance

You can find the baseline execution time and instruction count of the SPEC CPU 2017 benchmarks here

Actions

2024-05-22

Jeremy to run baseiline results for the other flavours of the vle*.v/vse*.v instructions
- COMPLETE (see details above)
Paolo to check the ARM SVE example mentioned int the gitlab issue (see 2024-05-01).
- COMPLETE: good example as soon as we'll need to implement vle8.v with a more direct access to the host.

2024-05-15

Jeremy to look at impact of masked v unmasked and strided v unstrided on vector operations.
- lower proirity.

2024-05-08

Jeremy to characterise QEMU floating point performance and file it as a performance regression issue in QEMU GitLab.
- low priority, deferred to prioritize the smoke tests work.

2024-05-01

Paolo to review the generic issue from Palmer Dabbelt to identify ideas for optimization and benchmarks to reuse.
- IN PROGRESS: Reproduction deferred to prioritize ARM analysis.
- So far we didn't see the execution time difference reported in the issue. Need to check the context.
- The bionic benchmarks may be a useful source of small benchmarks.
- Taken the ARM example: it might be tricky to map each load/store with the right host operations but that's the kind of optimization we are aiming at.
Daniel to advise Paolo on best practice for preparing QEMU upstream submissions.

Risk register

The risk register is held in a shared spreadsheet

We will keep it updated continuously and report any changes each week.

There are no changes to the risk register this week.

Planned absences

Jeremy will be on vacation from the 7th to the 16th of June. Paolo will be on vacation from the 20th to the 24th of June.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20240529.md

20240529.md

RISE RP005 QEMU weekly report 2024-05-29

Overview

Introducing the project team

Work completed since last report

Work planned for the coming week

Current priorties

Detailed description of work

WP1

Performance of `vle64.v` and `vse64.v`

Statistics

Memory/string operations

Individual RVV instruction performance

SPEC CPU 2017 performance

Actions

Risk register

Planned absences

Files

20240529.md

Latest commit

History

20240529.md

File metadata and controls

RISE RP005 QEMU weekly report 2024-05-29

Overview

Introducing the project team

Work completed since last report

Work planned for the coming week

Current priorties

Detailed description of work

WP1

Performance of vle64.v and vse64.v

Statistics

Memory/string operations

Individual RVV instruction performance

SPEC CPU 2017 performance

Actions

Risk register

Planned absences

Performance of `vle64.v` and `vse64.v`