Optimizations for Armv8-A #50

volyrique · 2019-05-11T17:25:38Z

This commit adapts the past SIMD optimizations to the Neon extension that is part of the Armv8-A architecture. The changes apply only to the AArch64 execution state (32-bit code would require more work due to the smaller general-purpose registers, so I didn't bother).

In order to gather some performance data, I compiled the included benchmark bench.c using:
gcc -Wall -Wextra -Ofast -flto -march=native -g -o bench bench.c picohttpparser.c
And then ran it with:
taskset -c 1 time -f "%e" ./bench

I ran the benchmark on Ubuntu 18.04 using 3 different cloud instances:

a1.large on Amazon Web Services - uses AWS Graviton processors that are apparently based on Arm Cortex-A72
c1.large.arm on Packet - uses Cavium ThunderX processors (note that it is the first version)
c2.large.arm also on Packet - uses Ampere eMAG processors

Here are the median results from 20 runs - all times are in seconds:

Instance	Base time	New time	Change
a1.large	7.62	4.63	-39.24%
c1.large.arm	16.23	13.36	-17.68%
c2.large.arm	6.29	6.88	+9.38%

Standard errors are 0.1% or less in all cases.

I don't have a good explanation for the regression on Ampere eMAG right now, but I noticed that compiling with Clang produced slightly better times (though still slower than the baseline), so a probable partial explanation is that the software support for the microarchitecture (which is the most recent one) can be improved (or it will be a while until the enhancements make their way into the OS images that can actually be deployed). Unfortunately, I couldn't find any optimization guide for the processor, and the support for the hardware performance counters seemed flaky, so it was a bit difficult to do a deeper analysis.

It should also be possible to optimize the parse_headers() function using the TBL instruction, but that would require transforming token_char_map into a bit array (or something similar), so that it fits into at most 4 vector registers.

I also have an initial implementation (not tested much and certainly not benchmarked) using the Scalable Vector Extension (SVE) in a branch in my fork of the repository.

kazuho · 2019-05-17T03:03:45Z

Thank you for the PR.

The results are interesting. I'll check the numbers on A52 that I happen to have.

volyrique · 2019-06-02T04:08:47Z

I figured out that header values had a higher chance of being large, so I decided to unroll the vector loop in get_token_to_eol() a bit. Here are the new results:

Instance	Base time	New time	Change
a1.large	7.62	5.13	-32.68%
m6g.large	5.19	2.97	-42.77%
c1.large.arm	16.23	12.54	-22.74%
c2.large.arm	6.29	6.20	-1.43%

The result on the Amazon instance is a bit worse (but still significantly faster than the scalar version), while all the other values have improved; in particular, there is no longer a performance regression on the Ampere eMAG. Standard errors are 0.26% or less.

P.S. I added results for a m6g.large instance on Amazon Web Services, which uses an AWS Graviton2 processor (based on Arm Neoverse N1) and which has recently become available. It ran Ubuntu 18.04 as in the other cases.

volyrique · 2019-10-31T01:10:20Z

Now that Travis CI supports testing in an Arm64 environment, I have enabled it for this project.

I think I also have a pretty good idea about why the performance on the Ampere eMAG is not that good. After some experiments, I have determined that the vector instruction throughput on that machine is 0.50 (instructions per cycle), while on Arm Cortex-A72 it is 1.49 (probably 1.5 - there is some measurement noise). Those values are for vector bitwise operations and comparisons, which are the main operations executed by my optimization. For comparison, the scalar addition throughput is 1.99 in both cases (again, probably 2.00). As a result, it is worth vectorizing on the Ampere machine mainly if there is a significant amount of data to process, so it is not surprising that the second version of my changes, which has raised the threshold for switching from scalar to vector code, behaves better.

As for the hardware performance counters being problematic on the Ampere eMAG - it turns out that there are no problems if the counters are specified explicitly on the perf command line using the IDs from the Ampere documentation.

volyrique · 2023-07-20T23:21:16Z

Optimized implementations of the parse_token() function using the TBL instruction or SVE gather loads are available in this and that branch respectively, but neither of them are as convincing as the changes I have proposed here.

The regular SVE optimizations have been merged into this PR.

enghitalo · 2024-02-16T00:47:25Z

up

These changes apply only to the AArch64 execution state.

volyrique force-pushed the armv8-a branch from 2355e5f to c91d34c Compare June 1, 2019 22:44

volyrique force-pushed the armv8-a branch from c91d34c to c219556 Compare June 2, 2019 04:29

volyrique force-pushed the armv8-a branch from c219556 to 29f8c6a Compare October 27, 2019 15:38

volyrique force-pushed the armv8-a branch from 29f8c6a to ddec707 Compare October 3, 2020 14:30

volyrique force-pushed the armv8-a branch from ddec707 to c383cbf Compare December 28, 2020 16:20

volyrique force-pushed the armv8-a branch from c383cbf to 5c4fe2c Compare June 12, 2022 00:44

volyrique force-pushed the armv8-a branch 3 times, most recently from eb11606 to c3967fc Compare September 10, 2022 20:12

volyrique force-pushed the armv8-a branch from c3967fc to af782f3 Compare March 5, 2023 20:04

volyrique force-pushed the armv8-a branch from af782f3 to 23cd1a2 Compare April 1, 2023 20:19

volyrique force-pushed the armv8-a branch from 23cd1a2 to a638f26 Compare February 3, 2024 17:55

Optimizations for Armv8-A

a9b9b56

These changes apply only to the AArch64 execution state.

volyrique force-pushed the armv8-a branch from a638f26 to a9b9b56 Compare February 24, 2024 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations for Armv8-A #50

Optimizations for Armv8-A #50

volyrique commented May 11, 2019

kazuho commented May 17, 2019 •

edited

Loading

volyrique commented Jun 2, 2019 •

edited

Loading

volyrique commented Oct 31, 2019 •

edited

Loading

volyrique commented Jul 20, 2023

enghitalo commented Feb 16, 2024

Optimizations for Armv8-A #50

Are you sure you want to change the base?

Optimizations for Armv8-A #50

Conversation

volyrique commented May 11, 2019

kazuho commented May 17, 2019 • edited Loading

volyrique commented Jun 2, 2019 • edited Loading

volyrique commented Oct 31, 2019 • edited Loading

volyrique commented Jul 20, 2023

enghitalo commented Feb 16, 2024

kazuho commented May 17, 2019 •

edited

Loading

volyrique commented Jun 2, 2019 •

edited

Loading

volyrique commented Oct 31, 2019 •

edited

Loading