Skip to content

Latest commit

 

History

History
1053 lines (906 loc) · 65.1 KB

README.md

File metadata and controls

1053 lines (906 loc) · 65.1 KB

SLOTHY Tutorial

This tutorial introduces you to using the SLOTHY superoptimizer for optimizing assembly programs for a specific microarchitecture. It goes beyond what is written in the README or the SLOTHY paper in that it gives more examples on how we, the developers of SLOTHY, typically use SLOTHY to optimize cryptographic code. At the end of the tutorial, you should be familiar with the workflow of using SLOTHY as well as a number of common ways to debug or improve your results.

Introduction to SLOTHY

SLOTHY is a fixed instruction superoptimizer: Its input is assembly and its output is semantically-equivalent optimized assembly using the same instructions and data flow. The fact that SLOTHY does not change instructions is very important both theoretically (in terms of complexity of optimization) and practically (in terms of developer control) and sets SLOTHY apart from synthesizing superoptimizers like souper.

Concretely, SLOTHY performs three main jobs:

  1. (Re-)schedule instructions to hide latencies and improve utilization of all execution units.
  2. Rename registers in case this enables a better scheduling.
  3. Perform software pipelining (aka periodic loop interleaving). We will cover software pipelining in more depth later in this tutorial.

SLOTHY performs these jobs by first lifting the input assembly into a data-flow graph (DFG) modelling dependencies between instructions. At this level, the ordering of instructions and the choice of register names is no longer visible. The goal of SLOTHY, then, is to find a traversal/lowering of the DFG that results in the least number of pipeline stalls. A traversal/lowering of the graph is assigning to each instruction an index at which the instruction will be in the output, plus a choice of registers to be used for its outputs. SLOTHY does so by turning the graph together with information about the (micro)architecture into constraints that are fed into an external constraint solver; so far, we have been using Google OR-tools, but in principle one can use other solvers as well. Constraints come in two flavours: Architectural and microarchitectural. Architectural constraints simply ensure that the resulting code is architecturally valid (e.g. SLOTHY does not use a vector register in a scalar instruction) and functionally correct (it has the same DFG). Microarchitectural constraints imply (hopefully) that the code will run fast on the target; SLOTHY models microarchitectures in terms of issue width, instruction latencies, throughput, forwarding paths, and the number of execution units able to execute certain instructions. We refer to the SLOTHY paper for details of the constraint model, which are not relevant here.

Note again that SLOTHY does (largely) not change instructions: Instruction selection is left to the developer. For cryptographic code -- which is what SLOTHY was developed for -- instruction selection is a core focus of research and highly-optimized instruction sequences implementing a cryptographic (sub-)routine usually exist. Tight control over the choice of instructions is also important from a security perspective, as variable-time instructions have to be avoided.

High-assurance cryptography: While formal verification is not part of SLOTHY itself, there is potential for combining existing formal verification tools with SLOTHY. From a high level, formal verification should be relatively simple, owing to the fact that SLOTHY does not change the DFG: In fact, SLOTHY itself includes a selfcheck that lifts the output assembly back into a DFG and confirms that it is isomorphic to the input DFG via the permutation found by SLOTHY. However, while this is a strong indicator of correctness of the output assembly, it does not amount to a formal verification, as pitfalls do remain (notably bad user configurations and subtleties around modelling memory and load/store offsets which we will not discuss in this tutorial). Research into combining SLOTHY with trusted verification infrastructure is therefore needed. As a first promising example, AWS-LC has recently integrated an implementation of X25519 that was auto-generated by SLOTHY and formally verified using the HOL-Light proof assistant.

Table of contents

  1. Installation. This is limited to the fastest way of installing SLOTHY using pip. For more complete instructions, see the README.
  2. Getting started
  3. Using SLOTHY for your own code
  4. Using SLOTHY's Software Pipelining
  5. Checking the quality of SLOTHY optimizations
  6. Optimizing a full Neon NTT
  7. Optimizing larger pieces of code
  8. Adding a new microarchitecture

The SLOTHY calling code used for the parts 3-7 is located in tutorial-{3a,3b,4,5,6,7}.py.

1. Installation

SLOTHY requires python3 (>= 3.10). The easiest way to install the dependencies of SLOTHY is using pip. It's advised to make use of virtual environment.

The following steps should get you started:

git clone https://github.com/slothy-optimizer/slothy
cd slothy
# setup venv
python3 -m venv venv
source venv/bin/activate
# install dependencies
pip install -r requirements.txt

You can try to run SLOTHY on one of the examples that come with SLOTHY to make sure it runs without errors:

python3 example.py --examples simple0

We will look into more examples shortly and discuss input, output, and available flags.

2. Getting Started

The simplest way to get started using SLOTHY is by trying out some of the examples that come with SLOTHY. Once you work on your own code, you will likely be using the slothy-cli command or calling the SLOTHY module from your own Python script for invoking SLOTHY allowing you to control all the different options SLOTHY has. However, for now we will be using the example.py script and containing a number of examples including the ones we have optimized in the SLOTHY paper. You can run python3 example.py --help to see all examples available.

Let's look at a very simple example from the previous section called aarch64_simple0. You can find the corresponding code in examples/naive/aarch64/aarch64_simple0.s:

ldr q0, [x1, #0]
ldr q1, [x2, #0]

ldr q8,  [x0]
ldr q9,  [x0, #1*16]
ldr q10, [x0, #2*16]
ldr q11, [x0, #3*16]

mul v24.8h, v9.8h, v0.h[0]
sqrdmulh v9.8h, v9.8h, v0.h[1]
mls v24.8h, v9.8h, v1.h[0]
sub     v9.8h,    v8.8h, v24.8h
add     v8.8h,    v8.8h, v24.8h

mul v24.8h, v11.8h, v0.h[0]
sqrdmulh v11.8h, v11.8h, v0.h[1]
mls v24.8h, v11.8h, v1.h[0]
sub     v11.8h,    v10.8h, v24.8h
add     v10.8h,    v10.8h, v24.8h

str q8,  [x0], #4*16
str q9,  [x0, #-3*16]
str q10, [x0, #-2*16]
str q11, [x0, #-1*16]

It contains a straight-line piece of assembly for the Armv8-A architecture. This architecture implements the Neon vector instruction extension and all the instructions in this example are Neon vector instructions. If you have never written Neon assembly before, you do not have to worry about it at this point. All you need to know about the code is that it loads some vectors from memory, performs some arithmetic operations, and writes back the result to memory. Note that there are two independent streams of computation on the four vectors loaded from memory, and, hence, there is quite some possibilities to re-order this code without affecting its semantics. This code is able to run on a variety of different microarchitectures, ranging from low-end energy efficient in-order cores like the Arm Cortex-A55 to high-end out-of-order CPUs with very complex pipelines like the Apple M1 or Arm Neoverse server CPUs. For the in-order cores, the instruction scheduling plays the most essential role as poorly scheduled code is very likely to have poor performance, and hence, we will focus on the Cortex-A55 architecture in the following. Note, however, that SLOTHY has been used to also obtain significant speed-ups for out-of-order cores.

SLOTHY comes with models for various Arm architectures, including the power-efficient, in-order Cortex-A55, so we can now optimize this piece of code for that microarchitecture. example.py contains the needed SLOTHY incarnations for convenience, so we can simply run python3 example.py --examples aarch64_simple0_a55 which will optimize for the Cortex-A55 microarchitecture. You can check example.py for the details. This will optimize the piece of code above and write the output code to examples/opt/aarch64/aarch64_simple0_opt_a55.s. SLOTHY should print something similar to this:

INFO:aarch64_simple0_a55:Instructions in body: 20
INFO:aarch64_simple0_a55.slothy:Perform internal binary search for minimal number of stalls...
INFO:aarch64_simple0_a55.slothy:Attempt optimization with max 32 stalls...
INFO:aarch64_simple0_a55.slothy:Objective: minimize number of stalls
INFO:aarch64_simple0_a55.slothy:Invoking external constraint solver (OR-Tools CP-SAT v9.7.2996) ...
INFO:aarch64_simple0_a55.slothy:[0.0653s]: Found 1 solutions so far... objective 19.0, bound 12.0 (minimize number of stalls)
INFO:aarch64_simple0_a55.slothy:[0.0801s]: Found 2 solutions so far... objective 18.0, bound 12.0 (minimize number of stalls)
INFO:aarch64_simple0_a55.slothy:OPTIMAL, wall time: 0.180540 s
INFO:aarch64_simple0_a55.slothy:Booleans in result: 449
INFO:aarch64_simple0_a55.slothy.selfcheck:OK!
INFO:aarch64_simple0_a55.slothy:Minimum number of stalls: 18

You can follow the steps SLOTHY performs and see the calls the constraint solver trying to find a re-scheduling of this code containing at most 32 stalls (a default starting point we have set here to speed up the example). At the same time it is trying to minimize the number of stalls. This is passed as an objective to the constraint solver (OR-tools) which tries to find a solution with the minimum number of stalls. The best solution it can find has 16 stalls -- which is guaranteed to be the minimum number of stalls given this piece of code and the model of the microarchitecture in SLOTHY. In the last step, SLOTHY will transform the found traversal of the DFG into actual assembly and write it to the file. To make sure everything worked out as expected, it will perform a selfcheck which consists of transforming the output assembly into a DFG again and testing that the resulting graph is isomorphic to the input DFG.

We can now take a look at the output assembly in examples/opt/aarch64/aarch64_simple0_opt_a55.s:

ldr q8, [x1, #0]                        // *...................
// gap                                  // ....................
// gap                                  // ....................
// gap                                  // ....................
ldr q30, [x0, #16]                      // ...*................
// gap                                  // ....................
// gap                                  // ....................
// gap                                  // ....................
ldr q25, [x0]                           // ..*.................
// gap                                  // ....................
// gap                                  // ....................
// gap                                  // ....................
mul v13.8H, v30.8H, v8.H[0]             // ......*.............
// gap                                  // ....................
sqrdmulh v21.8H, v30.8H, v8.H[1]        // .......*............
// gap                                  // ....................
ldr q30, [x0, #48]                      // .....*..............
// gap                                  // ....................
// gap                                  // ....................
// gap                                  // ....................
ldr q3, [x2, #0]                        // .*..................
// gap                                  // ....................
// gap                                  // ....................
// gap                                  // ....................
sqrdmulh v5.8H, v30.8H, v8.H[1]         // ............*.......
// gap                                  // ....................
mul v30.8H, v30.8H, v8.H[0]             // ...........*........
// gap                                  // ....................
mls v13.8H, v21.8H, v3.H[0]             // ........*...........
// gap                                  // ....................
ldr q15, [x0, #32]                      // ....*...............
// gap                                  // ....................
// gap                                  // ....................
// gap                                  // ....................
mls v30.8H, v5.8H, v3.H[0]              // .............*......
// gap                                  // ....................
add v8.8H, v25.8H, v13.8H               // ..........*.........
// gap                                  // ....................
sub v20.8H, v25.8H, v13.8H              // .........*..........
// gap                                  // ....................
// gap                                  // ....................
// gap                                  // ....................
str q8, [x0], #4*16                     // ................*...
// gap                                  // ....................
add v26.8H, v15.8H, v30.8H              // ...............*....
// gap                                  // ....................
str q20, [x0, #-48]                     // .................*..
// gap                                  // ....................
sub v5.8H, v15.8H, v30.8H               // ..............*.....
// gap                                  // ....................
str q26, [x0, #-32]                     // ..................*.
// gap                                  // ....................
// gap                                  // ....................
// gap                                  // ....................
str q5, [x0, #-16]                      // ...................*
// gap                                  // ....................

// original source code
// ldr q0, [x1, #0]                       // *...................
// ldr q1, [x2, #0]                       // ......*.............
// ldr q8,  [x0]                          // ..*.................
// ldr q9,  [x0, #1*16]                   // .*..................
// ldr q10, [x0, #2*16]                   // ..........*.........
// ldr q11, [x0, #3*16]                   // .....*..............
// mul v24.8h, v9.8h, v0.h[0]             // ...*................
// sqrdmulh v9.8h, v9.8h, v0.h[1]         // ....*...............
// mls v24.8h, v9.8h, v1.h[0]             // .........*..........
// sub     v9.8h,    v8.8h, v24.8h        // .............*......
// add     v8.8h,    v8.8h, v24.8h        // ............*.......
// mul v24.8h, v11.8h, v0.h[0]            // ........*...........
// sqrdmulh v11.8h, v11.8h, v0.h[1]       // .......*............
// mls v24.8h, v11.8h, v1.h[0]            // ...........*........
// sub     v11.8h,    v10.8h, v24.8h      // .................*..
// add     v10.8h,    v10.8h, v24.8h      // ...............*....
// str q8,  [x0], #4*16                   // ..............*.....
// str q9,  [x0, #-3*16]                  // ................*...
// str q10, [x0, #-2*16]                  // ..................*.
// str q11, [x0, #-1*16]                  // ...................*

At the top you can see the re-scheduled assembly and at the bottom you find the original source code as a comment. As comments next to the two sections, you can also see a visual representation on how these instructions have been rescheduled. You can see that various instructions have been moved around to achieve fewer stalls.

Note that if you do run SLOTHY again, it may produce a different scheduling with the same minimal number of stalls. This is expected and due to the constraint solver not producing deterministic outputs.

In the scheduled code, you can see // gap where SLOTHY would expect a "gap" in the current model: This is not a pipeline stall in the sense of a wasted cycle, but rather an issue slot of the CPU that was not used. The Cortex-A55 is a dual-issue CPU meaning in ideal circumstances 2 instructions can be issued per cycle. However, the Neon pipeline can only issue a single (128-bit/q-form) Neon instruction per cycle. Since our code only consists of (128-bit/q-form) Neon instructions, the best we can hope for is a single gap after each instruction. To make use of these issue slots one would have to mix in scalar instructions (or use 64-bit (d-form) Neon instructions).

Also note the registers used: In the original code v24 was as a temporary register in both computation streams preventing to effectively interleave them. SLOTHY renamed those registers to be able to interleave both computations. Other registers have also been arbitrarily renamed, but without any specific reason.

3. Writing your own calling code

When writing your own calls to SLOTHY, there are generally two options: (1) Using SLOTHY as a Python module, or (2) using slothy-cli using command line options. We will continue with (1) to demonstrate some features. To reproduce the example above, you can place the following code into your own Python script in the root directory of SLOTHY:

import logging
import sys

from slothy import Slothy

import slothy.targets.aarch64.aarch64_neon as AArch64_Neon
import slothy.targets.aarch64.cortex_a55 as Target_CortexA55

logging.basicConfig(stream=sys.stdout, level=logging.INFO)

arch = AArch64_Neon
target = Target_CortexA55

slothy = Slothy(arch, target)

# example
slothy.load_source_from_file("examples/naive/aarch64/aarch64_simple0.s")
slothy.config.variable_size=True
slothy.config.constraints.stalls_first_attempt=32

slothy.optimize()
slothy.write_source_to_file("opt/aarch64_simple0_a55.s")

You will need to pass to SLOTHY both the architecture model (containing the instruction mnemonics and which registers are input and outputs for each instruction) and the microarchitectual model (containing latencies, throughputs, execution units, etc.). In this case, we use the AArch64+Neon architecture model and the Arm Cortex-A55 microarchitecture model that come with SLOTHY.

The calls to SLOTHY should be self-explanatory:

  • load_source_from_file loads an assembly file to be optimized.
  • slothy.config can be used to configure SLOTHY. For the documentation of the configuration options, see the comments in config.py.
  • optimize performs the actual optimizations by calling the external constraint solver.
  • write_source_to_file writes back the optimized assembly to a file.

Setting slothy.config.variable_size results in the number of stalls being a parameter of the model that the constraint solver is trying to minimize within a static 'stall budget'. By default, SLOTHY would start with a stall budget of 0 and exponentially increase until a solution is found. To speed this process up, we set stalls_first_attempt=32, starting the search with a sufficient stall budget of 32 cycles.

The variable_size may not perform well for large examples. The default strategy (variable_size=False) is, hence, to pass a fixed number of allowed stalls to the constraint solver and to have SLOTHY perform an 'external' binary search to find the minimum number of stalls for which a solution exists.

Even with this small Neon example, you can see that understanding the input code is much easier than the output code. In fact, the input code can be further clarified through the use of macros and register aliases, leading to the following 'clean' version from examples/naive/aarch64/aarch64_simple0_macros.s which makes it apparent that our example is just a pair of NTT butterflies using Barrett multiplication. Note that the .req and .macro directives used here are commonly supported assembly directives.

qdata0   .req q8
qdata1   .req q9
qdata2   .req q10
qdata3   .req q11

qtwiddle .req q0

data0    .req v8
data1    .req v9
data2    .req v10
data3    .req v11

twiddle  .req v0
modulus  .req v1

tmp      .req v12

data_ptr      .req x0
twiddle_ptr   .req x1

.macro barmul out, in, twiddle, modulus
    mul      \out.8h,   \in.8h, \twiddle.h[0]
    sqrdmulh \in.8h,    \in.8h, \twiddle.h[1]
    mls      \out.8h,   \in.8h, \modulus.h[0]
.endm

.macro butterfly data0, data1, tmp, twiddle, modulus
    barmul \tmp, \data1, \twiddle, \modulus
    sub    \data1.8h, \data0.8h, \tmp.8h
    add    \data0.8h, \data0.8h, \tmp.8h
.endm

start:

    ldr qtwiddle, [twiddle_ptr, #0]

    ldr qdata0, [data_ptr, #0*16]
    ldr qdata1, [data_ptr, #1*16]
    ldr qdata2, [data_ptr, #2*16]
    ldr qdata3, [data_ptr, #3*16]

    butterfly data0, data1, tmp, twiddle, modulus
    butterfly data2, data3, tmp, twiddle, modulus

    str qdata0, [data_ptr], #4*16
    str qdata1, [data_ptr, #-3*16]
    str qdata2, [data_ptr, #-2*16]
    str qdata3, [data_ptr, #-1*16]

end:

SLOTHY will then internally expand all macros and the resulting DFG will be exactly the same as before. To make this work, we have to slightly change the SLOTHY code:

# example
slothy.load_source_from_file("../examples/naive/aarch64/aarch64_simple0_macros.s")
slothy.config.variable_size=True
slothy.config.constraints.stalls_first_attempt=32

slothy.optimize(start="start", end="end")
slothy.write_source_to_file("opt/aarch64_simple0_macros_a55.s")

The difference is that, we have to explicitly pass start and end labels to SLOTHY. This is because SLOTHY does not understand the code before that and the parsing would fail if run on that part of the code.

We have found it very useful to base assembly optimization on a 'clean' version as above and automate its optimization using SLOTHY, which is the reason why we believe that SLOTHY can help with the development of auditable and maintainable high-performance assembly.

4. Software Pipelining

One of the most powerful features of SLOTHY is software pipelining. The core idea of software pipelining is that loop scheduling can be improved by moving some instructions to earlier or later iterations of the loop, that is, by interleaving loop iterations. Note that this does not mean that the loop has to be unrolled: By maintaining the periodicity of the interleaved code, it is possible to keep it within a loop, thereby retaining code compactness. Only the first and last iteration(s) may require to be treated separately; those are called the preamble and postamble, respectively.

Let's look at an example demonstrating how SLOTHY can perform software pipelining for you. Consider the simple case of performing the code from the previous example within a loop with a fixed number of iterations (>=2). This is exactly what the aarch64_simple0_loop example in SLOTHY does:

... // .req and .macro as above

count .req x2
ldr qtwiddle, [twiddle_ptr, #0]
ldr qmodulus, modulus_ptr, #0

mov count, #16
start:

    ldr qtwiddle, [twiddle_ptr, #0]

    ldr qdata0, [data_ptr, #0*16]
    ldr qdata1, [data_ptr, #1*16]
    ldr qdata2, [data_ptr, #2*16]
    ldr qdata3, [data_ptr, #3*16]

    butterfly data0, data1, tmp, twiddle, modulus
    butterfly data2, data3, tmp, twiddle, modulus

    str qdata0, [data_ptr], #4*16
    str qdata1, [data_ptr, #-3*16]
    str qdata2, [data_ptr, #-2*16]
    str qdata3, [data_ptr, #-1*16]

    subs count, count, #1
    cbnz count, start

Let's use SLOTHY to superoptimize this loop:

slothy.load_source_from_file("examples/naive/aarch64/aarch64_simple0_loop.s")
slothy.config.variable_size=True
slothy.config.constraints.stalls_first_attempt=32

slothy.config.sw_pipelining.enabled = True
slothy.config.sw_pipelining.optimize_preamble = False
slothy.config.sw_pipelining.optimize_postamble = False
slothy.optimize_loop("start")
slothy.write_source_to_file("opt/aarch64_simple0_loop_a55.s")

Software pipelining needs to be enabled by setting slothy.config.sw_pipelining.enabled = True. We also need to specifically tell SLOTHY that we would like to optimize the loop starting at start -- SLOTHY will automatically detect that the loop ends at cbnz count, start. Finally, optimize_preamble = False and optimize_preamble = False prevent SLOTHY from optimizing the loop preamble and postamble (first/last iteration), which it would by default -- you normally want this set, but we unset it here to simplify the output. This is what it will look like:

// ...
count .req x2
ldr qtwiddle, [twiddle_ptr, #0]
ldr qmodulus, [modulus_ptr, #0]
mov count, #16
        ldr q3, [x0, #16]
        sqrdmulh v7.8H, v3.8H, v0.H[1]
        sub count, count, #1
start:
        mul v3.8H, v3.8H, v0.H[0]               // ....*.............
        // gap                                  // ..................
        ldr q19, [x0, #48]                      // ...*..............
        // gap                                  // ..................
        // gap                                  // ..................
        // gap                                  // ..................
        ldr q15, [x0, #0]                       // *.................
        // gap                                  // ..................
        // gap                                  // ..................
        // gap                                  // ..................
        mls v3.8H, v7.8H, v1.H[0]               // ......*...........
        // gap                                  // ..................
        mul v13.8H, v19.8H, v0.H[0]             // .........*........
        // gap                                  // ..................
        sqrdmulh v19.8H, v19.8H, v0.H[1]        // ..........*.......
        // gap                                  // ..................
        ldr q7, [x0, #32]                       // ..*...............
        // gap                                  // ..................
        // gap                                  // ..................
        // gap                                  // ..................
        sub v17.8H, v15.8H, v3.8H               // .......*..........
        // gap                                  // ..................
        add v10.8H, v15.8H, v3.8H               // ........*.........
        // gap                                  // ..................
        mls v13.8H, v19.8H, v1.H[0]             // ...........*......
        // gap                                  // ..................
        str q17, [x0, #16]                      // ...............*..
        // gap                                  // ..................
        ldr q3, [x0, #80]                       // .e................
        // gap                                  // ..................
        // gap                                  // ..................
        // gap                                  // ..................
        add v15.8H, v7.8H, v13.8H               // .............*....
        // gap                                  // ..................
        str q10, [x0], #4*16                    // ..............*...
        // gap                                  // ..................
        sub v13.8H, v7.8H, v13.8H               // ............*.....
        // gap                                  // ..................
        str q15, [x0, #-32]                     // ................*.
        // gap                                  // ..................
        sqrdmulh v7.8H, v3.8H, v0.H[1]          // .....e............
        // gap                                  // ..................
        str q13, [x0, #-16]                     // .................*
        // gap                                  // ..................

        // original source code
        // ldr q8, [x0, #0*16]                      // .......|.*...............
        // ldr q9, [x0, #1*16]                      // e......|..........e......
        // ldr q10, [x0, #2*16]                     // .......|.....*...........
        // ldr q11, [x0, #3*16]                     // .......|*................
        // mul      v12.8h,   v9.8h, v0.h[0]        // .......*.................
        // sqrdmulh v9.8h,    v9.8h, v0.h[1]        // .....e.|...............e.
        // mls      v12.8h,   v9.8h, v1.h[0]        // .......|..*..............
        // sub    v9.8h, v8.8h, v12.8h              // .......|......*..........
        // add    v8.8h, v8.8h, v12.8h              // .......|.......*.........
        // mul      v12.8h,   v11.8h, v0.h[0]       // .......|...*.............
        // sqrdmulh v11.8h,    v11.8h, v0.h[1]      // .......|....*............
        // mls      v12.8h,   v11.8h, v1.h[0]       // .......|........*........
        // sub    v11.8h, v10.8h, v12.8h            // ...*...|.............*...
        // add    v10.8h, v10.8h, v12.8h            // .*.....|...........*.....
        // str q8, [x0], #4*16                      // ..*....|............*....
        // str q9, [x0, #-3*16]                     // .......|.........*.......
        // str q10, [x0, #-2*16]                    // ....*..|..............*..
        // str q11, [x0, #-1*16]                    // ......*|................*

        sub count, count, #1
        cbnz count, start
        mul v3.8H, v3.8H, v0.H[0]
        ldr q19, [x0, #48]
        ldr q15, [x0, #0]
        mls v3.8H, v7.8H, v1.H[0]
        mul v13.8H, v19.8H, v0.H[0]
        sqrdmulh v19.8H, v19.8H, v0.H[1]
        ldr q7, [x0, #32]
        sub v17.8H, v15.8H, v3.8H
        add v10.8H, v15.8H, v3.8H
        mls v13.8H, v19.8H, v1.H[0]
        str q17, [x0, #16]
        add v15.8H, v7.8H, v13.8H
        str q10, [x0], #4*16
        sub v13.8H, v7.8H, v13.8H
        str q15, [x0, #-32]
        str q13, [x0, #-16]

Let's start by looking at the optimized loop body going from start: to cbnz count, start: We see that the loop now has 4 blocks of 3 gaps meaning that SLOTHY predicts 4 stalls of 1 cycle each. This compares to 7 stalls in the version without software pipelining. We see that 2 load instructions are marked as early instructions (annotated (e)), meaning they have been moved into the previous iteration: Intuitively, this makes sense: We know statically what data we need to load for the next iteration, and loads have a fairly long latency, so we can improve performance by issuing loads early. For the code to still be correct, SLOTHY decreases the number of iterations by one (sub count, count, #1), adds the missing early-instructions for the first iteration before the loop, and finally adds the non-early instructions of the last iteration after the loop.

Another experimental feature that can be witnessed in this example is address offset fixup. The two ldrs that were moved into the previous iteration have been reordered with the str _, [x0], #64 which modifies the address register. SLOTHY is aware of this and has adjusted the immediate offsets in ldr accordingly. Without this, software pipelining would not be possible here. Address offset fixup is an important yet somewhat subtle feature, and mistakes in its handling are currently not caught by SLOTHY's selfcheck. Going into the details of why that is goes too far for this tutorial, but it is one of the reasons why the selfcheck does, as it stands, not replace a formal verification.

5. Checking the quality of SLOTHY optimizations

You may ask how we know that SLOTHY has actually done something useful here? Sure enough, the interleaving in the above example looks somewhat sensible, and SLOTHY's model predicts only few full-cycle stalls. However, at this point we don't have any indicator of the impact of SLOTHY's optimizations on real hardware.

Indeed, developing accurate microarchitectural models for SLOTHY is a time-consuming and iterative process: It usually takes a while until you have refined things to the point where SLOTHY's prediction closely relates to performance on real hardware. The most common refinement steps are:

  1. There is a mistake in the microarchitectural model mismatching what is written in the Software Optimization Guide (SWOG);
  2. Some aspect of the microarchtecture (e.g., certain forwarding paths or other latency exceptions) is not documented in the SWOG.

We briefly discuss two ways that we found useful to evaluate the quality SLOTHY's optimizations and drive the refinement of microarchitectural models.

First, one useful tool for approximate but independent (of SLOTHY) performance evaluation is LLVM's Machine Code Analyzer. If you have llvm-mca available in your PATH (you may have to compile LLVM >= 18 yourself), you can make use of it in SLOTHY by setting the with_llvm_mca flag. Let's look at the last example and enable LLVM MCA:

slothy.load_source_from_file("../examples/naive/aarch64/aarch64_simple0_loop.s")
slothy.config.variable_size=True
slothy.config.constraints.stalls_first_attempt=32

slothy.config.sw_pipelining.enabled = True
slothy.config.sw_pipelining.optimize_preamble = False
slothy.config.sw_pipelining.optimize_postamble = False
slothy.config.with_llvm_mca = True
slothy.optimize_loop("start")
slothy.write_source_to_file("./aarch64_simple0_loop_mca_a55.s")

This will call LLVM MCA on both the original code and the optimized code and append the LLVM MCA statistics as a comment to the output. Somewhere in the code you will see:

// LLVM MCA STATISTICS (ORIGINAL) BEGIN
//
// Iterations:        100
// Instructions:      2000
// Total Cycles:      3102
// Total uOps:        2100
//
// Dispatch Width:    2
// uOps Per Cycle:    0.68
// IPC:               0.64
// Block RThroughput: 10.5

...

// LLVM MCA STATISTICS (OPTIMIZED) BEGIN
//
// Iterations:        100
// Instructions:      2000
// Total Cycles:      2102
// Total uOps:        2100
//
// Dispatch Width:    2
// uOps Per Cycle:    1.00
// IPC:               0.95
// Block RThroughput: 10.5

This suggests that our optimizations were actually useful: With respect to LLVM-MCA's scheduling model of Cortex-A55, the cycle count per iteration was reduced from 31 cycles to 21 cycles.

But LLVM MCA gives you more: It outputs a timeline view showing how each instruction travels through the pipeline:

// Timeline view (ORIGINAL):
//                     0123456789          0123456789          0123456789          0123456789          01234
// Index     0123456789          0123456789          0123456789          0123456789          0123456789
//
// [0,0]     DeeE .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   ldr	q0, [x1]
// [0,1]     .DeeE.    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   ldr	q1, [x2]
// [0,2]     . DeeE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   ldr	q8, [x0]
// [0,3]     .  DeeE   .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   ldr	q9, [x0, #16]
// [0,4]     .   DeeE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   ldr	q10, [x0, #32]
// [0,5]     .    DeeE .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   ldr	q11, [x0, #48]
// [0,6]     .    .DeeeE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   mul	v24.8h, v9.8h, v0.h[0]
// [0,7]     .    . DeeeE   .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   sqrdmulh	v9.8h, v9.8h, v0.h[1]
// [0,8]     .    .    .DeeeE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   mls	v24.8h, v9.8h, v1.h[0]
// [0,9]     .    .    .    DeE  .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   sub	v9.8h, v8.8h, v24.8h
// [0,10]    .    .    .    .DeE .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   add	v8.8h, v8.8h, v24.8h
// [0,11]    .    .    .    . DeeeE   .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   mul	v24.8h, v11.8h, v0.h[0]
// [0,12]    .    .    .    .  DeeeE  .    .    .    .    .    .    .    .    .    .    .    .    .    .   .   sqrdmulh	v11.8h, v11.8h, v0.h[1]
// [0,13]    .    .    .    .    . DeeeE   .    .    .    .    .    .    .    .    .    .    .    .    .   .   mls	v24.8h, v11.8h, v1.h[0]
// [0,14]    .    .    .    .    .    .DeE .    .    .    .    .    .    .    .    .    .    .    .    .   .   sub	v11.8h, v10.8h, v24.8h
// [0,15]    .    .    .    .    .    . DeE.    .    .    .    .    .    .    .    .    .    .    .    .   .   add	v10.8h, v10.8h, v24.8h
// [0,16]    .    .    .    .    .    .  DE.    .    .    .    .    .    .    .    .    .    .    .    .   .   str	q8, [x0], #64
// [0,17]    .    .    .    .    .    .   DE    .    .    .    .    .    .    .    .    .    .    .    .   .   stur	q9, [x0, #-48]
// [0,18]    .    .    .    .    .    .    DE   .    .    .    .    .    .    .    .    .    .    .    .   .   stur	q10, [x0, #-32]
// [0,19]    .    .    .    .    .    .    .DE  .    .    .    .    .    .    .    .    .    .    .    .   .   stur	q11, [x0, #-16]
// [1,0]     .    .    .    .    .    .    .DeeE.    .    .    .    .    .    .    .    .    .    .    .   .   ldr	q0, [x1]
// [1,1]     .    .    .    .    .    .    . DeeE    .    .    .    .    .    .    .    .    .    .    .   .   ldr	q1, [x2]
// [1,2]     .    .    .    .    .    .    .  DeeE   .    .    .    .    .    .    .    .    .    .    .   .   ldr	q8, [x0]
// [1,3]     .    .    .    .    .    .    .   DeeE  .    .    .    .    .    .    .    .    .    .    .   .   ldr	q9, [x0, #16]
// [1,4]     .    .    .    .    .    .    .    DeeE .    .    .    .    .    .    .    .    .    .    .   .   ldr	q10, [x0, #32]
// [1,5]     .    .    .    .    .    .    .    .DeeE.    .    .    .    .    .    .    .    .    .    .   .   ldr	q11, [x0, #48]
// [1,6]     .    .    .    .    .    .    .    . DeeeE   .    .    .    .    .    .    .    .    .    .   .   mul	v24.8h, v9.8h, v0.h[0]
// [1,7]     .    .    .    .    .    .    .    .  DeeeE  .    .    .    .    .    .    .    .    .    .   .   sqrdmulh	v9.8h, v9.8h, v0.h[1]
// [1,8]     .    .    .    .    .    .    .    .    . DeeeE   .    .    .    .    .    .    .    .    .   .   mls	v24.8h, v9.8h, v1.h[0]
// [1,9]     .    .    .    .    .    .    .    .    .    .DeE .    .    .    .    .    .    .    .    .   .   sub	v9.8h, v8.8h, v24.8h
// [1,10]    .    .    .    .    .    .    .    .    .    . DeE.    .    .    .    .    .    .    .    .   .   add	v8.8h, v8.8h, v24.8h
// [1,11]    .    .    .    .    .    .    .    .    .    .  DeeeE  .    .    .    .    .    .    .    .   .   mul	v24.8h, v11.8h, v0.h[0]
// [1,12]    .    .    .    .    .    .    .    .    .    .   DeeeE .    .    .    .    .    .    .    .   .   sqrdmulh	v11.8h, v11.8h, v0.h[1]
// [1,13]    .    .    .    .    .    .    .    .    .    .    .  DeeeE  .    .    .    .    .    .    .   .   mls	v24.8h, v11.8h, v1.h[0]
// [1,14]    .    .    .    .    .    .    .    .    .    .    .    . DeE.    .    .    .    .    .    .   .   sub	v11.8h, v10.8h, v24.8h
// [1,15]    .    .    .    .    .    .    .    .    .    .    .    .  DeE    .    .    .    .    .    .   .   add	v10.8h, v10.8h, v24.8h
// [1,16]    .    .    .    .    .    .    .    .    .    .    .    .   DE    .    .    .    .    .    .   .   str	q8, [x0], #64
// [1,17]    .    .    .    .    .    .    .    .    .    .    .    .    DE   .    .    .    .    .    .   .   stur	q9, [x0, #-48]
// [1,18]    .    .    .    .    .    .    .    .    .    .    .    .    .DE  .    .    .    .    .    .   .   stur	q10, [x0, #-32]
// [1,19]    .    .    .    .    .    .    .    .    .    .    .    .    . DE .    .    .    .    .    .   .   stur	q11, [x0, #-16]
// [2,0]     .    .    .    .    .    .    .    .    .    .    .    .    . DeeE    .    .    .    .    .   .   ldr	q0, [x1]
// [2,1]     .    .    .    .    .    .    .    .    .    .    .    .    .  DeeE   .    .    .    .    .   .   ldr	q1, [x2]
// [2,2]     .    .    .    .    .    .    .    .    .    .    .    .    .   DeeE  .    .    .    .    .   .   ldr	q8, [x0]
// [2,3]     .    .    .    .    .    .    .    .    .    .    .    .    .    DeeE .    .    .    .    .   .   ldr	q9, [x0, #16]
// [2,4]     .    .    .    .    .    .    .    .    .    .    .    .    .    .DeeE.    .    .    .    .   .   ldr	q10, [x0, #32]
// [2,5]     .    .    .    .    .    .    .    .    .    .    .    .    .    . DeeE    .    .    .    .   .   ldr	q11, [x0, #48]
// [2,6]     .    .    .    .    .    .    .    .    .    .    .    .    .    .  DeeeE  .    .    .    .   .   mul	v24.8h, v9.8h, v0.h[0]
// [2,7]     .    .    .    .    .    .    .    .    .    .    .    .    .    .   DeeeE .    .    .    .   .   sqrdmulh	v9.8h, v9.8h, v0.h[1]
// [2,8]     .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DeeeE  .    .    .   .   mls	v24.8h, v9.8h, v1.h[0]
// [2,9]     .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DeE.    .    .   .   sub	v9.8h, v8.8h, v24.8h
// [2,10]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DeE    .    .   .   add	v8.8h, v8.8h, v24.8h
// [2,11]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DeeeE .    .   .   mul	v24.8h, v11.8h, v0.h[0]
// [2,12]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DeeeE.    .   .   sqrdmulh	v11.8h, v11.8h, v0.h[1]
// [2,13]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DeeeE .   .   mls	v24.8h, v11.8h, v1.h[0]
// [2,14]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DeE   .   sub	v11.8h, v10.8h, v24.8h
// [2,15]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   DeE  .   add	v10.8h, v10.8h, v24.8h
// [2,16]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DE  .   str	q8, [x0], #64
// [2,17]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .DE .   stur	q9, [x0, #-48]
// [2,18]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    . DE.   stur	q10, [x0, #-32]
// [2,19]    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .  DE   stur	q11, [x0, #-16]
...

// Timeline view (OPTIMIZED):
//                     0123456789          0123456789          0123456789
// Index     0123456789          0123456789          0123456789          01234
//
// [0,0]     DeeE .    .    .    .    .    .    .    .    .    .    .    .   .   ldr	q7, [x1]
// [0,1]     .DeeE.    .    .    .    .    .    .    .    .    .    .    .   .   ldr	q31, [x0, #16]
// [0,2]     . DeeE    .    .    .    .    .    .    .    .    .    .    .   .   ldr	q11, [x0, #48]
// [0,3]     .   DeeeE .    .    .    .    .    .    .    .    .    .    .   .   mul	v20.8h, v31.8h, v7.h[0]
// [0,4]     .    DeeeE.    .    .    .    .    .    .    .    .    .    .   .   sqrdmulh	v31.8h, v31.8h, v7.h[1]
// [0,5]     .    .DeeeE    .    .    .    .    .    .    .    .    .    .   .   mul	v18.8h, v11.8h, v7.h[0]
// [0,6]     .    . DeeeE   .    .    .    .    .    .    .    .    .    .   .   sqrdmulh	v7.8h, v11.8h, v7.h[1]
// [0,7]     .    .  DeeE   .    .    .    .    .    .    .    .    .    .   .   ldr	q11, [x2]
// [0,8]     .    .   DeeE  .    .    .    .    .    .    .    .    .    .   .   ldr	q8, [x0]
// [0,9]     .    .    .DeeeE    .    .    .    .    .    .    .    .    .   .   mls	v20.8h, v31.8h, v11.h[0]
// [0,10]    .    .    . DeeeE   .    .    .    .    .    .    .    .    .   .   mls	v18.8h, v7.8h, v11.h[0]
// [0,11]    .    .    .  DeeE   .    .    .    .    .    .    .    .    .   .   ldr	q7, [x0, #32]
// [0,12]    .    .    .    DeE  .    .    .    .    .    .    .    .    .   .   sub	v31.8h, v8.8h, v20.8h
// [0,13]    .    .    .    .DeE .    .    .    .    .    .    .    .    .   .   add	v11.8h, v8.8h, v20.8h
// [0,14]    .    .    .    . DeE.    .    .    .    .    .    .    .    .   .   sub	v20.8h, v7.8h, v18.8h
// [0,15]    .    .    .    . DE .    .    .    .    .    .    .    .    .   .   str	q31, [x0, #16]
// [0,16]    .    .    .    .  DeE    .    .    .    .    .    .    .    .   .   add	v7.8h, v7.8h, v18.8h
// [0,17]    .    .    .    .   DE    .    .    .    .    .    .    .    .   .   str	q11, [x0], #64
// [0,18]    .    .    .    .    DE   .    .    .    .    .    .    .    .   .   stur	q7, [x0, #-32]
// [0,19]    .    .    .    .    .DE  .    .    .    .    .    .    .    .   .   stur	q20, [x0, #-16]
// [1,0]     .    .    .    .    .DeeE.    .    .    .    .    .    .    .   .   ldr	q7, [x1]
// [1,1]     .    .    .    .    . DeeE    .    .    .    .    .    .    .   .   ldr	q31, [x0, #16]
// [1,2]     .    .    .    .    .  DeeE   .    .    .    .    .    .    .   .   ldr	q11, [x0, #48]
// [1,3]     .    .    .    .    .    DeeeE.    .    .    .    .    .    .   .   mul	v20.8h, v31.8h, v7.h[0]
// [1,4]     .    .    .    .    .    .DeeeE    .    .    .    .    .    .   .   sqrdmulh	v31.8h, v31.8h, v7.h[1]
// [1,5]     .    .    .    .    .    . DeeeE   .    .    .    .    .    .   .   mul	v18.8h, v11.8h, v7.h[0]
// [1,6]     .    .    .    .    .    .  DeeeE  .    .    .    .    .    .   .   sqrdmulh	v7.8h, v11.8h, v7.h[1]
// [1,7]     .    .    .    .    .    .   DeeE  .    .    .    .    .    .   .   ldr	q11, [x2]
// [1,8]     .    .    .    .    .    .    DeeE .    .    .    .    .    .   .   ldr	q8, [x0]
// [1,9]     .    .    .    .    .    .    . DeeeE   .    .    .    .    .   .   mls	v20.8h, v31.8h, v11.h[0]
// [1,10]    .    .    .    .    .    .    .  DeeeE  .    .    .    .    .   .   mls	v18.8h, v7.8h, v11.h[0]
// [1,11]    .    .    .    .    .    .    .   DeeE  .    .    .    .    .   .   ldr	q7, [x0, #32]
// [1,12]    .    .    .    .    .    .    .    .DeE .    .    .    .    .   .   sub	v31.8h, v8.8h, v20.8h
// [1,13]    .    .    .    .    .    .    .    . DeE.    .    .    .    .   .   add	v11.8h, v8.8h, v20.8h
// [1,14]    .    .    .    .    .    .    .    .  DeE    .    .    .    .   .   sub	v20.8h, v7.8h, v18.8h
// [1,15]    .    .    .    .    .    .    .    .  DE.    .    .    .    .   .   str	q31, [x0, #16]
// [1,16]    .    .    .    .    .    .    .    .   DeE   .    .    .    .   .   add	v7.8h, v7.8h, v18.8h
// [1,17]    .    .    .    .    .    .    .    .    DE   .    .    .    .   .   str	q11, [x0], #64
// [1,18]    .    .    .    .    .    .    .    .    .DE  .    .    .    .   .   stur	q7, [x0, #-32]
// [1,19]    .    .    .    .    .    .    .    .    . DE .    .    .    .   .   stur	q20, [x0, #-16]
// [2,0]     .    .    .    .    .    .    .    .    . DeeE    .    .    .   .   ldr	q7, [x1]
// [2,1]     .    .    .    .    .    .    .    .    .  DeeE   .    .    .   .   ldr	q31, [x0, #16]
// [2,2]     .    .    .    .    .    .    .    .    .   DeeE  .    .    .   .   ldr	q11, [x0, #48]
// [2,3]     .    .    .    .    .    .    .    .    .    .DeeeE    .    .   .   mul	v20.8h, v31.8h, v7.h[0]
// [2,4]     .    .    .    .    .    .    .    .    .    . DeeeE   .    .   .   sqrdmulh	v31.8h, v31.8h, v7.h[1]
// [2,5]     .    .    .    .    .    .    .    .    .    .  DeeeE  .    .   .   mul	v18.8h, v11.8h, v7.h[0]
// [2,6]     .    .    .    .    .    .    .    .    .    .   DeeeE .    .   .   sqrdmulh	v7.8h, v11.8h, v7.h[1]
// [2,7]     .    .    .    .    .    .    .    .    .    .    DeeE .    .   .   ldr	q11, [x2]
// [2,8]     .    .    .    .    .    .    .    .    .    .    .DeeE.    .   .   ldr	q8, [x0]
// [2,9]     .    .    .    .    .    .    .    .    .    .    .  DeeeE  .   .   mls	v20.8h, v31.8h, v11.h[0]
// [2,10]    .    .    .    .    .    .    .    .    .    .    .   DeeeE .   .   mls	v18.8h, v7.8h, v11.h[0]
// [2,11]    .    .    .    .    .    .    .    .    .    .    .    DeeE .   .   ldr	q7, [x0, #32]
// [2,12]    .    .    .    .    .    .    .    .    .    .    .    . DeE.   .   sub	v31.8h, v8.8h, v20.8h
// [2,13]    .    .    .    .    .    .    .    .    .    .    .    .  DeE   .   add	v11.8h, v8.8h, v20.8h
// [2,14]    .    .    .    .    .    .    .    .    .    .    .    .   DeE  .   sub	v20.8h, v7.8h, v18.8h
// [2,15]    .    .    .    .    .    .    .    .    .    .    .    .   DE   .   str	q31, [x0, #16]
// [2,16]    .    .    .    .    .    .    .    .    .    .    .    .    DeE .   add	v7.8h, v7.8h, v18.8h
// [2,17]    .    .    .    .    .    .    .    .    .    .    .    .    .DE .   str	q11, [x0], #64
// [2,18]    .    .    .    .    .    .    .    .    .    .    .    .    . DE.   stur	q7, [x0, #-32]
// [2,19]    .    .    .    .    .    .    .    .    .    .    .    .    .  DE   stur	q20, [x0, #-16]

However, LLVM MCA's model might not be accurate either and cannot replacement measurements on real hardware -- so let's do that. Here, we use a profiling tool we wrote as part of the pqax benchmarking framework. It takes an assembly snippet as input and automatically generates a program running and benchmarking prefixes of the input, and combining them into a performance diagram similar to the one generated by LLVM-MCA. Here's the output in our case:

===== Stepwise profiling =======
[  0]:                     ldr q0, [x1, #0] ......*.....................................
[  1]:                  ldr q8, [x0, #0*16] .......*....................................
[  2]:                  ldr q9, [x0, #1*16] .........*..................................
[  3]:                 ldr q10, [x0, #2*16] ...........*................................
[  4]:                 ldr q11, [x0, #3*16] .............*..............................
[  5]:    mul      v12.8h,   v9.8h, v0.h[0] ...............*............................
[  6]:    sqrdmulh v9.8h,    v9.8h, v0.h[1] ................*...........................
[  7]:    mls      v12.8h,   v9.8h, v1.h[0] .................*..........................
[  8]:          sub    v9.8h, v8.8h, v12.8h .....................*......................
[  9]:          add    v8.8h, v8.8h, v12.8h ........................*...................
[ 10]:   mul      v12.8h,   v11.8h, v0.h[0] .........................*..................
[ 11]:  sqrdmulh v11.8h,    v11.8h, v0.h[1] ..........................*.................
[ 12]:   mls      v12.8h,   v11.8h, v1.h[0] ...........................*................
[ 13]:        sub    v11.8h, v10.8h, v12.8h ...............................*............
[ 14]:        add    v10.8h, v10.8h, v12.8h ..................................*.........
[ 15]:                  str q8, [x0], #4*16 ...................................*........
[ 16]:                 str q9, [x0, #-3*16] .....................................*......
[ 17]:                str q10, [x0, #-2*16] .....................................*......
[ 18]:                str q11, [x0, #-1*16] ........................................*...

===== Stepwise profiling (OPTIMIZED) =======
[  0]:  ldr q18, [x0, #16]              // .*........................
[  1]:  sqrdmulh v8.8H, v6.8H, v2.H[1]  // ..*.......................
[  2]:  mul v23.8H, v6.8H, v2.H[0]      // ...*......................
[  3]:  ldr q31, [x0, #32]              // ....*.....................
[  4]:  mul v3.8H, v18.8H, v2.H[0]      // ......*...................
[  5]:  mls v23.8H, v8.8H, v1.H[0]      // .......*..................
[  6]:  sqrdmulh v9.8H, v18.8H, v2.H[1] // ........*.................
[  7]:  ldr q15, [x0, #0]               // .........*................
[  8]:  sub v11.8H, v31.8H, v23.8H      // ...........*..............
[  9]:  mls v3.8H, v9.8H, v1.H[0]       // ............*.............
[ 10]:  add v16.8H, v31.8H, v23.8H      // .............*............
[ 11]:  str q11, [x0, #48]              // ..............*...........
[ 12]:  ldr q2, [x1, #0]                // ...............*..........
[ 13]:  add v13.8H, v15.8H, v3.8H       // .................*........
[ 14]:  str q16, [x0, #32]              // ..................*.......
[ 15]:  sub v7.8H, v15.8H, v3.8H        // ...................*......
[ 16]:  str q13, [x0], #4*16            // ....................*.....
[ 17]:  ldr q6, [x0, #48]               // .....................*....
[ 18]:  str q7, [x0, #-48]              // .......................*..

We can see that SLOTHY's predictions were exactly right, and that LLVM-MCA's model is off in a few places. So, in a nutshell, we'd say that LLVM-MCA is great for quick evaluation of performance, but when you get down to the last cycle and fine-tuning your model, there is no way around measurements on real hardware.

6. Optimizing a full Neon NTT

The examples previously considered were all toy examples, so you may wonder how to apply SLOTHY to actual cryptographic code. Let's look at a real-world example: The Kyber number-theoretic transform -- a core arithmetic function of the Kyber key-encapsulation mechanism making up a large chunk of the total run-time. The target platform is again the Arm Cortex-A55 and the code primarily consists of Neon vector instructions. We'll consider a straightforward implementation available here: ntt_kyber_123_4567.s. If you have ever written an NTT, it should be fairly easy to understand what the code is doing. The code consists of 2 main loops implementing layers 1+2+3 and 4+5+6+7 of the NTT. The actual operations are wrapped in macros implementing butterflies on single vector registers. Note that this code performs very poorly: No consideration was given to the intricacies of the microarchitecture.

Let's run SLOTHY on this code:

slothy.load_source_from_file("examples/naive/aarch64/ntt_kyber_123_4567.s")
slothy.config.sw_pipelining.enabled = True
slothy.config.inputs_are_outputs = True
slothy.config.sw_pipelining.minimize_overlapping = False
slothy.config.variable_size = True
slothy.config.reserved_regs = [f"x{i}" for i in range(0, 7)] + ["x30", "sp"]
slothy.config.constraints.stalls_first_attempt = 64
slothy.optimize_loop("layer123_start")
slothy.optimize_loop("layer4567_start")
slothy.write_source_to_file("opt/ntt_kyber_123_4567_opt_a55.s")

We simply optimize both loops separately. You will notice some additional flags we have set. To read the documentation of those, please have a look at config.py. We have set an additional flag: inputs_are_outputs = True. This tells SLOTHY that the registers that are used as inputs to the loop (e.g., the pointer to the polynomial input) are also outputs of the entire loop; otherwise, SLOTHY could overwrite them in the postamble once they are no longer needed. You most likely want inputs_are_outputs=True whenever you are optimizing a loop. We also use the reserved_regs option to tell SLOTHY that registers x0, ..., x7, x30, sp are used for other purposes and should not be used by SLOTHY. When optimizing only parts of a function, it is essential to tell SLOTHY which registers should not be used: By default SLOTHY will use any of the architectural registers. If you are familiar with inline assembly, SLOTHY's reserved_regs are essentially the complement of the 'clobber list'.

When running this example, you will notice that it has a significantly longer runtime. On my Intel i7-1360P it takes approximately 15 minutes to optimize both loops. You may instead look at an optimized version of the same code examples/opt/aarch64/ntt_kyber_123_4567_opt_a55.s. You notice that both loops have many early instructions, and coming up with this code by hand would be tedious, time-consuming and error-prone.

7. Optimizing larger pieces of code

We've seen that the code above can be optimized relatively fast (within seconds to minutes on a laptop). When using a more powerful machine and allowing optimization times of hours, one can scale this up to larger examples. We've successfully used (vanilla) SLOTHY for optimized code snippets of up to 180 instructions. However, for larger code at a certain point the constraint solving becomes prohibitively expensive and we need to use a different strategy.

One such example is the X25519 implementation we looked at in the SLOTHY paper available in X25519-AArch64-simple.s It is a hybrid vector-scalar implementation based on an implementation by Lenngren. Its core loop consists of 958 instructions which well exceeds what SLOTHY can currently optimize in a single pass.

However, we can still make use of SLOTHY to optimize this code by employing heuristics. One particularly useful heuristics supported by SLOTHY is the splitting heuristic. When a piece of code is too large to be optimized at once, it splits it into multiple overlapping pieces that are optimized separately. With this approach one loses the optimality guarantees as it may be that there is a solution that SLOTHY cannot find due to the splitting. However, by repeatedly running SLOTHY using the splitting heuristic, we managed to outperform the state-of-the-art and get very close to optimal results (in terms of IPC).

To demonstrate the splitting heuristic we can use the following SLOTHY call:

# example
slothy.load_source_from_file("../examples/naive/aarch64/X25519-AArch64-simple.s")

# first pass: replace symbolic register names by architectural registers
slothy.config.inputs_are_outputs=True
slothy.config.outputs=["x0"]
slothy.config.constraints.functional_only = True
slothy.config.constraints.allow_reordering = False
slothy.optimize(start="mainloop", end="end_label")
slothy.config.constraints.functional_only = False
slothy.config.constraints.allow_reordering = True

# second pass: splitting heuristic
slothy.config.variable_size=True
slothy.config.constraints.stalls_first_attempt=32
slothy.config.split_heuristic = True
slothy.config.split_heuristic_stepsize = 0.05
slothy.config.split_heuristic_factor = 10
slothy.config.split_heuristic_repeat = 2
slothy.config.split_heuristic_estimate_performance = False
slothy.optimize(start="mainloop", end="end_label")
slothy.write_source_to_file("opt/X25519-AArch64-simple_opt.s")

The splitting heuristic can be turned on by setting slothy.config.split_heuristic = True. It has three main parameters:

  • split_heuristic_factor : Determines the size of each split. In this case, 10 means that we will be optimizing 10% of the original code at a time.
  • split_heuristic_stepsize : Controls the degree of overlapping of the sliding window. Setting it to 0.05 means the sliding window moves by 5% every time. We will start with optimizing the first 10% ([0,0.1]) of the code, then [0.05,0.15], [0.1,0.20], ...
  • split_heuristic_repeat: The number of times the optimization should be repeated.
  • split_heuristic_estimate_performance: After running the split heuristic, don't attempt to estimate the stalls in the final code (that tends to be quite slow).

You will notice in the example above, that there is another call to slothy.optimize() prior to that. This is needed as the input implementation is using symbolic register names which is a feature unrelated to the splitting heuristic that we want to demonstrate here. It allows a developer of the code to leave the register allocation up to SLOTHY. Unfortunately, it is not compatible with the splitting heuristic (as register allocation can't be performed locally), and hence we first need to do the register allocation on the full code before we continue. We can configure SLOTHY to only consider register allocation by setting the allow_reordering=False (disabling the ordering constraints) and functional_only=True (disabling the microarchitectural constraints). In this way, the constraints remain manageable, and SLOTHY finds a register allocation within a few minutes.

Running this example takes around 15 minutes. You can instead look at the output available in opt/X25519-AArch64-simple_opt.s The output will look similar to the previous examples and contains significantly less pipeline stalls than the input. For achieving the best performance, we require a few more calls to SLOTHY. You can find the script we used here - it runs around 1.5 hours.

8. Adding a new microarchitecture

You may wonder how to extend SLOTHY to include a new microarchitecture. For example, you may want to optimize code for a newer iteration of the Arm Cortex-A55, e.g., the Arm Cortex-A510. To understand what is needed for that, let's look at the microarchitectural model for the Cortex-A55 available in slothy/targets/aarch64/cortex_a55.py.

Skipping some boilerplate code, you will see the following structure:

from slothy.targets.aarch64.aarch64_neon import *

issue_rate = 2
class ExecutionUnit(Enum):
    """Enumeration of execution units in Cortex-A55 model"""
    SCALAR_ALU0=1
    SCALAR_ALU1=2
    SCALAR_MAC=3
    SCALAR_LOAD=4
    SCALAR_STORE=5
    VEC0=6
    VEC1=7
    # ...

execution_units = {
        // ...
}

inverse_throughput = {
        // ...
}

default_latencies = {
        // ...
}


def get_latency(src, out_idx, dst):
    // ...
    latency = lookup_multidict(
        default_latencies, src)
    // ...
    return latency

def get_units(src):
    units = lookup_multidict(execution_units, src)
    if isinstance(units,list):
        return units
    return [units]

def get_inverse_throughput(src):
    return lookup_multidict(
        inverse_throughput, src)

Going through the snippet, we can see the core components:

  • Definition of the issue_rate corresponding to the number of issue slots available per cycle. Since the Cortex-A55 is a dual-issue CPU, this is two.
  • Definition of an Enum modelling the different execution units available. In this case, we model 2 scalar units, one MAC unit, 2 64-bit vector units, one load unit, and one store unit.
  • Finally, we need to implement the functions get_latency, get_units, get_inverse_throughput returning the latency, occupied execution units, and throughputs. The input to these functions is a class from the architectural model representing the instruction in question. For example, the class vmull in aarch64_neon.py corresponds to the umull instruction. We commonly implement this using dictionaries above.

For example, for the (128-bit/qform) vmull instruction, we can find in the Arm Cortex-A55 Software Optimization Guide that it occupies both vector execution units, has an inverse throughput of 1, and a latency of 4 cycles. We can model this in the following way:

execution_units = {
    ( vmull ): [[ExecutionUnit.VEC0, ExecutionUnit.VEC1]],
}

inverse_throughput = {
    ( vmull ) : 1,
}

default_latencies = {
    ( vmull ) : 4,
}

We mostly use the tuple-syntax, so we can group together instructions that belong together. For example, later we may want to add the Neon add. From the SWOG we can see that (128-bit/qform) add occupies both 64-bit vector execution units, has a latency of 3 cycles, and throughput of 1 cycle. We can extend the above model as follows:

execution_units = {
    ( vmull, vadd ): [[ExecutionUnit.VEC0, ExecutionUnit.VEC1]],
}

inverse_throughput = {
    ( vmull, vadd ) : 1,
}

default_latencies = {
    ( vmull ) : 4,
    ( vadd ) : 3,
}

(When looking at the actual model, you will notice that this is not quite how it is modelled. You will see that for some instructions, we have to distinguish between the q-form (128-bit) and the d-form (64-bit) of the instruction. Q-form instructions occupy both vector execution units, while most D-form instructions occupy only 1. Latencies also vary depending on the actual form.)

Note that both the architectural model and the micro-architectural model can be built lazily: As long as the corresponding instruction do not appear in your input, you may leave out their description. As soon as you hit an instruction that is not part of the architectural or micro-architectural model, you will see an error.

Troubleshooting

  • ModuleNotFoundError: No module named 'ortools'

This suggests that you have not installed the required dependencies needed by SLOTHY. Either you need to follow the installation instructions, or if you have done that already, you likely forgot to enter the virtual environment you have installed them in using source venv/bin/activate. You will have to run this every time you open a new terminal.

  • The selfcheck passes but the code is functionally incorrect!

The most common reason for this is a bad configuration: Check that you all registers that must be kept for the sake of the surrounding code are marked as reserved_regs.

Another possibility, albeit hopefully rare by now, is a failure during address offset fixup: This feature is not yet stable, and the selfcheck is currently blind to erroneous calculations here. If you are sure your configuration is correct, you might want to check the adjusted address offsets manually. If you find a bug, let us know!