Home

Overview

This repository contains the code for the highly experimental and completely unfinished high-level loop pipelining pass as presented during the April 2015 EuroLLVM developers meeting.

Intended functionality

The aim of this high-level software pipelining implementation is to provide a portable modulo scheduling implementation at LLVM's IR level. The current implementation uses the Swing Modulo Scheduling algorithm to schedule IR operations in a pipelined fashion to expose instruction-level parallelism, and relies on the existing scheduler in the target back-end to utilize this improved instruction-level parallelism when scheduling the actual machine instructions for the target.

The modulo scheduling algorithm requires information about the available resources of the target architecture from its back-end. This is implemented through two hooks in the TargetTransformInfo interface. These two hooks provide the number of available scalar and number of available vector processing elements respectively.

Current limitations

The current implementation works for several very basic kernels and is capable of significantly improving the ILP for those cases, but fails on many of the more complex kernels.

Some of the key problems that still need to be solved are:

Bookkeeping of live variables. There are some things that are plainly wrong with the live variable tracking in the current implementation which result in the generation of incorrect code. This problem becomes observable when more than two iterations of the loop body are overlapped within the kernel. In this case, live variables get mixed up and incorrect results are produced. Fixing this problem should also greatly improve the performance benefits of the high-level loop pipelining pass.
Instruction patterns are broken into pieces by the scheduling algorithm. This pass breaks patterns, such as multiply-add, and can redistribute them across different loop iterations. Which prevents the back-end from properly recognizing them during instruction selection. This is a problem as it can completely remove the benefits of software pipelining and can introduce significant regressions. I currently do not have a proper solution to this problem but several options were presented during the EuroLLVM talk mentioned above.
Hooking into existing targets. This pass was developed for an out-of-tree target, as such, it has not been tested yet with any of the in-tree targets. Nor do any of the in-tree targets have an appropriate implementation of the new TTI hooks. VLIW targets such as the Hexagon and R600 targets are obvious candidates but this pass can also provide benefits for super-scalar targets.
Regression tests need to be added to further verify the functionality and performance of both the pass and the generated code.
Register file pressure is not taken into account. Loop pipelining can significantly increase the number of live variables. It would make sense to add a heuristic to prevent this increased register file pressure from introducing performance regressions.
Hardware resources are currently modelled through two TTI hooks. There may be other hooks which could be added or used to replace the current ones.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Overview

Intended functionality

Current limitations

Clone this wiki locally