Performance of the loop nest can benefit from loop interchange, but an scalar reduction prevents it.
Replace the scalar reduction by a vector, particularly by the vector that contains the final result of the original loop. Then, perform loop fission to isolate the computation of the scalar reduction in a perfect loop nesting and enable loop interchange. This typically leads to creating two loops: the first loop initializes the scalar reduction; and the second loop computes the scalar reduction. All these operations are performed directly on the vector of final results of the original loop. See below a code example on how it is done.
Inefficient memory access pattern and low locality of reference are among the main reasons for low performance on modern computer systems. Matrices are stored in a row-major order in C and column-major order in Fortran. Iterating over them column-wise (in C) and row-wise (in Fortran) is inefficient, because it uses the memory subsystem suboptimally.
Nested loops that iterate over matrices inefficiently can be optimized by applying loop interchange. Using loop interchange, the inefficient matrix access pattern is replaced with a more efficient one. Often, loop interchange enables vectorization of the innermost loop which additionally improves performance.
In order to perform the loop interchange, the loops need to be perfectly nested, i.e. all the statements need to be inside the innermost loop. However, due to the initialization of a reduction variable, loop interchange is not directly applicable.
for (int i = 0; i < n; i++) {
double s = 0.0;
for (int j = 0; j < n; j++) {
s += a[j][i];
}
b[i] = s;
}
With regards to the innermost loop, the memory access pattern of the matrix a
is strided, and this loop can benefit from loop interchange. However, the
reduction variable initialization (line 2) and usage (line 8) prevent it.
To make the loop vectorizable, we can remove the temporary scalar value s
and
replace it with direct writes to b[i]
:
for (int i = 0; i < n; i++) {
b[i] = 0.0;
for (int j = 0; j < n; j++) {
b[i] += a[j][i];
}
}
After doing this, we can use loop fission to move the statement on line 2 to a separate loop. The result looks like this:
for (int i = 0; i < n; i++) {
b[i] = 0.0;
}
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
b[i] += a[j][i];
}
}
The first loop (line 1) is not performance critical, since it is not nested. On the other hand, the second loop (line 5) is performance critical, since it contains the loop nest.
Fortunately, the loop nest is now perfectly nested, making loop interchange
applicable. The final result has the nested loops in ji
order instead of the
original ij
order:
for (int i = 0; i < n; i++) {
b[i] = 0.0;
}
for (int j = 0; j < n; j++) {
for (int i = 0; i < n; i++) {
b[i] += a[j][i];
}
}
do i = 1, size(b, 1)
s = 0.0
do j = 1, size(a, 2)
s = s + a(i, j)
end do
b(i) = s
end do
With regards to the innermost loop, the memory access pattern of the matrix a
is strided, and this loop can benefit from loop interchange. However, the
reduction variable initialization (line 2) and usage (line 8) prevent it.
To make the loop vectorizable, we can remove the temporary scalar value s
and
replace it with direct writes to b(i)
:
do i = 1, size(b, 1)
b(i) = 0.0
do j = 1, size(a, 2)
b(i) = b(i) + a(i, j)
end do
end do
After doing this, we can use loop fission to move the statement on line 2 to a separate loop. The result looks like this:
do i = 1, size(b, 1)
b(i) = 0.0
end do
do i = 1, size(b, 1)
do j = 1, size(a, 2)
b(i) = b(i) + a(i, j)
end do
end do
The first loop (line 1) is not performance critical, since it is not nested. On the other hand, the second loop (line 5) is performance critical, since it contains the loop nest.
Fortunately, the loop nest is now perfectly nested, making loop interchange
applicable. The final result has the nested loops in ji
order instead of the
original ij
order:
do i = 1, size(b, 1)
b(i) = 0.0
end do
do j = 1, size(a, 2)
do i = 1, size(b, 1)
b(i) = b(i) + a(i, j)
end do
end do