The loop can be partially vectorized by moving a gather memory access pattern to a separate loop that stores the values in consecutive memory locations.
Rewrite the loop to enable partial vectorization by moving the gather memory access pattern to a first loop and the rest of the loop body in a second loop. Next, store the gathered data in consecutive memory locations using a temporary vector that is written in the first loop and read in the second loop.
Vectorization is one of the most important ways to speed up computation in the loop. In practice, loops may contain vectorizable statements, but vectorization may be either inhibited or inefficient due to the usage of data stored in non- consecutive memory locations. Programs exhibit different types of memory access patterns that lead to non-consecutive memory access, e.g. strided, indirect, random accesses.
Thus, loop fission enables the partial vectorization by moving the gather memory access pattern to a separate loop. It computes a temporary array that stores the used data in consecutive memory locations. After loop fission, we end up with two loops. The first loop computes the gathers and will not be vectorized. The second loop computes the remainder of the original loop and will be vectorized.
Note
Loop fission introduces additional memory overheads, which is needed to allocate, read/write and deallocate the memory of the temporary vector that stores the used data in consecutive memory locations. The implementation of loop fission must take this into consideration to produce performant code.
Have a look at the following loop. This computationally-intensive loop might benefit from vectorization, but the loop will not be vectorized because the gather memory access pattern to array X is reading non-consecutive memory locations.
for (int i = 0; i < n; ++i) {
D[i] = a * X[index[i]] + Y[i]
}
After applying loop fission, the original loop is split into two loops. The
first loop gathers the non-consecutive data and stores the values in consecutive
storage using the array X_index_i
. The second loop uses the data from that
array, and the compiler will be able to vectorize the second loop. Overall, the
original loop is partially vectorized through loop fission.
for (int i = 0; i < n; ++i) {
X_index_i[i] = X[index[i]];
}
for (int i = 0; i < n; ++i) {
D[i] = a * X_index_i[i] + Y[i]
}
Have a look at the following loop. This computationally-intensive loop might
benefit from vectorization, but the loop will not be vectorized because the
gather memory access pattern to array X
is reading non-consecutive memory
locations.
do i = 1, size(D, 1)
D(i) = a * X(index(i)) + Y(i)
end do
After applying loop fission, the original loop is split into two loops. The
first loop gathers the non-consecutive data and stores the values in consecutive
storage using the array X_index_i
. The second loop uses the data from that
array, and the compiler will be able to vectorize the second loop. Overall, the
original loop is partially vectorized through loop fission.
real(kind=real32), allocatable :: X_index_i(:)
allocate(X_index_i(size(X, 1)))
do i = 1, size(X, 1)
X_index_i(i) = X(index(i))
end do
do i = 1, size(D, 1)
D(i) = a * X_index_i(i) + Y(i)
end do