Tested algorithms and implementations:
- Stack Blur (Anti-Grain Geometry 2.5 by Maxim Shemanarev)
- Recursive Blur (Anti-Grain Geometry 2.5 by Maxim Shemanarev)
- My unoptimized implementation of Stack Blur
- My optimized implementations of Stack Blur using SSE2, SSSE3, SSE4.1
Note: AGG versions was slightly modified to be able to use them with multiple threads and to suppress some compile warnings.
The Stack Blur algorithm was invented by Mario Klingemann.
[email protected]
https://medium.com/@quasimondo
AGG - Anti-Grain Geometry - a library that was written by Maxim Shemanarev.
Do not use SIMD versions with disabled compiler optimizations. They'll be too slow.
Use at least '-O1' optimization level (GCC and Clang).
All tested versions use 32-bit (8 bits per component, order of components is not important) pixel format.
Clang do much better optimizations with same flags than GCC. Both tested are from MSYS2/MinGW64 toolchain.
The fastest implementation I could write is about 0.7ms for a 1280x720 32bpp frame on an AMD Ryzen 7 2700 with SSE4.1 and 16 threads.
Suppose, we have loop:
for ( int i = 0; i < 8; i++ ) ...
And we want to break him into 3 threads.
Thus, 8 iterations / 3 threads = 3 threads with 2 iterations + 2 remained iterations.
- Thread #1: [0;2) - size 2
- Thread #2: [2;4) - size 2
- Thread #3: [4;8) - size 4
This approach has two disadvantages:
- Non-uniform range distribution
- The last thread have biggest block size and (oftenly, not always) begins its execution after all previous threads already has started
- Thread #1: [0;3) - size 3
- Thread #2: [3;6) - size 3
- Thread #3: [6;8) - size 2
TODO:
- Recursive Blur SIMD version
- Gaussian Blur SIMD version