Skip to content

Latest commit

 

History

History
148 lines (115 loc) · 5.79 KB

README.md

File metadata and controls

148 lines (115 loc) · 5.79 KB

fftconv

Extremely fast CPU 1D convolutions. Faster than Intel IPP and Apple Accelerate on their respective platforms

Kernel size = 245

It's well know that convolution in the time domain is equivalent to multiplication in the frequency domain (circular convolution). With the Fast Fourier Transform, we can reduce the time complexity of a discrete convolution from O(n^2) to O(n log(n)), where n is the larger of the two array sizes. The overlap-add method is a fast convolution method commonly use in FIR filtering, where the discrete signal is often much longer than the FIR filter kernel.

Usage

Check this repo to see how to use fftconv as a custom port through VCPKG.

  • fftconv::convolve_fftw implements FFT convolution.
  • fftconv::oaconvolve_fftw implements FFT convolution using the overlap-add method, much faster when one sequence is much longer than the other (e.g. in FIR filtering).

All convolution functions support float and double and use a C++20 std::span interface.

template <FloatOrDouble Real>
void oaconvolve_fftw(const std::span<const Real> arr,
                     const std::span<const Real> kernel, std::span<Real> res);

Python bindings are provided through Cython.

Build the test and benchmark

This benchmark is out of date. Check this repo for the up-to-date benchmarks.

The only dependency of fftconv is fftw3. Since the float and double interface of fftw3 are used, link with -lfftw -lfftwf.

Benchmark and test dependencies:

Python

TODO The Python wrapper is currently out of date.

A Cython wrapper is provided. Dependencies:

  • Cython for C++ bindings
  • numpy (benchmarked against)
  • numba (benchmarked against)
  • scipy (benchmarked against)
  • matplotlib (plot results)
python3 setup.py build_ext -i
python3 test.py # run the python test/benchmark

Benchmark results

CPU: Intel i7 Comet Lake

C++.

The test_fftconv binary gives an easy benchmark that runs every test case 5000 times. The bench_fftconv uses google-benchmark and gives much more reliable measures. Use ./script/run_bench to run the benchmark and generate figures.

Output from bench_fftconv (accurate bench) raw result saved in ./bench_result.json. Plot generated from plot_bench.py:

Output from test_fftconv (simple bench)

% ./build/test_fftconv
=== test case (1664, 65) ===
All tests passed.
    (5000 runs) convolve_fftw took 82ms
    (5000 runs) oaconvolve_fftw took 36ms
    (5000 runs) convolve_armadillo took 108ms
=== test case (2816, 65) ===
All tests passed.
    (5000 runs) convolve_fftw took 111ms
    (5000 runs) oaconvolve_fftw took 60ms
    (5000 runs) convolve_armadillo took 174ms
=== test case (2304, 65) ===
All tests passed.
    (5000 runs) convolve_fftw took 536ms
    (5000 runs) oaconvolve_fftw took 52ms
    (5000 runs) convolve_armadillo took 147ms
=== test case (4352, 65) ===
All tests passed.
    (5000 runs) convolve_fftw took 335ms
    (5000 runs) oaconvolve_fftw took 86ms
    (5000 runs) convolve_armadillo took 276ms

Python.

% python3 test.py
=== test case (1664, 65) ===
Vectors are equal.
    (5000 runs) convolve_fftw took 73ms
    (5000 runs) oaconvolve_fftw took 38ms
    (5000 runs) np.convolve took 140ms
    (5000 runs) numba.njit(np.convolve) took 1409ms
    (5000 runs) scipy.signal.convolve took 162ms
    (5000 runs) scipy.signal.fftconvolve took 199ms
    (5000 runs) scipy.signal.oaconvolve took 321ms
=== test case (2816, 65) ===
Vectors are equal.
    (5000 runs) convolve_fftw took 96ms
    (5000 runs) oaconvolve_fftw took 60ms
    (5000 runs) np.convolve took 236ms
    (5000 runs) numba.njit(np.convolve) took 2883ms
    (5000 runs) scipy.signal.convolve took 256ms
    (5000 runs) scipy.signal.fftconvolve took 256ms
    (5000 runs) scipy.signal.oaconvolve took 362ms
=== test case (2304, 65) ===
Vectors are equal.
    (5000 runs) convolve_fftw took 281ms
    (5000 runs) oaconvolve_fftw took 53ms
    (5000 runs) np.convolve took 194ms
    (5000 runs) numba.njit(np.convolve) took 2215ms
    (5000 runs) scipy.signal.convolve took 213ms
    (5000 runs) scipy.signal.fftconvolve took 240ms
    (5000 runs) scipy.signal.oaconvolve took 346ms
=== test case (4352, 65) ===
Vectors are equal.
    (5000 runs) convolve_fftw took 326ms
    (5000 runs) oaconvolve_fftw took 82ms
    (5000 runs) np.convolve took 358ms
    (5000 runs) numba.njit(np.convolve) took 3657ms
    (5000 runs) scipy.signal.convolve took 378ms
    (5000 runs) scipy.signal.fftconvolve took 365ms
    (5000 runs) scipy.signal.oaconvolve took 395ms

The Python wrapper is almost as fast as the C++ code, as it has very little overhead.

Implementation Details

TODO