FFT benchmark performance calculation #5256

neoskypig · 2020-09-10T08:53:41Z

neoskypig
Sep 10, 2020

In apps/fft/main.cpp, I feel confused about below code, Why does the halide running time need to devided by reps? but FFTW doesn't do it?
double halide_t = benchmark(samples, 1, [&]() { bench_c2c.realize(R_c2c); }) * 1e6 / reps;
double fftw_t = benchmark(samples, reps, [&]() { fftwf_execute(c2c_plan); }) * 1e6;

abadams · 2020-09-10T16:15:54Z

abadams
Sep 10, 2020
Maintainer

In the first case, the code itself is able to run for some number of reps, so call the lambda once, time it, and divide by the number of reps we asked bench_c2c to do internally. In the second case reps is passed as an argument to benchmark, which internally calls the lambda that many times and then does the division.

0 replies

neoskypig · 2020-09-13T23:50:02Z

neoskypig
Sep 13, 2020
Author

@abadams thanks for your feedback. Why don't use same method for both time testings? I found If we change it to same method, the Halide version performance score became worse.

0 replies

dsharletg · 2020-09-14T00:20:39Z

dsharletg
Sep 14, 2020
Maintainer

You might be able to recover most of the loss in performance by changing the Halide target to include Target::NoAsserts and Target::NoBoundsQuery. But there is also a small bit of overhead associated with calling realize that might also be affecting it. This can be avoided by AOT compiling the pipeline with a generator.

The reasoning behind the benchmarking is:

When I looked, it seemed like fftw benchmarked to/from the same buffer, which means the input and output will be cached (if it is small enough), which is also what this FFT program does.
In practice, we use this FFT by running it in batches, like the Halide program does here. And typically, the input and output buffers are part of a larger pipeline, so it is reasonable to assume they are cached.
It might be possible to make FFTW also run in batches, which would be the most fair comparison, but I am not sure this would allow running to/from the same buffer (for caching). But even if you could, you wouldn't be able to fuse other work with the FFTs, which you can with the Halide version.

0 replies

abadams · 2020-09-14T00:22:37Z

abadams
Sep 14, 2020
Maintainer

In other words, this benchmark is representative of how the Halide FFT is used in production, and how fftw would be used in production if it were to be used instead.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FFT benchmark performance calculation #5256

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

FFT benchmark performance calculation #5256

neoskypig Sep 10, 2020

Replies: 4 comments

abadams Sep 10, 2020 Maintainer

neoskypig Sep 13, 2020 Author

dsharletg Sep 14, 2020 Maintainer

abadams Sep 14, 2020 Maintainer

neoskypig
Sep 10, 2020

abadams
Sep 10, 2020
Maintainer

neoskypig
Sep 13, 2020
Author

dsharletg
Sep 14, 2020
Maintainer

abadams
Sep 14, 2020
Maintainer