Skip to content

Commit

Permalink
sw/fft: Add more details on perf calculation
Browse files Browse the repository at this point in the history
  • Loading branch information
mp-17 committed Jan 11, 2025
1 parent d3d0689 commit 7059213
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 0 deletions.
21 changes: 21 additions & 0 deletions sw/spatzBenchmarks/dp-fft/main.c
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ int main() {

// Display runtime
if (cid == 0) {
// See the bottom of the file for additional info on the performance calculation
long unsigned int performance =
1000 * 5 * NFFT * log2_nfft / timer;
long unsigned int utilization =
Expand Down Expand Up @@ -146,3 +147,23 @@ int main() {

return 0;
}

// Comments on performance calculation:
// Number of mathematical operations: 5 * NFFT * log2(NFFT)
// Each fp-add, fp-mul, fp-sub, fp-div corresponds to a floating-point operation
// and can be executed in one cycle by each FPU.
// Instead, macc-like instructions, i.e. multiply-and-accumulate, correspond to two operations
// and an FPU can compute one macc instruction per cycle, i.e., two fp-operations per cycle in this case.

// Max perf: 5/4 * #FPUs [DP-FLOP/cycle] -> we replace the 5/4 with the equivalent 1250/1000
// Why the 5/4 factor?
// If we had only macc-like instructions in the kernel, the max throughput would be 2 DP-FLOP/(cycle)
// for each FPU.
// Instead, if we only had non-macc instructions in the kernel, the max throughput would be 1 DP-FLOP/(cycle)
// for each FPU.
// In an intermediate case, the value is between 2 and 1 DP-FLOP/(cycle) for each FPU.
// The kernel loop is composed of 8 vector FP instructions. Two of them are macc-like and six are not.
// Thus, in each loop and for a vector length of one element (the vector length gets simplified anyway),
// we have 10 operations executed in 8 cycles in the best case possible with 100% utilization.
// Therefore, the maximum performance at 100% utilization is 10/8 DP-FLOP/(cycle) per FPU, i.e.,
// 5/4 DP-FLOP/(cycle) per FPU
1 change: 1 addition & 0 deletions sw/spatzBenchmarks/sp-fft/main.c
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ int main() {

// Display runtime
if (cid == 0) {
// See the bottom of the file dp-fft/main.c for further info on the performance calculation
long unsigned int performance =
1000 * 5 * NFFT * log2_nfft / timer;
long unsigned int utilization =
Expand Down

0 comments on commit 7059213

Please sign in to comment.