From fee77a7294fe87dd4d583b895b646b1687b6bd2c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Francisco=20Jos=C3=A9=20Letterio?= <40742817+Fletterio@users.noreply.github.com> Date: Wed, 22 Jan 2025 16:34:37 -0300 Subject: [PATCH] Update index.md --- .../index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/2025/2025-01-10-fft-bloom-optimized-to-the-bone-in-nabla/index.md b/blog/2025/2025-01-10-fft-bloom-optimized-to-the-bone-in-nabla/index.md index 59c08e1..2085613 100644 --- a/blog/2025/2025-01-10-fft-bloom-optimized-to-the-bone-in-nabla/index.md +++ b/blog/2025/2025-01-10-fft-bloom-optimized-to-the-bone-in-nabla/index.md @@ -250,7 +250,7 @@ There is no element swap after butterflies though: virtual threads read in their Even better, all memory accesses done in stages previous to running a Workgroup-sized FFT are done in the same positions for different threads. What I mean by this is that even if virtual threads access different memory locations at each of these stages, *all memory locations accessed are owned the same thread*. You can see this in the diagram above: In stage $1$ thread $0$ owns memory locations $0,2,4,6$. After writing to these positions when computing the butterflies in that stage, it still owns those positions: virtual thread $0$ will need elements at positions $0$ and $2$ to run the Workgroup-sized FFT in stage $2$. -Element at position $2$ was computed by virtual thread $2$, but since that virtual thread is also emulated by thread $0$, it's the same thread that owns that memory location! In practice this means that these computations don't require any sort of barriers, syncs or used of shared memory. This allows us to employ an optimization which is to preload elements per thread - essentially reading the needed elements for each thread only once at the start and keeping them in local/private memory for the rest of the algorithm. This will be explained in more detail in the Statis Polymorphism section of this article. +Element at position $2$ was computed by virtual thread $2$, but since that virtual thread is also emulated by thread $0$, it's the same thread that owns that memory location! In practice this means that these computations don't require any sort of barriers, syncs or used of shared memory. This allows us to employ an optimization which is to preload elements per thread - essentially reading the needed elements for each thread only once at the start and keeping them in local/private memory for the rest of the algorithm. This will be explained in more detail in the Static Polymorphism section of this article. All of this implies that such FFTs use the same amount of shared memory as a Workgroup-sized one. The downside is either increased latency or decreased occupancy, depending on whether these reads/writes happen in global memory or local/private (preloaded) memory.