You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #9 , I mentioned how the FFT convolution is a bit slower than a direct for loop for the length of the templates being used. Below is a comparison of the convolution times (I made this with python so it wont be identical to the R functions but its the same idea)
At first I thought this was the end of the story and we'd just use the for loop, but then I realized that adept runs the convolutions on windows of 6000 (by default), and we can reuse a convolution multiple times.
For example, for a week of 80Hz data, adept chops it up into ~8000 windows of length 6000. The naive implementation of the FFT convolution would just run convolve(data_window, template) for all 30 scalings of the template for all 8000 windows, giving 240000 total calls, requiring 240000 FFTs of each data window, 240000 of each template, and 240000 iFFTs at the end for a total of 720000 [i]FFTs. However, we can share the FFT of the templates across all data windows and share the FFT of the data windows across the templates, i.e. we can do this as 8000 FFTs of each data window, 30 FFTs of each template, and 240000 iFFTs back, for a total of 248030 [i]FFTs.
In addition, we can tune the window length to be as fast as possible. For example, perhaps using 3000 longer windows of length 16000 for a total of 93030 (3000+30+90000) [i]FFTs would be faster. Or maybe a shorter window is faster.
Some crude benchmarking shows approximately 2x speedup (you could also tune the naive FFT windows a bit but they all seemed to be >1.9s):
set.seed(1)
n = 6000
x = 1:n
ptm <- proc.time()
for (i in 1:(30*100)){
convolve(x,x/n/(n-1))
}
proc.time() - ptm
# 1.9 s
n = 100 # 6000 slices
x = 1:n
ptm <- proc.time()
for (i in 1:30){
x_fft = fft(x)
}
for (i in 1:6000){
x_fft2 = Conj(fft(x))
}
for (i in 1:(30*6000)){
Re(fft(x_fft*x_fft2/n/(n-1), inverse = TRUE))[1]
}
proc.time() - ptm
# 1.16s
n = 600 # 1000 slices
# 0.99s
n = 6000 # 100 slices
# 1.05s
n = 10000 # 60 slices
# 1.06s
n = 60000 # 10 slices
# 1.4s
If I remember correctly, using the naive FFT convolution was maybe 30% slower than the for loop on some test runs (I say this with low confidence, but it definitely wasnt half as slow), so if we can make it ~2x faster we would see a speedup.
Two caveats:
Only the covariance can be written in one convolution as far as I can tell. The correlation requires 3 due to the computation of the sd's, and in that case I'm not sure this would help there (since we'd need something like >3x speedup).
It would probably require an overhaul of how the code thinks about its variables. Essentially it might be easier to start passing around data/templates in the frequency domain than in the time domain.
For now I think if the current state of optimizations is sufficient, it's not worth spending too much time on this (especially with caveat 2). I do think it would be a pretty interesting path to explore though, so wanted to bring it up as a discussion point.
The text was updated successfully, but these errors were encountered:
In #9 , I mentioned how the FFT convolution is a bit slower than a direct for loop for the length of the templates being used. Below is a comparison of the convolution times (I made this with python so it wont be identical to the R functions but its the same idea)
At first I thought this was the end of the story and we'd just use the for loop, but then I realized that adept runs the convolutions on windows of 6000 (by default), and we can reuse a convolution multiple times.
For example, for a week of 80Hz data,
adept
chops it up into ~8000 windows of length 6000. The naive implementation of the FFT convolution would just runconvolve(data_window, template)
for all 30 scalings of the template for all 8000 windows, giving 240000 total calls, requiring 240000 FFTs of each data window, 240000 of each template, and 240000 iFFTs at the end for a total of 720000 [i]FFTs. However, we can share the FFT of the templates across all data windows and share the FFT of the data windows across the templates, i.e. we can do this as 8000 FFTs of each data window, 30 FFTs of each template, and 240000 iFFTs back, for a total of 248030 [i]FFTs.In addition, we can tune the window length to be as fast as possible. For example, perhaps using 3000 longer windows of length 16000 for a total of 93030 (3000+30+90000) [i]FFTs would be faster. Or maybe a shorter window is faster.
Some crude benchmarking shows approximately 2x speedup (you could also tune the naive FFT windows a bit but they all seemed to be >1.9s):
If I remember correctly, using the naive FFT convolution was maybe 30% slower than the for loop on some test runs (I say this with low confidence, but it definitely wasnt half as slow), so if we can make it ~2x faster we would see a speedup.
Two caveats:
For now I think if the current state of optimizations is sufficient, it's not worth spending too much time on this (especially with caveat 2). I do think it would be a pretty interesting path to explore though, so wanted to bring it up as a discussion point.
The text was updated successfully, but these errors were encountered: