Minor improvement to si.smoo. #106
Replies: 22 comments 4 replies
-
How much would it save in concrete use-case ? Any measure? |
Beta Was this translation helpful? Give feedback.
-
So the DSP code would be for the last version: @orlarey @josmithiii @rmichon what do you guys think? |
Beta Was this translation helpful? Give feedback.
-
Hi Dario and Stéphane,
This is a winner. I am strongly in favor of the change. One multiply and
two additions is fundamentally less work than two multiplies and one
addition. However, when two multiplies are available in parallel, then
(1-b) * x + b * y can be faster than x + b * (y-x) because it takes two
steps instead of three. Thus, a SIMD implementation might prefer the first
form, but Faust does not yet support SIMD as far as I know.
Ideally both forms would compile to the same assembly, but this is not the
case. Neither the Faust compiler nor the C++ compiler appear to work to
minimize multiplies relative to additions when the target architecture
warrants that.
Of course we should run benchmarks to measure the actual improvement on
each architecture, but looking at assembly can also give the answer.
I recently learned about the Compiler Explorer at godbolt.org, for
comparing assemblies on various processors, and this was my first use of it:
First, here is the Faust source I used, from Dario:
// FAUST:
import("stdfaust.lib");
smooth(coeff, x) = fb ~ _ with { fb(y) = y + (1.0 - coeff) * (x - y); };
c = 1.0 - 44.1 / ma.SR
smooth3(s, x) = fb ~ _ with { fb(y) = s * (y - x) + x; };
process = _ <: si.smooth(c), smooth(c), smooth3(c);
Next, I compiled it at the command line with a simple "faust source.dsp"
command (no fancy options), and
lifted out the compute() method to create a standalone code snippet (not
that it's no longer virtual):
// C++
#define FAUSTFLOAT float
int fSampleRate = 44100;
float fConst0 = 0.1; // linear-interpolation constant
float fConst1 = 0.9; // 1-fConst0
float fRec0[2];
float fRec1[2];
float fRec2[2];
void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) {
FAUSTFLOAT* input0 = inputs[0];
FAUSTFLOAT* output0 = outputs[0];
FAUSTFLOAT* output1 = outputs[1];
FAUSTFLOAT* output2 = outputs[2];
for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) {
float fTemp0 = float(input0[i0]);
fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0));
output0[i0] = FAUSTFLOAT(fRec0[0]);
fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1])));
output1[i0] = FAUSTFLOAT(fRec1[0]);
fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0)));
output2[i0] = FAUSTFLOAT(fRec2[0]);
fRec0[1] = fRec0[0];
fRec1[1] = fRec1[0];
fRec2[1] = fRec2[0];
}
}
This code can be pasted into the left panel of the Compiler Explorer at
godbolt.org.
Finally, choose your processor architecture and compiler on the right, and
your C++ compiler options.
Here I chose the first Intel case (more readable than ARM): x86-64 clang
(assertions trunk), -std=c++17 -O3.
Below is the assembly output with my added comments indicating where I
guessed things came from.
You can see that the fundamental computation structure is preserved all the
way down to the bottom, even with -O3 optimization.
The clear winner is smooth3, and benchmarks should verify that.
Resulting annotated ASSEMBLY, x86-64 clang (assertions trunk), -std=c++17
-O3
# =============================================
compute(int, float**, float**): # @compute(int, float**, float**)
...
.LBB0_2: # %for.body
# fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0)); // where
fTemp0 = float(input0[i0]);
# output0[i0] = FAUSTFLOAT(fRec0[0]);
# 7 instructions:
movss xmm1, dword ptr [r8 + 4*rax] # xmm1 = mem[0],zero,zero,zero
*mulss* xmm0, dword ptr [rip + fConst1]
movss xmm2, dword ptr [rip + fConst0] # xmm2 = mem[0],zero,zero,zero
*mulss* xmm2, xmm1
*addss* xmm2, xmm0
movss dword ptr [rip + fRec0], xmm2
movss dword ptr [rcx + 4*rax], xmm2
# fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1])));
# output1[i0] = FAUSTFLOAT(fRec1[0]);
# 7 instructions:
movss xmm0, dword ptr [rip + fRec1+4] # xmm0 = mem[0],zero,zero,zero
movaps xmm2, xmm1
*subss* xmm2, xmm0
*mulss* xmm2, dword ptr [rip + fConst0]
*addss* xmm2, xmm0
movss dword ptr [rip + fRec1], xmm2
movss dword ptr [rsi + 4*rax], xmm2
# fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0)));
# output2[i0] = FAUSTFLOAT(fRec2[0]);
# 6 instructions:
movss xmm0, dword ptr [rip + fRec2+4] # xmm0 = mem[0],zero,zero,zero
*subss* xmm0, xmm1
*mulss* xmm0, dword ptr [rip + fConst1]
*addss* xmm0, xmm1
movss dword ptr [rip + fRec2], xmm0
movss dword ptr [rdx + 4*rax], xmm0
# fRec0[1] = fRec0[0];
movss xmm0, dword ptr [rip + fRec0] # xmm0 = mem[0],zero,zero,zero
movss dword ptr [rip + fRec0+4], xmm0
# fRec1[1] = fRec1[0];
movss xmm1, dword ptr [rip + fRec1] # xmm1 = mem[0],zero,zero,zero
movss dword ptr [rip + fRec1+4], xmm1
# fRec2[1] = fRec2[0];
movss xmm1, dword ptr [rip + fRec2] # xmm1 = mem[0],zero,zero,zero
movss dword ptr [rip + fRec2+4], xmm1
# i0 = i0 + 1
add rax, 1
cmp rdi, rax
jne .LBB0_2
#------------------------ end of loop ---------------------------
.LBB0_3: # %for.cond.cleanup
ret
fSampleRate:
.long 44100 # 0xac44
fConst0:
.long 0x3dcccccd # float 0.100000001
fConst1:
.long 0x3f666666 # float 0.899999976
fRec0:
.zero 8
fRec1:
.zero 8
fRec2:
.zero 8
=============================================
If you read this far, Intel wants to hire you :-)
Cheers,
Julius
…On Sat, Oct 30, 2021 at 2:38 PM Stéphane Letz ***@***.***> wrote:
So the DSP code would be for the last version : smooth(s, x) = fb ~ _
with { fb(y) = s * (y - x) + x; };
@orlarey <https://github.com/orlarey> @josmithiii
<https://github.com/josmithiii> @rmichon <https://github.com/rmichon>
what do you guys think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQZKFOJULXG6YFETZHBWE3UJRXWBANCNFSM5HBGMYXQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
|
Beta Was this translation helpful? Give feedback.
-
Since I was still the author of fi.smooth(), this is done.
Dario, I just copy/pasted your last line, and put your name next to it, but
feel free to "take it over" and write your own version with your own
copyright and documentation, etc.
However, it would need to be free on the level of an MIT or STK-4.3 license,
because otherwise I would have to rewrite the line to keep it STK-4,3,
and you would have to choose another name besides "smooth".
I hope we will continue to find performance-improvement gems like this!
This is a big one because it speeds up all controller parameter smoothing
in all the audio inner loops of all Faust modules, among many other uses.
Cheers,
- Julius
…On Sun, Oct 31, 2021 at 2:10 AM Julius Smith ***@***.***> wrote:
Hi Dario and Stéphane,
This is a winner. I am strongly in favor of the change. One multiply and
two additions is fundamentally less work than two multiplies and one
addition. However, when two multiplies are available in parallel, then
(1-b) * x + b * y can be faster than x + b * (y-x) because it takes two
steps instead of three. Thus, a SIMD implementation might prefer the first
form, but Faust does not yet support SIMD as far as I know.
Ideally both forms would compile to the same assembly, but this is not the
case. Neither the Faust compiler nor the C++ compiler appear to work to
minimize multiplies relative to additions when the target architecture
warrants that.
Of course we should run benchmarks to measure the actual improvement on
each architecture, but looking at assembly can also give the answer.
I recently learned about the Compiler Explorer at godbolt.org, for
comparing assemblies on various processors, and this was my first use of it:
First, here is the Faust source I used, from Dario:
// FAUST:
import("stdfaust.lib");
smooth(coeff, x) = fb ~ _ with { fb(y) = y + (1.0 - coeff) * (x - y); };
c = 1.0 - 44.1 / ma.SR
smooth3(s, x) = fb ~ _ with { fb(y) = s * (y - x) + x; };
process = _ <: si.smooth(c), smooth(c), smooth3(c);
Next, I compiled it at the command line with a simple "faust source.dsp"
command (no fancy options), and
lifted out the compute() method to create a standalone code snippet (not
that it's no longer virtual):
// C++
#define FAUSTFLOAT float
int fSampleRate = 44100;
float fConst0 = 0.1; // linear-interpolation constant
float fConst1 = 0.9; // 1-fConst0
float fRec0[2];
float fRec1[2];
float fRec2[2];
void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) {
FAUSTFLOAT* input0 = inputs[0];
FAUSTFLOAT* output0 = outputs[0];
FAUSTFLOAT* output1 = outputs[1];
FAUSTFLOAT* output2 = outputs[2];
for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) {
float fTemp0 = float(input0[i0]);
fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0));
output0[i0] = FAUSTFLOAT(fRec0[0]);
fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1])));
output1[i0] = FAUSTFLOAT(fRec1[0]);
fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0)));
output2[i0] = FAUSTFLOAT(fRec2[0]);
fRec0[1] = fRec0[0];
fRec1[1] = fRec1[0];
fRec2[1] = fRec2[0];
}
}
This code can be pasted into the left panel of the Compiler Explorer at
godbolt.org.
Finally, choose your processor architecture and compiler on the right, and
your C++ compiler options.
Here I chose the first Intel case (more readable than ARM): x86-64 clang
(assertions trunk), -std=c++17 -O3.
Below is the assembly output with my added comments indicating where I
guessed things came from.
You can see that the fundamental computation structure is preserved all
the way down to the bottom, even with -O3 optimization.
The clear winner is smooth3, and benchmarks should verify that.
Resulting annotated ASSEMBLY, x86-64 clang (assertions trunk), -std=c++17
-O3
# =============================================
compute(int, float**, float**): # @compute(int, float**, float**)
...
.LBB0_2: # %for.body
# fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0)); // where
fTemp0 = float(input0[i0]);
# output0[i0] = FAUSTFLOAT(fRec0[0]);
# 7 instructions:
movss xmm1, dword ptr [r8 + 4*rax] # xmm1 = mem[0],zero,zero,zero
*mulss* xmm0, dword ptr [rip + fConst1]
movss xmm2, dword ptr [rip + fConst0] # xmm2 = mem[0],zero,zero,zero
*mulss* xmm2, xmm1
*addss* xmm2, xmm0
movss dword ptr [rip + fRec0], xmm2
movss dword ptr [rcx + 4*rax], xmm2
# fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1])));
# output1[i0] = FAUSTFLOAT(fRec1[0]);
# 7 instructions:
movss xmm0, dword ptr [rip + fRec1+4] # xmm0 = mem[0],zero,zero,zero
movaps xmm2, xmm1
*subss* xmm2, xmm0
*mulss* xmm2, dword ptr [rip + fConst0]
*addss* xmm2, xmm0
movss dword ptr [rip + fRec1], xmm2
movss dword ptr [rsi + 4*rax], xmm2
# fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0)));
# output2[i0] = FAUSTFLOAT(fRec2[0]);
# 6 instructions:
movss xmm0, dword ptr [rip + fRec2+4] # xmm0 = mem[0],zero,zero,zero
*subss* xmm0, xmm1
*mulss* xmm0, dword ptr [rip + fConst1]
*addss* xmm0, xmm1
movss dword ptr [rip + fRec2], xmm0
movss dword ptr [rdx + 4*rax], xmm0
# fRec0[1] = fRec0[0];
movss xmm0, dword ptr [rip + fRec0] # xmm0 = mem[0],zero,zero,zero
movss dword ptr [rip + fRec0+4], xmm0
# fRec1[1] = fRec1[0];
movss xmm1, dword ptr [rip + fRec1] # xmm1 = mem[0],zero,zero,zero
movss dword ptr [rip + fRec1+4], xmm1
# fRec2[1] = fRec2[0];
movss xmm1, dword ptr [rip + fRec2] # xmm1 = mem[0],zero,zero,zero
movss dword ptr [rip + fRec2+4], xmm1
# i0 = i0 + 1
add rax, 1
cmp rdi, rax
jne .LBB0_2
#------------------------ end of loop ---------------------------
.LBB0_3: # %for.cond.cleanup
ret
fSampleRate:
.long 44100 # 0xac44
fConst0:
.long 0x3dcccccd # float 0.100000001
fConst1:
.long 0x3f666666 # float 0.899999976
fRec0:
.zero 8
fRec1:
.zero 8
fRec2:
.zero 8
=============================================
If you read this far, Intel wants to hire you :-)
Cheers,
Julius
On Sat, Oct 30, 2021 at 2:38 PM Stéphane Letz ***@***.***>
wrote:
> So the DSP code would be for the last version : smooth(s, x) = fb ~ _
> with { fb(y) = s * (y - x) + x; };
>
> @orlarey <https://github.com/orlarey> @josmithiii
> <https://github.com/josmithiii> @rmichon <https://github.com/rmichon>
> what do you guys think?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#106 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAQZKFOJULXG6YFETZHBWE3UJRXWBANCNFSM5HBGMYXQ>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
>
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
|
Beta Was this translation helpful? Give feedback.
-
I meant si.smooth - still not used to the new library organization!
…On Sun, Oct 31, 2021 at 6:53 AM Julius Smith ***@***.***> wrote:
Since I was still the author of fi.smooth(), this is done.
Dario, I just copy/pasted your last line, and put your name next to it, but
feel free to "take it over" and write your own version with your own
copyright and documentation, etc.
However, it would need to be free on the level of an MIT or STK-4.3
license,
because otherwise I would have to rewrite the line to keep it STK-4,3,
and you would have to choose another name besides "smooth".
I hope we will continue to find performance-improvement gems like this!
This is a big one because it speeds up all controller parameter smoothing
in all the audio inner loops of all Faust modules, among many other uses.
Cheers,
- Julius
On Sun, Oct 31, 2021 at 2:10 AM Julius Smith ***@***.***>
wrote:
> Hi Dario and Stéphane,
>
> This is a winner. I am strongly in favor of the change. One multiply
> and two additions is fundamentally less work than two multiplies and one
> addition. However, when two multiplies are available in parallel, then
> (1-b) * x + b * y can be faster than x + b * (y-x) because it takes two
> steps instead of three. Thus, a SIMD implementation might prefer the first
> form, but Faust does not yet support SIMD as far as I know.
>
> Ideally both forms would compile to the same assembly, but this is not
> the case. Neither the Faust compiler nor the C++ compiler appear to work
> to minimize multiplies relative to additions when the target architecture
> warrants that.
>
> Of course we should run benchmarks to measure the actual improvement on
> each architecture, but looking at assembly can also give the answer.
> I recently learned about the Compiler Explorer at godbolt.org, for
> comparing assemblies on various processors, and this was my first use of it:
>
> First, here is the Faust source I used, from Dario:
>
> // FAUST:
>
> import("stdfaust.lib");
> smooth(coeff, x) = fb ~ _ with { fb(y) = y + (1.0 - coeff) * (x - y); };
> c = 1.0 - 44.1 / ma.SR
> smooth3(s, x) = fb ~ _ with { fb(y) = s * (y - x) + x; };
> process = _ <: si.smooth(c), smooth(c), smooth3(c);
>
> Next, I compiled it at the command line with a simple "faust source.dsp"
> command (no fancy options), and
> lifted out the compute() method to create a standalone code snippet (not
> that it's no longer virtual):
>
> // C++
>
> #define FAUSTFLOAT float
>
> int fSampleRate = 44100;
> float fConst0 = 0.1; // linear-interpolation constant
> float fConst1 = 0.9; // 1-fConst0
> float fRec0[2];
> float fRec1[2];
> float fRec2[2];
>
> void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) {
> FAUSTFLOAT* input0 = inputs[0];
> FAUSTFLOAT* output0 = outputs[0];
> FAUSTFLOAT* output1 = outputs[1];
> FAUSTFLOAT* output2 = outputs[2];
> for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) {
> float fTemp0 = float(input0[i0]);
> fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0));
> output0[i0] = FAUSTFLOAT(fRec0[0]);
> fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1])));
> output1[i0] = FAUSTFLOAT(fRec1[0]);
> fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0)));
> output2[i0] = FAUSTFLOAT(fRec2[0]);
> fRec0[1] = fRec0[0];
> fRec1[1] = fRec1[0];
> fRec2[1] = fRec2[0];
> }
> }
>
> This code can be pasted into the left panel of the Compiler Explorer at
> godbolt.org.
>
> Finally, choose your processor architecture and compiler on the right,
> and your C++ compiler options.
> Here I chose the first Intel case (more readable than ARM): x86-64 clang
> (assertions trunk), -std=c++17 -O3.
>
> Below is the assembly output with my added comments indicating where I
> guessed things came from.
> You can see that the fundamental computation structure is preserved all
> the way down to the bottom, even with -O3 optimization.
> The clear winner is smooth3, and benchmarks should verify that.
>
> Resulting annotated ASSEMBLY, x86-64 clang (assertions trunk), -std=c++17
> -O3
>
> # =============================================
> compute(int, float**, float**): # @compute(int, float**, float**)
> ...
> .LBB0_2: # %for.body
>
> # fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0)); // where
> fTemp0 = float(input0[i0]);
> # output0[i0] = FAUSTFLOAT(fRec0[0]);
> # 7 instructions:
>
> movss xmm1, dword ptr [r8 + 4*rax] # xmm1 = mem[0],zero,zero,zero
> *mulss* xmm0, dword ptr [rip + fConst1]
> movss xmm2, dword ptr [rip + fConst0] # xmm2 = mem[0],zero,zero,zero
> *mulss* xmm2, xmm1
> *addss* xmm2, xmm0
> movss dword ptr [rip + fRec0], xmm2
> movss dword ptr [rcx + 4*rax], xmm2
>
> # fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1])));
> # output1[i0] = FAUSTFLOAT(fRec1[0]);
> # 7 instructions:
>
> movss xmm0, dword ptr [rip + fRec1+4] # xmm0 = mem[0],zero,zero,zero
> movaps xmm2, xmm1
> *subss* xmm2, xmm0
> *mulss* xmm2, dword ptr [rip + fConst0]
> *addss* xmm2, xmm0
> movss dword ptr [rip + fRec1], xmm2
> movss dword ptr [rsi + 4*rax], xmm2
>
> # fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0)));
> # output2[i0] = FAUSTFLOAT(fRec2[0]);
> # 6 instructions:
>
> movss xmm0, dword ptr [rip + fRec2+4] # xmm0 = mem[0],zero,zero,zero
> *subss* xmm0, xmm1
> *mulss* xmm0, dword ptr [rip + fConst1]
> *addss* xmm0, xmm1
> movss dword ptr [rip + fRec2], xmm0
> movss dword ptr [rdx + 4*rax], xmm0
>
> # fRec0[1] = fRec0[0];
> movss xmm0, dword ptr [rip + fRec0] # xmm0 = mem[0],zero,zero,zero
> movss dword ptr [rip + fRec0+4], xmm0
>
> # fRec1[1] = fRec1[0];
> movss xmm1, dword ptr [rip + fRec1] # xmm1 = mem[0],zero,zero,zero
> movss dword ptr [rip + fRec1+4], xmm1
>
> # fRec2[1] = fRec2[0];
> movss xmm1, dword ptr [rip + fRec2] # xmm1 = mem[0],zero,zero,zero
> movss dword ptr [rip + fRec2+4], xmm1
>
> # i0 = i0 + 1
> add rax, 1
> cmp rdi, rax
> jne .LBB0_2
>
> #------------------------ end of loop ---------------------------
>
> .LBB0_3: # %for.cond.cleanup
> ret
>
> fSampleRate:
> .long 44100 # 0xac44
>
> fConst0:
> .long 0x3dcccccd # float 0.100000001
>
> fConst1:
> .long 0x3f666666 # float 0.899999976
>
> fRec0:
> .zero 8
>
> fRec1:
> .zero 8
>
> fRec2:
> .zero 8
>
> =============================================
>
> If you read this far, Intel wants to hire you :-)
>
> Cheers,
> Julius
>
>
> On Sat, Oct 30, 2021 at 2:38 PM Stéphane Letz ***@***.***>
> wrote:
>
>> So the DSP code would be for the last version : smooth(s, x) = fb ~ _
>> with { fb(y) = s * (y - x) + x; };
>>
>> @orlarey <https://github.com/orlarey> @josmithiii
>> <https://github.com/josmithiii> @rmichon <https://github.com/rmichon>
>> what do you guys think?
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> <#106 (comment)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/AAQZKFOJULXG6YFETZHBWE3UJRXWBANCNFSM5HBGMYXQ>
>> .
>> Triage notifications on the go with GitHub Mobile for iOS
>> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
>> or Android
>> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>>
>>
>
>
> --
> Julius O. Smith III ***@***.***>
> Professor of Music and, by courtesy, Electrical Engineering
> CCRMA, Stanford University
> http://ccrma.stanford.edu/~jos/
>
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
|
Beta Was this translation helpful? Give feedback.
-
Oops, I just noticed Stéphane actually wrote the line I copy/pasted - will
fix . . .
…On Sun, Oct 31, 2021 at 7:06 AM Julius Smith ***@***.***> wrote:
I meant si.smooth - still not used to the new library organization!
On Sun, Oct 31, 2021 at 6:53 AM Julius Smith ***@***.***>
wrote:
> Since I was still the author of fi.smooth(), this is done.
> Dario, I just copy/pasted your last line, and put your name next to it,
> but
> feel free to "take it over" and write your own version with your own
> copyright and documentation, etc.
> However, it would need to be free on the level of an MIT or STK-4.3
> license,
> because otherwise I would have to rewrite the line to keep it STK-4,3,
> and you would have to choose another name besides "smooth".
>
> I hope we will continue to find performance-improvement gems like this!
> This is a big one because it speeds up all controller parameter smoothing
> in all the audio inner loops of all Faust modules, among many other uses.
>
> Cheers,
> - Julius
>
> On Sun, Oct 31, 2021 at 2:10 AM Julius Smith ***@***.***>
> wrote:
>
>> Hi Dario and Stéphane,
>>
>> This is a winner. I am strongly in favor of the change. One multiply
>> and two additions is fundamentally less work than two multiplies and one
>> addition. However, when two multiplies are available in parallel, then
>> (1-b) * x + b * y can be faster than x + b * (y-x) because it takes two
>> steps instead of three. Thus, a SIMD implementation might prefer the first
>> form, but Faust does not yet support SIMD as far as I know.
>>
>> Ideally both forms would compile to the same assembly, but this is not
>> the case. Neither the Faust compiler nor the C++ compiler appear to work
>> to minimize multiplies relative to additions when the target architecture
>> warrants that.
>>
>> Of course we should run benchmarks to measure the actual improvement on
>> each architecture, but looking at assembly can also give the answer.
>> I recently learned about the Compiler Explorer at godbolt.org, for
>> comparing assemblies on various processors, and this was my first use of it:
>>
>> First, here is the Faust source I used, from Dario:
>>
>> // FAUST:
>>
>> import("stdfaust.lib");
>> smooth(coeff, x) = fb ~ _ with { fb(y) = y + (1.0 - coeff) * (x - y);
>> };
>> c = 1.0 - 44.1 / ma.SR
>> smooth3(s, x) = fb ~ _ with { fb(y) = s * (y - x) + x; };
>> process = _ <: si.smooth(c), smooth(c), smooth3(c);
>>
>> Next, I compiled it at the command line with a simple "faust source.dsp"
>> command (no fancy options), and
>> lifted out the compute() method to create a standalone code snippet (not
>> that it's no longer virtual):
>>
>> // C++
>>
>> #define FAUSTFLOAT float
>>
>> int fSampleRate = 44100;
>> float fConst0 = 0.1; // linear-interpolation constant
>> float fConst1 = 0.9; // 1-fConst0
>> float fRec0[2];
>> float fRec1[2];
>> float fRec2[2];
>>
>> void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) {
>> FAUSTFLOAT* input0 = inputs[0];
>> FAUSTFLOAT* output0 = outputs[0];
>> FAUSTFLOAT* output1 = outputs[1];
>> FAUSTFLOAT* output2 = outputs[2];
>> for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) {
>> float fTemp0 = float(input0[i0]);
>> fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0));
>> output0[i0] = FAUSTFLOAT(fRec0[0]);
>> fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1])));
>> output1[i0] = FAUSTFLOAT(fRec1[0]);
>> fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0)));
>> output2[i0] = FAUSTFLOAT(fRec2[0]);
>> fRec0[1] = fRec0[0];
>> fRec1[1] = fRec1[0];
>> fRec2[1] = fRec2[0];
>> }
>> }
>>
>> This code can be pasted into the left panel of the Compiler Explorer at
>> godbolt.org.
>>
>> Finally, choose your processor architecture and compiler on the right,
>> and your C++ compiler options.
>> Here I chose the first Intel case (more readable than ARM): x86-64 clang
>> (assertions trunk), -std=c++17 -O3.
>>
>> Below is the assembly output with my added comments indicating where I
>> guessed things came from.
>> You can see that the fundamental computation structure is preserved all
>> the way down to the bottom, even with -O3 optimization.
>> The clear winner is smooth3, and benchmarks should verify that.
>>
>> Resulting annotated ASSEMBLY, x86-64 clang (assertions trunk),
>> -std=c++17 -O3
>>
>> # =============================================
>> compute(int, float**, float**): # @compute(int, float**, float**)
>> ...
>> .LBB0_2: # %for.body
>>
>> # fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0)); // where
>> fTemp0 = float(input0[i0]);
>> # output0[i0] = FAUSTFLOAT(fRec0[0]);
>> # 7 instructions:
>>
>> movss xmm1, dword ptr [r8 + 4*rax] # xmm1 = mem[0],zero,zero,zero
>> *mulss* xmm0, dword ptr [rip + fConst1]
>> movss xmm2, dword ptr [rip + fConst0] # xmm2 = mem[0],zero,zero,zero
>> *mulss* xmm2, xmm1
>> *addss* xmm2, xmm0
>> movss dword ptr [rip + fRec0], xmm2
>> movss dword ptr [rcx + 4*rax], xmm2
>>
>> # fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1])));
>> # output1[i0] = FAUSTFLOAT(fRec1[0]);
>> # 7 instructions:
>>
>> movss xmm0, dword ptr [rip + fRec1+4] # xmm0 = mem[0],zero,zero,zero
>> movaps xmm2, xmm1
>> *subss* xmm2, xmm0
>> *mulss* xmm2, dword ptr [rip + fConst0]
>> *addss* xmm2, xmm0
>> movss dword ptr [rip + fRec1], xmm2
>> movss dword ptr [rsi + 4*rax], xmm2
>>
>> # fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0)));
>> # output2[i0] = FAUSTFLOAT(fRec2[0]);
>> # 6 instructions:
>>
>> movss xmm0, dword ptr [rip + fRec2+4] # xmm0 = mem[0],zero,zero,zero
>> *subss* xmm0, xmm1
>> *mulss* xmm0, dword ptr [rip + fConst1]
>> *addss* xmm0, xmm1
>> movss dword ptr [rip + fRec2], xmm0
>> movss dword ptr [rdx + 4*rax], xmm0
>>
>> # fRec0[1] = fRec0[0];
>> movss xmm0, dword ptr [rip + fRec0] # xmm0 = mem[0],zero,zero,zero
>> movss dword ptr [rip + fRec0+4], xmm0
>>
>> # fRec1[1] = fRec1[0];
>> movss xmm1, dword ptr [rip + fRec1] # xmm1 = mem[0],zero,zero,zero
>> movss dword ptr [rip + fRec1+4], xmm1
>>
>> # fRec2[1] = fRec2[0];
>> movss xmm1, dword ptr [rip + fRec2] # xmm1 = mem[0],zero,zero,zero
>> movss dword ptr [rip + fRec2+4], xmm1
>>
>> # i0 = i0 + 1
>> add rax, 1
>> cmp rdi, rax
>> jne .LBB0_2
>>
>> #------------------------ end of loop ---------------------------
>>
>> .LBB0_3: # %for.cond.cleanup
>> ret
>>
>> fSampleRate:
>> .long 44100 # 0xac44
>>
>> fConst0:
>> .long 0x3dcccccd # float 0.100000001
>>
>> fConst1:
>> .long 0x3f666666 # float 0.899999976
>>
>> fRec0:
>> .zero 8
>>
>> fRec1:
>> .zero 8
>>
>> fRec2:
>> .zero 8
>>
>> =============================================
>>
>> If you read this far, Intel wants to hire you :-)
>>
>> Cheers,
>> Julius
>>
>>
>> On Sat, Oct 30, 2021 at 2:38 PM Stéphane Letz ***@***.***>
>> wrote:
>>
>>> So the DSP code would be for the last version : smooth(s, x) = fb ~ _
>>> with { fb(y) = s * (y - x) + x; };
>>>
>>> @orlarey <https://github.com/orlarey> @josmithiii
>>> <https://github.com/josmithiii> @rmichon <https://github.com/rmichon>
>>> what do you guys think?
>>>
>>> —
>>> You are receiving this because you were mentioned.
>>> Reply to this email directly, view it on GitHub
>>> <#106 (comment)>,
>>> or unsubscribe
>>> <https://github.com/notifications/unsubscribe-auth/AAQZKFOJULXG6YFETZHBWE3UJRXWBANCNFSM5HBGMYXQ>
>>> .
>>> Triage notifications on the go with GitHub Mobile for iOS
>>> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
>>> or Android
>>> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>>>
>>>
>>
>>
>> --
>> Julius O. Smith III ***@***.***>
>> Professor of Music and, by courtesy, Electrical Engineering
>> CCRMA, Stanford University
>> http://ccrma.stanford.edu/~jos/
>>
>
>
> --
> Julius O. Smith III ***@***.***>
> Professor of Music and, by courtesy, Electrical Engineering
> CCRMA, Stanford University
> http://ccrma.stanford.edu/~jos/
>
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
|
Beta Was this translation helpful? Give feedback.
-
Benchmarking is indeed necessary. This is faster with the new version, tested with faustbench-llvm on an Apple M1:
But this is a bit slower with the new version, especially in scalar (= default) code model:
So I kept the old code (just in case...) in this commit 234eadc |
Beta Was this translation helpful? Give feedback.
-
Wow, that is massively unexpected. I will next study the ARM assembly
(which I have to learn). It should not be possible to make it slower on
x86 unless the compiler can behave in some new way. I assume you had -O3
on, etc.
…On Sun, Oct 31, 2021 at 7:29 AM Stéphane Letz ***@***.***> wrote:
Benchmarking is indeed necessary.
This is faster with the new version, tested with faustbench-llvm
<https://github.com/grame-cncm/faust/tree/master-dev/tools/benchmark#faustbench-llvm>
on an Apple M1:
process = par(i, 10, si.smoo);
But this is a bit slower with the new version:
voice(i) = os.osc(400+i*300) : si.smoo;
process = par(i, 10, voice(i));
So I kept the old code (just in case...) in, this commit 234eadc
<234eadc>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQZKFN2WPQ473N5XWK2KRTUJVHGDANCNFSM5HBGMYXQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
|
Beta Was this translation helpful? Give feedback.
-
Interesting! A possible explanation (pure speculation ;-)) is that with the old smooth, the two multiplications are independent and can be done in parallel by the CPU, while all operations with the new smooth have to be done in sequence. Godbolt/ICC 2021.3.0/-O3 -ffast-math. Faust code compiled with experimental graph compiler with -osd option to optimize 1 sample delay lines. Old smooth (the two multiplications are independent) ..B8.4: # Preds ..B8.4 ..B8.3 New smooth, all the operations have to be done in sequence ..B7.4: # Preds ..B7.4 ..B7.3 |
Beta Was this translation helpful? Give feedback.
-
Should we have a way to choose one of the 3 implementations ? (which is a more general library design question BTW...) with something like:
|
Beta Was this translation helpful? Give feedback.
-
That sounds like a good theory and plan. In the spirit of how FFTW was
created, there could be a tool that benchmarks for any chosen architecture
to determine the best choices.
…On Mon, Nov 1, 2021 at 7:29 AM Stéphane Letz ***@***.***> wrote:
Should we have a way to choose one of the 3 implementations ? (which is a
more general *library design* question BTW...) with something like:
- added in platform.lib:
//---------------------------------`(pl.)smooth_type`----------------------------
// Smooth implementation type, see si.smooth
//-----------------------------------------------------------------------------
smooth_type = 0;
//smooth_type = 1;
//smooth_type = 2;
- then one of the 3 possible implementations can be selected with the
appropriate smooth_type value:
smooth_imp = case {
(0,s) => \(x).(x * (1.0 - s) : + ~ *(s));
(1,s) => \(x).(fb ~ _ with { fb(y) = s * (y - x) + x; });
(2,s) => \(x).(fb ~ _ with { fb(y) = y + (1.0 - s) * (x - y); });
};
smooth = smooth_imp(pl.smooth_type);
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQZKFISSQW4WBOOZUOFGUDUJ2P6BANCNFSM5HBGMYXQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
|
Beta Was this translation helpful? Give feedback.
-
We finally decided to restore the previous code (still faster), but keep the 3 versions available. The platform.lib mode is not used yet to select the version, which is hard-coded for now (since we are not sure the platform.lib mode is the way to go) , see eccd83c |
Beta Was this translation helpful? Give feedback.
-
Two more ideas inspired by the "settle latch" proposed in the Matthew
Robbetts talk at ADC-21 on using C++ Expression Templates to approximate
some features of Faust:
// automatically skip multiply-add when sufficiently close:
smooth4(s, x) = fb ~ _ with {
fb(y) = select2(ymx>0, dn, up) with {
ymx = y-x;
up = select2(ymx < 1.0e-7, s * ymx + x, x);
dn = select2(ymx > -1.0e-7, s * ymx + x, x);
};
};
// Set enable e to 0 by timer after any change in s to skip multiply-add
thereafter:
smooth5(s, e, x) = fb ~ _ with { fb(y) = select2(e>0, y, s * (y - x) + x);
};
- Julius
|
Beta Was this translation helpful? Give feedback.
-
Not sure Io follow..., I guess I'll to wait for Matthew Robbetts talk at ADC-21 to be available on TY? Or can you explain more ? |
Beta Was this translation helpful? Give feedback.
-
It should appear on YouTube soon.
The idea is quite simple: When a slider value changes, its smoothed output
exponentially approaches the new value, thanks to smooth(), and the
exponential keeps computing even after effectively getting to the target
value. The idea is to get rid of these unnecessary exponentials (one-pole
filterings) most of the time for parameters that normally sit still (which
is almost all of them). They only need "dezippering" when they actually
change. Furthermore, from the smoother pole, which we set, we also know
when we can turn it off, or it can sense arrival and turn off automatically.
- Julius
…On Sat, Nov 20, 2021 at 9:27 AM Stéphane Letz ***@***.***> wrote:
Not sure Io follow..., I guess I'mme to wait for Matthew Robbetts talk at
ADC-21 to be available on TY? Or can you explain more ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQZKFPR5G5JCJKRZWGL5PDUM7LABANCNFSM5HBGMYXQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
|
Beta Was this translation helpful? Give feedback.
-
Matthew's talk is related to this ADC-16 talk:
https://www.youtube.com/watch?v=XK88ji7vpyQ
…On Sat, Nov 20, 2021 at 2:48 PM Julius Smith ***@***.***> wrote:
It should appear on YouTube soon.
The idea is quite simple: When a slider value changes, its smoothed output
exponentially approaches the new value, thanks to smooth(), and the
exponential keeps computing even after effectively getting to the target
value. The idea is to get rid of these unnecessary exponentials (one-pole
filterings) most of the time for parameters that normally sit still (which
is almost all of them). They only need "dezippering" when they actually
change. Furthermore, from the smoother pole, which we set, we also know
when we can turn it off, or it can sense arrival and turn off automatically.
- Julius
On Sat, Nov 20, 2021 at 9:27 AM Stéphane Letz ***@***.***>
wrote:
> Not sure Io follow..., I guess I'mme to wait for Matthew Robbetts talk at
> ADC-21 to be available on TY? Or can you explain more ?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#106 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAQZKFPR5G5JCJKRZWGL5PDUM7LABANCNFSM5HBGMYXQ>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
>
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
|
Beta Was this translation helpful? Give feedback.
-
OK, but |
Beta Was this translation helpful? Give feedback.
-
OK, but select is strict in Faust, see
https://faustdoc.grame.fr/manual/syntax/#select2-primitive
Yes, this needs to wait for the master-with-mute branch
…On Sat, Nov 20, 2021 at 11:37 PM Stéphane Letz ***@***.***> wrote:
OK, but select is strict in Faust, see
https://faustdoc.grame.fr/manual/syntax/#select2-primitive
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQZKFKN2YJBWULNSRQ7ANDUNCOUNANCNFSM5HBGMYXQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
|
Beta Was this translation helpful? Give feedback.
-
Even if Faust's semantics weren't strict, I wonder if a branching
mechanism with two IFs inside an IF wouldn't still be heavier than a
multiply-and-add calculation. I'll try to put together a little benchmark
for that.
Yes, that is a valid concern. On some architectures it's probably better
to just let the multiply-add fly. However, fundamentally, in hardware, it
is much less work to skip it. It's software's job to reap that savings
somehow. :-)
…On Sat, Nov 20, 2021 at 11:42 PM Dario Sanfilippo ***@***.***> wrote:
Hi, Julius.
Wouldn't that require Faust's strict semantics to be broken? I am
referring to the fact that both branches of any IF-statement in Faust are
always computed, so we'd just add three branching mechanisms on top of what
we already have:
Smooth4:
virtual void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) {
FAUSTFLOAT* input0 = inputs[0];
FAUSTFLOAT* input1 = inputs[1];
FAUSTFLOAT* output0 = outputs[0];
for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) {
float fTemp0 = float(input1[i0]);
float fTemp1 = (fRec0[1] - fTemp0);
float fTemp2 = (fTemp0 + (float(input0[i0]) * fTemp1));
float fThen2 = ((fTemp1 > -1.00000001e-07f) ? fTemp0 : fTemp2);
float fElse2 = ((fTemp1 < 1.00000001e-07f) ? fTemp0 : fTemp2);
fRec0[0] = ((fTemp1 > 0.0f) ? fElse2 : fThen2);
output0[i0] = FAUSTFLOAT(fRec0[0]);
fRec0[1] = fRec0[0];
}
}
smooth(coeff, x) = fb ~ _
with {
fb(y) = coeff * (y - x) + x;
}; :
virtual void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) {
FAUSTFLOAT* input0 = inputs[0];
FAUSTFLOAT* input1 = inputs[1];
FAUSTFLOAT* output0 = outputs[0];
for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) {
float fTemp0 = float(input1[i0]);
fRec0[0] = (fTemp0 + (float(input0[i0]) * (fRec0[1] - fTemp0)));
output0[i0] = FAUSTFLOAT(fRec0[0]);
fRec0[1] = fRec0[0];
}
}
Even if Faust's semantics weren't strict, I wonder if a branching
mechanism with two IFs inside an IF wouldn't still be heavier than a
multiply-and-add calculation. I'll try to put together a little benchmark
for that.
Cheers,
Dario
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQZKFKLLVBDTCXPMWIE673UNCPG3ANCNFSM5HBGMYXQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
|
Beta Was this translation helpful? Give feedback.
-
Here's a preliminary benchmark for the two implementations; please have a closer look in case I did something wrong. If I use random inputs to actually challenge the branch predictor, we get these execution times (ms): Branch = 3720.51 If I use an input that constantly increments a very small values such as 1.0e-50, hence always resulting in a ymx < 1.0e-7, the IF implementation is about twice as fast. If I constantly increment it so that ymx is always > 1.0e-7, the two implementations are essentially the same. I'm on Apple M1, compiling with clang13 -Ofast. See the C++ code below:
Ciao, |
Beta Was this translation helpful? Give feedback.
-
Wow, that's much worse than I would have predicted!
It implies no branch target prediction at all, and no ability to save the
FPU cost on branches not needing it. Completely unexpected to me.
…On Sun, Nov 21, 2021 at 1:53 AM Dario Sanfilippo ***@***.***> wrote:
Here's a preliminary benchmark for the two implementations; please have a
closer look in case I did something wrong.
If I use random inputs to actually challenge the branch predictor, we get
these execution times (ms):
Branch = 3720.51
No branch = 1108.6
If I use an input that constantly increments a very small values such as
1.0e-50, hence always resulting in a ymx < 1.0e-7, the IF implementation is
about twice as fast. If I constantly increment it so that ymx is always >
1.0e-7, the two implementations are essentially the same.
I'm on Apple M1, compiling with clang13 -Ofast. See the C++ code below:
#include <iostream>
#include <cmath>
#include <chrono>
double smoothBranch(double b1, double& y, double x) {
double ymx = y - x;
if (ymx > 0) {
if (ymx < 1.0e-7) {
return x;
} else {
y = b1 * ymx + x;
return y;
}
} else {
if (ymx > -1.0e-7) {
return x;
} else {
y = b1 * ymx + x;
return y;
}
}
}
double smooth(double b1, double& y, double x) {
double ymx = y - x;
y = b1 * ymx + x;
return y;
}
int main(int argc, char** argv) {
double p = 0;
int n = 500000000;
double d1_avg = 0;
double d2_avg = 0;
int seed = 12345;
int mask = 4294967295;
int random = 0;
double x = 0;
double y1 = 0;
double y2 = 0;
double b1 = .999;
for (auto tries = 0; tries < 10; ++tries) {
auto n1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < n; ++i) {
x = random / 2147483647.0;
random = (random * 1103515245 + seed) & mask;
p = smoothBranch(b1, y1, x);
}
auto n2 = std::chrono::high_resolution_clock::now();
std::cout << "smoothBranch: " << p << std::endl;
std::chrono::duration<double, std::milli> d = n2 - n1;
auto c1 = d.count();
d1_avg += d.count();
auto n3 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < n; ++i) {
x = random / 2147483647.0;
random = (random * 1103515245 + seed) & mask;
p = smooth(b1, y2, x);
}
auto n4 = std::chrono::high_resolution_clock::now();
std::cout << "smooth: " << p << std::endl;
d = n4 - n3;
auto c2 = d.count();
d2_avg += d.count();
}
std::cout << "Branch = " << d1_avg / 10.0 << std::endl;
std::cout << "No branch = " << d2_avg / 10.0 << std::endl;
}
Ciao,
Dario
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQZKFIGYKTID3QH5RIEVO3UNC6Q3ANCNFSM5HBGMYXQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
|
Beta Was this translation helpful? Give feedback.
-
I thought modern processors had growing caches of parallel executed
instructions on both sides of the branch, allowing zero-overhead selection
of the winning branch. Reaching the target (or determining the enable bit
to be false) should abort the multiply-in-progress. In any case, this is
definitely a situation where we will have to tell the compiler/hardware it
can stop working unnecessarily hard, if we can ever figure out how to do
it. :-)
Thanks for the benchmarks!
- Julius
…On Sun, Nov 21, 2021 at 6:43 AM Dario Sanfilippo ***@***.***> wrote:
I'm really not strong on these low-level matters but it might also be
that, especially with nested branches, clearing the pipeline of a
mispredicted branch can sometimes be more expensive than calculating the
branch itself.
Ciao,
Dario
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQZKFLOMRR6JQLEDNZIET3UNEAPVANCNFSM5HBGMYXQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Julius O. Smith III ***@***.***>
Professor of Music and, by courtesy, Electrical Engineering
CCRMA, Stanford University
http://ccrma.stanford.edu/~jos/
|
Beta Was this translation helpful? Give feedback.
-
Hello, people.
This is really no big deal but we could save a multiply in si.smoo if we changed the one-pole structure from
y[n] = (1 - b) * x[n] + b * y[n - 1]
to
y[n] = y[n - 1] + (1 - b) * (x[n] - y[n - 1])
as they are identical.
However, we should also consider that the outputs are slightly different due to rounding errors, so changing the filter could affect some old Faust code that used si.smoo in a deterministic chaotic network, for example.
If we run this code
we see these differences in the outputs, which would be negligible for most cases:
Ciao,
Dario
Beta Was this translation helpful? Give feedback.
All reactions