Minor improvement to si.smoo. #106

dariosanfilippo · 2021-10-30T15:37:45Z

dariosanfilippo
Oct 30, 2021

Hello, people.

This is really no big deal but we could save a multiply in si.smoo if we changed the one-pole structure from

y[n] = (1 - b) * x[n] + b * y[n - 1]

to

y[n] = y[n - 1] + (1 - b) * (x[n] - y[n - 1])

as they are identical.

However, we should also consider that the outputs are slightly different due to rounding errors, so changing the filter could affect some old Faust code that used si.smoo in a deterministic chaotic network, for example.

If we run this code

import("stdfaust.lib");
smooth(coeff, x) = fb ~ _
    with {
        fb(y) = y + (1.0 - coeff) * (x - y);
    };
c = 1.0 - 44.1 / ma.SR;
process = 1 - 1' <: smooth(c) , si.smooth(c) : -;

we see these differences in the outputs, which would be negligible for most cases:

0
0
0
0
0
0
0
-1.16415322e-10
-1.16415322e-10
-1.16415322e-10
-1.16415322e-10
-1.16415322e-10
-1.16415322e-10
-1.16415322e-10
-1.16415322e-10
-1.16415322e-10
-1.16415322e-10
-2.32830644e-10
-2.32830644e-10
-2.32830644e-10
-2.32830644e-10
-2.32830644e-10
-2.32830644e-10
-2.32830644e-10
-2.32830644e-10
-2.32830644e-10
-2.32830644e-10
-2.32830644e-10
-2.32830644e-10
-2.32830644e-10
-2.91038305e-10
-2.91038305e-10

Ciao,
Dario

sletz · 2021-10-30T17:08:43Z

sletz
Oct 30, 2021
Maintainer

How much would it save in concrete use-case ? Any measure?

1 reply

dariosanfilippo Oct 30, 2021
Author

No, I haven't measured anything yet but we would save a multiply operation looking at the generated code. I'm sure that that's never gonna cause any bottle neck but I don't know if we could see a tiny improvement in some cases.

Also, consider that we could change si.smooth into using the form

y[n] = b * (y[n - 1] - x[n]) + x[n]

and all functions using it (all the envelopes, for example) wouldn't require changing the calculation of the coefficients.

Ciao,
Dario

sletz · 2021-10-30T21:38:13Z

sletz
Oct 30, 2021
Maintainer

So the DSP code would be for the last version: smooth(s, x) = fb ~ _ with { fb(y) = s * (y - x) + x; };

@orlarey @josmithiii @rmichon what do you guys think?

0 replies

josmithiii · 2021-10-31T09:11:23Z

josmithiii
Oct 31, 2021
Collaborator

Hi Dario and Stéphane, This is a winner. I am strongly in favor of the change. One multiply and two additions is fundamentally less work than two multiplies and one addition. However, when two multiplies are available in parallel, then (1-b) * x + b * y can be faster than x + b * (y-x) because it takes two steps instead of three. Thus, a SIMD implementation might prefer the first form, but Faust does not yet support SIMD as far as I know. Ideally both forms would compile to the same assembly, but this is not the case. Neither the Faust compiler nor the C++ compiler appear to work to minimize multiplies relative to additions when the target architecture warrants that. Of course we should run benchmarks to measure the actual improvement on each architecture, but looking at assembly can also give the answer. I recently learned about the Compiler Explorer at godbolt.org, for comparing assemblies on various processors, and this was my first use of it: First, here is the Faust source I used, from Dario: // FAUST: import("stdfaust.lib"); smooth(coeff, x) = fb ~ _ with { fb(y) = y + (1.0 - coeff) * (x - y); }; c = 1.0 - 44.1 / ma.SR smooth3(s, x) = fb ~ _ with { fb(y) = s * (y - x) + x; }; process = _ <: si.smooth(c), smooth(c), smooth3(c); Next, I compiled it at the command line with a simple "faust source.dsp" command (no fancy options), and lifted out the compute() method to create a standalone code snippet (not that it's no longer virtual): // C++ #define FAUSTFLOAT float int fSampleRate = 44100; float fConst0 = 0.1; // linear-interpolation constant float fConst1 = 0.9; // 1-fConst0 float fRec0[2]; float fRec1[2]; float fRec2[2]; void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) { FAUSTFLOAT* input0 = inputs[0]; FAUSTFLOAT* output0 = outputs[0]; FAUSTFLOAT* output1 = outputs[1]; FAUSTFLOAT* output2 = outputs[2]; for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) { float fTemp0 = float(input0[i0]); fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0)); output0[i0] = FAUSTFLOAT(fRec0[0]); fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1]))); output1[i0] = FAUSTFLOAT(fRec1[0]); fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0))); output2[i0] = FAUSTFLOAT(fRec2[0]); fRec0[1] = fRec0[0]; fRec1[1] = fRec1[0]; fRec2[1] = fRec2[0]; } } This code can be pasted into the left panel of the Compiler Explorer at godbolt.org. Finally, choose your processor architecture and compiler on the right, and your C++ compiler options. Here I chose the first Intel case (more readable than ARM): x86-64 clang (assertions trunk), -std=c++17 -O3. Below is the assembly output with my added comments indicating where I guessed things came from. You can see that the fundamental computation structure is preserved all the way down to the bottom, even with -O3 optimization. The clear winner is smooth3, and benchmarks should verify that. Resulting annotated ASSEMBLY, x86-64 clang (assertions trunk), -std=c++17 -O3 # ============================================= compute(int, float**, float**): # @compute(int, float**, float**) ... .LBB0_2: # %for.body # fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0)); // where fTemp0 = float(input0[i0]); # output0[i0] = FAUSTFLOAT(fRec0[0]); # 7 instructions: movss xmm1, dword ptr [r8 + 4*rax] # xmm1 = mem[0],zero,zero,zero *mulss* xmm0, dword ptr [rip + fConst1] movss xmm2, dword ptr [rip + fConst0] # xmm2 = mem[0],zero,zero,zero *mulss* xmm2, xmm1 *addss* xmm2, xmm0 movss dword ptr [rip + fRec0], xmm2 movss dword ptr [rcx + 4*rax], xmm2 # fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1]))); # output1[i0] = FAUSTFLOAT(fRec1[0]); # 7 instructions: movss xmm0, dword ptr [rip + fRec1+4] # xmm0 = mem[0],zero,zero,zero movaps xmm2, xmm1 *subss* xmm2, xmm0 *mulss* xmm2, dword ptr [rip + fConst0] *addss* xmm2, xmm0 movss dword ptr [rip + fRec1], xmm2 movss dword ptr [rsi + 4*rax], xmm2 # fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0))); # output2[i0] = FAUSTFLOAT(fRec2[0]); # 6 instructions: movss xmm0, dword ptr [rip + fRec2+4] # xmm0 = mem[0],zero,zero,zero *subss* xmm0, xmm1 *mulss* xmm0, dword ptr [rip + fConst1] *addss* xmm0, xmm1 movss dword ptr [rip + fRec2], xmm0 movss dword ptr [rdx + 4*rax], xmm0 # fRec0[1] = fRec0[0]; movss xmm0, dword ptr [rip + fRec0] # xmm0 = mem[0],zero,zero,zero movss dword ptr [rip + fRec0+4], xmm0 # fRec1[1] = fRec1[0]; movss xmm1, dword ptr [rip + fRec1] # xmm1 = mem[0],zero,zero,zero movss dword ptr [rip + fRec1+4], xmm1 # fRec2[1] = fRec2[0]; movss xmm1, dword ptr [rip + fRec2] # xmm1 = mem[0],zero,zero,zero movss dword ptr [rip + fRec2+4], xmm1 # i0 = i0 + 1 add rax, 1 cmp rdi, rax jne .LBB0_2 #------------------------ end of loop --------------------------- .LBB0_3: # %for.cond.cleanup ret fSampleRate: .long 44100 # 0xac44 fConst0: .long 0x3dcccccd # float 0.100000001 fConst1: .long 0x3f666666 # float 0.899999976 fRec0: .zero 8 fRec1: .zero 8 fRec2: .zero 8 ============================================= If you read this far, Intel wants to hire you :-) Cheers, Julius

…

On Sat, Oct 30, 2021 at 2:38 PM Stéphane Letz ***@***.***> wrote: So the DSP code would be for the last version : smooth(s, x) = fb ~ _ with { fb(y) = s * (y - x) + x; }; @orlarey <https://github.com/orlarey> @josmithiii <https://github.com/josmithiii> @rmichon <https://github.com/rmichon> what do you guys think? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#106 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQZKFOJULXG6YFETZHBWE3UJRXWBANCNFSM5HBGMYXQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

0 replies

josmithiii · 2021-10-31T13:53:43Z

josmithiii
Oct 31, 2021
Collaborator

Since I was still the author of fi.smooth(), this is done. Dario, I just copy/pasted your last line, and put your name next to it, but feel free to "take it over" and write your own version with your own copyright and documentation, etc. However, it would need to be free on the level of an MIT or STK-4.3 license, because otherwise I would have to rewrite the line to keep it STK-4,3, and you would have to choose another name besides "smooth". I hope we will continue to find performance-improvement gems like this! This is a big one because it speeds up all controller parameter smoothing in all the audio inner loops of all Faust modules, among many other uses. Cheers, - Julius

…

On Sun, Oct 31, 2021 at 2:10 AM Julius Smith ***@***.***> wrote: Hi Dario and Stéphane, This is a winner. I am strongly in favor of the change. One multiply and two additions is fundamentally less work than two multiplies and one addition. However, when two multiplies are available in parallel, then (1-b) * x + b * y can be faster than x + b * (y-x) because it takes two steps instead of three. Thus, a SIMD implementation might prefer the first form, but Faust does not yet support SIMD as far as I know. Ideally both forms would compile to the same assembly, but this is not the case. Neither the Faust compiler nor the C++ compiler appear to work to minimize multiplies relative to additions when the target architecture warrants that. Of course we should run benchmarks to measure the actual improvement on each architecture, but looking at assembly can also give the answer. I recently learned about the Compiler Explorer at godbolt.org, for comparing assemblies on various processors, and this was my first use of it: First, here is the Faust source I used, from Dario: // FAUST: import("stdfaust.lib"); smooth(coeff, x) = fb ~ _ with { fb(y) = y + (1.0 - coeff) * (x - y); }; c = 1.0 - 44.1 / ma.SR smooth3(s, x) = fb ~ _ with { fb(y) = s * (y - x) + x; }; process = _ <: si.smooth(c), smooth(c), smooth3(c); Next, I compiled it at the command line with a simple "faust source.dsp" command (no fancy options), and lifted out the compute() method to create a standalone code snippet (not that it's no longer virtual): // C++ #define FAUSTFLOAT float int fSampleRate = 44100; float fConst0 = 0.1; // linear-interpolation constant float fConst1 = 0.9; // 1-fConst0 float fRec0[2]; float fRec1[2]; float fRec2[2]; void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) { FAUSTFLOAT* input0 = inputs[0]; FAUSTFLOAT* output0 = outputs[0]; FAUSTFLOAT* output1 = outputs[1]; FAUSTFLOAT* output2 = outputs[2]; for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) { float fTemp0 = float(input0[i0]); fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0)); output0[i0] = FAUSTFLOAT(fRec0[0]); fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1]))); output1[i0] = FAUSTFLOAT(fRec1[0]); fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0))); output2[i0] = FAUSTFLOAT(fRec2[0]); fRec0[1] = fRec0[0]; fRec1[1] = fRec1[0]; fRec2[1] = fRec2[0]; } } This code can be pasted into the left panel of the Compiler Explorer at godbolt.org. Finally, choose your processor architecture and compiler on the right, and your C++ compiler options. Here I chose the first Intel case (more readable than ARM): x86-64 clang (assertions trunk), -std=c++17 -O3. Below is the assembly output with my added comments indicating where I guessed things came from. You can see that the fundamental computation structure is preserved all the way down to the bottom, even with -O3 optimization. The clear winner is smooth3, and benchmarks should verify that. Resulting annotated ASSEMBLY, x86-64 clang (assertions trunk), -std=c++17 -O3 # ============================================= compute(int, float**, float**): # @compute(int, float**, float**) ... .LBB0_2: # %for.body # fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0)); // where fTemp0 = float(input0[i0]); # output0[i0] = FAUSTFLOAT(fRec0[0]); # 7 instructions: movss xmm1, dword ptr [r8 + 4*rax] # xmm1 = mem[0],zero,zero,zero *mulss* xmm0, dword ptr [rip + fConst1] movss xmm2, dword ptr [rip + fConst0] # xmm2 = mem[0],zero,zero,zero *mulss* xmm2, xmm1 *addss* xmm2, xmm0 movss dword ptr [rip + fRec0], xmm2 movss dword ptr [rcx + 4*rax], xmm2 # fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1]))); # output1[i0] = FAUSTFLOAT(fRec1[0]); # 7 instructions: movss xmm0, dword ptr [rip + fRec1+4] # xmm0 = mem[0],zero,zero,zero movaps xmm2, xmm1 *subss* xmm2, xmm0 *mulss* xmm2, dword ptr [rip + fConst0] *addss* xmm2, xmm0 movss dword ptr [rip + fRec1], xmm2 movss dword ptr [rsi + 4*rax], xmm2 # fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0))); # output2[i0] = FAUSTFLOAT(fRec2[0]); # 6 instructions: movss xmm0, dword ptr [rip + fRec2+4] # xmm0 = mem[0],zero,zero,zero *subss* xmm0, xmm1 *mulss* xmm0, dword ptr [rip + fConst1] *addss* xmm0, xmm1 movss dword ptr [rip + fRec2], xmm0 movss dword ptr [rdx + 4*rax], xmm0 # fRec0[1] = fRec0[0]; movss xmm0, dword ptr [rip + fRec0] # xmm0 = mem[0],zero,zero,zero movss dword ptr [rip + fRec0+4], xmm0 # fRec1[1] = fRec1[0]; movss xmm1, dword ptr [rip + fRec1] # xmm1 = mem[0],zero,zero,zero movss dword ptr [rip + fRec1+4], xmm1 # fRec2[1] = fRec2[0]; movss xmm1, dword ptr [rip + fRec2] # xmm1 = mem[0],zero,zero,zero movss dword ptr [rip + fRec2+4], xmm1 # i0 = i0 + 1 add rax, 1 cmp rdi, rax jne .LBB0_2 #------------------------ end of loop --------------------------- .LBB0_3: # %for.cond.cleanup ret fSampleRate: .long 44100 # 0xac44 fConst0: .long 0x3dcccccd # float 0.100000001 fConst1: .long 0x3f666666 # float 0.899999976 fRec0: .zero 8 fRec1: .zero 8 fRec2: .zero 8 ============================================= If you read this far, Intel wants to hire you :-) Cheers, Julius On Sat, Oct 30, 2021 at 2:38 PM Stéphane Letz ***@***.***> wrote: > So the DSP code would be for the last version : smooth(s, x) = fb ~ _ > with { fb(y) = s * (y - x) + x; }; > > @orlarey <https://github.com/orlarey> @josmithiii > <https://github.com/josmithiii> @rmichon <https://github.com/rmichon> > what do you guys think? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#106 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAQZKFOJULXG6YFETZHBWE3UJRXWBANCNFSM5HBGMYXQ> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > > -- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

-- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

1 reply

dariosanfilippo Oct 31, 2021
Author

Dear Julius,

that's totally fine that you added the code, and indeed it was Stéphane's. :-)

Also, thank you so much for taking a look at the assembly; it does appear that it should be at least a little bit faster.

I've also found that looking at the assembly can be deceiving sometimes: you see an IF-based algorithm generating branching and jumps and you'd assume that it will be slower than, say, using the std::signbit function, but then the branch predictor on M1 chips must be really good and it's faster when measuring execution times.

It would be great to bench the two but I am in the middle of a Homebrew/Macports war at the moment and nothing works anymore. :-) I'll try the faustbench tools once I'm able to build again.

Dario

josmithiii · 2021-10-31T14:06:50Z

josmithiii
Oct 31, 2021
Collaborator

I meant si.smooth - still not used to the new library organization!

…

On Sun, Oct 31, 2021 at 6:53 AM Julius Smith ***@***.***> wrote: Since I was still the author of fi.smooth(), this is done. Dario, I just copy/pasted your last line, and put your name next to it, but feel free to "take it over" and write your own version with your own copyright and documentation, etc. However, it would need to be free on the level of an MIT or STK-4.3 license, because otherwise I would have to rewrite the line to keep it STK-4,3, and you would have to choose another name besides "smooth". I hope we will continue to find performance-improvement gems like this! This is a big one because it speeds up all controller parameter smoothing in all the audio inner loops of all Faust modules, among many other uses. Cheers, - Julius On Sun, Oct 31, 2021 at 2:10 AM Julius Smith ***@***.***> wrote: > Hi Dario and Stéphane, > > This is a winner. I am strongly in favor of the change. One multiply > and two additions is fundamentally less work than two multiplies and one > addition. However, when two multiplies are available in parallel, then > (1-b) * x + b * y can be faster than x + b * (y-x) because it takes two > steps instead of three. Thus, a SIMD implementation might prefer the first > form, but Faust does not yet support SIMD as far as I know. > > Ideally both forms would compile to the same assembly, but this is not > the case. Neither the Faust compiler nor the C++ compiler appear to work > to minimize multiplies relative to additions when the target architecture > warrants that. > > Of course we should run benchmarks to measure the actual improvement on > each architecture, but looking at assembly can also give the answer. > I recently learned about the Compiler Explorer at godbolt.org, for > comparing assemblies on various processors, and this was my first use of it: > > First, here is the Faust source I used, from Dario: > > // FAUST: > > import("stdfaust.lib"); > smooth(coeff, x) = fb ~ _ with { fb(y) = y + (1.0 - coeff) * (x - y); }; > c = 1.0 - 44.1 / ma.SR > smooth3(s, x) = fb ~ _ with { fb(y) = s * (y - x) + x; }; > process = _ <: si.smooth(c), smooth(c), smooth3(c); > > Next, I compiled it at the command line with a simple "faust source.dsp" > command (no fancy options), and > lifted out the compute() method to create a standalone code snippet (not > that it's no longer virtual): > > // C++ > > #define FAUSTFLOAT float > > int fSampleRate = 44100; > float fConst0 = 0.1; // linear-interpolation constant > float fConst1 = 0.9; // 1-fConst0 > float fRec0[2]; > float fRec1[2]; > float fRec2[2]; > > void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) { > FAUSTFLOAT* input0 = inputs[0]; > FAUSTFLOAT* output0 = outputs[0]; > FAUSTFLOAT* output1 = outputs[1]; > FAUSTFLOAT* output2 = outputs[2]; > for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) { > float fTemp0 = float(input0[i0]); > fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0)); > output0[i0] = FAUSTFLOAT(fRec0[0]); > fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1]))); > output1[i0] = FAUSTFLOAT(fRec1[0]); > fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0))); > output2[i0] = FAUSTFLOAT(fRec2[0]); > fRec0[1] = fRec0[0]; > fRec1[1] = fRec1[0]; > fRec2[1] = fRec2[0]; > } > } > > This code can be pasted into the left panel of the Compiler Explorer at > godbolt.org. > > Finally, choose your processor architecture and compiler on the right, > and your C++ compiler options. > Here I chose the first Intel case (more readable than ARM): x86-64 clang > (assertions trunk), -std=c++17 -O3. > > Below is the assembly output with my added comments indicating where I > guessed things came from. > You can see that the fundamental computation structure is preserved all > the way down to the bottom, even with -O3 optimization. > The clear winner is smooth3, and benchmarks should verify that. > > Resulting annotated ASSEMBLY, x86-64 clang (assertions trunk), -std=c++17 > -O3 > > # ============================================= > compute(int, float**, float**): # @compute(int, float**, float**) > ... > .LBB0_2: # %for.body > > # fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0)); // where > fTemp0 = float(input0[i0]); > # output0[i0] = FAUSTFLOAT(fRec0[0]); > # 7 instructions: > > movss xmm1, dword ptr [r8 + 4*rax] # xmm1 = mem[0],zero,zero,zero > *mulss* xmm0, dword ptr [rip + fConst1] > movss xmm2, dword ptr [rip + fConst0] # xmm2 = mem[0],zero,zero,zero > *mulss* xmm2, xmm1 > *addss* xmm2, xmm0 > movss dword ptr [rip + fRec0], xmm2 > movss dword ptr [rcx + 4*rax], xmm2 > > # fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1]))); > # output1[i0] = FAUSTFLOAT(fRec1[0]); > # 7 instructions: > > movss xmm0, dword ptr [rip + fRec1+4] # xmm0 = mem[0],zero,zero,zero > movaps xmm2, xmm1 > *subss* xmm2, xmm0 > *mulss* xmm2, dword ptr [rip + fConst0] > *addss* xmm2, xmm0 > movss dword ptr [rip + fRec1], xmm2 > movss dword ptr [rsi + 4*rax], xmm2 > > # fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0))); > # output2[i0] = FAUSTFLOAT(fRec2[0]); > # 6 instructions: > > movss xmm0, dword ptr [rip + fRec2+4] # xmm0 = mem[0],zero,zero,zero > *subss* xmm0, xmm1 > *mulss* xmm0, dword ptr [rip + fConst1] > *addss* xmm0, xmm1 > movss dword ptr [rip + fRec2], xmm0 > movss dword ptr [rdx + 4*rax], xmm0 > > # fRec0[1] = fRec0[0]; > movss xmm0, dword ptr [rip + fRec0] # xmm0 = mem[0],zero,zero,zero > movss dword ptr [rip + fRec0+4], xmm0 > > # fRec1[1] = fRec1[0]; > movss xmm1, dword ptr [rip + fRec1] # xmm1 = mem[0],zero,zero,zero > movss dword ptr [rip + fRec1+4], xmm1 > > # fRec2[1] = fRec2[0]; > movss xmm1, dword ptr [rip + fRec2] # xmm1 = mem[0],zero,zero,zero > movss dword ptr [rip + fRec2+4], xmm1 > > # i0 = i0 + 1 > add rax, 1 > cmp rdi, rax > jne .LBB0_2 > > #------------------------ end of loop --------------------------- > > .LBB0_3: # %for.cond.cleanup > ret > > fSampleRate: > .long 44100 # 0xac44 > > fConst0: > .long 0x3dcccccd # float 0.100000001 > > fConst1: > .long 0x3f666666 # float 0.899999976 > > fRec0: > .zero 8 > > fRec1: > .zero 8 > > fRec2: > .zero 8 > > ============================================= > > If you read this far, Intel wants to hire you :-) > > Cheers, > Julius > > > On Sat, Oct 30, 2021 at 2:38 PM Stéphane Letz ***@***.***> > wrote: > >> So the DSP code would be for the last version : smooth(s, x) = fb ~ _ >> with { fb(y) = s * (y - x) + x; }; >> >> @orlarey <https://github.com/orlarey> @josmithiii >> <https://github.com/josmithiii> @rmichon <https://github.com/rmichon> >> what do you guys think? >> >> — >> You are receiving this because you were mentioned. >> Reply to this email directly, view it on GitHub >> <#106 (comment)>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/AAQZKFOJULXG6YFETZHBWE3UJRXWBANCNFSM5HBGMYXQ> >> . >> Triage notifications on the go with GitHub Mobile for iOS >> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> >> or Android >> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. >> >> > > > -- > Julius O. Smith III ***@***.***> > Professor of Music and, by courtesy, Electrical Engineering > CCRMA, Stanford University > http://ccrma.stanford.edu/~jos/ > -- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

-- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

0 replies

josmithiii · 2021-10-31T14:10:26Z

josmithiii
Oct 31, 2021
Collaborator

Oops, I just noticed Stéphane actually wrote the line I copy/pasted - will fix . . .

…

On Sun, Oct 31, 2021 at 7:06 AM Julius Smith ***@***.***> wrote: I meant si.smooth - still not used to the new library organization! On Sun, Oct 31, 2021 at 6:53 AM Julius Smith ***@***.***> wrote: > Since I was still the author of fi.smooth(), this is done. > Dario, I just copy/pasted your last line, and put your name next to it, > but > feel free to "take it over" and write your own version with your own > copyright and documentation, etc. > However, it would need to be free on the level of an MIT or STK-4.3 > license, > because otherwise I would have to rewrite the line to keep it STK-4,3, > and you would have to choose another name besides "smooth". > > I hope we will continue to find performance-improvement gems like this! > This is a big one because it speeds up all controller parameter smoothing > in all the audio inner loops of all Faust modules, among many other uses. > > Cheers, > - Julius > > On Sun, Oct 31, 2021 at 2:10 AM Julius Smith ***@***.***> > wrote: > >> Hi Dario and Stéphane, >> >> This is a winner. I am strongly in favor of the change. One multiply >> and two additions is fundamentally less work than two multiplies and one >> addition. However, when two multiplies are available in parallel, then >> (1-b) * x + b * y can be faster than x + b * (y-x) because it takes two >> steps instead of three. Thus, a SIMD implementation might prefer the first >> form, but Faust does not yet support SIMD as far as I know. >> >> Ideally both forms would compile to the same assembly, but this is not >> the case. Neither the Faust compiler nor the C++ compiler appear to work >> to minimize multiplies relative to additions when the target architecture >> warrants that. >> >> Of course we should run benchmarks to measure the actual improvement on >> each architecture, but looking at assembly can also give the answer. >> I recently learned about the Compiler Explorer at godbolt.org, for >> comparing assemblies on various processors, and this was my first use of it: >> >> First, here is the Faust source I used, from Dario: >> >> // FAUST: >> >> import("stdfaust.lib"); >> smooth(coeff, x) = fb ~ _ with { fb(y) = y + (1.0 - coeff) * (x - y); >> }; >> c = 1.0 - 44.1 / ma.SR >> smooth3(s, x) = fb ~ _ with { fb(y) = s * (y - x) + x; }; >> process = _ <: si.smooth(c), smooth(c), smooth3(c); >> >> Next, I compiled it at the command line with a simple "faust source.dsp" >> command (no fancy options), and >> lifted out the compute() method to create a standalone code snippet (not >> that it's no longer virtual): >> >> // C++ >> >> #define FAUSTFLOAT float >> >> int fSampleRate = 44100; >> float fConst0 = 0.1; // linear-interpolation constant >> float fConst1 = 0.9; // 1-fConst0 >> float fRec0[2]; >> float fRec1[2]; >> float fRec2[2]; >> >> void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) { >> FAUSTFLOAT* input0 = inputs[0]; >> FAUSTFLOAT* output0 = outputs[0]; >> FAUSTFLOAT* output1 = outputs[1]; >> FAUSTFLOAT* output2 = outputs[2]; >> for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) { >> float fTemp0 = float(input0[i0]); >> fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0)); >> output0[i0] = FAUSTFLOAT(fRec0[0]); >> fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1]))); >> output1[i0] = FAUSTFLOAT(fRec1[0]); >> fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0))); >> output2[i0] = FAUSTFLOAT(fRec2[0]); >> fRec0[1] = fRec0[0]; >> fRec1[1] = fRec1[0]; >> fRec2[1] = fRec2[0]; >> } >> } >> >> This code can be pasted into the left panel of the Compiler Explorer at >> godbolt.org. >> >> Finally, choose your processor architecture and compiler on the right, >> and your C++ compiler options. >> Here I chose the first Intel case (more readable than ARM): x86-64 clang >> (assertions trunk), -std=c++17 -O3. >> >> Below is the assembly output with my added comments indicating where I >> guessed things came from. >> You can see that the fundamental computation structure is preserved all >> the way down to the bottom, even with -O3 optimization. >> The clear winner is smooth3, and benchmarks should verify that. >> >> Resulting annotated ASSEMBLY, x86-64 clang (assertions trunk), >> -std=c++17 -O3 >> >> # ============================================= >> compute(int, float**, float**): # @compute(int, float**, float**) >> ... >> .LBB0_2: # %for.body >> >> # fRec0[0] = ((fConst1 * fRec0[1]) + (fConst0 * fTemp0)); // where >> fTemp0 = float(input0[i0]); >> # output0[i0] = FAUSTFLOAT(fRec0[0]); >> # 7 instructions: >> >> movss xmm1, dword ptr [r8 + 4*rax] # xmm1 = mem[0],zero,zero,zero >> *mulss* xmm0, dword ptr [rip + fConst1] >> movss xmm2, dword ptr [rip + fConst0] # xmm2 = mem[0],zero,zero,zero >> *mulss* xmm2, xmm1 >> *addss* xmm2, xmm0 >> movss dword ptr [rip + fRec0], xmm2 >> movss dword ptr [rcx + 4*rax], xmm2 >> >> # fRec1[0] = (fRec1[1] + (fConst0 * (fTemp0 - fRec1[1]))); >> # output1[i0] = FAUSTFLOAT(fRec1[0]); >> # 7 instructions: >> >> movss xmm0, dword ptr [rip + fRec1+4] # xmm0 = mem[0],zero,zero,zero >> movaps xmm2, xmm1 >> *subss* xmm2, xmm0 >> *mulss* xmm2, dword ptr [rip + fConst0] >> *addss* xmm2, xmm0 >> movss dword ptr [rip + fRec1], xmm2 >> movss dword ptr [rsi + 4*rax], xmm2 >> >> # fRec2[0] = (fTemp0 + (fConst1 * (fRec2[1] - fTemp0))); >> # output2[i0] = FAUSTFLOAT(fRec2[0]); >> # 6 instructions: >> >> movss xmm0, dword ptr [rip + fRec2+4] # xmm0 = mem[0],zero,zero,zero >> *subss* xmm0, xmm1 >> *mulss* xmm0, dword ptr [rip + fConst1] >> *addss* xmm0, xmm1 >> movss dword ptr [rip + fRec2], xmm0 >> movss dword ptr [rdx + 4*rax], xmm0 >> >> # fRec0[1] = fRec0[0]; >> movss xmm0, dword ptr [rip + fRec0] # xmm0 = mem[0],zero,zero,zero >> movss dword ptr [rip + fRec0+4], xmm0 >> >> # fRec1[1] = fRec1[0]; >> movss xmm1, dword ptr [rip + fRec1] # xmm1 = mem[0],zero,zero,zero >> movss dword ptr [rip + fRec1+4], xmm1 >> >> # fRec2[1] = fRec2[0]; >> movss xmm1, dword ptr [rip + fRec2] # xmm1 = mem[0],zero,zero,zero >> movss dword ptr [rip + fRec2+4], xmm1 >> >> # i0 = i0 + 1 >> add rax, 1 >> cmp rdi, rax >> jne .LBB0_2 >> >> #------------------------ end of loop --------------------------- >> >> .LBB0_3: # %for.cond.cleanup >> ret >> >> fSampleRate: >> .long 44100 # 0xac44 >> >> fConst0: >> .long 0x3dcccccd # float 0.100000001 >> >> fConst1: >> .long 0x3f666666 # float 0.899999976 >> >> fRec0: >> .zero 8 >> >> fRec1: >> .zero 8 >> >> fRec2: >> .zero 8 >> >> ============================================= >> >> If you read this far, Intel wants to hire you :-) >> >> Cheers, >> Julius >> >> >> On Sat, Oct 30, 2021 at 2:38 PM Stéphane Letz ***@***.***> >> wrote: >> >>> So the DSP code would be for the last version : smooth(s, x) = fb ~ _ >>> with { fb(y) = s * (y - x) + x; }; >>> >>> @orlarey <https://github.com/orlarey> @josmithiii >>> <https://github.com/josmithiii> @rmichon <https://github.com/rmichon> >>> what do you guys think? >>> >>> — >>> You are receiving this because you were mentioned. >>> Reply to this email directly, view it on GitHub >>> <#106 (comment)>, >>> or unsubscribe >>> <https://github.com/notifications/unsubscribe-auth/AAQZKFOJULXG6YFETZHBWE3UJRXWBANCNFSM5HBGMYXQ> >>> . >>> Triage notifications on the go with GitHub Mobile for iOS >>> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> >>> or Android >>> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. >>> >>> >> >> >> -- >> Julius O. Smith III ***@***.***> >> Professor of Music and, by courtesy, Electrical Engineering >> CCRMA, Stanford University >> http://ccrma.stanford.edu/~jos/ >> > > > -- > Julius O. Smith III ***@***.***> > Professor of Music and, by courtesy, Electrical Engineering > CCRMA, Stanford University > http://ccrma.stanford.edu/~jos/ > -- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

-- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

0 replies

sletz · 2021-10-31T14:29:42Z

sletz
Oct 31, 2021
Maintainer

Benchmarking is indeed necessary.

This is faster with the new version, tested with faustbench-llvm on an Apple M1:

process = par(i, 10, si.smoo);

But this is a bit slower with the new version, especially in scalar (= default) code model:

voice(i) = os.osc(400+i*300) : si.smoo;
process = par(i, 10, voice(i));

So I kept the old code (just in case...) in this commit 234eadc

0 replies

josmithiii · 2021-10-31T15:20:22Z

josmithiii
Oct 31, 2021
Collaborator

Wow, that is massively unexpected. I will next study the ARM assembly (which I have to learn). It should not be possible to make it slower on x86 unless the compiler can behave in some new way. I assume you had -O3 on, etc.

…

On Sun, Oct 31, 2021 at 7:29 AM Stéphane Letz ***@***.***> wrote: Benchmarking is indeed necessary. This is faster with the new version, tested with faustbench-llvm <https://github.com/grame-cncm/faust/tree/master-dev/tools/benchmark#faustbench-llvm> on an Apple M1: process = par(i, 10, si.smoo); But this is a bit slower with the new version: voice(i) = os.osc(400+i*300) : si.smoo; process = par(i, 10, voice(i)); So I kept the old code (just in case...) in, this commit 234eadc <234eadc> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#106 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQZKFN2WPQ473N5XWK2KRTUJVHGDANCNFSM5HBGMYXQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

0 replies

orlarey · 2021-11-01T13:14:30Z

orlarey
Nov 1, 2021
Maintainer

Interesting! A possible explanation (pure speculation ;-)) is that with the old smooth, the two multiplications are independent and can be done in parallel by the CPU, while all operations with the new smooth have to be done in sequence.

Godbolt/ICC 2021.3.0/-O3 -ffast-math. Faust code compiled with experimental graph compiler with -osd option to optimize 1 sample delay lines.

Old smooth (the two multiplications are independent)

..B8.4: # Preds ..B8.4 ..B8.3
movss xmm3, DWORD PTR [rdx+rsi8] #76.72
mulss xmm2, xmm1 #76.33
mulss xmm3, xmm0 #76.72
addss xmm2, xmm3 #76.72
movss DWORD PTR [rax+rsi8], xmm2 #77.17
movss xmm4, DWORD PTR [4+rdx+rsi8] #76.72
mulss xmm2, xmm1 #76.33
mulss xmm4, xmm0 #76.72
addss xmm2, xmm4 #76.72
movss DWORD PTR [4+rax+rsi8], xmm2 #77.17
inc rsi #75.9
cmp rsi, rcx #75.9
jb ..B8.4 # Prob 63% #75.9

New smooth, all the operations have to be done in sequence

..B7.4: # Preds ..B7.4 ..B7.3
movss xmm2, DWORD PTR [rdx+rsi8] #62.38
subss xmm1, xmm2 #63.47
mulss xmm1, xmm0 #63.47
addss xmm2, xmm1 #63.47
movss DWORD PTR [rax+rsi8], xmm2 #64.17
movss xmm1, DWORD PTR [4+rdx+rsi8] #62.38
subss xmm2, xmm1 #63.47
mulss xmm2, xmm0 #63.47
addss xmm1, xmm2 #63.47
movss DWORD PTR [4+rax+rsi8], xmm1 #64.17
inc rsi #61.9
cmp rsi, rcx #61.9
jb ..B7.4 # Prob 63% #61.9

0 replies

sletz · 2021-11-01T14:29:42Z

sletz
Nov 1, 2021
Maintainer

Should we have a way to choose one of the 3 implementations ? (which is a more general library design question BTW...) with something like:

added in platform.lib:


//---------------------------------`(pl.)smooth_type`----------------------------
// Smooth implementation type, see si.smooth
//-----------------------------------------------------------------------------
smooth_type = 0;
//smooth_type = 1;
//smooth_type = 2;

then one of the 3 possible implementations can be selected with the appropriate smooth_type value:

smooth_imp = case { 
  // y[n] = (1 - s) * x[n] + s * y[n - 1]
  (0,s) => \(x).(fb ~ _ with { fb(y) = (1.0 - s) * x + s * y; });
  
  // y[n] = s * (y[n - 1] - x[n]) + x[n]
  (1,s) => \(x).(fb ~ _ with { fb(y) = s * (y - x) + x; });
  
  // y[n] = y[n - 1] + (1 - s) * (x[n] - y[n - 1])
  (2,s) => \(x).(fb ~ _ with { fb(y) = y + (1.0 - s) * (x - y); });
};

smooth = smooth_imp(pl.smooth_type);

0 replies

josmithiii · 2021-11-02T08:14:57Z

josmithiii
Nov 2, 2021
Collaborator

That sounds like a good theory and plan. In the spirit of how FFTW was created, there could be a tool that benchmarks for any chosen architecture to determine the best choices.

…

On Mon, Nov 1, 2021 at 7:29 AM Stéphane Letz ***@***.***> wrote: Should we have a way to choose one of the 3 implementations ? (which is a more general *library design* question BTW...) with something like: - added in platform.lib: //---------------------------------`(pl.)smooth_type`---------------------------- // Smooth implementation type, see si.smooth //----------------------------------------------------------------------------- smooth_type = 0; //smooth_type = 1; //smooth_type = 2; - then one of the 3 possible implementations can be selected with the appropriate smooth_type value: smooth_imp = case { (0,s) => \(x).(x * (1.0 - s) : + ~ *(s)); (1,s) => \(x).(fb ~ _ with { fb(y) = s * (y - x) + x; }); (2,s) => \(x).(fb ~ _ with { fb(y) = y + (1.0 - s) * (x - y); }); }; smooth = smooth_imp(pl.smooth_type); — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#106 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQZKFISSQW4WBOOZUOFGUDUJ2P6BANCNFSM5HBGMYXQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

0 replies

sletz · 2021-11-02T12:52:37Z

sletz
Nov 2, 2021
Maintainer

We finally decided to restore the previous code (still faster), but keep the 3 versions available. The platform.lib mode is not used yet to select the version, which is hard-coded for now (since we are not sure the platform.lib mode is the way to go) , see eccd83c

0 replies

josmithiii · 2021-11-20T16:03:34Z

josmithiii
Nov 20, 2021
Collaborator

Two more ideas inspired by the "settle latch" proposed in the Matthew Robbetts talk at ADC-21 on using C++ Expression Templates to approximate some features of Faust: // automatically skip multiply-add when sufficiently close: smooth4(s, x) = fb ~ _ with { fb(y) = select2(ymx>0, dn, up) with { ymx = y-x; up = select2(ymx < 1.0e-7, s * ymx + x, x); dn = select2(ymx > -1.0e-7, s * ymx + x, x); }; }; // Set enable e to 0 by timer after any change in s to skip multiply-add thereafter: smooth5(s, e, x) = fb ~ _ with { fb(y) = select2(e>0, y, s * (y - x) + x); }; - Julius

0 replies

sletz · 2021-11-20T17:27:17Z

sletz
Nov 20, 2021
Maintainer

Not sure Io follow..., I guess I'll to wait for Matthew Robbetts talk at ADC-21 to be available on TY? Or can you explain more ?

0 replies

josmithiii · 2021-11-20T22:49:03Z

josmithiii
Nov 20, 2021
Collaborator

It should appear on YouTube soon. The idea is quite simple: When a slider value changes, its smoothed output exponentially approaches the new value, thanks to smooth(), and the exponential keeps computing even after effectively getting to the target value. The idea is to get rid of these unnecessary exponentials (one-pole filterings) most of the time for parameters that normally sit still (which is almost all of them). They only need "dezippering" when they actually change. Furthermore, from the smoother pole, which we set, we also know when we can turn it off, or it can sense arrival and turn off automatically. - Julius

…

On Sat, Nov 20, 2021 at 9:27 AM Stéphane Letz ***@***.***> wrote: Not sure Io follow..., I guess I'mme to wait for Matthew Robbetts talk at ADC-21 to be available on TY? Or can you explain more ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#106 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQZKFPR5G5JCJKRZWGL5PDUM7LABANCNFSM5HBGMYXQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

1 reply

dariosanfilippo Nov 21, 2021
Author

Hi, Julius.

Wouldn't that require Faust's strict semantics to be broken? I am referring to the fact that both branches of any IF-statement in Faust are always computed, so we'd just add three branching mechanisms on top of what we already have:

Smooth4:

	virtual void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) {
		FAUSTFLOAT* input0 = inputs[0];
		FAUSTFLOAT* input1 = inputs[1];
		FAUSTFLOAT* output0 = outputs[0];
		for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) {
			float fTemp0 = float(input1[i0]);
			float fTemp1 = (fRec0[1] - fTemp0);
			float fTemp2 = (fTemp0 + (float(input0[i0]) * fTemp1));
			float fThen2 = ((fTemp1 > -1.00000001e-07f) ? fTemp0 : fTemp2);
			float fElse2 = ((fTemp1 < 1.00000001e-07f) ? fTemp0 : fTemp2);
			fRec0[0] = ((fTemp1 > 0.0f) ? fElse2 : fThen2);
			output0[i0] = FAUSTFLOAT(fRec0[0]);
			fRec0[1] = fRec0[0];
		}
	}

smooth(coeff, x) = fb ~ _
with {
fb(y) = coeff * (y - x) + x;
}; :

	virtual void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) {
		FAUSTFLOAT* input0 = inputs[0];
		FAUSTFLOAT* input1 = inputs[1];
		FAUSTFLOAT* output0 = outputs[0];
		for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) {
			float fTemp0 = float(input1[i0]);
			fRec0[0] = (fTemp0 + (float(input0[i0]) * (fRec0[1] - fTemp0)));
			output0[i0] = FAUSTFLOAT(fRec0[0]);
			fRec0[1] = fRec0[0];
		}
	}

Even if Faust's semantics weren't strict, I wonder if a branching mechanism with two IFs inside an IF wouldn't still be heavier than a multiply-and-add calculation. I'll try to put together a little benchmark for that.

Cheers,
Dario

josmithiii · 2021-11-20T23:26:37Z

josmithiii
Nov 20, 2021
Collaborator

Matthew's talk is related to this ADC-16 talk: https://www.youtube.com/watch?v=XK88ji7vpyQ

…

On Sat, Nov 20, 2021 at 2:48 PM Julius Smith ***@***.***> wrote: It should appear on YouTube soon. The idea is quite simple: When a slider value changes, its smoothed output exponentially approaches the new value, thanks to smooth(), and the exponential keeps computing even after effectively getting to the target value. The idea is to get rid of these unnecessary exponentials (one-pole filterings) most of the time for parameters that normally sit still (which is almost all of them). They only need "dezippering" when they actually change. Furthermore, from the smoother pole, which we set, we also know when we can turn it off, or it can sense arrival and turn off automatically. - Julius On Sat, Nov 20, 2021 at 9:27 AM Stéphane Letz ***@***.***> wrote: > Not sure Io follow..., I guess I'mme to wait for Matthew Robbetts talk at > ADC-21 to be available on TY? Or can you explain more ? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#106 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAQZKFPR5G5JCJKRZWGL5PDUM7LABANCNFSM5HBGMYXQ> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > > -- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

-- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

0 replies

sletz · 2021-11-21T07:37:31Z

sletz
Nov 21, 2021
Maintainer

OK, but select is strict in Faust, see https://faustdoc.grame.fr/manual/syntax/#select2-primitive

0 replies

josmithiii · 2021-11-21T09:42:31Z

josmithiii
Nov 21, 2021
Collaborator

OK, but select is strict in Faust, see

https://faustdoc.grame.fr/manual/syntax/#select2-primitive Yes, this needs to wait for the master-with-mute branch

…

On Sat, Nov 20, 2021 at 11:37 PM Stéphane Letz ***@***.***> wrote: OK, but select is strict in Faust, see https://faustdoc.grame.fr/manual/syntax/#select2-primitive — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#106 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQZKFKN2YJBWULNSRQ7ANDUNCOUNANCNFSM5HBGMYXQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

0 replies

josmithiii · 2021-11-21T09:48:24Z

josmithiii
Nov 21, 2021
Collaborator

Even if Faust's semantics weren't strict, I wonder if a branching

mechanism with two IFs inside an IF wouldn't still be heavier than a multiply-and-add calculation. I'll try to put together a little benchmark for that. Yes, that is a valid concern. On some architectures it's probably better to just let the multiply-add fly. However, fundamentally, in hardware, it is much less work to skip it. It's software's job to reap that savings somehow. :-)

…

On Sat, Nov 20, 2021 at 11:42 PM Dario Sanfilippo ***@***.***> wrote: Hi, Julius. Wouldn't that require Faust's strict semantics to be broken? I am referring to the fact that both branches of any IF-statement in Faust are always computed, so we'd just add three branching mechanisms on top of what we already have: Smooth4: virtual void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) { FAUSTFLOAT* input0 = inputs[0]; FAUSTFLOAT* input1 = inputs[1]; FAUSTFLOAT* output0 = outputs[0]; for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) { float fTemp0 = float(input1[i0]); float fTemp1 = (fRec0[1] - fTemp0); float fTemp2 = (fTemp0 + (float(input0[i0]) * fTemp1)); float fThen2 = ((fTemp1 > -1.00000001e-07f) ? fTemp0 : fTemp2); float fElse2 = ((fTemp1 < 1.00000001e-07f) ? fTemp0 : fTemp2); fRec0[0] = ((fTemp1 > 0.0f) ? fElse2 : fThen2); output0[i0] = FAUSTFLOAT(fRec0[0]); fRec0[1] = fRec0[0]; } } smooth(coeff, x) = fb ~ _ with { fb(y) = coeff * (y - x) + x; }; : virtual void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) { FAUSTFLOAT* input0 = inputs[0]; FAUSTFLOAT* input1 = inputs[1]; FAUSTFLOAT* output0 = outputs[0]; for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) { float fTemp0 = float(input1[i0]); fRec0[0] = (fTemp0 + (float(input0[i0]) * (fRec0[1] - fTemp0))); output0[i0] = FAUSTFLOAT(fRec0[0]); fRec0[1] = fRec0[0]; } } Even if Faust's semantics weren't strict, I wonder if a branching mechanism with two IFs inside an IF wouldn't still be heavier than a multiply-and-add calculation. I'll try to put together a little benchmark for that. Cheers, Dario — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#106 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQZKFKLLVBDTCXPMWIE673UNCPG3ANCNFSM5HBGMYXQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

0 replies

dariosanfilippo · 2021-11-21T09:53:07Z

dariosanfilippo
Nov 21, 2021
Author

Here's a preliminary benchmark for the two implementations; please have a closer look in case I did something wrong.

If I use random inputs to actually challenge the branch predictor, we get these execution times (ms):

Branch = 3720.51
No branch = 1108.6

If I use an input that constantly increments a very small values such as 1.0e-50, hence always resulting in a ymx < 1.0e-7, the IF implementation is about twice as fast. If I constantly increment it so that ymx is always > 1.0e-7, the two implementations are essentially the same.

I'm on Apple M1, compiling with clang13 -Ofast. See the C++ code below:

#include <iostream>
#include <cmath>
#include <chrono>

double smoothBranch(double b1, double& y, double x) {
    double ymx = y - x;
    if (ymx > 0) {
        if (ymx < 1.0e-7) {
            return x;
        } else {
            y = b1 * ymx + x;
            return y;
        }
    } else {
        if (ymx > -1.0e-7) {
            return x;
        } else {
            y = b1 * ymx + x;
            return y;
        }
    }
}

double smooth(double b1, double& y, double x) {
    double ymx = y - x;
    y = b1 * ymx + x;
    return y;
}

int main(int argc, char** argv) {
    double p = 0;
    int n = 500000000;
    double d1_avg = 0;
    double d2_avg = 0;
   
    int seed = 12345;
    int mask = 4294967295;
    int random = 0;
    double x = 0;
    double y1 = 0;
    double y2 = 0;
    double b1 = .999;

    for (auto tries = 0; tries < 10; ++tries) {
        auto n1 = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < n; ++i) {
            x = random / 2147483647.0;
            random = (random * 1103515245 + seed) & mask;
            p = smoothBranch(b1, y1, x);
        }
        auto n2 = std::chrono::high_resolution_clock::now();
        std::cout << "smoothBranch: " << p << std::endl;
        std::chrono::duration<double, std::milli> d = n2 - n1;
        auto c1 = d.count();
        d1_avg += d.count();

        auto n3 = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < n; ++i) {
            x = random / 2147483647.0;
            random = (random * 1103515245 + seed) & mask;
            p = smooth(b1, y2, x);
        }
        auto n4 = std::chrono::high_resolution_clock::now();
        std::cout << "smooth: " << p << std::endl;
        d = n4 - n3;
        auto c2 = d.count();
        d2_avg += d.count();
    }

    std::cout << "Branch = " << d1_avg / 10.0 << std::endl;
    std::cout << "No branch = " << d2_avg / 10.0 << std::endl;
}

Ciao,
Dario

0 replies

josmithiii · 2021-11-21T13:48:02Z

josmithiii
Nov 21, 2021
Collaborator

Wow, that's much worse than I would have predicted! It implies no branch target prediction at all, and no ability to save the FPU cost on branches not needing it. Completely unexpected to me.

…

On Sun, Nov 21, 2021 at 1:53 AM Dario Sanfilippo ***@***.***> wrote: Here's a preliminary benchmark for the two implementations; please have a closer look in case I did something wrong. If I use random inputs to actually challenge the branch predictor, we get these execution times (ms): Branch = 3720.51 No branch = 1108.6 If I use an input that constantly increments a very small values such as 1.0e-50, hence always resulting in a ymx < 1.0e-7, the IF implementation is about twice as fast. If I constantly increment it so that ymx is always > 1.0e-7, the two implementations are essentially the same. I'm on Apple M1, compiling with clang13 -Ofast. See the C++ code below: #include <iostream> #include <cmath> #include <chrono> double smoothBranch(double b1, double& y, double x) { double ymx = y - x; if (ymx > 0) { if (ymx < 1.0e-7) { return x; } else { y = b1 * ymx + x; return y; } } else { if (ymx > -1.0e-7) { return x; } else { y = b1 * ymx + x; return y; } } } double smooth(double b1, double& y, double x) { double ymx = y - x; y = b1 * ymx + x; return y; } int main(int argc, char** argv) { double p = 0; int n = 500000000; double d1_avg = 0; double d2_avg = 0; int seed = 12345; int mask = 4294967295; int random = 0; double x = 0; double y1 = 0; double y2 = 0; double b1 = .999; for (auto tries = 0; tries < 10; ++tries) { auto n1 = std::chrono::high_resolution_clock::now(); for (int i = 0; i < n; ++i) { x = random / 2147483647.0; random = (random * 1103515245 + seed) & mask; p = smoothBranch(b1, y1, x); } auto n2 = std::chrono::high_resolution_clock::now(); std::cout << "smoothBranch: " << p << std::endl; std::chrono::duration<double, std::milli> d = n2 - n1; auto c1 = d.count(); d1_avg += d.count(); auto n3 = std::chrono::high_resolution_clock::now(); for (int i = 0; i < n; ++i) { x = random / 2147483647.0; random = (random * 1103515245 + seed) & mask; p = smooth(b1, y2, x); } auto n4 = std::chrono::high_resolution_clock::now(); std::cout << "smooth: " << p << std::endl; d = n4 - n3; auto c2 = d.count(); d2_avg += d.count(); } std::cout << "Branch = " << d1_avg / 10.0 << std::endl; std::cout << "No branch = " << d2_avg / 10.0 << std::endl; } Ciao, Dario — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#106 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQZKFIGYKTID3QH5RIEVO3UNC6Q3ANCNFSM5HBGMYXQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

1 reply

dariosanfilippo Nov 21, 2021
Author

I'm really not strong on these low-level matters but it might also be that, especially with nested branches, clearing the pipeline of a mispredicted branch can sometimes be more expensive than calculating the branch itself.

Ciao,
Dario

josmithiii · 2021-11-21T14:58:24Z

josmithiii
Nov 21, 2021
Collaborator

I thought modern processors had growing caches of parallel executed instructions on both sides of the branch, allowing zero-overhead selection of the winning branch. Reaching the target (or determining the enable bit to be false) should abort the multiply-in-progress. In any case, this is definitely a situation where we will have to tell the compiler/hardware it can stop working unnecessarily hard, if we can ever figure out how to do it. :-) Thanks for the benchmarks! - Julius

…

On Sun, Nov 21, 2021 at 6:43 AM Dario Sanfilippo ***@***.***> wrote: I'm really not strong on these low-level matters but it might also be that, especially with nested branches, clearing the pipeline of a mispredicted branch can sometimes be more expensive than calculating the branch itself. Ciao, Dario — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#106 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQZKFLOMRR6JQLEDNZIET3UNEAPVANCNFSM5HBGMYXQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Julius O. Smith III ***@***.***> Professor of Music and, by courtesy, Electrical Engineering CCRMA, Stanford University http://ccrma.stanford.edu/~jos/

0 replies

Minor improvement to si.smoo. #106

dariosanfilippo Oct 30, 2021

Replies: 22 comments · 4 replies

sletz Oct 30, 2021 Maintainer

dariosanfilippo Oct 30, 2021 Author

sletz Oct 30, 2021 Maintainer

josmithiii Oct 31, 2021 Collaborator

josmithiii Oct 31, 2021 Collaborator

dariosanfilippo Oct 31, 2021 Author

josmithiii Oct 31, 2021 Collaborator

josmithiii Oct 31, 2021 Collaborator

sletz Oct 31, 2021 Maintainer

josmithiii Oct 31, 2021 Collaborator

orlarey Nov 1, 2021 Maintainer

sletz Nov 1, 2021 Maintainer

josmithiii Nov 2, 2021 Collaborator

sletz Nov 2, 2021 Maintainer

josmithiii Nov 20, 2021 Collaborator

sletz Nov 20, 2021 Maintainer

josmithiii Nov 20, 2021 Collaborator

dariosanfilippo Nov 21, 2021 Author

josmithiii Nov 20, 2021 Collaborator

sletz Nov 21, 2021 Maintainer

josmithiii Nov 21, 2021 Collaborator

josmithiii Nov 21, 2021 Collaborator

dariosanfilippo Nov 21, 2021 Author

josmithiii Nov 21, 2021 Collaborator

dariosanfilippo Nov 21, 2021 Author

josmithiii Nov 21, 2021 Collaborator

dariosanfilippo
Oct 30, 2021

Replies: 22 comments 4 replies

sletz
Oct 30, 2021
Maintainer

dariosanfilippo Oct 30, 2021
Author

sletz
Oct 30, 2021
Maintainer

josmithiii
Oct 31, 2021
Collaborator

josmithiii
Oct 31, 2021
Collaborator

dariosanfilippo Oct 31, 2021
Author

josmithiii
Oct 31, 2021
Collaborator

josmithiii
Oct 31, 2021
Collaborator

sletz
Oct 31, 2021
Maintainer

josmithiii
Oct 31, 2021
Collaborator

orlarey
Nov 1, 2021
Maintainer

sletz
Nov 1, 2021
Maintainer

josmithiii
Nov 2, 2021
Collaborator

sletz
Nov 2, 2021
Maintainer

josmithiii
Nov 20, 2021
Collaborator

sletz
Nov 20, 2021
Maintainer

josmithiii
Nov 20, 2021
Collaborator

dariosanfilippo Nov 21, 2021
Author

josmithiii
Nov 20, 2021
Collaborator

sletz
Nov 21, 2021
Maintainer

josmithiii
Nov 21, 2021
Collaborator

josmithiii
Nov 21, 2021
Collaborator

dariosanfilippo
Nov 21, 2021
Author

josmithiii
Nov 21, 2021
Collaborator

dariosanfilippo Nov 21, 2021
Author

josmithiii
Nov 21, 2021
Collaborator