Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AUDIO_WORKLET] Optimise the copy back from wasm's heap to JS #22753

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

cwoffenden
Copy link
Contributor

@cwoffenden cwoffenden commented Oct 16, 2024

This builds on #22741 just because that's where I was at, but it's not required. The interesting changes are in audio_worklet.js and I'd appreciate some feedback from @juj before tidying this up (with sanity checks and a fallback).

Since we pass in the stack for the worklet from the caller's heap, its address shouldn't change. And since the (I'll make myself say it) render quantum size doesn't change after the audio worklet creation, the stack positions for the audio buffers should not change either. So, we can create one-time subarray views and replace the float-by-float copy with a simple set() per channel (per output).

I've thrown simple tests at it at and it works, fulfilling the garbage-free requirement and theoretically having a nice performance boost (not measured, but looping over thousands of JS Number types and shuffling them to and from floats must come at a cost). If the outputList does change, then it should only change after changes to the audio chain, which would be expensive enough that changing the subarrays wouldn't make a difference.

To be extra sure, we can move the output buffers to the first entries on the stack, then simple additional changes like input buffers won't change the address.

It wants sanity checks here and there but I'd like feedback for anything I'm missing or misunderstanding. Thanks!

@cwoffenden cwoffenden changed the title [AUDIO_WORKET] Optimise the copy back from wasm's heap to JS [AUDIO_WORKLET] Optimise the copy back from wasm's heap to JS Oct 16, 2024
@cwoffenden cwoffenden force-pushed the cw-audio-tweaks-3 branch 2 times, most recently from 38203f0 to 24a90e4 Compare October 17, 2024 13:01
@juj
Copy link
Collaborator

juj commented Oct 17, 2024

Nice idea! It looks like this PR carries the rename from the other PR. Rebase/merge should hide it?

for (/*which output*/ i of outputList) {
for (/*which channel*/ j of i) {
this.outputViews.push({
// dataPtr is the sanity check (to be implemented)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its a sanity check then perhaps put it behind if ASSERTIONS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve some rearranging to do, to make sure the addresses are fixed, and if those change from the base I’ll add assertions.

I’ll make this a draft and work on it next week, adding some robustness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its a sanity check then perhaps put it behind if ASSERTIONS?

Interestingly (unless I'm doing something hideous) but assert() isn't available in the AudioWorkletGlobalScope, and I can't add it as a dependency the same as stackAlloc(), etc. This stuff I'm very rusty with, mind.

if (!this.outputViews) {
this.outputViews = [];
k = outputDataPtr;
for (/*which output*/ i of outputList) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the length of outputList be different in each call?

Perhaps assert(this.outputViews.length == outputList.length) after this block to be clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can grow if other nodes are manually added. I’ll need to write contrived examples to see this, since in the real world I don’t think it will.

@cwoffenden
Copy link
Contributor Author

Nice idea!

A shower thought, as all good ideas are!

It looks like this PR carries the rename from the other PR. Rebase/merge should hide it?

Yes, this one was rebased off the earlier one so should have the same commit IDs.

We can remove the float-by-float JS copy and replace with this simple TypedArray set() calls.
Typed views are recreated if needed but otherwise are reused.
// Verify we have enough views (it doesn't matter if we have too many, any
// excess won't be accessed) then also verify the views' start address
// hasn't changed.
// TODO: allocate space for outputDataPtr before any inputs?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be ok to turn this around to allocate outputs before inputs. This way it should be possible to only assert() in debug builds that the invariant this.outputViews[0].byteOffset == k << 2 holds.

Also, the stack is already preallocated by the time the above constructor() is called, so things can be precomputed there in the ctor instead of at process time for any performance wins.

There are two things that I'd recommend:

  1. benchmark the effect of this optimization. Since the code does become more sophisticated/more dependent on things having been pre-set up, it would be good to confirm that there is actually a performance benefit to calling TypedArray.set() instead of manually looping the samples.

I think performance.now() should work here, so adding calls to performance.now() at very top of process() and at the very end of process() to measure the overall perf differential would probably work.

  1. It would be important to test two scenarios: a) there being multiple distinct audio processor classes, and b) there being different node paths in the same graph that have different number of inputs/outputs channels.

With the original code, it was reasonably "trivial" to statically prove that the code will handle any combination of classes or channels, though when state is being cached like this, there will then be the possibility that different schemes could trip up.

I wonder if a good test case might be to have a scenario where there's simultaneously
a) one stereo source clip -> simple passthrough stereo worklet -> stereo audiocontext speaker destination
b) one mono source clip -> simple passthrough mono worklet -> mono output
c) two copies of mono sources from b) above -> worklet merging (memcpying) these into stereo as Left/Right channels -> stereo audiocontext speaker destination

In such a test, there would be a full combination of "1ch input, 1ch output", "2ch input, 2ch outputs", "two 1ch inputs, 2ch output" cases that I think would cover every possibility (graph would have multiple processor classes and different shaped nodes).

The worklets themselves would be trivial memcpy input -> output implementations so no much complexity to the audio code itself, but they would then test that different shapes of the graph are preserved. Would something like that be a feasible test case to build?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be ok to turn this around to allocate outputs before inputs. This way it should be possible to only assert() in debug builds that the invariant this.outputViews[0].byteOffset == k << 2 holds.

I was unexpectedly free this afternoon so already rearranged everything.

Also, the stack is already preallocated by the time the above constructor() is called, so things can be precomputed there in the ctor instead of at process time for any performance wins.

I'll look at creating some views in the ctor up front after I've done the timings.

But you're 100% right about timing this, which I'll aim to do once it's in roughly the right shape, and same for the test cases. At work our use is mono mic, stereo out from a wasm mixer, and then combined mic and mixer, but I want to see it tested with more ins and outs like you propose.

(I can dedicate quite a bit of time to this for the next weeks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think performance.now() should work here

It's not part of the AudioWorkletGlobalScope. Date.now() is available but I just get zeros, though I did see a 1 which brought excitement for a brief moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the same Date.now() to the original code... and it's a lot of zeros, then some occasional ones. I think there are more ones than with the views, but that's not very scientific.

What I'll need to do is write something standalone that runs process() not in the AudioWorkletGlobalScope and then benchmark it over many runs on different browser and hardware (which is worth doing).

Lots of juggling with the various pointers, and next will be to reduce the code and move all of the output first to stop repeating some of the calculations. Some can also move to the constructor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants