-
Notifications
You must be signed in to change notification settings - Fork 167
/
convolution.html
186 lines (117 loc) · 8.74 KB
/
convolution.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Web Audio API - Convolution Architecture</title>
<link rel="stylesheet" href="http://www.w3.org/StyleSheets/TR/W3C-ED" type="text/css" />
</head>
<body>
<h3>Convolution Reverb</h3>
<p>
<em>This section is informative and may be helpful to implementors</em>
</p>
<p>
A <a href="http://en.wikipedia.org/wiki/Convolution_reverb">convolution reverb</a> can be used to simulate an acoustic space with very high quality.
It can also be used as the basis for creating a vast number of unique and interesting special effects. This technique is widely used
in modern professional audio and motion picture production, and is an excellent choice to create room effects in a game engine.
</p>
<p>
Creating a well-optimized real-time convolution engine is one of the more challenging parts of the Web Audio API implementation.
When convolving an input audio stream of unknown (or theoretically infinite) length, the <a href="http://en.wikipedia.org/wiki/Overlap-add_method">overlap-add</a> approach is used, chopping the
input stream into pieces of length L, performing the convolution on each piece, then re-constructing the output signal by delaying each result and summing.
</p>
<h4>Overlap-Add Convolution</h4>
<img src="http://upload.wikimedia.org/wikipedia/commons/7/77/Depiction_of_overlap-add_algorithm.png" alt="Depiction of overlap-add algorithm" />
<p>
Direct convolution is far too computationally expensive due to the extremely long impulse responses typically used. Therefore an approach using
<a href="http://en.wikipedia.org/wiki/FFT">FFTs</a> must be used. But naively doing a standard overlap-add FFT convolution using an FFT of size N with L=N/2, where N is chosen to be at least twice the length of the convolution kernel (zero-padding the kernel) to perform each convolution operation in the diagram above would incur a substantial input to output pipeline latency on the order of L samples. Because of the enormous audible delay, this simple method cannot be used. Aside from the enormous delay, the size N of the FFT could be extremely large. For example, with an impulse response of 10 seconds at 44.1Khz, N would equal 1048576 (2^20).
This would take a very long time to evaluate. Furthermore, such large FFTs are not practical due to substantial phase errors.
</p>
<h4>Optimizations and Tricks</h4>
<p>
There exist several clever tricks which break the impulse response into smaller pieces, performing separate convolutions, then
combining the results (exploiting the property of linearity). The best ones use a divide and conquer approach using different size FFTs and a direct
convolution for the initial (leading) portion of the impulse response to achieve a zero-latency output. There are additional optimizations which can be done exploiting the
fact that the tail of the reverb typically contains very little or no high-frequency energy. For this part, the convolution may be done at a lower sample-rate...
</p>
<p>
Performance can be quite good, easily
done in real-time without creating undo stress on modern mid-range CPUs. A multi-threaded implementation is really required if low (or zero) latency is required because of the way the buffering / processing chunking works. Achieving good performance requires a highly optimized FFT algorithm.
</p>
<h4>Multi-channel convolution</h4>
<p>
It should be noted that a convolution reverb typically involves two convolution operations, with separate impulse responses for the left and right channels in
the stereo case. For 5.1 surround, at least five separate convolution operations are necessary to generate output for each of the five channels.
</p>
<h4>Impulse Responses</h4>
<p>
Similar to other assets such as JPEG images, WAV sound files, MP4 videos, shaders, and geometry, impulse responses can be considered as multi-media assets. As with these other
assets, they require work to produce, and the high-quality ones are considered valuable. For example, a company called Audio Ease makes a fairly expensive ($500 - $1000)
product called <a href="http://www.audioease.com/Pages/Altiverb/AltiverbMain.html">Altiverb</a>
containing several nicely recorded impulse responses along with a convolution reverb engine.
</p>
<h3>Convolution Engine Implementation</h3>
<h4>FFTConvolver (short convolutions)</h4>
<p>
The <code>FFTConvolver</code> is able to do short convolutions with the FFT size N being at least twice as large as the
length of the short impulse response. It incurs a latency of N/2 sample-frames. Because of this latency and performance considerations,
it is not suitable for long convolutions. Multiple instances of this building block can be used to perform extremely long convolutions.
</p>
<img src="images/fft-convolver.png" alt="description of FFT convolver" />
<h4>ReverbConvolver (long convolutions)</h4>
The <code>ReverbConvolver</code> is able to perform extremely long real-time convolutions on a single audio channel.
It uses multiple <code>FFTConvolver</code> objects as well as an input buffer and an accumulation buffer. Note that it's
possible to get a multi-threaded implementation by exploiting the parallelism. Also note that the leading sections of the long
impulse response are processed in the real-time thread for minimum latency. In theory it's possible to get zero latency if the
very first FFTConvolver is replaced with a DirectConvolver (not using a FFT).
<img src="images/reverb-convolver.png" alt="description of reverb convolver" />
<h4>Reverb Effect (with matrixing)</h4>
<img src="images/reverb-matrixing.png" alt="description of reverb matrixing" />
<h3>Recording Impulse Responses</h3>
<img src="images/impulse-response.png" alt="impulse-response waveforms" />
<p>The most <a href="http://pcfarina.eng.unipr.it/Public/Papers/226-AES122.pdf">modern</a>
and accurate way to record the impulse response of a real acoustic space is to use
a long exponential sine sweep. The test-tone can be as long as 20 or 30 seconds, or longer.</p>
<p> Several recordings of the
test tone played through a speaker can be made with microphones placed and oriented at various positions in the room. It's important
to document speaker placement/orientation, the types of microphones, their settings, placement, and orientations for each recording taken.</p>
<p>
Post-processing is required for each of these recordings by performing an inverse-convolution with the test tone,
yielding the impulse response of the room with the corresponding microphone placement. These impulse responses are then
ready to be loaded into the convolution reverb engine to re-create the sound of being in the room.
</p>
<h3>Tools</h3>
<p>
Two command-line tools have been written:
</p>
<p>
<code>generate_testtones</code> generates an exponential sine-sweep test-tone and its inverse. Another
tool <code>convolve</code> was written for post-processing. With these tools, anybody with recording equipment can record their own impulse responses.
To test the tools in practice, several recordings were made in a warehouse space with interesting
acoustics. These were later post-processed with the command-line tools.
</p>
<pre>
% generate_testtones -h
Usage: generate_testtone
[-o /Path/To/File/To/Create] Two files will be created: .tone and .inverse
[-rate <sample rate>] sample rate of the generated test tones
[-duration <duration>] The duration, in seconds, of the generated files
[-min_freq <min_freq>] The minimum frequency, in hertz, for the sine sweep
% convolve -h
Usage: convolve input_file impulse_response_file output_file
</pre>
<h3>Recording Setup</h3>
<div>
<img src="images/recording-setup.png" alt="photograph of recording setup" />
Audio Interface: Metric Halo Mobile I/O 2882
<img src="images/microphones-speaker.png" alt="microphones and speakers used" />
<img src="images/microphone.png" alt="microphone used" />
<img src="images/speaker.png" alt="speakers used" />
Microphones: AKG 414s, Speaker: Mackie HR824
<h3>The Warehouse Space</h3>
<img src="images/warehouse.png" alt="photo of the warehouse used" />
</div>
</body>
</html>