From bc3404e208fe2b4c17ef62ae32fe60e55f868afc Mon Sep 17 00:00:00 2001 From: Chintan Shah Date: Fri, 14 Jun 2024 18:19:31 -0700 Subject: [PATCH] KHR audio graph design --- .../2.0/Khronos/KHR_audio_graph/README.md | 1279 +++++++++++++++++ 1 file changed, 1279 insertions(+) create mode 100644 extensions/2.0/Khronos/KHR_audio_graph/README.md diff --git a/extensions/2.0/Khronos/KHR_audio_graph/README.md b/extensions/2.0/Khronos/KHR_audio_graph/README.md new file mode 100644 index 0000000000..a81d464792 --- /dev/null +++ b/extensions/2.0/Khronos/KHR_audio_graph/README.md @@ -0,0 +1,1279 @@ + + + + +## KHR Audio Graph Design + + + + +## **Contributors** + + + +* Chintan Shah, Meta +* Alexey Medvedev, Meta + + +## Context + +During the recent Khronos 3D formats working group meeting held on 5/29, we reviewed the [proposal to define the KHR audio glTF specification using an audio graph framework](https://docs.google.com/presentation/d/1IrrQaE-jHyzOtFRabjtLAzeP5UOirFOEj8FRADAceqk/edit?usp=sharing). The purpose of this document is to delve deeper into that proposal, offering a comprehensive design of the KHR audio graph. This includes a detailed description of each node object within the graph along with functionality, the specific properties associated with it, and how it interacts with other nodes in the graph. The document is structured to facilitate clear understanding and to solicit feedback on the proposed design. Based on the feedback we will update and finalize the design before it is formally schematized into KHR core audio spec, extensions, animations, and interactivity. + + +## 1. Introduction + +This document provides a detailed design for managing audio routing, mixing, and processing in various applications for desktop, mobile, and wearable devices. The core idea involves an audio graph consisting of multiple interconnected audio node objects to create the final audio output. This design seeks to incorporate the audio capabilities found in modern web, game, and XR engines as well as processing, mixing, and filtering functions available in audio production softwares. Although this system is designed with diverse use cases in mind, it might not cover every specialized feature found in state-of-the-art audio tools. In such instances, users are encouraged to develop custom extensions. Nevertheless, the proposed system will support many complex audio applications by default and has been designed to facilitate future expansion with more sophisticated features. + + +## 2. Features + +The core specification should support these primary features: + + + +* Graph based audio routing for simple or complex mixing and processing architectures. +* Processing of audio data stored in memory buffer or accessed via file paths. +* Capturing audio metadata such as encoding properties. +* Audio playback functionalities including playing, stopping, pausing, looping, and controlling playback speed. +* Spatialized audio supporting a wide range of 3D applications and immersive environments with 6DoF source/listener capabilities, panning models (equal power, HRTF), distance attenuation, and sound cones. +* Basic audio signal processing to control gain, delay, and pitch. +* Flexible handling of channels in an audio stream, allowing splitting, merging, up-mixing, or down-mixing. +* Audio mixing, reverb, and filtering with a set of low-order audio filters. +* Animation control and dynamic update of node properties. + + +## 3. Graph based audio processing + +Audio nodes are the building blocks of an audio graph for rendering audio to the audio hardware. Graph based audio routing allows arbitrary connections between different audio node objects. An audio graph can be represented by audio sources, the audio destination/sink, and intermediate processing nodes. Each node can have inputs and/or outputs. A source node has no inputs and a single output. A destination or sink node has one input and no outputs. In the simplest case, a single source can be routed directly to the output. + +One or more intermediate processing nodes such as filters can be placed between the source and destination nodes. Most processing nodes will have one input and one output. Each type of audio node differs in the details of how it processes or synthesizes audio. But, in general, an audio node will process its inputs (if it has any), and generate audio for its outputs (if it has any). An output may connect to one or more audio node inputs, thus fan-out is supported. An input (except source and sink) may be connected to one or more audio node outputs, thus fan-in is supported. Each input and output has one or more channels. The exact number of channels depends on the details of the specific audio node. + + +## 4. Audio source + + +### 4.1 Source node (0 input / 1 output) + +Audio sources reference audio data and define playback properties for it. An audio source node has no inputs and exactly one output, which has the same number of channels as indicated in encoding properties. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
data + object + audio or oscillator +

+See 4.2 Audio data. +

+See 4.3 Oscillator data. +

+ Only selective source node properties apply with oscillator data. +
priority + integer + Determines the priority of this audio source among all the ones that coexist in the scene (0 = most important, 256 = least important, default = 128). + + Need to persist and propagate priority with downstream processing. +
state + string + Playback state (paused, playing, stopped). + + +
auto play + boolean + Play on load. + + +
loop + boolean + Playback in a loop (false = no-loop, true = loop). If set to true, then once playback reaches the time specified by “loop end position” (or the end of the asset, whichever is first), the source node will continue playback again from a position specified by the “loop start position” property. + + +
position + number + Play position in ms. + + +
gain + number + Gain value applied to the audio. + + +
playback speed + number + Rate of playback. A value of 1.0 would playback the audio at the standard rate. A value of 2.0 would play back the asset at double the speed. + + +
loop start position + number + The starting position in ms when looping. + + +
loop end position + number + The ending position (ms) of the loop. + + +
duration + number + Length of the underlying audio data in ms. + + +
encoding properties + object + See 4.4 Encoding properties. + + Could be a part of audio data property instead. +
+ + + +### 4.2 Audio data + +Audio data objects define where audio data is located and what format the data is in. The data is either accessed via a bufferView or uri. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
buffer view + integer + The index of the buffer view that contains the audio data. The buffer represents an audio asset residing in memory, created from decoding an audio file, or from raw data. + + +
mime type + string + The audio's MIME type. Required if buffer view is defined. + + +
uri + string + The uri of the audio file. Relative paths are relative to the .gltf file. + + +
+ + + +### 4.3 Oscillator data + +This represents an audio source generating a periodic waveform. It can be set to a few commonly used waveforms. Oscillators are common foundational building blocks in audio synthesis. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
type + string + Specifies the waveform type (saw, square, triangle, sine, custom). + + +
frequency + number + The Oscillator frequency, from 0-20kHz. The default value is 440 Hz. + + +
pulse width + number + The amount of pulse width modulation applied when the “square” waveform is selected. A 0.5 value will produce a pure square wave, and increasing or decreasing the value will add harmonics which change the timbre of the sound. + + Applies to square waveform. +
+ + + +### 4.4 Encoding properties + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
bits per sample + integer + Number of bits per audio sample. + + +
samples + integer + Number of samples. + + +
sample rate + number + Audio sampling rate (Hz). + + +
channels + integer + Number of audio channels. + + +
properties + object + Additional user-defined properties. + + +
+ + + +## 5. Audio sink/destination + + +### 5.1 Emitter node (1 input / 0 output) + +Audio emitter of type “Global” is non-spatialized, whereas a “Spatial” emitter is used for spatial audio. A scene can contain one or more global emitters. Spatial emitter is connected to a scene node. A scene node can have one or more spatial emitters. A spatial emitter by default inherits the pose (position and orientation) properties of the scene node, hence we do not support updating these properties within the spatial emitter node. + +Using a spatial emitter, an audio stream can be spatialized or positioned in space relative to a listener node. A scene has a single listener node. Both spatial emitters and listeners have a position in 3D space. Spatial emitters have an orientation representing in which direction the sound is projecting. Additionally, they have a sound cone representing how directional the sound is. For example, the sound could be omnidirectional, in which case it would be heard anywhere regardless of its orientation, or it can be more directional and heard only if it is facing the listener. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
type + string + Emitter type (global, spatial). + + +
gain + number + gain applied to the signal by the emitter. + + +
spatial properties + object + See 5.2 Spatial properties. + + +
+ + + +### 5.2 Spatial properties + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
spatialization model + string + Determines which spatialization model will be used to position the audio in 3D space (equal power, HRTF, Custom). +

+equalpower: Represents the equal-power panning algorithm, generally regarded as simple and efficient. equalpower is the default value. +

+HRTF: Renders a stereo output of higher quality than equalpower — it uses a convolution with measured impulse responses from human subjects. +

+Custom: User-defined panning algorithm. +

+ +
attenuation + object + See 5.3 Attenuation properties. + + +
shape + object + See 5.4 Shape properties + + +
+ + + +### 5.3 Attenuation properties + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
distance model + string + Specifies the distance model for the audio emitter linear, inverse, exponential, custom). + + +
ref distance + number + A reference distance for reducing volume as the emitter moves further from the listener. + + +
max distance + number + The maximum distance between the emitter and listener, beyond which the audio cannot be heard. + + +
rolloff factor + number + Describes how quickly the volume is reduced as the emitter moves away from the listener. + + +
+ + + +### 5.4 Shape properties + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
type + string + Shape in which emitter emits audio (cone, omnidirectional, custom). + + +
cone inner angle + number + The angular diameter of a cone inside of which there will be no angular volume reduction. + + +
cone outer angle + number + A parameter for directional audio sources that is an angle, in degrees, outside of which the volume will be reduced to a constant value of coneOuterGain. + + +
cone outer gain + number + A parameter for directional audio sources that is the gain outside of the cone outer angle. + + +
+ + + +### 5.5 Audio listener node (0 input / 0 output) + +Describes the position and other physical characteristics of a listener from which the audio output of spatial emitter nodes is heard when spatial audio processing is used. A listener node is typically attached to the main camera and by default inherits camera pose (position and orientation) properties. Hence, we do not support updating these properties within the listener node. + + +## 6. Audio processors + + +### 6.1 Gain node (1 input / 1 output) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
gain + number + The gain to apply. Once set, the actual gain applied will transition from it’s current setting to the new one set over “duration” milliseconds. + + +
interpolation + string + The curve to apply when changing gains (linear, custom). + + +
duration + number + When changing gain, this parameter controls how long to spend interpolating from the previously set gain value to the one that is specified. + + +
+ + + +### 6.2 Delay node (1 input / 1 output) + +The node that causes a delay between the arrival of an input data and its propagation to the output. A delay node always has exactly one input and one output, both with the same amount of channels. + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
delay time + number + representing the amount of delay to apply, specified in ms. + + +
+ + + +### 6.3 Pitch shifter node (1 input / 1 output) + +Use the node to make the pitch of an audio deeper or higher. + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
semitone adjustment + number + Pitch shift in musical semitones. A value of -12 halves the pitch, while 12 doubles the pitch. A value of 0 will not change the pitch of the audio source. + + +
+ + + +### 6.4 Channel splitter node (1 input / N outputs) + +Node for accessing the individual channels of an audio stream in the routing graph. It has a single input, and a number of outputs which equals the number of channels in the input audio stream. For example, if a stereo input is connected to this node then the number of active outputs will be two (one from the left channel and one from the right). + + + + + + + + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
input channels + integer + Number of channels in input audio. + + +
channel interpretation + string + speakers, discrete +

+Channel ordering for speaker channel interpretation are captured here. When the number of channels do not match any of the basic speaker layouts, use discrete to maps channels to outputs. +

+ +
+ + + +### 6.5 Channel merger node (N inputs / 1 output) + +Node for combining channels from multiple audio streams into a single audio stream. It has a variable number of inputs, and a single output whose audio stream has a number of channels equal to the number of inputs. To merge multiple inputs into one stream, each input should be a single channel audio stream. + + + + + + + + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
output channels + integer + Number of channels in output audio. + + +
channel interpretation + string + speakers, discrete +

+Channel ordering for speaker channel interpretation are captured here. When the number of channels do not match any of the basic speaker layouts, use discrete as it maps inputs to channels. +

+ +
+ + + +### 6.6 Channel mixer node (1 input / 1 output) + +Up-mixing refers to the process of taking a stream with a smaller number of channels and converting it to a stream with a larger number of channels. Down-mixing refers to the process of taking a stream with a larger number of channels and converting it to a stream with a smaller number of channels. Channel mixer node should ideally use these [mixing rules](https://webaudio.github.io/web-audio-api/#mixing-rules). + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
input channels + integer + Number of channels in input audio. + + +
output channels + integer + Number of channels in output audio. + + +
channel interpretation + string + speakers, discrete +

+Speakers use up-mix / down-mix equations. In cases where the number of channels do not match any of the basic speaker layouts, use "discrete". "Discrete" up-mix by filling channels until they run out then zero out remaining channels; down-mix by filling as many channels as possible, then dropping remaining channels. +

+ +
+ + + +### 6.7 Audio mixer node (N inputs / 1 output) + +Use the Audio mixer node to combine the output from multiple audio sources. Number of channels should be the same across all inputs. + +[no properties for this node] + + +### 6.8 Filter node (1 input / 1 output) + +Use the Audio Mixer node to combine the output from multiple audio sources. A filter node always has exactly one input and one output. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
type + string + Defining the kind of filtering algorithm the node is implementing (lowpass, highpass, bandpass, lowshelf, highshelf, peaking, notch, allpass, custom) + + +
frequency + number + Frequency in the current filtering algorithm measured in Hz. + + +
quality factor + number + The lower the Quality, the broader the bandwidth of frequencies cut or boosted. Value range: 0 to 100. + + +
gain + number + gain applied to the signal by the filter. + + +
bypass + boolean + Disables this processor while still allowing unprocessed audio signals to pass. + + +
+ + + +### 6.9 Reverb node (1 input / 1 output) + +Reverberation is the persistence of sound in an enclosure after a sound source has been stopped. This is a result of the multiple reflections of sound waves throughout the room arriving at the ear so closely spaced that they are indistinguishable from one another and are heard as a gradual decay of sound. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Type + Description + Required + Notes +
mix + number + Blend between the source signal ('dry') and the reverb effect. A value of 0 will not add any reverb. A value of 50 will mix the signal 50% dry and 50% reverberation. A value of 100 will result in only reverberation, without any of the source signal. + + +
early reflections gain + number + Loudness control for the early reflections of the reverberation. + + +
diffusion gain + number + Loudness control for the reverb decay as it returns to silence. + + +
room size + number + Approximates the size of the room you want to simulate in meters from wall to wall. + + +
reflectivity + number + Defines how much of the audio is reflected at each bounce on a wall. Value range: 0 to 100. Low values will simulate softer sounding materials like carpet or curtains. High values will simulate harder materials like wood, glass or metal. A value of 100 will result in self-oscillation and is not recommended. + + +
reflectivity high + number + Separate value for the reflectivity of high frequencies. + + +
reflectivity low + number + Separate value for the reflectivity of low frequencies. + + +
early reflections + number + The number of early reflections of reverberation. The value range is 0 to 32. + + +
min distance + number + The distance from the centerpoint that the reverb will have full effect at. + + +
max distance + number + The distance from the centerpoint that the reverb will not have any effect. + + +
reflection delay + number + Initial reflection delay time in ms. + + +
reverb delay + number + Late reverberation delay time relative to initial reflection. + + +
custom properties + object + Application-specific data. + + +
bypass + boolean + Disables this processor while still allowing unprocessed audio signals to pass. + + +
+ + + +## 7. Audio graph rules + + + +* Multiple audio source nodes can reference the same audio data. +* An output of a source node can serve as input to multiple processor and emitter nodes. +* An output of a processor node can serve as input to multiple processor and emitter nodes. +* One audio emitter node can have only one input. +* A scene can have multiple emitters. +* A node can have multiple spatial emitters. +* A scene can have only one audio listener.