Skip to content

Commit

Permalink
talk abt gan
Browse files Browse the repository at this point in the history
  • Loading branch information
shuklabhay committed Oct 25, 2024
1 parent 53997ba commit c93cb89
Showing 1 changed file with 4 additions and 6 deletions.
10 changes: 4 additions & 6 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,22 +42,20 @@ These datasets provide robust frameworks for determining the model's response to

To simplify the taks at hand, this work represents audio as an image of frequency bins by time steps, with each pixel's intensity representing magnitude. This spectrogram-like representation of audio contains almost all the information as pure waveform information with the benefit of having a lower dimensionality and potentially more effectively capruting temporal dependencies. Utilizing this spectrogram-like representation of audio also eliminates the need for recurrent architectures, but the semi-invertable nature of Fourier transforms introduces an avenue for potentially significant information loss. Each audio sample is first converted into a two channel array using a standard 44100 KHz sample rate. If necessary, single channel audio is duplicated. The audio sample is then normalized to a standard length and passes into a Short-time Fourier Transform (STFT).

The STFT utilizes a window size and hop length determined by the audio sample length and constant sample rate so that each resulting data point is 256 frequency bins by 256 time frames. When validating processing using pure sine signals at random frequencies, audio information was preserved to the greatest extent by using a kaiser window where a beta value of 12. Next, to preserve higher frequency information, the STFT's resulting magnitude information is converted to a decibal scale and the range of the loudness information is scaled down to a range of -1 to 1. Scaling down to this interval further standardizes training audio and matches the output of the Generator, which uses a hyperbolic tangent activation. Both channels of the input audio are processed seperately and concatenated to create a two channel data point with each channel containing 256 frequency bins and 256 time steps, along with normalized loudness information at each frequency bin and time step.
The STFT utilizes a window size and hop length determined by the audio sample length and constant sample rate so that each resulting data point is 256 frequency bins by 256 time frames. The transform utilizes a kaiser window with a beta value of 12, a value determined by processing pure sine signals at random frequencies with the intent of getting the most information out of the signal. Next, to preserve higher frequency information, the STFT's resulting magnitude information is converted to a decibal scale and the range of the loudness information is scaled down to a range of -1 to 1. Scaling down to this interval further standardizes training audio and matches the output of the Generator, which uses a hyperbolic tangent activation. Both channels of the input audio are processed seperately and concatenated to create a two channel data point with each channel containing 256 frequency bins and 256 time steps, along with normalized loudness information at each frequency bin and time step.

When converting generated audio representations to audio, this process occurs in reverse. Each channel's generated normalized loudness information is scaled up to a range of -40 to 40. A noise gate is then implemented and the decibal values are converted to magnitudes. Magnitude information is passed into 10 iterations of a Momentum Driven Griffin-Lim Reconstruction with noise gating at each iteration, resulting in effectively recreated audio.
When converting generated audio representations to audio, this process occurs in reverse. Each channel's generated normalized loudness information is scaled up to a range of -40 to 40, a range of loudness similar to the minimum and maximum of training examples before normalizing to [-1,1]. A noise gate is then implemented and the decibal values are converted to magnitudes. Magnitude information is passed into 10 iterations of a Momentum Driven Griffin-Lim Reconstruction with noise gating at each iteration, resulting in effectively recreated audio.

## 4. Model Implementation

### 4.1. Architecture

- why am i using a gan ?? & w part ig
This work utilizes a GAN architecture to create high-fidelity audio, exploiting adversaial loss to promote realism and detail within generated audio. To address the GANs training instability, this work utilizes a Wasserstein GAN and gradient penalty (WGAN-GP). The Wasserstein distance provides a stable measure of divergence between real and generated audio distributions compared to typical GAN loss functions, and minimizing this distance through the WGAN-GP framework empircally improves training stability and promotes convergence. In this work, the switch to a WGAN architecture from a standard GAN was instrumental in creating a model that could consistently converge to model that generated actual audio over noise.

This work utilizes a Wasserstein GAN and gradient penalty (WGAN-GP) based model architecutre. The generator passes 128 latent dimensions into six transpose convolution blocks blocks, the first five consisting each of a 2D transpose convolution and batch normalization followed by a Leaky ReLU activation and dropout layer. The final block contains a 2D transpose convolution and hyperbolic tangent activation, creating a 256 by 256 representation of audio with values between -1 to 1.
The final generator passes 128 latent dimensions into six transpose convolution blocks blocks, the first five consisting each of a 2D transpose convolution and batch normalization followed by a Leaky ReLU activation and dropout layer. The final block contains a 2D transpose convolution and hyperbolic tangent activation, creating a 256 by 256 representation of audio with values between -1 to 1.

The Critic consists of six convolution blocks, converting a 256 by 256 representation of audio to a single value, an approximation of the wasterstien distance. The critic utilizes seven 2D convolution blocks with spectral normalization with to stabilize training, batch normalization, a Leaky ReLU activation, and a dropout layer, except for the first layer which does not utilize batch normalization and the third layer which includes a Linear Attention mechanism to assist the model in understanding contextual relationships in feature maps and prevenent the checkerboard issue audio generation is often plagued with. After these operations, a final 2D convolution with spectral normalization is applied and the result is flattened, returning single value wasserstein distance approximations.

\*\*\* FIND A CITATION FOR CHECKERBOARD ISSUE

### 4.2. Training

This work uses 80% of each dataset as training data and 20% as validation with all data split into batches of 16. The Generator and Critic are initialized with RMSprop loss optimizers where the critic is given a slightly higher learning rate. Since the model tends to learn audio representation patterns in relatively few epochs, training is smoothened by initializing the RMSprop optimizers with relatively high weight decay and exponential LR decay. The Generator only a step every five Critic steps, validation occurs every epoch, and early exit is based on validation wasserstein distance improvement over epochs.
Expand Down

0 comments on commit c93cb89

Please sign in to comment.