lil note

shuklabhay · shuklabhay · Jan 24, 2025 · Oct 6, 2024 · Oct 8, 2024 · Oct 9, 2024
commit 8aebdbde89cf9a7c18aff7e07ead6a723856479b
@@ -52,6 +52,8 @@ When converting generated audio representations to audio, this process occurs in
 
 This work utilizes a GAN architecture to create high-fidelity audio, exploiting adversaial loss to promote realism and detail within generated audio. To address the GANs training instability, this work utilizes a Wasserstein GAN and gradient penalty (WGAN-GP). The Wasserstein distance provides a stable measure of divergence between real and generated audio distributions compared to typical GAN loss functions, and minimizing this distance through the WGAN-GP framework empircally improves training stability and promotes convergence. In this work, the switch to a WGAN architecture from a standard GAN was instrumental in creating a model that could consistently converge to model that generated actual audio over noise.
 
+VISION TRANSFORMER
+
 The final generator passes 128 latent dimensions into six transpose convolution blocks blocks, the first five consisting each of a 2D transpose convolution and batch normalization followed by a Leaky ReLU activation and dropout layer. The final block contains a 2D transpose convolution and hyperbolic tangent activation, creating a 256 by 256 representation of audio with values between -1 to 1.
 
 The Critic consists of six convolution blocks, converting a 256 by 256 representation of audio to a single value, an approximation of the wasterstien distance. The critic utilizes seven 2D convolution blocks with spectral normalization with to stabilize training, batch normalization, a Leaky ReLU activation, and a dropout layer, except for the first layer which does not utilize batch normalization and the third layer which includes a Linear Attention mechanism to assist the model in understanding contextual relationships in feature maps and prevenent the checkerboard issue audio generation is often plagued with. After these operations, a final 2D convolution with spectral normalization is applied and the result is flattened, returning single value wasserstein distance approximations.