starting work

shuklabhay · Aug 19, 2024 · 3490335 · 3490335
1 parent 1435a58
commit 3490335
Show file tree

Hide file tree

Showing 3 changed files with 95 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -4,6 +4,7 @@ Implementing Deep Convolution to generate audio using a generative network
 
 ## Directories
 
+- `paper`: Research paper and static images
 - `model`: Trained model and generated audio
 - `src`: Model source code
   - `utils`: Model and data utilities
diff --git a/paper/main.md b/paper/main.md
@@ -0,0 +1,60 @@
+# Kick it Out: Audio Generation With a Deep Convolution Generative Network
+
+Abhay Shukla\
+[email protected]\
+Continuation of UCLA COSMOS 2024 Research
+
+## Abstract
+
+Generative adversarial networks have been used to much sucess for generating images
+
+## Introduction
+
+## Background
+
+## Methodology
+
+## Results
+
+## Discussion
+
+## Conclusion
+
+## References
+
+<a id="1">[1]</a> DCGAN paper methodology structure etc
+https://arxiv.org/abs/1511.06434
+
+similar result to me
+https://openaccess.thecvf.com/content_CVPR_2020/papers/Durall_Watch_Your_Up-Convolution_CNN_Based_Generative_Deep_Neural_Networks_Are_CVPR_2020_paper.pdf
+
+also talk abt like wavenet as other ideas for models
+
+i have to be doing something wrong. it has to be doable. quickly just check it all make sure theres noooothing more i can do bc im sure its possible just limitations here idk what else i can do to improve model or wtv. there has to be some way to improve at least get better, allthe changes i made should be making it better bruh.
+
+- go thru code and like clean up vars/make naming consistent (moreso helpers)
+- see if theres anything else i know that can be improved/possible source of error (prob not, but there has to be something it should be better w/ changes i made not "worse" its back to noise)
+- at most spend today doing this but thats it. paper has to happen now.
+
+STRUCTURE OF A PAPER (claude generated)
+
+1.
+2. Abstract: A brief summary of your paper, including the problem, methods, key results, and conclusions.
+3. Introduction: Present the research problem, its importance, and your objectives.
+4. Background/Literature Review: Provide context on deep convolution and its applications in audio generation. Review relevant previous work.
+5. Methodology: Describe your approach, including:
+
+- Neural network architecture
+- Dataset description
+- Training process
+- Evaluation metrics
+
+6. Results: Present your findings, including:
+
+- Performance metrics
+- Audio samples (if possible)
+- Comparisons with other methods
+
+7. Discussion: Interpret your results, discuss limitations, and suggest future work.
+8. Conclusion: Summarize your key findings and their implications.
+9. References: List all sources cited in your paper.
diff --git a/src/architecture.py b/src/architecture.py
@@ -12,6 +12,33 @@
 
 
 # Model Components
+class PhaseShuffleLayer(nn.Module):
+    def __init__(self, max_shift=2):
+        super(PhaseShuffleLayer, self).__init__()
+        self.max_shift = max_shift
+
+    def forward(self, x):
+        batch_size, channels, frames, freq = x.size()
+        shifts = torch.randint(
+            -self.max_shift,
+            self.max_shift + 1,
+            (batch_size, channels, 1, 1),
+            device=x.device,
+        )
+        shifts = shifts.expand(-1, -1, frames, freq)
+
+        idx = (
+            torch.arange(frames, device=x.device)
+            .unsqueeze(0)
+            .unsqueeze(0)
+            .unsqueeze(-1)
+        )
+        idx_shifted = idx + shifts
+        idx_shifted = torch.clamp(idx_shifted, 0, frames - 1)
+
+        return x.gather(2, idx_shifted)
+
+
 class Generator(nn.Module):
     def __init__(self):
         super(Generator, self).__init__()
@@ -22,24 +49,31 @@ def __init__(self):
             nn.ConvTranspose2d(512, 256, kernel_size=4, stride=2, padding=1),
             nn.BatchNorm2d(256),
             nn.ReLU(inplace=True),
+            PhaseShuffleLayer(max_shift=2),
             nn.ConvTranspose2d(256, 128, kernel_size=4, stride=2, padding=1),
             nn.BatchNorm2d(128),
             nn.ReLU(inplace=True),
+            PhaseShuffleLayer(max_shift=2),
             nn.ConvTranspose2d(128, 64, kernel_size=4, stride=2, padding=1),
             nn.BatchNorm2d(64),
             nn.ReLU(inplace=True),
+            PhaseShuffleLayer(max_shift=2),
             nn.ConvTranspose2d(64, 32, kernel_size=4, stride=2, padding=1),
             nn.BatchNorm2d(32),
             nn.ReLU(inplace=True),
+            PhaseShuffleLayer(max_shift=2),
             nn.ConvTranspose2d(32, 16, kernel_size=4, stride=2, padding=1),
             nn.BatchNorm2d(16),
             nn.ReLU(inplace=True),
+            PhaseShuffleLayer(max_shift=2),
             nn.ConvTranspose2d(16, 8, kernel_size=4, stride=2, padding=1),
             nn.BatchNorm2d(8),
             nn.ReLU(inplace=True),
+            PhaseShuffleLayer(max_shift=2),
             nn.ConvTranspose2d(8, 4, kernel_size=4, stride=2, padding=1),
             nn.BatchNorm2d(4),
             nn.ReLU(inplace=True),
+            PhaseShuffleLayer(max_shift=2),
             nn.ConvTranspose2d(4, N_CHANNELS, kernel_size=4, stride=2, padding=1),
             nn.Upsample(
                 size=(N_FRAMES, N_FREQ_BINS), mode="bilinear", align_corners=False