paper static deps (#20)

shuklabhay · Oct 6, 2024 · 0db5702 · 0db5702
1 parent d94763b
commit 0db5702
Show file tree

Hide file tree

Showing 22 changed files with 130 additions and 77 deletions.
diff --git a/.gitignore b/.gitignore
@@ -165,7 +165,7 @@ cython_debug/
 data/
 .vscode/
 .DS_Store
-outputs/training_progress
+outputs/spectrogram_images
 outputs/generated_audio.wav
 outputs/generated_audio_[0-9]*
 test.wav
diff --git a/README.md b/README.md
@@ -2,13 +2,13 @@
 
 [![On Push](https://github.com/shuklabhay/stereo-sample-gan/actions/workflows/push.yml/badge.svg)](https://github.com/shuklabhay/stereo-sample-gan/actions/workflows/push.yml/badge.svg)
 
-StereoSampleGAN: A lightweight approach high fidelity stereo audio sample generation.
+StereoSampleGAN: A computationally inexpensive approach high fidelity stereo audio sample generation.
 
 ## Model Usage
 
 ### 1. Prereqs
 
-- Optional but highly reccomended: Set up a [Python virtual environment.](https://www.youtube.com/watch?v=e5GL1obY_sI)
+- Optional but highly reccomended: Set up a [Python virtual environment.](https://docs.python.org/3/library/venv.html)
   - Audio loader package `librosa` requires an outdated version of Numpy
 - Install requirements by running `pip3 install -r requirements.txt`
 
@@ -18,7 +18,7 @@ Specify sample count to generate, output, etc `usage_params.py`
 
 - Generate audio from the Curated Kick model by running `python3 src/run_pretrained/generate_curated_kick.py`
 - Generate audio from the Diverse Kick model by running `python3 src/run_pretrained/generate_diverse_kick.py`
-- Generate audio from the One Shot model by running `python3 src/run_pretrained/generate_one_shot.py`
+- Generate audio from the Instrument One Shot model by running `python3 src/run_pretrained/generate_instrument_one_shot.py`
 
 ### 3. Train model
 
@@ -30,7 +30,7 @@ Specify training data paramaters in `usage_params.py`
 - Train model by running `python3 src/stereo_sample_gan.py`
 - Generate audio (based on current `usage_params.py`) by running `python3 src/generate.py`
 
-Training progress visualization (Diverse Kick Drum Model):
+Training progress visualization (training Diverse Kick Drum Model):
 
 <img src="static/diverse_kick_training_progress.gif" alt="Diverse kick training progress" width="400">
 
@@ -40,25 +40,31 @@ Training progress visualization (Diverse Kick Drum Model):
 
 Kick drum generation model trained on ~8000 essentially random kick drums.
 
-- More variation between each generated sample, audio is occasionally inconsistent and contains some artifacts.
+- More variation between each generated sample, audio is occasionally inconsistent and noisy.
+
+<img src="static/diverse_kick_generated_examples.png" alt="Diverse kick model generated examples" width="800">
 
 ### Curated Kick Drum
 
-Kick drum generation model trained on ~4400 slightly more rigorously but still essentially randomly chosen kick drums.
+Kick drum generation model trained on ~4400 kick drums with closer matching overall characteristics.
+
+- Less variation between each drum sample's decay and auditory tone.
 
-- Less variation between each drum sample's tone, performs slightly better to an auditory test.
+<img src="static/curated_kick_generated_examples.png" alt="Curated kick model generated examples" width="800">
 
 ### Instrument One Shot
 
 Instrument one shot generation model, trained on ~3000 semi-curated instrument one shots.
 
-- Demonstrates model's capability to generate longer audio, yet fails to generate coherent, useable instrument one shots.
+- Demonstrates model's capability to generate longer audio, yet fails to generate coherent and useable instrument one shots.
+
+<img src="static/instrument_one_shot_generated_examples.png" alt="Instrument one shot model generated examples" width="800">
 
 ## Directories
 
 - `outputs`: Trained model and generated audio
 - `paper`: Research paper / model writeup
-- `static`: Static images and gifs
+- `static`: Static resources
 - `src`: Model source code
   - `utils`: Model and data utilities
   - `data_processing`: Training data processing scripts
diff --git a/paper/main.md b/paper/main.md
@@ -31,18 +31,6 @@ This model aims to focus on generating a category of audio and wholicsticly lear
 
 ### 3.1. Collection
 
-Training data is primarily sourced from digital production “sample packs.” For kick drums, the main "case study" for this paper, the training data used is a compilation of 7856 kick drum impules with different characteristics and use cases (analog, electronic, pop, hip-hop, beatbox, heavy, punchy, etc), overall providing a diverse range of potential drum sounds to generate that. A metric to watch for model validaiton is how well the model is able to generate the following set of "defining" kick drum characteristics.
-
-A kick drum's "defining" characteristics include:
-
-1. A transient: The “click” at the beginning of the generated audio incorporating most of the frequency spectrum
-2. A fundamental: The sustained, decaying low frequency "rumble" after the transient
-3. An overall "decaying" nature (spectral centroid shifts downwards)
-4. Ample variability between decay times for each sample
-
-<img alt='Features of a Kick Drum' src="static/kick-drum-features.png" width="350">
-<p><b>Fig 1:</b> <i>Visualization of key features of a kick drum.</i></p>
-
 ### 3.2. Feature Extraction/Encoding
 
 specifies audio shape then finds ideal hop length and frame size. Then cut data shape down to remove edge rtifact at end of sample (it egenrates slightly bigger than desired shape and includes an artifact only on those frames so thsi fixes both problems)

diff --git a/paper/paper.md b/paper/paper.md
@@ -0,0 +1,45 @@
+# StereoSampleGAN: A Computationally Inexpensive Approach High Fidelity Stereo Audio Generation.
+
+Abhay Shukla\
+[email protected]\
+Continuation of UCLA COSMOS 2024 Research
+
+## 1. Abstract
+
+Existing convolutional aproaches to audio generation often are limited to producing low-fidelity, single-channel, monophonic audio, while demanding significant computational resources for both training and inference. To address these challenges, this work introduces StereoSampleGAN, a novel audio generation architecture that combines a Deep Convolutional Wasserstein GAN (WGAN), attention mechanisms, and loss optimization techniques. StereoSampleGAN allows high-fidelity, stereo audio generation for audio samples while being remaining computationally efficient. Training on three distinct sample datasets with varying spectral overlap–two of kick drums and one of tonal one shots–StereoSampleGAN demonstrates promising results in generating high quality simple stereo sounds. While successfully understanding how to generate the "shape" of required audio, it displays notable limiatations in achieving the correct "tone," in some cases even generating incoherent noise. These results indicate finite limitations and areas for improvement to this approach of audio generation.
+
+## 2. Introduction
+
+## 3. Data Manipulation
+
+## 3.1 Datasets
+
+This paper utilizes three distinct data sets engineered to measure the model's resilince to variation in spectral content.
+
+1. Curated Kick Drum Set: Kick drum impulses with primarily short decay profiles.
+
+2. Diverse Kick Drum Set: Kick drum impulses with greater variation in decay profile and overall harmonic content.
+
+3. Instrument One Shot Set: Single note impulses capturing the tonal qualities and spectral characteristics of varying synthesizer and instrument sounds.
+
+These datasets provide robust frameworks for determining the model's response to scaled variation within training data. Most audio is sourced from online "digital audio production sample packs" which compile sounds for a wide variety of generes and use cases.
+
+## 3.2 Feature Extraction and Encoding
+
+## 4. Model Implementation
+
+### 4.1. Architecture
+
+### 4.2. Training
+
+## 5. Results and Discussion
+
+### 5.1. Evaluation
+
+The model generated 44.1k high quality audio, but not audio of high quality (important distinction). Shape vs tone (fundamental completely missing), why it makes sense (limitations to ft, training for shape of img not AUDIO)
+
+### 5.2. Contributions
+
+## 6. Conclusion
+
+## 7. References
diff --git a/src/data_processing/audio_processing_validation.py b/src/data_processing/audio_processing_validation.py
@@ -4,21 +4,24 @@
 sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 
 import random
-from usage_params import compiled_data_path, training_sample_length
+from usage_params import UsageParams
 from utils.signal_helpers import (
     stft_and_istft,
 )
 
+# Initialize sample selection
+params = UsageParams()
+
 
 def choose_random_sample():
     audio_files = [
         f
-        for f in os.listdir(compiled_data_path)
-        if os.path.isfile(os.path.join(compiled_data_path, f))
+        for f in os.listdir(params.compiled_data_path)
+        if os.path.isfile(os.path.join(params.compiled_data_path, f))
     ]
     if audio_files:
         sample_name = random.choice(audio_files)
-        sample_path = os.path.join(compiled_data_path, sample_name)
+        sample_path = os.path.join(params.compiled_data_path, sample_name)
         return sample_path, sample_name
     else:
         return None, None
@@ -27,4 +30,4 @@ def choose_random_sample():
 # Analyze fourier transform audio degradation
 sample_path, sample_name = choose_random_sample()
 
-stft_and_istft(sample_path, "test", training_sample_length)
+stft_and_istft(sample_path, "test", params.training_sample_length)
diff --git a/src/data_processing/encode_audio_data.py b/src/data_processing/encode_audio_data.py
@@ -8,18 +8,19 @@
     load_loudness_data,
 )
 from utils.signal_helpers import encode_sample_directory
-from usage_params import training_audio_dir, compiled_data_path
+from usage_params import UsageParams
 
 # Encode audio samples
+params = UsageParams()
 if len(sys.argv) > 1:
     visualize = sys.argv[1].lower() == "visualize"
 else:
     visualize = False
 
 
-encode_sample_directory(training_audio_dir, compiled_data_path, visualize)
+encode_sample_directory(params.training_audio_dir, params.compiled_data_path, visualize)
 
 real_data = load_loudness_data(
-    compiled_data_path
+    params.compiled_data_path
 )  # datapts, channels, frames, freq bins
-print(f"{training_audio_dir} data shape: {str(real_data.shape)}")
+print(f"{params.training_audio_dir} data shape: {str(real_data.shape)}")
diff --git a/src/generate.py b/src/generate.py
@@ -1,9 +1,7 @@
 from utils.generation_helpers import generate_audio
-from usage_params import (
-    model_to_generate_with,
-    training_sample_length,
-)
+from usage_params import UsageParams
 
 
 # Generate based on usage_params
-generate_audio(model_to_generate_with, training_sample_length)
+params = UsageParams()
+generate_audio(params.model_to_generate_with, params.training_sample_length, True)
diff --git a/src/stereo_sample_gan.py b/src/stereo_sample_gan.py
@@ -10,20 +10,23 @@
     load_loudness_data,
 )
 
-from usage_params import compiled_data_path
+from usage_params import UsageParams
 
 # Constants
 LR_G = 0.003
 LR_C = 0.004
 
 # Load data
-all_spectrograms = load_loudness_data(compiled_data_path)
+params = UsageParams()
+all_spectrograms = load_loudness_data(params.compiled_data_path)
 all_spectrograms = torch.FloatTensor(all_spectrograms)
+
 train_size = int(0.8 * len(all_spectrograms))
 val_size = len(all_spectrograms) - train_size
 train_dataset, val_dataset = random_split(
     TensorDataset(all_spectrograms), [train_size, val_size]
 )
+
 train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
 val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
 

diff --git a/src/usage_params.py b/src/usage_params.py
@@ -1,16 +1,22 @@
 # Main params
-audio_generation_count = 2  # Audio examples to generate
+class UsageParams:
+    def __init__(self):
+        self.audio_generation_count = 2  # Audio examples to generate
 
-# Training params
-training_sample_length = 1.5  # seconds
-outputs_dir = "outputs"  # Where to save your generated audio & model
+        # Training params
+        self.training_sample_length = 1.5  # seconds
+        self.outputs_dir = "outputs"  # Where to save your generated audio & model
 
-model_save_name = "StereoSampleGAN-InstrumentOneShot"  # What to name your model save
-training_audio_dir = "data/one_shots"  # Your training data path
-compiled_data_path = "data/compiled_data.npy"  # Your compiled data/output path
-model_save_path = f"{outputs_dir}/{model_save_name}.pth"
+        self.model_save_name = (
+            "StereoSampleGAN-InstrumentOneShot"  # What to name your model save
+        )
+        self.training_audio_dir = "data/one_shots"  # Your training data path
+        self.compiled_data_path = (
+            "data/compiled_data.npy"  # Your compiled data/output path
+        )
+        self.model_save_path = f"{self.outputs_dir}/{self.model_save_name}.pth"
 
-# Generating audio
-model_to_generate_with = model_save_path  # Generation model path
-generated_audio_name = "generated_audio"  # Output file name
-visualize_generated = True  # Show generated audio spectrograms
+        # Generating audio
+        self.model_to_generate_with = self.model_save_path  # Generation model path
+        self.generated_audio_name = "generated_audio"  # Output file name
+        self.visualize_generated = True  # Show generated audio spectrograms
diff --git a/src/utils/file_helpers.py b/src/utils/file_helpers.py
@@ -3,9 +3,10 @@
 import torch
 import soundfile as sf
 
-from usage_params import model_save_path
+from usage_params import UsageParams
 
 # Constants
+params = UsageParams()
 GLOBAL_SR = 44100
 
 
@@ -26,9 +27,9 @@ def save_model(model):
     # Save model
     torch.save(
         model.state_dict(),
-        model_save_path,
+        params.model_save_path,
     )
-    print(f"Model saved at {model_save_path}")
+    print(f"Model saved at {params.model_save_path}")
 
 
 def get_device():

diff --git a/src/utils/generation_helpers.py b/src/utils/generation_helpers.py
@@ -1,18 +1,16 @@
 import os
 import torch
 from architecture import Generator, LATENT_DIM
-from usage_params import (
-    outputs_dir,
-    generated_audio_name,
-    audio_generation_count,
-    visualize_generated,
-)
+from usage_params import UsageParams
 from utils.file_helpers import get_device, save_audio
 from utils.signal_helpers import audio_to_norm_db, graph_spectrogram, norm_db_to_audio
 
 
 # Generation function
-def generate_audio(generation_model_save, len_audio_in):
+params = UsageParams()
+
+
+def generate_audio(generation_model_save, len_audio_in, save_images=False):
     device = get_device()
 
     generator = Generator()
@@ -26,24 +24,28 @@ def generate_audio(generation_model_save, len_audio_in):
     generator.eval()
 
     # Generate audio
-    z = torch.randn(audio_generation_count, LATENT_DIM, 1, 1)
+    z = torch.randn(params.audio_generation_count, LATENT_DIM, 1, 1)
     with torch.no_grad():
         generated_output = generator(z)
 
     generated_output = generated_output.squeeze().numpy()
     print("Generated output shape:", generated_output.shape)
 
     # Visualize and save audio
-    for i in range(audio_generation_count):
+    for i in range(params.audio_generation_count):
         current_sample = generated_output[i]
 
         audio_info = norm_db_to_audio(current_sample, len_audio_in)
         audio_save_path = os.path.join(
-            outputs_dir, f"{generated_audio_name}_{i + 1}.wav"
+            params.outputs_dir, f"{params.generated_audio_name}_{i + 1}.wav"
         )
 
         save_audio(audio_save_path, audio_info)
 
-        if visualize_generated is True:
+        if params.visualize_generated is True:
             vis_signal_after_istft = audio_to_norm_db(audio_info)
-            graph_spectrogram(vis_signal_after_istft, "generated audio (after istft)")
+            graph_spectrogram(
+                vis_signal_after_istft,
+                f"{params.generated_audio_name}_{i + 1}",
+                save_images,
+            )