Skip to content

Commit

Permalink
paper static deps (#20)
Browse files Browse the repository at this point in the history
  • Loading branch information
shuklabhay authored Oct 6, 2024
1 parent d94763b commit 0db5702
Show file tree
Hide file tree
Showing 22 changed files with 130 additions and 77 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ cython_debug/
data/
.vscode/
.DS_Store
outputs/training_progress
outputs/spectrogram_images
outputs/generated_audio.wav
outputs/generated_audio_[0-9]*
test.wav
24 changes: 15 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

[![On Push](https://github.com/shuklabhay/stereo-sample-gan/actions/workflows/push.yml/badge.svg)](https://github.com/shuklabhay/stereo-sample-gan/actions/workflows/push.yml/badge.svg)

StereoSampleGAN: A lightweight approach high fidelity stereo audio sample generation.
StereoSampleGAN: A computationally inexpensive approach high fidelity stereo audio sample generation.

## Model Usage

### 1. Prereqs

- Optional but highly reccomended: Set up a [Python virtual environment.](https://www.youtube.com/watch?v=e5GL1obY_sI)
- Optional but highly reccomended: Set up a [Python virtual environment.](https://docs.python.org/3/library/venv.html)
- Audio loader package `librosa` requires an outdated version of Numpy
- Install requirements by running `pip3 install -r requirements.txt`

Expand All @@ -18,7 +18,7 @@ Specify sample count to generate, output, etc `usage_params.py`

- Generate audio from the Curated Kick model by running `python3 src/run_pretrained/generate_curated_kick.py`
- Generate audio from the Diverse Kick model by running `python3 src/run_pretrained/generate_diverse_kick.py`
- Generate audio from the One Shot model by running `python3 src/run_pretrained/generate_one_shot.py`
- Generate audio from the Instrument One Shot model by running `python3 src/run_pretrained/generate_instrument_one_shot.py`

### 3. Train model

Expand All @@ -30,7 +30,7 @@ Specify training data paramaters in `usage_params.py`
- Train model by running `python3 src/stereo_sample_gan.py`
- Generate audio (based on current `usage_params.py`) by running `python3 src/generate.py`

Training progress visualization (Diverse Kick Drum Model):
Training progress visualization (training Diverse Kick Drum Model):

<img src="static/diverse_kick_training_progress.gif" alt="Diverse kick training progress" width="400">

Expand All @@ -40,25 +40,31 @@ Training progress visualization (Diverse Kick Drum Model):

Kick drum generation model trained on ~8000 essentially random kick drums.

- More variation between each generated sample, audio is occasionally inconsistent and contains some artifacts.
- More variation between each generated sample, audio is occasionally inconsistent and noisy.

<img src="static/diverse_kick_generated_examples.png" alt="Diverse kick model generated examples" width="800">

### Curated Kick Drum

Kick drum generation model trained on ~4400 slightly more rigorously but still essentially randomly chosen kick drums.
Kick drum generation model trained on ~4400 kick drums with closer matching overall characteristics.

- Less variation between each drum sample's decay and auditory tone.

- Less variation between each drum sample's tone, performs slightly better to an auditory test.
<img src="static/curated_kick_generated_examples.png" alt="Curated kick model generated examples" width="800">

### Instrument One Shot

Instrument one shot generation model, trained on ~3000 semi-curated instrument one shots.

- Demonstrates model's capability to generate longer audio, yet fails to generate coherent, useable instrument one shots.
- Demonstrates model's capability to generate longer audio, yet fails to generate coherent and useable instrument one shots.

<img src="static/instrument_one_shot_generated_examples.png" alt="Instrument one shot model generated examples" width="800">

## Directories

- `outputs`: Trained model and generated audio
- `paper`: Research paper / model writeup
- `static`: Static images and gifs
- `static`: Static resources
- `src`: Model source code
- `utils`: Model and data utilities
- `data_processing`: Training data processing scripts
12 changes: 0 additions & 12 deletions paper/main.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,18 +31,6 @@ This model aims to focus on generating a category of audio and wholicsticly lear

### 3.1. Collection

Training data is primarily sourced from digital production “sample packs.” For kick drums, the main "case study" for this paper, the training data used is a compilation of 7856 kick drum impules with different characteristics and use cases (analog, electronic, pop, hip-hop, beatbox, heavy, punchy, etc), overall providing a diverse range of potential drum sounds to generate that. A metric to watch for model validaiton is how well the model is able to generate the following set of "defining" kick drum characteristics.

A kick drum's "defining" characteristics include:

1. A transient: The “click” at the beginning of the generated audio incorporating most of the frequency spectrum
2. A fundamental: The sustained, decaying low frequency "rumble" after the transient
3. An overall "decaying" nature (spectral centroid shifts downwards)
4. Ample variability between decay times for each sample

<img alt='Features of a Kick Drum' src="static/kick-drum-features.png" width="350">
<p><b>Fig 1:</b> <i>Visualization of key features of a kick drum.</i></p>

### 3.2. Feature Extraction/Encoding

specifies audio shape then finds ideal hop length and frame size. Then cut data shape down to remove edge rtifact at end of sample (it egenrates slightly bigger than desired shape and includes an artifact only on those frames so thsi fixes both problems)
Expand Down
45 changes: 45 additions & 0 deletions paper/paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# StereoSampleGAN: A Computationally Inexpensive Approach High Fidelity Stereo Audio Generation.

Abhay Shukla\
[email protected]\
Continuation of UCLA COSMOS 2024 Research

## 1. Abstract

Existing convolutional aproaches to audio generation often are limited to producing low-fidelity, single-channel, monophonic audio, while demanding significant computational resources for both training and inference. To address these challenges, this work introduces StereoSampleGAN, a novel audio generation architecture that combines a Deep Convolutional Wasserstein GAN (WGAN), attention mechanisms, and loss optimization techniques. StereoSampleGAN allows high-fidelity, stereo audio generation for audio samples while being remaining computationally efficient. Training on three distinct sample datasets with varying spectral overlap–two of kick drums and one of tonal one shots–StereoSampleGAN demonstrates promising results in generating high quality simple stereo sounds. While successfully understanding how to generate the "shape" of required audio, it displays notable limiatations in achieving the correct "tone," in some cases even generating incoherent noise. These results indicate finite limitations and areas for improvement to this approach of audio generation.

## 2. Introduction

## 3. Data Manipulation

## 3.1 Datasets

This paper utilizes three distinct data sets engineered to measure the model's resilince to variation in spectral content.

1. Curated Kick Drum Set: Kick drum impulses with primarily short decay profiles.

2. Diverse Kick Drum Set: Kick drum impulses with greater variation in decay profile and overall harmonic content.

3. Instrument One Shot Set: Single note impulses capturing the tonal qualities and spectral characteristics of varying synthesizer and instrument sounds.

These datasets provide robust frameworks for determining the model's response to scaled variation within training data. Most audio is sourced from online "digital audio production sample packs" which compile sounds for a wide variety of generes and use cases.

## 3.2 Feature Extraction and Encoding

## 4. Model Implementation

### 4.1. Architecture

### 4.2. Training

## 5. Results and Discussion

### 5.1. Evaluation

The model generated 44.1k high quality audio, but not audio of high quality (important distinction). Shape vs tone (fundamental completely missing), why it makes sense (limitations to ft, training for shape of img not AUDIO)

### 5.2. Contributions

## 6. Conclusion

## 7. References
13 changes: 8 additions & 5 deletions src/data_processing/audio_processing_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,24 @@
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

import random
from usage_params import compiled_data_path, training_sample_length
from usage_params import UsageParams
from utils.signal_helpers import (
stft_and_istft,
)

# Initialize sample selection
params = UsageParams()


def choose_random_sample():
audio_files = [
f
for f in os.listdir(compiled_data_path)
if os.path.isfile(os.path.join(compiled_data_path, f))
for f in os.listdir(params.compiled_data_path)
if os.path.isfile(os.path.join(params.compiled_data_path, f))
]
if audio_files:
sample_name = random.choice(audio_files)
sample_path = os.path.join(compiled_data_path, sample_name)
sample_path = os.path.join(params.compiled_data_path, sample_name)
return sample_path, sample_name
else:
return None, None
Expand All @@ -27,4 +30,4 @@ def choose_random_sample():
# Analyze fourier transform audio degradation
sample_path, sample_name = choose_random_sample()

stft_and_istft(sample_path, "test", training_sample_length)
stft_and_istft(sample_path, "test", params.training_sample_length)
9 changes: 5 additions & 4 deletions src/data_processing/encode_audio_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,19 @@
load_loudness_data,
)
from utils.signal_helpers import encode_sample_directory
from usage_params import training_audio_dir, compiled_data_path
from usage_params import UsageParams

# Encode audio samples
params = UsageParams()
if len(sys.argv) > 1:
visualize = sys.argv[1].lower() == "visualize"
else:
visualize = False


encode_sample_directory(training_audio_dir, compiled_data_path, visualize)
encode_sample_directory(params.training_audio_dir, params.compiled_data_path, visualize)

real_data = load_loudness_data(
compiled_data_path
params.compiled_data_path
) # datapts, channels, frames, freq bins
print(f"{training_audio_dir} data shape: {str(real_data.shape)}")
print(f"{params.training_audio_dir} data shape: {str(real_data.shape)}")
8 changes: 3 additions & 5 deletions src/generate.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
from utils.generation_helpers import generate_audio
from usage_params import (
model_to_generate_with,
training_sample_length,
)
from usage_params import UsageParams


# Generate based on usage_params
generate_audio(model_to_generate_with, training_sample_length)
params = UsageParams()
generate_audio(params.model_to_generate_with, params.training_sample_length, True)
7 changes: 5 additions & 2 deletions src/stereo_sample_gan.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,23 @@
load_loudness_data,
)

from usage_params import compiled_data_path
from usage_params import UsageParams

# Constants
LR_G = 0.003
LR_C = 0.004

# Load data
all_spectrograms = load_loudness_data(compiled_data_path)
params = UsageParams()
all_spectrograms = load_loudness_data(params.compiled_data_path)
all_spectrograms = torch.FloatTensor(all_spectrograms)

train_size = int(0.8 * len(all_spectrograms))
val_size = len(all_spectrograms) - train_size
train_dataset, val_dataset = random_split(
TensorDataset(all_spectrograms), [train_size, val_size]
)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

Expand Down
30 changes: 18 additions & 12 deletions src/usage_params.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,22 @@
# Main params
audio_generation_count = 2 # Audio examples to generate
class UsageParams:
def __init__(self):
self.audio_generation_count = 2 # Audio examples to generate

# Training params
training_sample_length = 1.5 # seconds
outputs_dir = "outputs" # Where to save your generated audio & model
# Training params
self.training_sample_length = 1.5 # seconds
self.outputs_dir = "outputs" # Where to save your generated audio & model

model_save_name = "StereoSampleGAN-InstrumentOneShot" # What to name your model save
training_audio_dir = "data/one_shots" # Your training data path
compiled_data_path = "data/compiled_data.npy" # Your compiled data/output path
model_save_path = f"{outputs_dir}/{model_save_name}.pth"
self.model_save_name = (
"StereoSampleGAN-InstrumentOneShot" # What to name your model save
)
self.training_audio_dir = "data/one_shots" # Your training data path
self.compiled_data_path = (
"data/compiled_data.npy" # Your compiled data/output path
)
self.model_save_path = f"{self.outputs_dir}/{self.model_save_name}.pth"

# Generating audio
model_to_generate_with = model_save_path # Generation model path
generated_audio_name = "generated_audio" # Output file name
visualize_generated = True # Show generated audio spectrograms
# Generating audio
self.model_to_generate_with = self.model_save_path # Generation model path
self.generated_audio_name = "generated_audio" # Output file name
self.visualize_generated = True # Show generated audio spectrograms
7 changes: 4 additions & 3 deletions src/utils/file_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@
import torch
import soundfile as sf

from usage_params import model_save_path
from usage_params import UsageParams

# Constants
params = UsageParams()
GLOBAL_SR = 44100


Expand All @@ -26,9 +27,9 @@ def save_model(model):
# Save model
torch.save(
model.state_dict(),
model_save_path,
params.model_save_path,
)
print(f"Model saved at {model_save_path}")
print(f"Model saved at {params.model_save_path}")


def get_device():
Expand Down
26 changes: 14 additions & 12 deletions src/utils/generation_helpers.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,16 @@
import os
import torch
from architecture import Generator, LATENT_DIM
from usage_params import (
outputs_dir,
generated_audio_name,
audio_generation_count,
visualize_generated,
)
from usage_params import UsageParams
from utils.file_helpers import get_device, save_audio
from utils.signal_helpers import audio_to_norm_db, graph_spectrogram, norm_db_to_audio


# Generation function
def generate_audio(generation_model_save, len_audio_in):
params = UsageParams()


def generate_audio(generation_model_save, len_audio_in, save_images=False):
device = get_device()

generator = Generator()
Expand All @@ -26,24 +24,28 @@ def generate_audio(generation_model_save, len_audio_in):
generator.eval()

# Generate audio
z = torch.randn(audio_generation_count, LATENT_DIM, 1, 1)
z = torch.randn(params.audio_generation_count, LATENT_DIM, 1, 1)
with torch.no_grad():
generated_output = generator(z)

generated_output = generated_output.squeeze().numpy()
print("Generated output shape:", generated_output.shape)

# Visualize and save audio
for i in range(audio_generation_count):
for i in range(params.audio_generation_count):
current_sample = generated_output[i]

audio_info = norm_db_to_audio(current_sample, len_audio_in)
audio_save_path = os.path.join(
outputs_dir, f"{generated_audio_name}_{i + 1}.wav"
params.outputs_dir, f"{params.generated_audio_name}_{i + 1}.wav"
)

save_audio(audio_save_path, audio_info)

if visualize_generated is True:
if params.visualize_generated is True:
vis_signal_after_istft = audio_to_norm_db(audio_info)
graph_spectrogram(vis_signal_after_istft, "generated audio (after istft)")
graph_spectrogram(
vis_signal_after_istft,
f"{params.generated_audio_name}_{i + 1}",
save_images,
)
Loading

0 comments on commit 0db5702

Please sign in to comment.