model loss possible improvements

shuklabhay · Aug 21, 2024 · 7577f0a · 7577f0a
1 parent d2fe31b
commit 7577f0a
Show file tree

Hide file tree

Showing 9 changed files with 64 additions and 56 deletions.
diff --git a/paper/main.md b/paper/main.md
@@ -1,11 +1,7 @@
-# Kick it Out: Limitations to Two Channel Audio Generation With a Deep Convolution Generative Network
+# Kick it Out: Two Channel Kick Drum Generation With a Deep Convolution Generative Architecture.
 
 use attention, use special loss calcs, if i get this working then BOOM make it about kick generation using super optimized deep conv
 
-#todo: after written, rewrite everything with like the better clearer idea of like bigger differences of what is happening here that's more novel and new result or whatever- like the two channell gen not happening whereas specgan 1 channel works or whatever like figure out better whast novel here and whatever THEN can send it out to whoever whaytever like make clear the goals are to haev the like stero signal and the kick drum specifically and pure ish dcgan.
-
-REWRITE THE FIRST FEW PARAGRAPHS TO REFLECT THIS LIKE BIGGER DIFFERENCE THING LIKE MY OVERALL GOALS BEING DIFF THAN USE CASE FOR SPECGAN THEN WRITE ABT RESULTS IM DONE TRYING TO FIX THIS SHIT LOL
-
 Abhay Shukla\
 [email protected]\
 Continuation of UCLA COSMOS 2024 Research
@@ -16,13 +12,9 @@ Continuation of UCLA COSMOS 2024 Research
 
 ## Introduction
 
-Since their introduction, CNN based Generative Adversarial Networks (DCGANs) have vastly increased the capabilites of machine learning models, allowing high-fidelity synthetic image generation [1]. Despite these capabilities, audio generation is a more complicated problem for DCGANs. High quality audio generation models must be able to capture and replicate sophisticated temporal relationships and spectral characteristcs, in a consistent manner. Accounting for audio data's complexities requires advanced training techniques and/or a network architecture tailored towards audio, yet this work attempts to represent audio data as images and recognize the limitations of audio representation generation using a Deep Convolutional Generative Network that does not use advanced training techniques and specialized audio-specific network architectures.
-
-Considerations for sounds to generate were kick drums, snare drums, full drum loops, and synth impules. This work attempts to generate kick drums because they best met the criteria of containing some temporal patterns but also not being too complex of a sound and also being a quick impulse. Kick drums are simple sounds that have the potential to have some, but not an infinite amount of possible variance. Kick drums are also an integral part of digital audio production and the foundational element of almost every song and drumset. Due to their importance, finding a large quantity of high quality, unique kick drum samples is often a problem in the digital audio production enviroment.
+Since their introduction, CNN based Generative Adversarial Networks (DCGANs) have vastly increased the capabilites of machine learning models, allowing high-fidelity synthetic image generation [1]. Despite these capabilities, audio generation is a more complicated problem for DCGAN as a model must capture and replicate sophisticated temporal relationships and spectral characteristcs. To make this task easier, audio generation models often reduce the sophistication of data, reducing multi-channel signals into one and reducing sampling rates, laerning to a loss of audio quality. This work aims to generate high quality, multi channel kick drum audio representations while not straying too far away from a Deep Convolutional Generative Network Architecture.
 
-This investigation primarily seeks to determine how feasible it can be to use purely a DCGAN Architecture to recognize and replicate the spatial patterns and temporal patterns of an image representation of a kick drum. We will also experiment with generating pure sine waves as a means of validation.
-
-An important distinction is that this network aims to treat spectrograms truly as images, without tailoring the model towards audio generation as established DCGAN audio generation models such as WaveGAN[2] do. SpecGAN[2] follows a more conventional DCGAN approach as this paper aims to do, but key differences between SpecGAN and this work are that this work uses basic convolutions, a deeper architecture, a simpler loss function not accounting for stft loss, and spectral normalization- producing a model closer to a "pure" DCGAN approach. This project also specifically aims to generate stero channel kick drum clips, instead of mono varied percussion and natural sounds as previous models have set out to do.
+Considerations for sounds to generate were kick drums, snare drums, full drum loops, and synth impules, but this work attempts to generate kick drums because they best met the criteria of containing some temporal patterns but also not being too complex of a sound and also being a quick impulse. Kick drums are simple sounds that have the potential to have some, but not an infinite amount of possible variance. Kick drums are also an integral part of digital audio production and the foundational element of almost every song and drumset. Due to their importance, finding a large quantity of high quality, unique kick drum samples is often a problem in the digital audio production enviroment.
 
 ## Data Manipulation
 
@@ -39,9 +31,9 @@ Training data is first sourced from digital production “sample packs” compil
 
 ### Feature Extraction/Encoding
 
-The training data used is a compilation of 7856 audio samples. A simple DCGAN can not learn about the time-series component of audio, so this feature extraction process must to flatten the time-series component into a static form of data. This is achieved by representing audio in the time-frequency domain. Each sample is normalized to a length of 500 miliseconds and passed into a Short-time Fourier Transform with a window of 512 and hop size of 128, returning a representation of audio as an array of amplitudes for 2 channels, 176 frames of audio, 257 frequency bins. This shape is partially determined by hardware contraints.
+The training data used is a compilation of 7856 audio samples. A simple DCGAN can not learn about the time-series component of audio, so this feature extraction process must to flatten the time-series component into a static form of data. This is achieved by representing audio in the time-frequency domain. Each sample is first converted into a raw audio array representation using a standard 44100 hz sampling rate and preserving the two channel characteristic of the data. Then the audio sample is normalized to a length of 500 miliseconds and passed into a Short-time Fourier Transform with a window of 512 and hop size of 128, returning a representation of a kick drum as an array of amplitudes for 2 channels, 176 frames of audio, 257 frequency bins. The paramaters for the Short-time Fourier Transform are partially determined by hardware contraints.
 
-While amplitude data is important, this data is by nature skewed towards lower frequencies which contain more intensity. To mitigate the effect this has on training, a process of feature extraction occurs to eqalize the representation of frequencies in data. First, after extracting channel amplitudes, the tensor of data is scaled to be between 0 and 100. The data is then passed through a noise threshold where all values under 10e-10 are set to zero. This normalized, noise gated amplitude information is then converted into a logarithmic, decibal scale, which describes percieved loudness instead of intensity, displaying audio information in a more uniform way relative to the entire frequency spectrum. This data is then finally scaled to be between -1 and 1, representative of the output the model creates using the hyperbolic tangent activation function.
+While amplitude data (output of fourier transform) is important, this data is by nature skewed towards lower frequencies which contain more intensity. To mitigate the effect this has on training, a process of feature extraction occurs to eqalize the representation of frequencies in data. First, after extracting channel amplitudes, the tensor of data is scaled to be between 0 and 100. The data is then passed through a noise threshold where all values under 10e-10 are set to zero. This normalized, noise gated amplitude information is then converted into a logarithmic, decibal scale, which describes percieved loudness instead of intensity, displaying audio information in a more uniform way relative to the entire frequency spectrum. This data is then finally scaled to be between -1 and 1, representative of the output the model creates using the hyperbolic tangent activation function.
 
 [show amp data and loudness spectrograms]
 
@@ -54,6 +46,8 @@ The model itself is is a standard DCGAN model[1] with two slight modifcations, u
 This work uses 80% of the dataset as training data and 20% as validation with all data split into batches of 16. The loss function is Binary Cross Entropy with Logit Loss and both the generator and discriminator use the Adam optimizer with seperate learning rates. Due to hardware limitations, the model is trained over ten epochs. Validation occurs every 5 epochs and label smoothing is also applied to prevent overconfidence.
 TALK ABOUT CUSTOM LOSS METRICS
 
+![Average Data Point](static/average-data-point.png)
+
 ## Results
 
 ### Kick Drum Generation

diff --git a/paper/static/average-data-point.png b/paper/static/average-data-point.png
diff --git a/paper/static/average-data-point.wav b/paper/static/average-data-point.wav
diff --git a/src/dcgan.py b/src/dcgan.py
@@ -16,8 +16,8 @@
 )
 
 # Constants
-LR_G = 0.002
-LR_D = 0.001
+LR_G = 0.003
+LR_D = 0.002
 
 # Load data
 audio_data = load_npy_data(compiled_data_path)

diff --git a/src/generate.py b/src/generate.py
@@ -1,6 +1,6 @@
 import torch
 from architecture import Generator, LATENT_DIM
-from utils.helpers import amplitudes_to_wav, get_device, graph_spectrogram
+from utils.helpers import normalized_db_to_wav, get_device, graph_spectrogram
 
 # Initialize Generator
 device = get_device()
@@ -19,4 +19,4 @@
 
 generated_output = generated_output.squeeze().numpy()
 print("Generated output shape:", generated_output.shape)
-amplitudes_to_wav(generated_output, "DCGAN_generated_audio")
+normalized_db_to_wav(generated_output, "DCGAN_generated_audio")
diff --git a/src/train.py b/src/train.py
@@ -1,7 +1,13 @@
 import torch
 import torch.nn.functional as F
-from architecture import BATCH_SIZE, LATENT_DIM
-from utils.helpers import N_FRAMES, graph_spectrogram, save_model, scale_data_to_range
+from architecture import LATENT_DIM
+from utils.helpers import (
+    average_spectrogram_path,
+    graph_spectrogram,
+    load_npy_data,
+    save_model,
+    scale_data_to_range,
+)
 
 
 N_EPOCHS = 10
@@ -11,34 +17,32 @@
 
 # Helpers
 def calculate_decay_penalty(audio_data):
-    envelope = audio_data.mean(dim=(1, 3))
-    current_batch_size = audio_data.shape[0]
+    batch = audio_data.shape[0]
+    average_spectrogram = load_npy_data(average_spectrogram_path)
+    average_spectrogram = torch.from_numpy(average_spectrogram).to(audio_data.device)
+    average_spectrogram_batch = average_spectrogram.repeat(batch, 1, 1, 1)
 
-    # ideal env (exponetial decay)
-    ideal_envelope = torch.exp(-torch.linspace(0, 5, N_FRAMES)).to(audio_data.device)
-    ideal_envelope = ideal_envelope.unsqueeze(0).repeat(current_batch_size, 1) + 5
-
-    decay_penalty = F.mse_loss(envelope, ideal_envelope)
+    decay_penalty = F.mse_loss(audio_data, average_spectrogram_batch)
     return decay_penalty
 
 
-def calculate_periodicity_penalty(audio_data):
-    current_batch_size = audio_data.shape[0]
-    reshaped_audio = audio_data.reshape(current_batch_size, -1, N_FRAMES)
-    autocorr = torch.tensor([]).to(audio_data.device)
-    for i in range(current_batch_size):
-        sample = reshaped_audio[i]
-        sample_autocorr = F.conv1d(
-            sample.unsqueeze(0),
-            sample.flip(-1).unsqueeze(0),
-            padding=sample.shape[-1] - 1,
-        )
-        autocorr = torch.cat((autocorr, sample_autocorr), dim=0)
+# def calculate_periodicity_penalty(audio_data):
+#     current_batch_size = audio_data.shape[0]
+#     reshaped_audio = audio_data.reshape(current_batch_size, -1, N_FRAMES)
+#     autocorr = torch.tensor([]).to(audio_data.device)
+#     for i in range(current_batch_size):
+#         sample = reshaped_audio[i]
+#         sample_autocorr = F.conv1d(
+#             sample.unsqueeze(0),
+#             sample.flip(-1).unsqueeze(0),
+#             padding=sample.shape[-1] - 1,
+#         )
+#         autocorr = torch.cat((autocorr, sample_autocorr), dim=0)
 
-    autocorr = autocorr / autocorr.max(dim=2, keepdim=True)[0]
-    periodicity_penalty = autocorr[:, :, 50:].mean()
+#     autocorr = autocorr / autocorr.max(dim=2, keepdim=True)[0]
+#     periodicity_penalty = autocorr[:, :, 50:].mean()
 
-    return periodicity_penalty
+#     return periodicity_penalty
 
 
 # Training
@@ -51,8 +55,8 @@ def train_epoch(
     optimizer_D,
     device,
 ):
-    decay_penalty_weight = 0.001
-    periodicity_penalty_weight = 0.001
+    decay_penalty_weight = 0.01
+    periodicity_penalty_weight = 0.01
 
     generator.train()
     discriminator.train()
@@ -74,13 +78,13 @@ def smooth_labels(tensor, amount=0.1):
         fake_audio_data = generator(z)
         g_adv_loss = criterion(discriminator(fake_audio_data), real_labels)
         decay_penalty = calculate_decay_penalty(fake_audio_data)
-        periodicity_penalty = calculate_periodicity_penalty(fake_audio_data)
+        # periodicity_penalty = calculate_periodicity_penalty(fake_audio_data)
 
         # Combine losses
         g_loss = (
             g_adv_loss
             + decay_penalty * decay_penalty_weight
-            + periodicity_penalty * periodicity_penalty_weight
+            # + periodicity_penalty * periodicity_penalty_weight # ignore for now
         )
 
         g_loss.backward()

diff --git a/src/utils/audio_processing_validation.py b/src/utils/audio_processing_validation.py
@@ -1,5 +1,5 @@
 from utils.helpers import (
-    amplitudes_to_wav,
+    normalized_db_to_wav,
     encode_sample,
     graph_spectrogram,
     scale_normalized_db_to_amplis,
@@ -13,7 +13,7 @@
 
 amplis = scale_normalized_db_to_amplis(loudness)
 
-amplitudes_to_wav(amplis, "test")
+normalized_db_to_wav(amplis, "test")
 
 saved = "model/test.wav"
 loudness2 = encode_sample(saved)

diff --git a/src/utils/encode_audio_data.py b/src/utils/encode_audio_data.py
@@ -1,12 +1,21 @@
 from helpers import (
     audio_data_dir,
+    average_spectrogram_path,
     compiled_data_path,
+    compute_average_spectrogram,
     encode_sample_directory,
+    graph_spectrogram,
     load_npy_data,
+    normalized_db_to_wav,
 )
 
 # Encode samples
 encode_sample_directory(audio_data_dir, silent=True)
+compute_average_spectrogram()
 
 real_data = load_npy_data(compiled_data_path)  # datapts, channels, frames, freq bins
-print(real_data.shape)
+average_data = load_npy_data(average_spectrogram_path)  # channels, frames, freq bins
+print("Data " + str(real_data.shape))
+print("Average " + str(average_data.shape))
+
+graph_spectrogram(average_data, "Average Data Point")
diff --git a/src/utils/helpers.py b/src/utils/helpers.py
@@ -11,6 +11,7 @@
 audio_data_dir = "data/kick_samples"
 sinetest_data_dir = "data/sine_test"
 compiled_data_path = "data/compiled_data.npy"
+average_spectrogram_path = "data/average_spectrogram.npy"
 audio_output_dir = "model"
 model_save_dir = "model"
 
@@ -91,6 +92,12 @@ def generate_sine_impules():
         sf.write(save_path, audio_signal, GLOBAL_SR)
 
 
+def compute_average_spectrogram():
+    spectrogram_data = load_npy_data(compiled_data_path)
+    average_spectrogram = np.mean(spectrogram_data, axis=0, dtype=np.float32)
+    save_freq_info(average_spectrogram, average_spectrogram_path)
+
+
 def normalize_sample_length(audio_file_path):
     target_length = AUDIO_SAMPLE_LENGTH
 
@@ -236,13 +243,7 @@ def istft_with_griffin_lim_reconstruction(amplitudes, preserve_signal_angles=Fal
     return STFT.istft((amplitudes * angles).T)
 
 
-def istft_hybrid(amplitudes):
-    return istft_with_griffin_lim_reconstruction(
-        amplitudes, preserve_signal_angles=True
-    )
-
-
-def amplitudes_to_wav(amplitudes, name):
+def normalized_db_to_wav(amplitudes, name):
     audio_channels = []
     loudness_info = []
 
@@ -251,7 +252,7 @@ def amplitudes_to_wav(amplitudes, name):
         loudness_info.append(channel_db_loudnes)
 
         channel_amplitudes = scale_normalized_db_to_amplis(channel_loudness)
-        audio_signal = istft_hybrid(channel_amplitudes)
+        audio_signal = istft_with_griffin_lim_reconstruction(channel_amplitudes)
         audio_channels.append(audio_signal)
 
     graph_spectrogram(loudness_info, "Generated Audio Loudness (db)", 10)