optim model

shuklabhay · Aug 22, 2024 · 00365c5 · 00365c5
1 parent 65afeb4
commit 00365c5
Show file tree

Hide file tree

Showing 13 changed files with 67 additions and 55 deletions.
diff --git a/model/DCGAN_final_model.pth b/model/DCGAN_final_model.pth
diff --git a/paper/main.md b/paper/main.md
@@ -1,6 +1,4 @@
-# Kick it Out: Two Channel Kick Drum Generation With a Deep Convolution Generative Architecture.
-
-use attention, use special loss calcs, if i get this working then BOOM make it about kick generation using super optimized deep conv
+# Kick it Out: Multi-Channel Kick Drum Generation With a Deep Convolution Generative Architecture.
 
 Abhay Shukla\
 [email protected]\
@@ -31,21 +29,25 @@ Training data is first sourced from digital production “sample packs” compil
 
 ### Feature Extraction/Encoding
 
-The training data used is a compilation of 7856 audio samples. A simple DCGAN can not learn about the time-series component of audio, so this feature extraction process must to flatten the time-series component into a static form of data. This is achieved by representing audio in the time-frequency domain. Each sample is first converted into a raw audio array representation using a standard 44100 hz sampling rate and preserving the two channel characteristic of the data. Then the audio sample is normalized to a length of 500 miliseconds and passed into a Short-time Fourier Transform with a window of 512 and hop size of 128, returning a representation of a kick drum as an array of amplitudes for 2 channels, 176 frames of audio, 257 frequency bins. The parameters for the Short-time Fourier Transform are partially determined by hardware contraints.
+The training data used is a compilation of 7856 audio samples. A simple DCGAN can not learn about the time-series component of audio, so this feature extraction process must to flatten the time-series component into a static form of data. This is achieved by representing audio in the time-frequency domain. Each sample is first converted into a raw audio array representation using a standard 44100 hz sampling rate and preserving the two channel characteristic of the data. Then the audio sample is normalized to a length of 500 miliseconds and passed into a Short-time Fourier Transform with a [window type] window, window of 512 and hop size of 128, returning a representation of a kick drum as an array of amplitudes for 2 channels, 176 frames of audio, 257 frequency bins. The parameters for the Short-time Fourier Transform are partially determined by hardware contraints.
+
+[talk abt fft parameters specifically: the window sizew and also about using larger amts of frames then cutting down so information is more detailed but the useless stuff not there. talk abt also like doing the oppsotie for when geenrating back the audio file]
 
 While amplitude data (output of fourier transform) is important, this data is by nature skewed towards lower frequencies which contain more intensity. To remove this effect, a process of feature extraction occurs to equalize the representation of frequencies in data. The tensor of amplitude data is scaled to be between 0 and 100 and then passed through a noise threshold where all values under 10e-10 are set to zero. This normalized, noise gated amplitude information is then converted into a logarithmic, decibal scale, which displays audio information as loudness, a more uniform way relative to the entire frequency spectrum. This data is then finally scaled to be between -1 and 1, representative of the output the model creates using the hyperbolic tangent activation function.
 
-[show amp data and loudness spectrograms]
+![Magnitude information of a Kick Drum](static/magnitude.png)
+![Loudness information of the same Kick Drum](static/loudness.png)
+Note that both examples graphs are the same audio information, just because the magnitude information returns the null color the same
 
 Generated audio representaions are a tensor of the same shape with values between -1 and 1. This data is scaled to be between -120 and 40, then passed into an exponential function converting the data back to "amplitudes" and finally noise gated. This amplitude information is then passed into a griffin-lim phase reconstruction algorithm[3] and finally converted to playable audio.
 
 ## Implementation
 
 The model itself is is a standard DCGAN model[1] with two slight modifcations, upsampling and spectral normalization. The Generator takes in 100 latent dimensions and passes it into 9 convolution transpose blocks, each consisting of a convolution transpose layer, a batch normalization layer, and a ReLU activation. After convolving, the Generator upsamples the output from a two channel 256 by 256 output to to a two channel output of frames by frequency bins and applies a hyperbolic tangent activation function. The Discriminator upscales audio from frames by frequency bins to 256 by 256 to then pass through 9 convolution blocks, each consisting of a convolution layer with spectral normalization to prevent model collapse, a batch normalization layer, and a Leaky ReLU activation. After convolution, the probability of an audio clip audio being real is returned using a sigmoid activation.
 
-This work uses 80% of the dataset as training data and 20% as validation with all data split into batches of 16. The Generator and Discriminator utilize Binary Cross Entropy with Logit loss functions to compute loss and Adam optimizers. Generator loss is also modified to encourage create a decaying sound [explain how] and penalize periodic noise patterns [explain how]. Due to hardware limitations, the model is trained over ten epochs. Validation occurs every 5 epochs and label smoothing is also applied to prevent overconfidence.
+This work uses 80% of the dataset as training data and 20% as validation with all data split into batches of 16. The Generator and Discriminator utilize Binary Cross Entropy with Logit loss functions to compute loss and Adam optimizers. Generator loss is also modified to encourage create a decaying sound [explain how]. Due to hardware limitations, the model is trained over ten epochs and Validation occurs every 5 epochs. Overconfidence is prevented using label smoothing.
 
-![Average Data Point](static/average-data-point.png)
+![Average Kick Drum](static/average-kick-drum.png)
 
 ## Results
 
@@ -56,8 +58,6 @@ In mostly every training loop, generator and discriminator loss always tends to
 When analyzing generated audio, it is apparent that the model is creating some periodic noise pattern with some sort of sound in the middle of the frequency spectrum. Each generated output also appears contain little to no differences between each other.
 ![Output spectrogram](static/model-output.png)
 
-intuition/reasoning idea for why it doesnt work (maybe?)
-
 - decaying shape exists but details of shape vary, some samples decay longer some decay smapper
 - subtle complexities stop gan from perfect replication (one sample w/ super long decay makes it question all other short decay samples?? figure out if this is fr)
 - discrim could be fousing on
@@ -73,7 +73,7 @@ As a result, a discriminator could be fooled into believing random periodic nois
 
 audio waveforms very periodic, need to do something so it doesnt learn to just generate fake lines
 
-proposed model collapse fix only makes worse
+compare with wavegan??? lowkey too much work be like ohhhh limitations
 
 for it to work need to optimize for kind of data, cant just use image gen
 
@@ -87,7 +87,7 @@ talk abt sine validation, also how even halving data to only be middle freq stil
 
 ### Model Shortcomings
 
-### iSTFT Shortcomings
+### STFT and iSTFT Losses
 
 ### Contributions
 

diff --git a/paper/static/average-data-point.png b/paper/static/average-data-point.png
diff --git a/paper/static/average-data-point.wav b/paper/static/average-data-point.wav
diff --git a/paper/static/average-kick-drum.png b/paper/static/average-kick-drum.png
diff --git a/paper/static/loudness.png b/paper/static/loudness.png
diff --git a/paper/static/magnitude.png b/paper/static/magnitude.png
diff --git a/src/architecture.py b/src/architecture.py
@@ -16,6 +16,7 @@
 class Generator(nn.Module):
     def __init__(self):
         super(Generator, self).__init__()
+
         self.deconv_blocks = nn.Sequential(
             nn.ConvTranspose2d(LATENT_DIM, 256, kernel_size=4, stride=2, padding=1),
             nn.BatchNorm2d(256),
@@ -39,22 +40,19 @@ def __init__(self):
             nn.BatchNorm2d(4),
             nn.ReLU(inplace=True),
             nn.ConvTranspose2d(4, N_CHANNELS, kernel_size=4, stride=2, padding=1),
-            nn.Upsample(
-                size=(N_FRAMES, N_FREQ_BINS), mode="bilinear", align_corners=False
-            ),
-            nn.Tanh(),
-        )
+        )  # Output: 2, 256, 256
+        self.tanh = nn.Tanh()
 
     def forward(self, z):
         x = self.deconv_blocks(z)
-        return x
+
+        return self.tanh(x)
 
 
 class Discriminator(nn.Module):
     def __init__(self):
         super(Discriminator, self).__init__()
-        self.conv_blocks = nn.Sequential(
-            nn.Upsample(size=(256, 256), mode="bilinear", align_corners=False),
+        self.conv_blocks = nn.Sequential(  # input: 2, 256, 256
             spectral_norm(nn.Conv2d(N_CHANNELS, 4, kernel_size=4, stride=2, padding=1)),
             nn.LeakyReLU(0.2, inplace=True),
             spectral_norm(nn.Conv2d(4, 8, kernel_size=4, stride=2, padding=1)),
@@ -82,6 +80,4 @@ def __init__(self):
 
     def forward(self, x):
         x = self.conv_blocks(x)
-        x = torch.squeeze(x)
-        x = torch.unsqueeze(x, 1)
         return x
diff --git a/src/dcgan.py b/src/dcgan.py
@@ -18,6 +18,7 @@
 # Constants
 LR_G = 0.002
 LR_D = 0.001
+LR_DECAY = 0.97
 
 # Load data
 audio_data = load_npy_data(compiled_data_path)
@@ -36,6 +37,9 @@
 criterion = nn.BCEWithLogitsLoss()
 optimizer_G = optim.Adam(generator.parameters(), lr=LR_G, betas=(0.5, 0.999))  # type: ignore
 optimizer_D = optim.Adam(discriminator.parameters(), lr=LR_D, betas=(0.5, 0.999))  # type: ignore
+scheduler_G = optim.lr_scheduler.ExponentialLR(optimizer_G, gamma=LR_DECAY)
+scheduler_D = optim.lr_scheduler.ExponentialLR(optimizer_D, gamma=LR_DECAY)
+
 
 device = get_device()
 generator.to(device)
@@ -50,5 +54,7 @@
     criterion,
     optimizer_G,
     optimizer_D,
+    scheduler_G,
+    scheduler_D,
     device,
 )
diff --git a/src/train.py b/src/train.py
@@ -10,7 +10,7 @@
     scale_data_to_range,
 )
 
-
+# Constants
 N_EPOCHS = 10
 VALIDATION_INTERVAL = int(N_EPOCHS / 2)
 SAVE_INTERVAL = int(N_EPOCHS / 1)
@@ -27,25 +27,6 @@ def calculate_decay_penalty(audio_data):
     return decay_penalty
 
 
-def calculate_periodicity_penalty(audio_data):
-    current_batch_size = audio_data.shape[0]
-    reshaped_audio = audio_data.reshape(current_batch_size, -1, N_FRAMES)
-    autocorr = torch.tensor([]).to(audio_data.device)
-    for i in range(current_batch_size):
-        sample = reshaped_audio[i]
-        sample_autocorr = F.conv1d(
-            sample.unsqueeze(0),
-            sample.flip(-1).unsqueeze(0),
-            padding=sample.shape[-1] - 1,
-        )
-        autocorr = torch.cat((autocorr, sample_autocorr), dim=0)
-
-    autocorr = autocorr / autocorr.max(dim=2, keepdim=True)[0]
-    periodicity_penalty = autocorr[:, :, 50:].mean()
-
-    return periodicity_penalty
-
-
 # Training
 def train_epoch(
     generator,
@@ -54,10 +35,11 @@ def train_epoch(
     criterion,
     optimizer_G,
     optimizer_D,
+    scheduler_G,
+    scheduler_D,
     device,
 ):
     decay_penalty_weight = 0.1
-    periodicity_penalty_weight = 0.1
 
     generator.train()
     discriminator.train()
@@ -79,17 +61,13 @@ def smooth_labels(tensor, amount=0.1):
         fake_audio_data = generator(z)
         g_adv_loss = criterion(discriminator(fake_audio_data), real_labels)
         decay_penalty = calculate_decay_penalty(fake_audio_data)
-        # periodicity_penalty = calculate_periodicity_penalty(fake_audio_data)
 
         # Combine losses
-        g_loss = (
-            g_adv_loss
-            + decay_penalty * decay_penalty_weight
-            # + periodicity_penalty * periodicity_penalty_weight # ignore for now
-        )
+        g_loss = g_adv_loss + decay_penalty * decay_penalty_weight
 
         g_loss.backward()
         optimizer_G.step()
+        scheduler_G.step()
         total_g_loss += g_loss.item()
 
         # Train discriminator
@@ -100,6 +78,7 @@ def smooth_labels(tensor, amount=0.1):
         d_loss = (real_loss + fake_loss) / 2
         d_loss.backward()
         optimizer_D.step()
+        scheduler_D.step()
         total_d_loss += d_loss.item()
 
     return total_g_loss / len(dataloader), total_d_loss / len(dataloader)
@@ -139,6 +118,8 @@ def training_loop(
     criterion,
     optimizer_G,
     optimizer_D,
+    scheduler_G,
+    scheduler_D,
     device,
 ):
     for epoch in range(N_EPOCHS):
@@ -149,6 +130,8 @@ def training_loop(
             criterion,
             optimizer_G,
             optimizer_D,
+            scheduler_G,
+            scheduler_D,
             device,
         )
 

diff --git a/src/utils/audio_processing_validation.py b/src/utils/audio_processing_validation.py
@@ -1,4 +1,4 @@
-from utils.helpers import (
+from helpers import (
     normalized_db_to_wav,
     encode_sample,
     graph_spectrogram,

diff --git a/src/utils/encode_audio_data.py b/src/utils/encode_audio_data.py
@@ -6,7 +6,6 @@
     encode_sample_directory,
     graph_spectrogram,
     load_npy_data,
-    normalized_db_to_wav,
 )
 
 # Encode samples
@@ -18,4 +17,4 @@
 print("Data " + str(real_data.shape))
 print("Average " + str(average_data.shape))
 
-graph_spectrogram(average_data, "Average Data Point")
+graph_spectrogram(average_data, "Average Kick Drum")
diff --git a/src/utils/helpers.py b/src/utils/helpers.py
@@ -18,12 +18,12 @@
 AUDIO_SAMPLE_LENGTH = 0.5  # 500 ms
 GLOBAL_SR = 44100
 N_CHANNELS = 2  # Left, right
-N_FRAMES = 176
+N_FRAMES = 352
 N_FREQ_BINS = 257
 
 # Initialize STFT Object
 GLOBAL_WIN = 2**9
-GLOBAL_HOP = 2**7
+GLOBAL_HOP = 2**6
 win = scipy.signal.windows.kaiser(GLOBAL_WIN, beta=14)
 STFT = scipy.signal.ShortTimeFFT(
     win=win, hop=GLOBAL_HOP, fs=GLOBAL_SR, scale_to="magnitude"
@@ -111,12 +111,36 @@ def normalize_sample_length(audio_file_path):
         y = y[:, : int(target_length * sr)]
     else:
         padding = int((target_length - actual_length) * sr)
-        y = np.pad(y, ((0, 0), (0, padding)), mode="constant")
+        y = np.pad(y, ((0, 0), (0, padding)), mode="linear_ramp")
 
     return y
 
 
-def noise_thresh(data, threshold=10e-10):
+def resize_spectrogram(channel_spectrogram):
+    # Remove topmost frequency bin, flatten frames to 256
+    channel_spectrogram = channel_spectrogram[:, :-1]
+    channel_spectrogram = channel_spectrogram[:256, :]
+
+    frames_to_fade = 50
+    fade_out_weights = np.linspace(1, 0, frames_to_fade)  # linear
+    channel_spectrogram[256 - frames_to_fade :, :] *= fade_out_weights[:, np.newaxis]
+
+    return channel_spectrogram
+
+
+def resize_generated_audio(channel_loudness):
+    # Add topmost frequency bin, add blank frames until N_FRAMES
+    new_freq_bin = np.zeros((channel_loudness.shape[0], 1))
+    channel_loudness = np.hstack((new_freq_bin, channel_loudness))
+
+    num_frames_to_add = N_FRAMES - channel_loudness.shape[0]
+    empty_frames = np.zeros((num_frames_to_add, channel_loudness.shape[1]))
+    channel_loudness = np.vstack((channel_loudness, empty_frames))
+
+    return channel_loudness
+
+
+def noise_thresh(data, threshold=10e-12):
     data[np.abs(data) < threshold] = 0
     return data
 
@@ -166,11 +190,13 @@ def graph_spectrogram(audio_data, sample_name, graphScale=10):
 # Encoding audio
 def extract_sample_magnitudes(audio_data):
     sample_as_magnitudes = []
+
     for channel in audio_data:
         channel_mean = np.mean(channel)
         channel -= channel_mean
         stft = STFT.stft(channel)
         magnitudes = np.abs(stft).T
+        magnitudes = resize_spectrogram(magnitudes)
         sample_as_magnitudes.append(magnitudes)
 
     sample_as_magnitudes = np.array(sample_as_magnitudes)
@@ -251,8 +277,10 @@ def normalized_db_to_wav(loudness_data, name):
         channel_db_loudnes = scale_data_to_range(channel_loudness, -120, 40)
         audio_channel_loudness_info.append(channel_db_loudnes)
 
+        channel_loudness = resize_generated_audio(channel_loudness)
         channel_magnitudes = scale_normalized_db_to_magnitudes(channel_loudness)
         audio_signal = istft_with_griffin_lim_reconstruction(channel_magnitudes)
+
         audio_reconstruction.append(audio_signal)
 
     graph_spectrogram(audio_channel_loudness_info, "Generated Audio Loudness (db)", 10)