Skip to content

Commit

Permalink
optim model
Browse files Browse the repository at this point in the history
  • Loading branch information
shuklabhay committed Aug 22, 2024
1 parent 65afeb4 commit 00365c5
Show file tree
Hide file tree
Showing 13 changed files with 67 additions and 55 deletions.
Binary file modified model/DCGAN_final_model.pth
Binary file not shown.
22 changes: 11 additions & 11 deletions paper/main.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
# Kick it Out: Two Channel Kick Drum Generation With a Deep Convolution Generative Architecture.

use attention, use special loss calcs, if i get this working then BOOM make it about kick generation using super optimized deep conv
# Kick it Out: Multi-Channel Kick Drum Generation With a Deep Convolution Generative Architecture.

Abhay Shukla\
[email protected]\
Expand Down Expand Up @@ -31,21 +29,25 @@ Training data is first sourced from digital production “sample packs” compil

### Feature Extraction/Encoding

The training data used is a compilation of 7856 audio samples. A simple DCGAN can not learn about the time-series component of audio, so this feature extraction process must to flatten the time-series component into a static form of data. This is achieved by representing audio in the time-frequency domain. Each sample is first converted into a raw audio array representation using a standard 44100 hz sampling rate and preserving the two channel characteristic of the data. Then the audio sample is normalized to a length of 500 miliseconds and passed into a Short-time Fourier Transform with a window of 512 and hop size of 128, returning a representation of a kick drum as an array of amplitudes for 2 channels, 176 frames of audio, 257 frequency bins. The parameters for the Short-time Fourier Transform are partially determined by hardware contraints.
The training data used is a compilation of 7856 audio samples. A simple DCGAN can not learn about the time-series component of audio, so this feature extraction process must to flatten the time-series component into a static form of data. This is achieved by representing audio in the time-frequency domain. Each sample is first converted into a raw audio array representation using a standard 44100 hz sampling rate and preserving the two channel characteristic of the data. Then the audio sample is normalized to a length of 500 miliseconds and passed into a Short-time Fourier Transform with a [window type] window, window of 512 and hop size of 128, returning a representation of a kick drum as an array of amplitudes for 2 channels, 176 frames of audio, 257 frequency bins. The parameters for the Short-time Fourier Transform are partially determined by hardware contraints.

[talk abt fft parameters specifically: the window sizew and also about using larger amts of frames then cutting down so information is more detailed but the useless stuff not there. talk abt also like doing the oppsotie for when geenrating back the audio file]

While amplitude data (output of fourier transform) is important, this data is by nature skewed towards lower frequencies which contain more intensity. To remove this effect, a process of feature extraction occurs to equalize the representation of frequencies in data. The tensor of amplitude data is scaled to be between 0 and 100 and then passed through a noise threshold where all values under 10e-10 are set to zero. This normalized, noise gated amplitude information is then converted into a logarithmic, decibal scale, which displays audio information as loudness, a more uniform way relative to the entire frequency spectrum. This data is then finally scaled to be between -1 and 1, representative of the output the model creates using the hyperbolic tangent activation function.

[show amp data and loudness spectrograms]
![Magnitude information of a Kick Drum](static/magnitude.png)
![Loudness information of the same Kick Drum](static/loudness.png)
Note that both examples graphs are the same audio information, just because the magnitude information returns the null color the same

Generated audio representaions are a tensor of the same shape with values between -1 and 1. This data is scaled to be between -120 and 40, then passed into an exponential function converting the data back to "amplitudes" and finally noise gated. This amplitude information is then passed into a griffin-lim phase reconstruction algorithm[3] and finally converted to playable audio.

## Implementation

The model itself is is a standard DCGAN model[1] with two slight modifcations, upsampling and spectral normalization. The Generator takes in 100 latent dimensions and passes it into 9 convolution transpose blocks, each consisting of a convolution transpose layer, a batch normalization layer, and a ReLU activation. After convolving, the Generator upsamples the output from a two channel 256 by 256 output to to a two channel output of frames by frequency bins and applies a hyperbolic tangent activation function. The Discriminator upscales audio from frames by frequency bins to 256 by 256 to then pass through 9 convolution blocks, each consisting of a convolution layer with spectral normalization to prevent model collapse, a batch normalization layer, and a Leaky ReLU activation. After convolution, the probability of an audio clip audio being real is returned using a sigmoid activation.

This work uses 80% of the dataset as training data and 20% as validation with all data split into batches of 16. The Generator and Discriminator utilize Binary Cross Entropy with Logit loss functions to compute loss and Adam optimizers. Generator loss is also modified to encourage create a decaying sound [explain how] and penalize periodic noise patterns [explain how]. Due to hardware limitations, the model is trained over ten epochs. Validation occurs every 5 epochs and label smoothing is also applied to prevent overconfidence.
This work uses 80% of the dataset as training data and 20% as validation with all data split into batches of 16. The Generator and Discriminator utilize Binary Cross Entropy with Logit loss functions to compute loss and Adam optimizers. Generator loss is also modified to encourage create a decaying sound [explain how]. Due to hardware limitations, the model is trained over ten epochs and Validation occurs every 5 epochs. Overconfidence is prevented using label smoothing.

![Average Data Point](static/average-data-point.png)
![Average Kick Drum](static/average-kick-drum.png)

## Results

Expand All @@ -56,8 +58,6 @@ In mostly every training loop, generator and discriminator loss always tends to
When analyzing generated audio, it is apparent that the model is creating some periodic noise pattern with some sort of sound in the middle of the frequency spectrum. Each generated output also appears contain little to no differences between each other.
![Output spectrogram](static/model-output.png)

intuition/reasoning idea for why it doesnt work (maybe?)

- decaying shape exists but details of shape vary, some samples decay longer some decay smapper
- subtle complexities stop gan from perfect replication (one sample w/ super long decay makes it question all other short decay samples?? figure out if this is fr)
- discrim could be fousing on
Expand All @@ -73,7 +73,7 @@ As a result, a discriminator could be fooled into believing random periodic nois

audio waveforms very periodic, need to do something so it doesnt learn to just generate fake lines

proposed model collapse fix only makes worse
compare with wavegan??? lowkey too much work be like ohhhh limitations

for it to work need to optimize for kind of data, cant just use image gen

Expand All @@ -87,7 +87,7 @@ talk abt sine validation, also how even halving data to only be middle freq stil

### Model Shortcomings

### iSTFT Shortcomings
### STFT and iSTFT Losses

### Contributions

Expand Down
Binary file removed paper/static/average-data-point.png
Binary file not shown.
Binary file removed paper/static/average-data-point.wav
Binary file not shown.
Binary file added paper/static/average-kick-drum.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added paper/static/loudness.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added paper/static/magnitude.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
16 changes: 6 additions & 10 deletions src/architecture.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
class Generator(nn.Module):
def __init__(self):
super(Generator, self).__init__()

self.deconv_blocks = nn.Sequential(
nn.ConvTranspose2d(LATENT_DIM, 256, kernel_size=4, stride=2, padding=1),
nn.BatchNorm2d(256),
Expand All @@ -39,22 +40,19 @@ def __init__(self):
nn.BatchNorm2d(4),
nn.ReLU(inplace=True),
nn.ConvTranspose2d(4, N_CHANNELS, kernel_size=4, stride=2, padding=1),
nn.Upsample(
size=(N_FRAMES, N_FREQ_BINS), mode="bilinear", align_corners=False
),
nn.Tanh(),
)
) # Output: 2, 256, 256
self.tanh = nn.Tanh()

def forward(self, z):
x = self.deconv_blocks(z)
return x

return self.tanh(x)


class Discriminator(nn.Module):
def __init__(self):
super(Discriminator, self).__init__()
self.conv_blocks = nn.Sequential(
nn.Upsample(size=(256, 256), mode="bilinear", align_corners=False),
self.conv_blocks = nn.Sequential( # input: 2, 256, 256
spectral_norm(nn.Conv2d(N_CHANNELS, 4, kernel_size=4, stride=2, padding=1)),
nn.LeakyReLU(0.2, inplace=True),
spectral_norm(nn.Conv2d(4, 8, kernel_size=4, stride=2, padding=1)),
Expand Down Expand Up @@ -82,6 +80,4 @@ def __init__(self):

def forward(self, x):
x = self.conv_blocks(x)
x = torch.squeeze(x)
x = torch.unsqueeze(x, 1)
return x
6 changes: 6 additions & 0 deletions src/dcgan.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
# Constants
LR_G = 0.002
LR_D = 0.001
LR_DECAY = 0.97

# Load data
audio_data = load_npy_data(compiled_data_path)
Expand All @@ -36,6 +37,9 @@
criterion = nn.BCEWithLogitsLoss()
optimizer_G = optim.Adam(generator.parameters(), lr=LR_G, betas=(0.5, 0.999)) # type: ignore
optimizer_D = optim.Adam(discriminator.parameters(), lr=LR_D, betas=(0.5, 0.999)) # type: ignore
scheduler_G = optim.lr_scheduler.ExponentialLR(optimizer_G, gamma=LR_DECAY)
scheduler_D = optim.lr_scheduler.ExponentialLR(optimizer_D, gamma=LR_DECAY)


device = get_device()
generator.to(device)
Expand All @@ -50,5 +54,7 @@
criterion,
optimizer_G,
optimizer_D,
scheduler_G,
scheduler_D,
device,
)
37 changes: 10 additions & 27 deletions src/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
scale_data_to_range,
)


# Constants
N_EPOCHS = 10
VALIDATION_INTERVAL = int(N_EPOCHS / 2)
SAVE_INTERVAL = int(N_EPOCHS / 1)
Expand All @@ -27,25 +27,6 @@ def calculate_decay_penalty(audio_data):
return decay_penalty


def calculate_periodicity_penalty(audio_data):
current_batch_size = audio_data.shape[0]
reshaped_audio = audio_data.reshape(current_batch_size, -1, N_FRAMES)
autocorr = torch.tensor([]).to(audio_data.device)
for i in range(current_batch_size):
sample = reshaped_audio[i]
sample_autocorr = F.conv1d(
sample.unsqueeze(0),
sample.flip(-1).unsqueeze(0),
padding=sample.shape[-1] - 1,
)
autocorr = torch.cat((autocorr, sample_autocorr), dim=0)

autocorr = autocorr / autocorr.max(dim=2, keepdim=True)[0]
periodicity_penalty = autocorr[:, :, 50:].mean()

return periodicity_penalty


# Training
def train_epoch(
generator,
Expand All @@ -54,10 +35,11 @@ def train_epoch(
criterion,
optimizer_G,
optimizer_D,
scheduler_G,
scheduler_D,
device,
):
decay_penalty_weight = 0.1
periodicity_penalty_weight = 0.1

generator.train()
discriminator.train()
Expand All @@ -79,17 +61,13 @@ def smooth_labels(tensor, amount=0.1):
fake_audio_data = generator(z)
g_adv_loss = criterion(discriminator(fake_audio_data), real_labels)
decay_penalty = calculate_decay_penalty(fake_audio_data)
# periodicity_penalty = calculate_periodicity_penalty(fake_audio_data)

# Combine losses
g_loss = (
g_adv_loss
+ decay_penalty * decay_penalty_weight
# + periodicity_penalty * periodicity_penalty_weight # ignore for now
)
g_loss = g_adv_loss + decay_penalty * decay_penalty_weight

g_loss.backward()
optimizer_G.step()
scheduler_G.step()
total_g_loss += g_loss.item()

# Train discriminator
Expand All @@ -100,6 +78,7 @@ def smooth_labels(tensor, amount=0.1):
d_loss = (real_loss + fake_loss) / 2
d_loss.backward()
optimizer_D.step()
scheduler_D.step()
total_d_loss += d_loss.item()

return total_g_loss / len(dataloader), total_d_loss / len(dataloader)
Expand Down Expand Up @@ -139,6 +118,8 @@ def training_loop(
criterion,
optimizer_G,
optimizer_D,
scheduler_G,
scheduler_D,
device,
):
for epoch in range(N_EPOCHS):
Expand All @@ -149,6 +130,8 @@ def training_loop(
criterion,
optimizer_G,
optimizer_D,
scheduler_G,
scheduler_D,
device,
)

Expand Down
2 changes: 1 addition & 1 deletion src/utils/audio_processing_validation.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from utils.helpers import (
from helpers import (
normalized_db_to_wav,
encode_sample,
graph_spectrogram,
Expand Down
3 changes: 1 addition & 2 deletions src/utils/encode_audio_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
encode_sample_directory,
graph_spectrogram,
load_npy_data,
normalized_db_to_wav,
)

# Encode samples
Expand All @@ -18,4 +17,4 @@
print("Data " + str(real_data.shape))
print("Average " + str(average_data.shape))

graph_spectrogram(average_data, "Average Data Point")
graph_spectrogram(average_data, "Average Kick Drum")
36 changes: 32 additions & 4 deletions src/utils/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,12 @@
AUDIO_SAMPLE_LENGTH = 0.5 # 500 ms
GLOBAL_SR = 44100
N_CHANNELS = 2 # Left, right
N_FRAMES = 176
N_FRAMES = 352
N_FREQ_BINS = 257

# Initialize STFT Object
GLOBAL_WIN = 2**9
GLOBAL_HOP = 2**7
GLOBAL_HOP = 2**6
win = scipy.signal.windows.kaiser(GLOBAL_WIN, beta=14)
STFT = scipy.signal.ShortTimeFFT(
win=win, hop=GLOBAL_HOP, fs=GLOBAL_SR, scale_to="magnitude"
Expand Down Expand Up @@ -111,12 +111,36 @@ def normalize_sample_length(audio_file_path):
y = y[:, : int(target_length * sr)]
else:
padding = int((target_length - actual_length) * sr)
y = np.pad(y, ((0, 0), (0, padding)), mode="constant")
y = np.pad(y, ((0, 0), (0, padding)), mode="linear_ramp")

return y


def noise_thresh(data, threshold=10e-10):
def resize_spectrogram(channel_spectrogram):
# Remove topmost frequency bin, flatten frames to 256
channel_spectrogram = channel_spectrogram[:, :-1]
channel_spectrogram = channel_spectrogram[:256, :]

frames_to_fade = 50
fade_out_weights = np.linspace(1, 0, frames_to_fade) # linear
channel_spectrogram[256 - frames_to_fade :, :] *= fade_out_weights[:, np.newaxis]

return channel_spectrogram


def resize_generated_audio(channel_loudness):
# Add topmost frequency bin, add blank frames until N_FRAMES
new_freq_bin = np.zeros((channel_loudness.shape[0], 1))
channel_loudness = np.hstack((new_freq_bin, channel_loudness))

num_frames_to_add = N_FRAMES - channel_loudness.shape[0]
empty_frames = np.zeros((num_frames_to_add, channel_loudness.shape[1]))
channel_loudness = np.vstack((channel_loudness, empty_frames))

return channel_loudness


def noise_thresh(data, threshold=10e-12):
data[np.abs(data) < threshold] = 0
return data

Expand Down Expand Up @@ -166,11 +190,13 @@ def graph_spectrogram(audio_data, sample_name, graphScale=10):
# Encoding audio
def extract_sample_magnitudes(audio_data):
sample_as_magnitudes = []

for channel in audio_data:
channel_mean = np.mean(channel)
channel -= channel_mean
stft = STFT.stft(channel)
magnitudes = np.abs(stft).T
magnitudes = resize_spectrogram(magnitudes)
sample_as_magnitudes.append(magnitudes)

sample_as_magnitudes = np.array(sample_as_magnitudes)
Expand Down Expand Up @@ -251,8 +277,10 @@ def normalized_db_to_wav(loudness_data, name):
channel_db_loudnes = scale_data_to_range(channel_loudness, -120, 40)
audio_channel_loudness_info.append(channel_db_loudnes)

channel_loudness = resize_generated_audio(channel_loudness)
channel_magnitudes = scale_normalized_db_to_magnitudes(channel_loudness)
audio_signal = istft_with_griffin_lim_reconstruction(channel_magnitudes)

audio_reconstruction.append(audio_signal)

graph_spectrogram(audio_channel_loudness_info, "Generated Audio Loudness (db)", 10)
Expand Down

0 comments on commit 00365c5

Please sign in to comment.