Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvements #21

Merged
merged 46 commits into from
Jan 24, 2025
Merged
Changes from 1 commit
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
b976705
wording
shuklabhay Oct 6, 2024
9794e83
feature engineering
shuklabhay Oct 8, 2024
2931c95
istft stuff
shuklabhay Oct 9, 2024
7a1285a
model architecutre
shuklabhay Oct 9, 2024
197f963
fix stuff
shuklabhay Oct 9, 2024
8e3d993
training loop
shuklabhay Oct 12, 2024
8f6bc12
more outlining
shuklabhay Oct 15, 2024
aca7360
wording
shuklabhay Oct 16, 2024
da74f31
results conclusion
shuklabhay Oct 16, 2024
b51f1f7
update wording and stuff
shuklabhay Oct 18, 2024
418d49c
outline nitro
shuklabhay Oct 19, 2024
b498a41
checkkerboard reference
shuklabhay Oct 21, 2024
e6fc7b7
intro, other wording
shuklabhay Oct 21, 2024
b45e567
change org :(
shuklabhay Oct 21, 2024
946c4d6
organize headers
shuklabhay Oct 22, 2024
7b2a680
related works
shuklabhay Oct 22, 2024
1ee197c
wavenet related works
shuklabhay Oct 23, 2024
cdb0cb3
wavegan explaination
shuklabhay Oct 24, 2024
15304c0
update description
shuklabhay Oct 24, 2024
53997ba
update results and stuff
shuklabhay Oct 24, 2024
c93cb89
talk abt gan
shuklabhay Oct 25, 2024
6be57be
wording stuff
shuklabhay Oct 25, 2024
69d9f66
mel based spec representation
shuklabhay Oct 27, 2024
099209a
remove 1k artifact
shuklabhay Oct 27, 2024
e210fba
CURATED KICK MODEL
shuklabhay Oct 28, 2024
a9b00d9
massive cleanup
shuklabhay Oct 29, 2024
305850f
lint
shuklabhay Oct 29, 2024
a7df899
resolve helper imports
shuklabhay Oct 29, 2024
ab70108
implement resize conv
shuklabhay Oct 30, 2024
0e35f58
wording
shuklabhay Nov 5, 2024
5fe2d29
Snare model prereqs & Abstract
shuklabhay Nov 8, 2024
b5fc84f
snare model
shuklabhay Nov 8, 2024
ce80a85
dataset description
shuklabhay Nov 8, 2024
aa54d44
feature eng part 1
shuklabhay Nov 13, 2024
8aebdbd
lil note
shuklabhay Nov 14, 2024
224aff6
wording stuff
shuklabhay Nov 18, 2024
7974f30
remove old
shuklabhay Nov 18, 2024
87ee183
feature eng
shuklabhay Nov 18, 2024
0f091e4
visualize bool
shuklabhay Nov 22, 2024
4c393d6
cleanup and write stuff
shuklabhay Nov 22, 2024
37b90f0
architecture and stuff
shuklabhay Nov 22, 2024
654b1dd
update lint action
shuklabhay Nov 24, 2024
5f77126
wording
shuklabhay Nov 25, 2024
78a2c1a
spellig
shuklabhay Nov 25, 2024
9e719be
cleanup ish
shuklabhay Nov 25, 2024
94e3bd5
wording
shuklabhay Nov 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
istft stuff
shuklabhay committed Oct 9, 2024

Verified

This commit was signed with the committer’s verified signature.
GromNaN Jérôme Tamarelle
commit 2931c953d0bdf14e5ced3a9f08f5379565d86183
8 changes: 5 additions & 3 deletions paper/paper.md
Original file line number Diff line number Diff line change
@@ -12,7 +12,7 @@ Existing convolutional aproaches to audio generation often are limited to produc

## 3. Data Manipulation

## 3.1 Datasets
### 3.1 Datasets

This paper utilizes three distinct data sets engineered to measure the model's resilince to variation in spectral content.

@@ -24,11 +24,13 @@ This paper utilizes three distinct data sets engineered to measure the model's r

These datasets provide robust frameworks for determining the model's response to varying amounts of variation within training data. Most audio is sourced from online "digital audio production sample packs" which compile sounds for a wide variety of generes and use cases.

## 3.2 Feature Engineering
### 3.2 Feature Engineering

To simplify the taks at hand, this work represents audio as an image of frequency bins by time steps, with each pixel's intensity representing magnitude. Utilizing this spectrogram-like representation of audio eliminates the need for recurrent architectures Each audio sample is first converted into a two channel array using a standard 44100 hz sampling rate. If necessary, single channel audio is duplicated. The audio sample is then normalized to a standard length and passes into a Short-time Fourier Transform (STFT).

The STFT utilizes a window size and hop length determined by the audio sample length and constant sample rate so that each resulting data point is 256 frequency bins by 256 time frames. When validating processing using pure sine signals at random frequencies, audio information was preserved to the greatest extent by using a kaiser window where a beta value of 12. Next, to preserve higher frequency information, the STFT's resulting magnitude information is converted to a decibal scale and the range of the loudness information is scaled down to the interval (-1,1). Scaling down to this interval further standardizes training audio and matches the output of the Generator, which uses a hyperbolic tangent activation. Both channels of the input audio are processed seperately and concatenated to create a two channel data point with each channel containing 256 frequency bins and 256 time steps, along with normalized loudness information at each frequency bin and time step.
The STFT utilizes a window size and hop length determined by the audio sample length and constant sample rate so that each resulting data point is 256 frequency bins by 256 time frames. When validating processing using pure sine signals at random frequencies, audio information was preserved to the greatest extent by using a kaiser window where a beta value of 12. Next, to preserve higher frequency information, the STFT's resulting magnitude information is converted to a decibal scale and the range of the loudness information is scaled down to a range of -1 to 1. Scaling down to this interval further standardizes training audio and matches the output of the Generator, which uses a hyperbolic tangent activation. Both channels of the input audio are processed seperately and concatenated to create a two channel data point with each channel containing 256 frequency bins and 256 time steps, along with normalized loudness information at each frequency bin and time step.

When converting generated audio representations to audio, this process occurs in reverse. Each channel's generated normalized loudness information is scaled up to a range of -40 to 40. A noise gate is then implemented and the decibal values are converted to magnitudes. Magnitude information is passed into a Momentum Driven Griffin-Lim Reconstruction with noise gating at each iteration, resulting in effectively recreated audio.

## 4. Model Implementation