Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUGGESTIONS] floating point, dequantize and more #18

Open
MarcoRavich opened this issue Jul 29, 2024 · 11 comments
Open

[SUGGESTIONS] floating point, dequantize and more #18

MarcoRavich opened this issue Jul 29, 2024 · 11 comments
Assignees

Comments

@MarcoRavich
Copy link

MarcoRavich commented Jul 29, 2024

Hi there,
since I've some experiences in this field (audio delossify/upscale) I'd like to share what I have learned:

  • During a lossy audio treating, the best approach is to carefully decode and preserve it as much as possible: that's why a floating point internal computing and output grants better fidelity;
  • Operations like normalization (that alters the level of the original signal) should be optional and not part of the process;
  • Since your software relies on FFMPEG, I strongly suggest you to always use the -drc_scale 0 parameter when decoding lossy sources (even if it's AC3-specific feature);
  • Always seek for new scientific papers/code - such as @zawi01's Audio Dequantization Using (Co)Sparse (Non)Convex Methods - that may help to achieve better results;
  • While you've probably already done this, I recommend you to check how Izotope' Spectral Recovery and Stereotool's Delossifier functions works, for possible inspirations.

Last but not least, I've added your project to HyMPS project one (under AUDIO section \ Treatments page \ Enhancing subsection) so check out other interesting open projects to collaborate with.

Hope that helps/inspires !

EDIT
I've also opened a 3ad @ hydrogenaudio about Fat Llama, enjoy.

@MarcoRavich MarcoRavich changed the title [SUGGESTIONS] fp32, dequantize and more [SUGGESTIONS] floating point, dequantize and more Jul 29, 2024
@bkraad47
Copy link
Owner

Much appreciated will get to it when I can

@MarcoRavich
Copy link
Author

MarcoRavich commented Jul 30, 2024

I also suggest you to deeply check this paper (that describes many interesting audio restoration techniques) and, of course, try to collaborate with @abreuwallace that unofficially implemented it here.

@bkraad47 bkraad47 self-assigned this Aug 1, 2024
@bkraad47
Copy link
Owner

bkraad47 commented Aug 1, 2024

Changes made to v-1.1.0 to address some of the topics...cheers

@MarcoRavich
Copy link
Author

MarcoRavich commented Aug 1, 2024

Bump.

I recalled an old - but still valid - comparison about MP3 decoders quality:

mp3_quality

...and this explains better why fp matters (google-translated from here):

There is another solution for decoding mp3 files that has been around for less than a year and whose technical specifications are quite impressive, and which should offer a quality that is significantly higher than that of MAD. This software is foobar2000, and uses the mpglib library as a decoder, largely modified by Peter Pawlowski. This decoding solution has the disadvantage of not supporting freeformat encodings, but as advantages, both theoretical and practical, we find the following characteristics:

  • decoding performed in 64 floating bits
  • integration of a disengageable dithering algorithm, with three noise shaping modes to limit its negative impact
  • full consideration of offsets, allowing gapless chaining of lame encodings (automatic) and any other mp3 encoder (manual)
  • full support for replaygain
  • management of several tag formats (ID3v1, ID3v2.x and APEv2)

All of these qualities are applicable to formats other than mp3, and - this is the most important aspect - are perfectly reproducible in the form of a test PCM file thanks to the complete diskwriter available with the software. Integrating foobar2000 should allow us to measure the contribution of the noise shaping technique, an essential complement to dithering that MAD strangely ignores. This technique should push back the nuisances linked to dithering (background noise) while preserving its contribution (higher resolution). From a theoretical point of view at least, foobar2000 should offer qualitatively superior decoding to that of MAD.

It's unfortunally no longer used/developed (a 2006-source code version shared from Peter here), but someone like @buhadram (that recently forked mpglibdll) can be interested in implement such optimizations.

I also suggest to check project like Audio Delossifier by @kroll-software that may help even more.

Last but not least, the approach you're appying seems very similar to the well-known Spectral band replication but without the encoded-datas for recontruction.
This could be similar to what stereotool does in its Delossifier function, but you have to investigate better/deeper.
In this field somone like @Abhipray - that some time ago developed an audio codec with SBR logic - may explain you better if it's feasible.

Anyway I believe that these types of alterations shouldn't be mandatory but, just like in stereotool, optionally activated by users.

Hope that helps.

@bkraad47
Copy link
Owner

bkraad47 commented Aug 2, 2024

Hi the code was updated in version 1.1.0 to use fp 64 cheers

@MarcoRavich
Copy link
Author

Hi the code was updated in version 1.1.0 to use fp 64 cheers

Nice.

Please join the 3ad I've started @ hydrogenaudio 'cause others are asking me things that I sincerely don't know (note that there are many audio and codecs experts there).

Thanks.

@MarcoRavich
Copy link
Author

MarcoRavich commented Aug 6, 2024

Here are some other interesting projects to collaborate with:

It would be also interesting to adopt better quality measurement method, such as the well-known Perceptual Evaluation of Audio Quality.

You can evaluate to integrate it in fatllama too: https://github.com/search?q=PEAQ+audio&type=repositories&s=updated

Last but not least, what about a "Jupyter Notebook" to let anyone test fatllama easily ?
Check out Google Colab/HF Spaces/Paperspace/Kaggle/Jupiter/Deepnote/etc.

Hope that inspires !

@MarcoRavich
Copy link
Author

Look and you shall find

Another interesting - even if speech-oriented - project with many cool resources (= papers) linked in: EnglishSpeechUpsampler by @jhetherly.

@ivandustin
Copy link

You can experiment with using transformer models to reconstruct frequencies for upscaling audio. For example, breathing sounds that are audible in CD quality but lost in MP3 could potentially be restored using a transformer model. While the model might introduce sounds that weren't present in the original recording, it can artificially enhance the details of the music. However, this approach requires significant effort and time to develop and fine-tune.

@MarcoRavich
Copy link
Author

Thanks for your knowledge sharing @ivandustin !

Seems that well-known @IAHispano choosed @haoheliu's AudioSR for their Audio Upscaler:

At the moment the best approach seems to be the Audio Delossifier by @kroll-software one, which is trainable.

Anyway I would love to see a "scientific" (= spectral ?) quality comparison between all contenders...

...so I've decided to reorganize (and expand) the HyMPS' collection: AUDIO section \ AI-based category \ Enhancers page \ Upscalers

Enjoy !

@MarcoRavich
Copy link
Author

MarcoRavich commented Aug 20, 2024

Well the HA discussion is going on and - as always - something interesting comes out:

DeltaWave Audio Null Comparator: a free experimental software built to allow detailed comparison of two different captures of the same audio signal.

It performs a wide range of measurements and comparisons, which could be very useful in testing the effectiveness and accuracy of fat llama: please check it out.

EDIT
Here's a DeltaWave report - with some graphs - on experiment files you shared:

DeltaWave v2.0.13, 2024-08-20T12:23:57.2062764+02:00
Reference: test_original_flac.flac[L] 7278467 samples 192000Hz 24bits, stereo, MD5=00
Comparison: test_converted_flac.flac[L] 8491546 samples 224000Hz 24bits, stereo, MD5=00
Settings:
Gain:True, Remove DC:True
Non-linear Gain EQ:False Non-linear Phase EQ: False
EQ FFT Size:65536, EQ Frequency Cut: 0Hz - 0Hz, EQ Threshold: -500dB
Correct Non-linearity: False
Correct Drift:True, Precision:30, Subsample Align:True
Non-Linear drift Correction:False
Upsample:True, Window:Kaiser
Spectrum Window:Kaiser, Spectrum Size:32768
Spectrogram Window:Hann, Spectrogram Size:4096, Spectrogram Steps:2048
Filter Type:FIR, window:Kaiser, taps:262144, minimum phase=False
Dither:False bits=0
Trim Silence:True
Enable Simple Waveform Measurement: False
Resampled Reference to 224000Hz
Discarding Reference: Start=0s, End=0s
Discarding Comparison: Start=0s, End=0s
Initial peak values Reference: -3,358dB Comparison: 0dB
Initial RMS values Reference: -20,25dB Comparison: -16,675dB
Null Depth=101,926dB
Trimming 0 samples at start and 0 samples at the end that are below -90,31dB level
X-Correlation offset: -3 samples
Trimming 0 samples at start and 0 samples at the end that are below -90,31dB level
Drift computation quality, #1: Excellent (0,3μs)
Trimmed 171244 samples ( 764,482143ms) front, 224860 samples ( 1003,839286ms end)
Final peak values Reference: -3,358dB Comparison: -3,801dB
Final RMS values Reference: -20,505dB Comparison: -20,765dB
Gain= 3,822dB (1,5528x) DC=0,00001 Phase offset=-0,013447ms (-3,012 samples)
Difference (rms) = -32,77dB [-39,68dBA]
Correlated Null Depth=42,37dB [41,89dBA]
Clock drift: -0,01 ppm
Files are NOT a bit-perfect match (match=0,13%) at 16 bits
Files are NOT a bit-perfect match (match=0%) at 24 bits
Files match @ 49,998% when reduced to 6,5 bits
---- Phase difference (full bandwidth): 143,874834932871°
0-10kHz: 23,92°
0-20kHz: 56,12°
0-24kHz: 66,07°
Timing error (rms jitter): 514ns
PK Metric (step=400ms, overlap=50%):
RMS=-37,5dBFS
Median=-46,7
Max=-29,4
99%: -30,89
75%: -34,7
50%: -46,7
25%: -50,74
1%: -55,33
gn=0,644017646514286, dc=1,40872942458951E-05, dr=-8,96573621136896E-09, of=-3,01221162486093
DONE!
Signature: 85634bda22fde7ebef2c2e8a0c038e7a
RMS of the difference of spectra: -78,1160326779584dB
DF Metric (step=400ms, overlap=0%):
Median=-18,9dB
Max=-10,6dB Min=-30,8dB
1% > -30,78dB
10% > -27,18dB
25% > -24,67dB
50% > -18,88dB
75% > -13,32dB
90% > -12,28dB
99% > -2,19dB
Linearity 2,1bits @ 0.5dB error


As already asked, a jupiter notebook (that let users run tests on Colab, for example) would be great for those - like me too - who don't own CUDA-enabled GPUs.

Last but not least, I've also found this interesting repo by @franciscorafart that may help too.

Hope that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants