Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example for raw audio #21

Open
mm3509 opened this issue Dec 3, 2023 · 4 comments
Open

Example for raw audio #21

mm3509 opened this issue Dec 3, 2023 · 4 comments

Comments

@mm3509
Copy link

mm3509 commented Dec 3, 2023

Hello, and thanks for the code! I want to replicate the audio results from the paper, but the DeepMind repo does not have a VQ-VAE example for audio (see google-deepmind/sonnet#141 ), and it seems quite different from the one for CIFAR:

We train a VQ-VAE where the encoder has 6 strided convolutions with stride 2 and window-size 4. This yields a latent space 64x smaller than the original waveform. The latents consist of one feature map and the discrete space is 512-dimensional.

Could you please include an example of using your code for audio?

@UkiTenzai
Copy link

Why not take a look at AudioDec and Descript-Audio-Codec? They are open source.

@mm3509
Copy link
Author

mm3509 commented Mar 14, 2025

Thank you @UkiTenzai . I checked the GitHub pages for both (https://github.com/facebookresearch/AudioDec and https://github.com/descriptinc/descript-audio-codec) and neither seems to do vocal cloning, i.e. voice neural transfer, right? That's what I would like to do with the VQ-VAE.

@UkiTenzai
Copy link

UkiTenzai commented Mar 15, 2025

Thank you @UkiTenzai . I checked the GitHub pages for both (https://github.com/facebookresearch/AudioDec and https://github.com/descriptinc/descript-audio-codec) and neither seems to do vocal cloning, i.e. voice neural transfer, right? That's what I would like to do with the VQ-VAE.

Sorry, AudioDec and DAC are for compression. You can try SpeechTokenizor[https://github.com/ZhangXInFD/SpeechTokenizer/], which utilize a VQVAE and can be used for zero-shot VC. Altugh there are many similar VQVAE that surpass it, but they all basically improve on it. It was necessary to learn this one first.

@mm3509
Copy link
Author

mm3509 commented Mar 15, 2025

Thank you. I checked the repo and it doesn't mention vocal cloning either, and an online search for SpeechTokenizer and vocal cloning did not show any applications, so I wouldn't know where to start. Could you please point an application or sample code using SpeechTokenizer for neural voice transfer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants