Replies: 4 comments 2 replies
-
It doesn't matter between those two formats. There are a few supported formats and a few that just don't work. Rather than enumerate all of the ones that work, I just stated fp32. |
Beta Was this translation helpful? Give feedback.
-
Is it possible to train / output using 48Khz audio? |
Beta Was this translation helpful? Give feedback.
-
This is something I wondered as well. Might be my lack of knowledge about audio processing, but what is the reasoning for inputting at Awesome library btw, thanks for the hard work putting this out there. |
Beta Was this translation helpful? Give feedback.
-
Because midway through figuring out how to get this to work, I decided I wanted to use a vocoder. The best vocoder I could find was univnet, and the pretrained models use a 24khz sampling rate. I was unwilling to retrain my AR model to use this sample rate and I didn't have any interest in retraining univnet either, so I trained the diffusion model to bridge the gap. If the system was retrained from scratch, I would probably just train it at 24khz. |
Beta Was this translation helpful? Give feedback.
-
The README.md section on "Adding a new voice" states:-
Save the clips as a WAV file with floating point format and a 22,050 sample rate
However, the samples in:-
https://github.com/neonbjb/tortoise-tts/tree/main/tortoise/voices/....
seem to be in the more common and standard pcm_s16le (PCM signed 16-bit little-endian) format rather than floating point e.g. pcm_f32le (PCM 32-bit floating point little-endian).
So which is it?
Beta Was this translation helpful? Give feedback.
All reactions