-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignment data should be exposed as one of the outputs #70
Comments
I am currently working on this and found the following things:
My current implementation would output alignment data for sentences in CSV:
|
I found a solution to work around the problem:
It's so annoying, but it works. |
By the way, I read the piper source code. The process involves converting text to piper-phonemize, which retrieves ids from espeak-ng. Then, the phonetic symbols are passed to ONNX. ONNX acts as a black box: it takes the phonetic symbols as input and produces audio data as output. There is no way to get the alignment data... |
My assumption is that the models can be trained with time stamped data for word boundaries and then they would know to output alignment data, but I don't know enough about this yet. |
Alignment data is obtainable from the original PyTorch models, but not the Onnx models currently. This would require re-exporting all the voice models (incompatible with existing Piper) as well as adjusting Piper's code. |
We created a workaround for rough alignment data in #407 |
I created a straightforward approximate alignment for the audio. I developed a set of timing coefficients, where consonants are short and stressed vowels are long, and then stretched the sum of these coefficients for each phoneme to fit the length of the synthesized audio. Thankfully, this method works well even without an extra recognition step. In fact, aligning the Whisper-recognized text (which included timestamps) to the phonemes was quite a challenge. It involved several steps: first, matching words with position penalties to split the word sequence, and then breaking it down by sentence endings. After that, there was a less confident phase followed by DTW alignment. Surprisingly, the simple algorithm using the coefficients produced results that were almost as good. Phoneme relative durations: This repo's issues: |
This is useful to determine e.g. the word boundaries in the output waveform.
The text was updated successfully, but these errors were encountered: