Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment data should be exposed as one of the outputs #70

Open
shaunren opened this issue May 11, 2023 · 7 comments
Open

Alignment data should be exposed as one of the outputs #70

shaunren opened this issue May 11, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@shaunren
Copy link

This is useful to determine e.g. the word boundaries in the output waveform.

@synesthesiam synesthesiam added the enhancement New feature or request label May 12, 2023
@orgarten
Copy link

I am currently working on this and found the following things:

  • The word boundaries are not obtainable, because the sentences are synthesized as a whole
  • Synthesizing singular words and accumulating the length (including silent bytes) to get alignment data for individual words is possible but takes much longer and also is anything but accurate.
  • Synthesized sentences/words are of different length with each run.

My current implementation would output alignment data for sentences in CSV:

timestamp, word, start_index

@luispater
Copy link

luispater commented Jan 7, 2025

I found a solution to work around the problem:

  1. Use an HTTP request with a Piper Python web server to create a wave file with a sample rate of 20500.
  2. Downsample the wave file to 16000 using librosa.
  3. Post the wave data to the whisper.cpp web server for text recognition and alignment data.
  4. Convert the wave data to MP3 format and embed the alignment data in the ID3 lyrics tag.
  5. Output the MP3 file for the client.
  6. Client parses the alignment data from the ID3 tag.

It's so annoying, but it works.
On my mac mini 4, the process used about 400ms, sometimes text recognition isn't very accurate, but I don't care, I only need the alignment data, it's accurate. because I already have the original text.

@luispater
Copy link

By the way, I read the piper source code. The process involves converting text to piper-phonemize, which retrieves ids from espeak-ng. Then, the phonetic symbols are passed to ONNX. ONNX acts as a black box: it takes the phonetic symbols as input and produces audio data as output. There is no way to get the alignment data...

@eeejay
Copy link

eeejay commented Jan 10, 2025

My assumption is that the models can be trained with time stamped data for word boundaries and then they would know to output alignment data, but I don't know enough about this yet.

@synesthesiam
Copy link
Contributor

Alignment data is obtainable from the original PyTorch models, but not the Onnx models currently. This would require re-exporting all the voice models (incompatible with existing Piper) as well as adjusting Piper's code.

@orgarten
Copy link

We created a workaround for rough alignment data in #407

@Boorj
Copy link

Boorj commented Jan 17, 2025

I created a straightforward approximate alignment for the audio. I developed a set of timing coefficients, where consonants are short and stressed vowels are long, and then stretched the sum of these coefficients for each phoneme to fit the length of the synthesized audio. Thankfully, this method works well even without an extra recognition step.

In fact, aligning the Whisper-recognized text (which included timestamps) to the phonemes was quite a challenge. It involved several steps: first, matching words with position penalties to split the word sequence, and then breaking it down by sentence endings. After that, there was a less confident phase followed by DTW alignment. Surprisingly, the simple algorithm using the coefficients produced results that were almost as good.

Phoneme relative durations:
https://github.com/OpenVoiceOS/ovos-classifiers/blob/dev/ovos_classifiers/heuristics/phonemizer.py

This repo's issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants