Alignment data should be exposed as one of the outputs #70

shaunren · 2023-05-11T16:29:49Z

This is useful to determine e.g. the word boundaries in the output waveform.

orgarten · 2023-12-19T06:46:04Z

I am currently working on this and found the following things:

The word boundaries are not obtainable, because the sentences are synthesized as a whole
Synthesizing singular words and accumulating the length (including silent bytes) to get alignment data for individual words is possible but takes much longer and also is anything but accurate.
Synthesized sentences/words are of different length with each run.

My current implementation would output alignment data for sentences in CSV:

timestamp, word, start_index

luispater · 2025-01-07T17:52:25Z

I found a solution to work around the problem:

Use an HTTP request with a Piper Python web server to create a wave file with a sample rate of 20500.
Downsample the wave file to 16000 using librosa.
Post the wave data to the whisper.cpp web server for text recognition and alignment data.
Convert the wave data to MP3 format and embed the alignment data in the ID3 lyrics tag.
Output the MP3 file for the client.
Client parses the alignment data from the ID3 tag.

It's so annoying, but it works.
On my mac mini 4, the process used about 400ms, sometimes text recognition isn't very accurate, but I don't care, I only need the alignment data, it's accurate. because I already have the original text.

luispater · 2025-01-07T18:06:33Z

By the way, I read the piper source code. The process involves converting text to piper-phonemize, which retrieves ids from espeak-ng. Then, the phonetic symbols are passed to ONNX. ONNX acts as a black box: it takes the phonetic symbols as input and produces audio data as output. There is no way to get the alignment data...

eeejay · 2025-01-10T18:06:15Z

My assumption is that the models can be trained with time stamped data for word boundaries and then they would know to output alignment data, but I don't know enough about this yet.

synesthesiam · 2025-01-11T02:24:15Z

Alignment data is obtainable from the original PyTorch models, but not the Onnx models currently. This would require re-exporting all the voice models (incompatible with existing Piper) as well as adjusting Piper's code.

orgarten · 2025-01-12T10:27:00Z

We created a workaround for rough alignment data in #407

Boorj · 2025-01-17T11:24:33Z

I created a straightforward approximate alignment for the audio. I developed a set of timing coefficients, where consonants are short and stressed vowels are long, and then stretched the sum of these coefficients for each phoneme to fit the length of the synthesized audio. Thankfully, this method works well even without an extra recognition step.

In fact, aligning the Whisper-recognized text (which included timestamps) to the phonemes was quite a challenge. It involved several steps: first, matching words with position penalties to split the word sequence, and then breaking it down by sentence endings. After that, there was a less confident phase followed by DTW alignment. Surprisingly, the simple algorithm using the coefficients produced results that were almost as good.

Phoneme relative durations:
https://github.com/OpenVoiceOS/ovos-classifiers/blob/dev/ovos_classifiers/heuristics/phonemizer.py

This repo's issues:

synesthesiam added the enhancement New feature or request label May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment data should be exposed as one of the outputs #70

Alignment data should be exposed as one of the outputs #70

shaunren commented May 11, 2023

orgarten commented Dec 19, 2023

luispater commented Jan 7, 2025 •

edited

Loading

luispater commented Jan 7, 2025

eeejay commented Jan 10, 2025

synesthesiam commented Jan 11, 2025

orgarten commented Jan 12, 2025

Boorj commented Jan 17, 2025 •

edited

Loading

Alignment data should be exposed as one of the outputs #70

Alignment data should be exposed as one of the outputs #70

Comments

shaunren commented May 11, 2023

orgarten commented Dec 19, 2023

luispater commented Jan 7, 2025 • edited Loading

luispater commented Jan 7, 2025

eeejay commented Jan 10, 2025

synesthesiam commented Jan 11, 2025

orgarten commented Jan 12, 2025

Boorj commented Jan 17, 2025 • edited Loading

luispater commented Jan 7, 2025 •

edited

Loading

Boorj commented Jan 17, 2025 •

edited

Loading