Audio Length Limitation and FlashAttention Warning in Parler TTS #126

suman819 · 2024-08-30T10:28:22Z

I have been working with Parler TTS and encountered an issue where I am unable to generate audio longer than 20 seconds. Despite trying various methods, such as streaming and splitting the text into chunks, the audio output is still truncated to around 15-20 seconds.

Additionally, I received a warning stating that FlashAttention is not installed. Could this be the cause of the issue? I would appreciate any guidance or suggestions on how to handle longer input text effectively.

dhaivat1729 · 2024-09-01T20:48:56Z

I have the same issue. Audio length is truncated.

kunci115 · 2024-09-02T02:52:25Z

the training default configuration in parler-tts is max 30sec, max text length 600
https://github.com/huggingface/parler-tts/blob/main/training/README.md#3-training

either you fine tune it with longer data, or send it in split if you're text > 30 sec or text length > 600 sec, just split it by (.,)

suman819 · 2024-09-03T13:15:41Z

I have already applied the suggested method of splitting the text if it exceeds 30 seconds or 600 characters by using punctuation (.,). However, when I combine the audio segments, there is an inconsistency in the voice tone, even when a specific voice prompt is set.

b-feldmann · 2024-09-09T17:09:47Z

I could get it to work with this PR: #110

The main idea is to generate once with a small prompt like "This is my prefix prompt." and storing the encoded result.
Afterward generate a lot of sentences like with the encoded result passed as decoder_input_ids:

"This is my prefix prompt. This is my first real sentence."
"This is my prefix prompt. This is another sentence that sounds the same."
"This is my prefix prompt. And again here"
...

You then need to remove the encoded audio from each output to get consistent results without the prefix prompt

cesinsingapore · 2024-09-13T04:22:17Z

do you guys experiencing not fluent(in strange way) when parler inferencing number and letter ? for example: "my id card is 5o613123jkl"

Guppy16 · 2024-09-25T18:12:44Z

Perhaps u can also experiment with the min_new_tokens parameter. I believe in ParlerTTS, a single audio token represents ~12 ms of audio, so if you want to generate 20 secs, that would be 1720 tokens.

model.generate(min_new_tokens=1720, **generation_kwargs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio Length Limitation and FlashAttention Warning in Parler TTS #126

Audio Length Limitation and FlashAttention Warning in Parler TTS #126

suman819 commented Aug 30, 2024

dhaivat1729 commented Sep 1, 2024

kunci115 commented Sep 2, 2024

suman819 commented Sep 3, 2024

b-feldmann commented Sep 9, 2024

cesinsingapore commented Sep 13, 2024

Guppy16 commented Sep 25, 2024

Audio Length Limitation and FlashAttention Warning in Parler TTS #126

Audio Length Limitation and FlashAttention Warning in Parler TTS #126

Comments

suman819 commented Aug 30, 2024

dhaivat1729 commented Sep 1, 2024

kunci115 commented Sep 2, 2024

suman819 commented Sep 3, 2024

b-feldmann commented Sep 9, 2024

cesinsingapore commented Sep 13, 2024

Guppy16 commented Sep 25, 2024