Bugfix: Delay pattern mask is applied twice #110

Guppy16 · 2024-08-16T16:15:46Z

There are a few bugfixes / contributions:

(Bugfix) In ParlerTTSForConditionalGeneration::generate(), the delay patter mask is built and applied to input_ids before calling _sample(). However, we do not want to apply the mask until we're inside the _sample() function. The above bug doesn't affect the current inference setup because the input_ids returned happens to be the same as what's passed in.
(Contribution) I've provided an example of how to do audio enrolment to improve the consistency of audio generation in (helpers/voice_enrolment/enrol.ipynb ). I believe it helps, but I'm not sure if it's always better. Nonetheless, I mainly provided this as an example to demonstrate how the bugfix helps: we can try to provide the enrolment as prefix tokens in decoder_input_ids args when calling model.generate(). You should notice that without the bugfix, the audio sounds "crackly", which is because the mask has effectively been applied twice on the prefix.
(Bugfix) When performing deterministic greedy decoding (by passing in do_sample=False, there is bug where the logits_warper is not passed in. I believe this should just be None(?), which I've commited in this PR. Related to this, I also want to raise an issue that deterministic sampling by setting do_sample=False or temperatute=0.1 tends to generate random noise.

…more consistent

Guppy16 · 2024-08-19T09:12:51Z

Perhaps the bugfixes also need to be applied in ParlerTTSForCausalLM? (I haven't touched this class so I'm not sure about it's intended use)

Guppy16 · 2024-08-21T17:07:05Z

@ylacombe Would it be possible for you to review this?

ylacombe · 2024-09-02T11:21:06Z

Hey @Guppy16, thanks for opening this ! I'll take a look in the coming days!

ylacombe · 2024-09-05T09:51:19Z

(Bugfix) In ParlerTTSForConditionalGeneration::generate(), the delay patter mask is built and applied to input_ids before calling _sample(). However, we do not want to apply the mask until we're inside the _sample() function. The above bug doesn't affect the current inference setup because the input_ids returned happens to be the same as what's passed in.

So it's actually a redundant operation that changes nothing, right ? Just want to make sure it's not a bug. When I experiment, it seems that it doesn't change anything

ylacombe

Hey @Guppy16, so even though the modification you're proposing makes sense, I'm not sure that your other additions should be merged:

the logit wrapper modif is an error
I'd rather have an issue opened that explain how to do your voice enrolment thing than a notebook ! Would you like to write a guide about this and add it to the issues ? I can ping it afterwards

ylacombe · 2024-09-05T12:21:11Z

parler_tts/modeling_parler_tts.py

@@ -3442,6 +3443,7 @@ def generate(
                generation_config=generation_config,
                synced_gpus=synced_gpus,
                streamer=streamer,
+                logits_warper=None,


You should keep the logits_warper, I'm not sure why you removed it!

I didn't remove it! Originally, logits_warper wasn't being passed in, so this part of the code was failing. I believe when doing greedy search, logits_warper=None should be set. Please could you double check this!

ylacombe · 2024-09-05T12:22:35Z

parler_tts/modeling_parler_tts.py

+        # but don't overwrite the input_ids tensor with the delay pattern mask. We perform that later
+        _, decoder_delay_pattern_mask = self.decoder.build_delay_pattern_mask(


As pointed out, this is a redundant operation that has no impact on the results!

Hmm, I think this line does indeed change the results when using enrolled tokens. Perhaps your setup is working because it is slightly different as you've described below. I shall try this and get back to you

Ok, so my testing shows that this fix is required to get the right audio when doing the enrolment. Here is an example audio file generated with and without the fix:
audio.zip

ylacombe · 2024-09-05T12:26:46Z

helpers/voice_enrolment_demo/enrol.ipynb

I've been able to test this with the following code, which also requires a small modification of the DAC code (adding main_input_name = "input_values" as a class attribute of DACModel) :

import torch from parler_tts import ParlerTTSForConditionalGeneration from transformers import AutoTokenizer, set_seed, AutoFeatureExtractor import soundfile as sf import torchaudio device = "cuda" if torch.cuda.is_available() else "cpu" model_id = "parler-tts/parler-tts-mini-v1" model = ParlerTTSForConditionalGeneration.from_pretrained(model_id).to(device) tokenizer = AutoTokenizer.from_pretrained(model_id) feature_extractor = AutoFeatureExtractor.from_pretrained(model_id) init_audio, init_sr = torchaudio.load(PATH_TO_SPECIFY) init_audio = torchaudio.functional.resample(init_audio, init_sr, model.config.sampling_rate) init_audio = init_audio.mean(0) init_prompt = "Here, write the transcript of the init audio" prompt = "Is it really working ?" description = "A man speaker speaks quickly with a low-pitched voice. The recording is of very high quality, with the speaker's voice sounding clear and very close up." # TODO: adapt the prompt to describe the input audio input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device) prompt_input_ids = tokenizer(init_prompt + " " + prompt, return_tensors="pt").input_ids.to(device) input_values = feature_extractor(init_audio, sampling_rate=model.config.sampling_rate, return_tensors="pt").input_values.to(device) set_seed(2) generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, input_values = input_values) audio_arr = generation.cpu().numpy().squeeze() sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

I found that Parler has difficulty generalizing to unseen speakers (meaning using a speaker that has not been seen during training or that has not been generated by Parler), so there's no actual edge of using it for voice cloning. However, from my experiment, it's working quite well with Parler generation!

Hey @ylacombe , I tried the above code sample with both the mini and large model but the audio file generated is noisy and inconsistent. I've used the input audio generated through ParlerTTS itself.

This is a clean snippet! When calling model.generate(...) Is there a preference for using input_values=input_values? I was originally doing something along the lines of decoder_input_ids=input_values.squeeze().long().

Credits: 1. ylacombe - Add input_values to DACModel - dac_wrapper/modeling_dac.py - huggingface#110 (comment) 2. stg2015 - Delay mask adjustment for input_values - modeling_parler_tts.py - huggingface#81 (comment)

Guppy16 · 2024-09-25T18:19:25Z

Thanks a lot for reviewing this, as well as your great suggestions! I'll work on this in the coming few days.

Guppy16 · 2024-09-27T13:27:56Z

I'd rather have an issue opened that explain how to do your voice enrolment thing than a notebook ! Would you like to write a guide about this and add it to the issues ? I can ping it afterwards

Looks like @apresence has made a start on this! I've added a modified version of ur snippet there (#139)

Akash Gupta added 3 commits August 16, 2024 15:52

fix: don't apply delay mask on input ids before sampling

12e455a

fix: set logits warper none for deterministic greedy generation

6124a46

feat: example notebook for providing a voice enrolment to make voice …

cc3b190

…more consistent

This was referenced Aug 16, 2024

Bug in generation code #85

Open

Poor Audio Quality with input_values Input in Parler_TTS #81

Open

Need the abillity to save/re-use a generated voice #14

Open

Guppy16 mentioned this pull request Aug 21, 2024

Speaker voice is not consistent across different generation #112

Open

ylacombe reviewed Sep 5, 2024

View reviewed changes

b-feldmann mentioned this pull request Sep 9, 2024

Audio Length Limitation and FlashAttention Warning in Parler TTS #126

Open

apresence mentioned this pull request Sep 24, 2024

Prep for Voice Steering feature #141

Open

feat: add main_input_name input_values to DACModel, and remove notebook

fbaf621

Guppy16 mentioned this pull request Sep 27, 2024

Voice Consistency Working Pretty Well -- Plus Zero-Shot Cloning! #139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: Delay pattern mask is applied twice #110

Bugfix: Delay pattern mask is applied twice #110

Guppy16 commented Aug 16, 2024

Guppy16 commented Aug 19, 2024

Guppy16 commented Aug 21, 2024

ylacombe commented Sep 2, 2024

ylacombe commented Sep 5, 2024

ylacombe left a comment

ylacombe Sep 5, 2024

Guppy16 Sep 26, 2024

ylacombe Sep 5, 2024

Guppy16 Sep 25, 2024

Guppy16 Sep 26, 2024

ylacombe Sep 5, 2024

kaushal-gawri9899 Sep 24, 2024

Guppy16 Sep 26, 2024

Guppy16 commented Sep 25, 2024

Guppy16 commented Sep 27, 2024 •

edited

Loading

		# but don't overwrite the input_ids tensor with the delay pattern mask. We perform that later
		_, decoder_delay_pattern_mask = self.decoder.build_delay_pattern_mask(

Bugfix: Delay pattern mask is applied twice #110

Are you sure you want to change the base?

Bugfix: Delay pattern mask is applied twice #110

Conversation

Guppy16 commented Aug 16, 2024

Guppy16 commented Aug 19, 2024

Guppy16 commented Aug 21, 2024

ylacombe commented Sep 2, 2024

ylacombe commented Sep 5, 2024

ylacombe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Guppy16 commented Sep 25, 2024

Guppy16 commented Sep 27, 2024 • edited Loading

Guppy16 commented Sep 27, 2024 •

edited

Loading