Poor Audio Quality with input_values Input in Parler_TTS #81

LiuZH-19 · 2024-07-02T12:32:27Z

I am using the Parler_TTS model with a reference audio (input_values) during inference, similar to MusicGen, to perform continuation tasks.

model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, input_values=input_values)

While the model continues in the style of the reference audio, the resulting audio quality is poor and contains a lot of noise.
Why does the audio quality degrade when using a reference audio input, and how can this be improved?

Thank you!

The text was updated successfully, but these errors were encountered:

ylacombe · 2024-08-01T15:56:27Z

we should remove input_values as it's not used in this model, it's an artifact left from the fact I was inspired by Musicgen architecture

stg1205 · 2024-08-14T08:11:43Z

change the code after this comment: "# revert the pattern delay mask by filtering the eos and bos token ids from the delay pattern mask"
to

if "input_values" in model_kwargs:
            mask = (output_ids != generation_config.bos_token_id) & (output_ids != generation_config.pad_token_id)
else:
          _, mask = self.decoder.build_delay_pattern_mask(
              input_ids,
              bos_token_id=generation_config.bos_token_id,
              pad_token_id=generation_config.pad_token_id,
              max_length=output_ids.shape[1],
          )
          mask = (mask != generation_config.bos_token_id) & (mask != generation_config.pad_token_id)

I haven't looked into any details, for now it works. I found this bug by comparing the output_ids with the original input_ids encoded by dac, and there are some wrong delays in output_ids.

Guppy16 · 2024-08-17T14:23:42Z

#110

Does this help? I have an example notebook doing continuation as well. You need to use the decode_input_ids argument (as well as fix a bug similar to how @stg1205 showed above)

Credits: 1. ylacombe - Add input_values to DACModel - dac_wrapper/modeling_dac.py - huggingface#110 (comment) 2. stg2015 - Delay mask adjustment for input_values - modeling_parler_tts.py - huggingface#81 (comment)

apresence mentioned this issue Sep 24, 2024

Prep for Voice Steering feature #141

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor Audio Quality with input_values Input in Parler_TTS #81

Poor Audio Quality with input_values Input in Parler_TTS #81

LiuZH-19 commented Jul 2, 2024

ylacombe commented Aug 1, 2024

stg1205 commented Aug 14, 2024 •

edited

Loading

Guppy16 commented Aug 17, 2024

Poor Audio Quality with input_values Input in Parler_TTS #81

Poor Audio Quality with input_values Input in Parler_TTS #81

Comments

LiuZH-19 commented Jul 2, 2024

ylacombe commented Aug 1, 2024

stg1205 commented Aug 14, 2024 • edited Loading

Guppy16 commented Aug 17, 2024

stg1205 commented Aug 14, 2024 •

edited

Loading