autoregression_samples and diffusion_iterations are not implemented in read_fast. #794

TheMechanicX32 · 2024-06-20T22:29:29Z

After a pretty hefty dive into the code, I have discovered that diffusion_iterations is a useless parameter. It does not make it's way back to the decoder in any way. Additionally, if you use the read_fast API, the autoregression_samples parameter is also a useless parameter. read_fast has (probably unintentionally) been designed to generate the lowest quality audio with no room for customization.

As you can see in read_fast.py:

for j, text in enumerate(texts):
        if regenerate is not None and j not in regenerate:
            all_parts.append(load_audio(os.path.join(voice_outpath, f'{j}.wav'), 24000))
            continue
        start_time = time()
        gen = tts.tts(text, voice_samples=voice_samples, use_deterministic_seed=seed)
        end_time = time()
        audio_ = gen.squeeze(0).cpu()
        print("Time taken to generate the audio: ", end_time - start_time, "seconds")
        print("RTF: ", (end_time - start_time) / (audio_.shape[1] / 24000))
        torchaudio.save(os.path.join(voice_outpath, f'{j}.wav'), audio_, 24000)
        all_parts.append(audio_)
    full_audio = torch.cat(all_parts, dim=-1)
    torchaudio.save(os.path.join(voice_outpath, f"{outname}.wav"), full_audio, 24000)

Notice the line gen = tts.tts(text, voice_samples=voice_samples, use_deterministic_seed=seed) Keep this line in mind as we examine the next section of code.

Taking a look at the last large function in api_fast.py, we see the tts method (the method that gen calls) defined:

def tts(self, text, voice_samples=None, k=1, verbose=True, 
        use_deterministic_seed=None,
        # autoregressive generation parameters follow
        num_autoregressive_samples=512, 
        diffusion_iterations=400,
        temperature=.8, 
        length_penalty=1, 
        repetition_penalty=2.0, 
        top_p=.8, 
        max_mel_tokens=500,
        # CVVP parameters follow
        cvvp_amount=.0,
        **hf_generate_kwargs):
    
    
    """
    Produces an audio clip of the given text being spoken with the given reference voice.
    :param text: Text to be spoken.
    :param voice_samples: List of 2 or more ~10 second reference clips which should be torch tensors containing 22.05kHz waveform data.
    :param conditioning_latents: A tuple of (autoregressive_conditioning_latent, diffusion_conditioning_latent), which
                                 can be provided in lieu of voice_samples. This is ignored unless voice_samples=None.
                                 Conditioning latents can be retrieved via get_conditioning_latents().
    :param k: The number of returned clips. The most likely (as determined by Tortoises' CLVP model) clips are returned.
    :param verbose: Whether or not to print log messages indicating the progress of creating a clip. Default=true.
    ~~AUTOREGRESSIVE KNOBS~~
    :param num_autoregressive_samples: Number of samples taken from the autoregressive model, all of which are filtered using CLVP.
           As Tortoise is a probabilistic model, more samples means a higher probability of creating something "great".
    :param temperature: The softmax temperature of the autoregressive model.
    :param length_penalty: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs.
    :param repetition_penalty: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence
                               of long silences or "uhhhhhhs", etc.
    :param top_p: P value used in nucleus sampling. (0,1]. Lower values mean the decoder produces more "likely" (aka boring) outputs.
    :param max_mel_tokens: Restricts the output length. (0,600] integer. Each unit is 1/20 of a second.
    ~~DIFFUSION KNOBS~~
    :param diffusion_iterations: Number of diffusion steps to perform. [0,4000]. More steps means the network has more chances to iteratively refine
                                 the output, which should theoretically mean a higher quality output. Generally a value above 250 is not noticeably better,
                                 however.
    :param cond_free: Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for
                      each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output
                      of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and
                      dramatically improves realism.
    :param cond_free_k: Knob that determines how to balance the conditioning free signal with the conditioning-present signal. [0,inf].
                        As cond_free_k increases, the output becomes dominated by the conditioning-free signal.
                        Formula is: output=cond_present_output*(cond_free_k+1)-cond_absenct_output*cond_free_k
    :param diffusion_temperature: Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0
                                  are the "mean" prediction of the diffusion network and will sound bland and smeared.
    ~~OTHER STUFF~~
    :param hf_generate_kwargs: The huggingface Transformers generate API is used for the autoregressive transformer.
                               Extra keyword args fed to this function get forwarded directly to that API. Documentation
                               here: https://huggingface.co/docs/transformers/internal/generation_utils
    :return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length.
             Sample rate is 24kHz.
    """

    deterministic_seed = self.deterministic_state(seed=use_deterministic_seed)

    text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).to(self.device)
    text_tokens = F.pad(text_tokens, (0, 1))  # This may not be necessary.
    assert text_tokens.shape[-1] < 400, 'Too much text provided. Break the text up into separate segments and re-try inference.'
    if voice_samples is not None:
        auto_conditioning = self.get_conditioning_latents(voice_samples, return_mels=False)
    else:
        auto_conditioning  = self.get_random_conditioning_latents()
    auto_conditioning = auto_conditioning.to(self.device)

    with torch.no_grad():
        calm_token = 83  # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"
        if verbose:
            print("Generating autoregressive samples..")
        with torch.autocast(
                device_type="cpu" , dtype=torch.float16, enabled=self.half
            ):
            codes = self.autoregressive.inference_speech(auto_conditioning, text_tokens,
                                                        top_k=50,
                                                        top_p=top_p,
                                                        temperature=temperature,
                                                        do_sample=True,
                                                        num_beams=1,
                                                        num_return_sequences=1,
                                                        length_penalty=float(length_penalty),
                                                        repetition_penalty=float(repetition_penalty),
                                                        output_attentions=False,
                                                        output_hidden_states=True,
                                                        **hf_generate_kwargs)
            gpt_latents = self.autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1),
                            torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), codes,
                            torch.tensor([codes.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device),
                            return_latent=True, clip_inputs=False)
        if verbose:
            print("generating audio..")
        wav_gen = self.hifi_decoder.inference(gpt_latents.to(self.device), auto_conditioning)
        return wav_gen

Please use ctrl+f to search for autoregressive. Notice that num_autoregressive_samples is listed in the documentation as a possible parameter, but at no point is it ever implemented. Same goes with diffusion_iterations. A documented parameter that has no effect on the generated audio.

I really enjoy how fast read_fast works, but it generates very shoddy audio. I would like the option to bump up quality in exchange for loss of generation speed.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autoregression_samples and diffusion_iterations are not implemented in read_fast. #794

autoregression_samples and diffusion_iterations are not implemented in read_fast. #794

TheMechanicX32 commented Jun 20, 2024 •

edited

Loading

autoregression_samples and diffusion_iterations are not implemented in read_fast. #794

autoregression_samples and diffusion_iterations are not implemented in read_fast. #794

Comments

TheMechanicX32 commented Jun 20, 2024 • edited Loading

TheMechanicX32 commented Jun 20, 2024 •

edited

Loading