Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autoregression_samples and diffusion_iterations are not implemented in read_fast. #794

Open
TheMechanicX32 opened this issue Jun 20, 2024 · 0 comments

Comments

@TheMechanicX32
Copy link

TheMechanicX32 commented Jun 20, 2024

After a pretty hefty dive into the code, I have discovered that diffusion_iterations is a useless parameter. It does not make it's way back to the decoder in any way. Additionally, if you use the read_fast API, the autoregression_samples parameter is also a useless parameter. read_fast has (probably unintentionally) been designed to generate the lowest quality audio with no room for customization.

As you can see in read_fast.py:

for j, text in enumerate(texts):
        if regenerate is not None and j not in regenerate:
            all_parts.append(load_audio(os.path.join(voice_outpath, f'{j}.wav'), 24000))
            continue
        start_time = time()
        gen = tts.tts(text, voice_samples=voice_samples, use_deterministic_seed=seed)
        end_time = time()
        audio_ = gen.squeeze(0).cpu()
        print("Time taken to generate the audio: ", end_time - start_time, "seconds")
        print("RTF: ", (end_time - start_time) / (audio_.shape[1] / 24000))
        torchaudio.save(os.path.join(voice_outpath, f'{j}.wav'), audio_, 24000)
        all_parts.append(audio_)
    full_audio = torch.cat(all_parts, dim=-1)
    torchaudio.save(os.path.join(voice_outpath, f"{outname}.wav"), full_audio, 24000)

Notice the line gen = tts.tts(text, voice_samples=voice_samples, use_deterministic_seed=seed) Keep this line in mind as we examine the next section of code.

Taking a look at the last large function in api_fast.py, we see the tts method (the method that gen calls) defined:

def tts(self, text, voice_samples=None, k=1, verbose=True, 
        use_deterministic_seed=None,
        # autoregressive generation parameters follow
        num_autoregressive_samples=512, 
        diffusion_iterations=400,
        temperature=.8, 
        length_penalty=1, 
        repetition_penalty=2.0, 
        top_p=.8, 
        max_mel_tokens=500,
        # CVVP parameters follow
        cvvp_amount=.0,
        **hf_generate_kwargs):
    
    
    """
    Produces an audio clip of the given text being spoken with the given reference voice.
    :param text: Text to be spoken.
    :param voice_samples: List of 2 or more ~10 second reference clips which should be torch tensors containing 22.05kHz waveform data.
    :param conditioning_latents: A tuple of (autoregressive_conditioning_latent, diffusion_conditioning_latent), which
                                 can be provided in lieu of voice_samples. This is ignored unless voice_samples=None.
                                 Conditioning latents can be retrieved via get_conditioning_latents().
    :param k: The number of returned clips. The most likely (as determined by Tortoises' CLVP model) clips are returned.
    :param verbose: Whether or not to print log messages indicating the progress of creating a clip. Default=true.
    ~~AUTOREGRESSIVE KNOBS~~
    :param num_autoregressive_samples: Number of samples taken from the autoregressive model, all of which are filtered using CLVP.
           As Tortoise is a probabilistic model, more samples means a higher probability of creating something "great".
    :param temperature: The softmax temperature of the autoregressive model.
    :param length_penalty: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs.
    :param repetition_penalty: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence
                               of long silences or "uhhhhhhs", etc.
    :param top_p: P value used in nucleus sampling. (0,1]. Lower values mean the decoder produces more "likely" (aka boring) outputs.
    :param max_mel_tokens: Restricts the output length. (0,600] integer. Each unit is 1/20 of a second.
    ~~DIFFUSION KNOBS~~
    :param diffusion_iterations: Number of diffusion steps to perform. [0,4000]. More steps means the network has more chances to iteratively refine
                                 the output, which should theoretically mean a higher quality output. Generally a value above 250 is not noticeably better,
                                 however.
    :param cond_free: Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for
                      each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output
                      of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and
                      dramatically improves realism.
    :param cond_free_k: Knob that determines how to balance the conditioning free signal with the conditioning-present signal. [0,inf].
                        As cond_free_k increases, the output becomes dominated by the conditioning-free signal.
                        Formula is: output=cond_present_output*(cond_free_k+1)-cond_absenct_output*cond_free_k
    :param diffusion_temperature: Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0
                                  are the "mean" prediction of the diffusion network and will sound bland and smeared.
    ~~OTHER STUFF~~
    :param hf_generate_kwargs: The huggingface Transformers generate API is used for the autoregressive transformer.
                               Extra keyword args fed to this function get forwarded directly to that API. Documentation
                               here: https://huggingface.co/docs/transformers/internal/generation_utils
    :return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length.
             Sample rate is 24kHz.
    """

    deterministic_seed = self.deterministic_state(seed=use_deterministic_seed)

    text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).to(self.device)
    text_tokens = F.pad(text_tokens, (0, 1))  # This may not be necessary.
    assert text_tokens.shape[-1] < 400, 'Too much text provided. Break the text up into separate segments and re-try inference.'
    if voice_samples is not None:
        auto_conditioning = self.get_conditioning_latents(voice_samples, return_mels=False)
    else:
        auto_conditioning  = self.get_random_conditioning_latents()
    auto_conditioning = auto_conditioning.to(self.device)

    with torch.no_grad():
        calm_token = 83  # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"
        if verbose:
            print("Generating autoregressive samples..")
        with torch.autocast(
                device_type="cpu" , dtype=torch.float16, enabled=self.half
            ):
            codes = self.autoregressive.inference_speech(auto_conditioning, text_tokens,
                                                        top_k=50,
                                                        top_p=top_p,
                                                        temperature=temperature,
                                                        do_sample=True,
                                                        num_beams=1,
                                                        num_return_sequences=1,
                                                        length_penalty=float(length_penalty),
                                                        repetition_penalty=float(repetition_penalty),
                                                        output_attentions=False,
                                                        output_hidden_states=True,
                                                        **hf_generate_kwargs)
            gpt_latents = self.autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1),
                            torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), codes,
                            torch.tensor([codes.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device),
                            return_latent=True, clip_inputs=False)
        if verbose:
            print("generating audio..")
        wav_gen = self.hifi_decoder.inference(gpt_latents.to(self.device), auto_conditioning)
        return wav_gen

Please use ctrl+f to search for autoregressive. Notice that num_autoregressive_samples is listed in the documentation as a possible parameter, but at no point is it ever implemented. Same goes with diffusion_iterations. A documented parameter that has no effect on the generated audio.

I really enjoy how fast read_fast works, but it generates very shoddy audio. I would like the option to bump up quality in exchange for loss of generation speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant