You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After a pretty hefty dive into the code, I have discovered that diffusion_iterations is a useless parameter. It does not make it's way back to the decoder in any way. Additionally, if you use the read_fast API, the autoregression_samples parameter is also a useless parameter. read_fast has (probably unintentionally) been designed to generate the lowest quality audio with no room for customization.
As you can see in read_fast.py:
for j, text in enumerate(texts):
if regenerate is not None and j not in regenerate:
all_parts.append(load_audio(os.path.join(voice_outpath, f'{j}.wav'), 24000))
continue
start_time = time()
gen = tts.tts(text, voice_samples=voice_samples, use_deterministic_seed=seed)
end_time = time()
audio_ = gen.squeeze(0).cpu()
print("Time taken to generate the audio: ", end_time - start_time, "seconds")
print("RTF: ", (end_time - start_time) / (audio_.shape[1] / 24000))
torchaudio.save(os.path.join(voice_outpath, f'{j}.wav'), audio_, 24000)
all_parts.append(audio_)
full_audio = torch.cat(all_parts, dim=-1)
torchaudio.save(os.path.join(voice_outpath, f"{outname}.wav"), full_audio, 24000)
Notice the line gen = tts.tts(text, voice_samples=voice_samples, use_deterministic_seed=seed) Keep this line in mind as we examine the next section of code.
Taking a look at the last large function in api_fast.py, we see the tts method (the method that gen calls) defined:
def tts(self, text, voice_samples=None, k=1, verbose=True,
use_deterministic_seed=None,
# autoregressive generation parameters follow
num_autoregressive_samples=512,
diffusion_iterations=400,
temperature=.8,
length_penalty=1,
repetition_penalty=2.0,
top_p=.8,
max_mel_tokens=500,
# CVVP parameters follow
cvvp_amount=.0,
**hf_generate_kwargs):
"""
Produces an audio clip of the given text being spoken with the given reference voice.
:param text: Text to be spoken.
:param voice_samples: List of 2 or more ~10 second reference clips which should be torch tensors containing 22.05kHz waveform data.
:param conditioning_latents: A tuple of (autoregressive_conditioning_latent, diffusion_conditioning_latent), which
can be provided in lieu of voice_samples. This is ignored unless voice_samples=None.
Conditioning latents can be retrieved via get_conditioning_latents().
:param k: The number of returned clips. The most likely (as determined by Tortoises' CLVP model) clips are returned.
:param verbose: Whether or not to print log messages indicating the progress of creating a clip. Default=true.
~~AUTOREGRESSIVE KNOBS~~
:param num_autoregressive_samples: Number of samples taken from the autoregressive model, all of which are filtered using CLVP.
As Tortoise is a probabilistic model, more samples means a higher probability of creating something "great".
:param temperature: The softmax temperature of the autoregressive model.
:param length_penalty: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs.
:param repetition_penalty: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence
of long silences or "uhhhhhhs", etc.
:param top_p: P value used in nucleus sampling. (0,1]. Lower values mean the decoder produces more "likely" (aka boring) outputs.
:param max_mel_tokens: Restricts the output length. (0,600] integer. Each unit is 1/20 of a second.
~~DIFFUSION KNOBS~~
:param diffusion_iterations: Number of diffusion steps to perform. [0,4000]. More steps means the network has more chances to iteratively refine
the output, which should theoretically mean a higher quality output. Generally a value above 250 is not noticeably better,
however.
:param cond_free: Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for
each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output
of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and
dramatically improves realism.
:param cond_free_k: Knob that determines how to balance the conditioning free signal with the conditioning-present signal. [0,inf].
As cond_free_k increases, the output becomes dominated by the conditioning-free signal.
Formula is: output=cond_present_output*(cond_free_k+1)-cond_absenct_output*cond_free_k
:param diffusion_temperature: Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0
are the "mean" prediction of the diffusion network and will sound bland and smeared.
~~OTHER STUFF~~
:param hf_generate_kwargs: The huggingface Transformers generate API is used for the autoregressive transformer.
Extra keyword args fed to this function get forwarded directly to that API. Documentation
here: https://huggingface.co/docs/transformers/internal/generation_utils
:return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length.
Sample rate is 24kHz.
"""
deterministic_seed = self.deterministic_state(seed=use_deterministic_seed)
text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).to(self.device)
text_tokens = F.pad(text_tokens, (0, 1)) # This may not be necessary.
assert text_tokens.shape[-1] < 400, 'Too much text provided. Break the text up into separate segments and re-try inference.'
if voice_samples is not None:
auto_conditioning = self.get_conditioning_latents(voice_samples, return_mels=False)
else:
auto_conditioning = self.get_random_conditioning_latents()
auto_conditioning = auto_conditioning.to(self.device)
with torch.no_grad():
calm_token = 83 # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"
if verbose:
print("Generating autoregressive samples..")
with torch.autocast(
device_type="cpu" , dtype=torch.float16, enabled=self.half
):
codes = self.autoregressive.inference_speech(auto_conditioning, text_tokens,
top_k=50,
top_p=top_p,
temperature=temperature,
do_sample=True,
num_beams=1,
num_return_sequences=1,
length_penalty=float(length_penalty),
repetition_penalty=float(repetition_penalty),
output_attentions=False,
output_hidden_states=True,
**hf_generate_kwargs)
gpt_latents = self.autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1),
torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), codes,
torch.tensor([codes.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device),
return_latent=True, clip_inputs=False)
if verbose:
print("generating audio..")
wav_gen = self.hifi_decoder.inference(gpt_latents.to(self.device), auto_conditioning)
return wav_gen
Please use ctrl+f to search for autoregressive. Notice that num_autoregressive_samples is listed in the documentation as a possible parameter, but at no point is it ever implemented. Same goes with diffusion_iterations. A documented parameter that has no effect on the generated audio.
I really enjoy how fast read_fast works, but it generates very shoddy audio. I would like the option to bump up quality in exchange for loss of generation speed.
The text was updated successfully, but these errors were encountered:
After a pretty hefty dive into the code, I have discovered that diffusion_iterations is a useless parameter. It does not make it's way back to the decoder in any way. Additionally, if you use the read_fast API, the autoregression_samples parameter is also a useless parameter. read_fast has (probably unintentionally) been designed to generate the lowest quality audio with no room for customization.
As you can see in read_fast.py:
Notice the line
gen = tts.tts(text, voice_samples=voice_samples, use_deterministic_seed=seed)
Keep this line in mind as we examine the next section of code.Taking a look at the last large function in api_fast.py, we see the tts method (the method that gen calls) defined:
Please use ctrl+f to search for autoregressive. Notice that num_autoregressive_samples is listed in the documentation as a possible parameter, but at no point is it ever implemented. Same goes with diffusion_iterations. A documented parameter that has no effect on the generated audio.
I really enjoy how fast read_fast works, but it generates very shoddy audio. I would like the option to bump up quality in exchange for loss of generation speed.
The text was updated successfully, but these errors were encountered: