You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I read the paper and noticed that when training AR model part, the speaker condition is another clip of the same person speaking, while when training the Diffusion part, it seems that the speaker condition clip is just a clip of the target speech. why there is a diffrent design? what if we use the same target speech as speaker cond during the training of AR model? or just use another sample of the same speaker as speaker condition when training the Diffusion model? what is the reason? thanks.
The text was updated successfully, but these errors were encountered:
I read the paper and noticed that when training AR model part, the speaker condition is
another clip of the same person speaking
, while when training the Diffusion part, it seems that the speaker condition clip is just a clip of the target speech. why there is a diffrent design? what if we use the same target speech as speaker cond during the training of AR model? or just use another sample of the same speaker as speaker condition when training the Diffusion model? what is the reason? thanks.The text was updated successfully, but these errors were encountered: