You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently experimenting with audio streaming into the model. The idea is to improve latency by ingesting parts of the audio before the utterance of the user is finished. As we already run computation while the user is speaking, we can let the model answer faster.
That’s one way to do it, but it’s not quite right since the audio segments are encoded separately. Additionally, they are treated as separate speaker turns, and you end up wasting compute on the inference of intermediate segments. The latency incurred from these additional inference steps would be significantly higher than that of speech encoding.
We experimented with block-wise unidirectional encoding for the speech encoder. You can find a config here:
Hi everyone,
I am currently experimenting with audio streaming into the model. The idea is to improve latency by ingesting parts of the audio before the utterance of the user is finished. As we already run computation while the user is speaking, we can let the model answer faster.
This is what I came up with:
Is there any more efficient approach to this?
Thank you for your help.
Best regards,
Felix
The text was updated successfully, but these errors were encountered: