You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently if you interupt ichigo it will just start a new turn (at best)
Description
With emperical result from this paper you can turn the LLM into non-turn based, which mean if you interupt it, it will continue the previous talk with new idea for you, instead of start a new turn
This approach requires specialized datasets, which have been scarce so far. However, this Duplex-UltraChat dataset could solve this issue. It will be costly and difficult to train, but once it is trained implementation should be easier.
System-level duplex dedicates a model to listen and to generate separately. The listening model acts as an advanced VAD+Turn Taking Predictor. An earlier concept by google for an ASR-only listening model also exists: https://arxiv.org/pdf/2208.13321, which can take into account acoustic queues. However with an LLM at the wheel, the linguistic context can be very useful in making better predictions.
Evaluation
Approach 2 allows us to train multiple smaller specialized models and use logic to control the overall system, which might produce a more explainable system, compared to approach 1, where the model makes all the decisions.
Another underappreciated downside of Approach 1 models, is that the model needs to constantly predict silence tokens for the entire time a user is speaking, whereas for approach 2 the listening model only needs to predict a single EOS token. Additionally, extra silence token predictions can lead to "phantom" latency in response.
That said, all these downsides could very well disappear with sufficient quality data and enough training. Duplex-UltraChat is a text-only dataset, so we could be training a model to make turn-taking predictions based only upon linguistic queues.
Proposal
With the T2S approach of ichigo, we can try training on Duplex-UltraChat with minimal changes. If it works, the changes in implementation we would need to make are as follows:
Once the mic has been turned on, send WhisperVQ tokens to ichigo 2 seconds as a time.
Based on what is generated by ichigo, intercept any "idle" token tags to prevent it from being sent to the TTS
When a new non-idle token is generated by ichigo, and the TTS is still speaking, turn off the current TTS and start playing the new response.
Goal
Currently if you interupt ichigo it will just start a new turn (at best)
Description
With emperical result from this paper you can turn the LLM into non-turn based, which mean if you interupt it, it will continue the previous talk with new idea for you, instead of start a new turn
Resources
https://arxiv.org/abs/2406.15718
References
https://arxiv.org/abs/2406.15718
The text was updated successfully, but these errors were encountered: