epic: Introducing naturalness for interuption #100

tikikun · 2024-10-25T15:49:22Z

Goal

Currently if you interupt ichigo it will just start a new turn (at best)

Description

With emperical result from this paper you can turn the LLM into non-turn based, which mean if you interupt it, it will continue the previous talk with new idea for you, instead of start a new turn

Resources

https://arxiv.org/abs/2406.15718

References

https://arxiv.org/abs/2406.15718

PodsAreAllYouNeed · 2024-11-06T01:41:57Z

There have been a number of duplex models released recently, with two approaches to duplex

Approach 1: Model-level duplex (1 model with 2 input streams)
Moshi: https://arxiv.org/abs/2410.00037
Hertz-dev: https://github.com/Standard-Intelligence/hertz-dev

This approach requires specialized datasets, which have been scarce so far. However, this Duplex-UltraChat dataset could solve this issue. It will be costly and difficult to train, but once it is trained implementation should be easier.

Approach 2: System-level duplex (2 models with 2 input streams)
VITA: https://arxiv.org/pdf/2408.05211

System-level duplex dedicates a model to listen and to generate separately. The listening model acts as an advanced VAD+Turn Taking Predictor. An earlier concept by google for an ASR-only listening model also exists: https://arxiv.org/pdf/2208.13321, which can take into account acoustic queues. However with an LLM at the wheel, the linguistic context can be very useful in making better predictions.

Evaluation

Approach 2 allows us to train multiple smaller specialized models and use logic to control the overall system, which might produce a more explainable system, compared to approach 1, where the model makes all the decisions.
Another underappreciated downside of Approach 1 models, is that the model needs to constantly predict silence tokens for the entire time a user is speaking, whereas for approach 2 the listening model only needs to predict a single EOS token. Additionally, extra silence token predictions can lead to "phantom" latency in response.

That said, all these downsides could very well disappear with sufficient quality data and enough training. Duplex-UltraChat is a text-only dataset, so we could be training a model to make turn-taking predictions based only upon linguistic queues.

Proposal

With the T2S approach of ichigo, we can try training on Duplex-UltraChat with minimal changes. If it works, the changes in implementation we would need to make are as follows:

Once the mic has been turned on, send WhisperVQ tokens to ichigo 2 seconds as a time.
Based on what is generated by ichigo, intercept any "idle" token tags to prevent it from being sent to the TTS
When a new non-idle token is generated by ichigo, and the TTS is still speaking, turn off the current TTS and start playing the new response.

tikikun added the type: epic A major feature or initiative label Oct 25, 2024

tikikun added this to the Ichigo v0.4 milestone Oct 25, 2024

tikikun assigned PodsAreAllYouNeed Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

epic: Introducing naturalness for interuption #100

epic: Introducing naturalness for interuption #100

tikikun commented Oct 25, 2024

PodsAreAllYouNeed commented Nov 6, 2024 •

edited

Loading

epic: Introducing naturalness for interuption #100

epic: Introducing naturalness for interuption #100

Comments

tikikun commented Oct 25, 2024

Goal

Description

Resources

References

PodsAreAllYouNeed commented Nov 6, 2024 • edited Loading

Evaluation

Proposal

PodsAreAllYouNeed commented Nov 6, 2024 •

edited

Loading