Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epic: Introducing naturalness for interuption #100

Open
tikikun opened this issue Oct 25, 2024 · 1 comment
Open

epic: Introducing naturalness for interuption #100

tikikun opened this issue Oct 25, 2024 · 1 comment
Assignees
Labels
type: epic A major feature or initiative
Milestone

Comments

@tikikun
Copy link
Collaborator

tikikun commented Oct 25, 2024

Goal

Currently if you interupt ichigo it will just start a new turn (at best)

Description

With emperical result from this paper you can turn the LLM into non-turn based, which mean if you interupt it, it will continue the previous talk with new idea for you, instead of start a new turn

Resources

https://arxiv.org/abs/2406.15718

References

https://arxiv.org/abs/2406.15718

@tikikun tikikun added the type: epic A major feature or initiative label Oct 25, 2024
@tikikun tikikun added this to the Ichigo v0.4 milestone Oct 25, 2024
@PodsAreAllYouNeed
Copy link

PodsAreAllYouNeed commented Nov 6, 2024

There have been a number of duplex models released recently, with two approaches to duplex

Approach 1: Model-level duplex (1 model with 2 input streams)
Moshi: https://arxiv.org/abs/2410.00037
Hertz-dev: https://github.com/Standard-Intelligence/hertz-dev
image

This approach requires specialized datasets, which have been scarce so far. However, this Duplex-UltraChat dataset could solve this issue. It will be costly and difficult to train, but once it is trained implementation should be easier.

Approach 2: System-level duplex (2 models with 2 input streams)
VITA: https://arxiv.org/pdf/2408.05211
image

System-level duplex dedicates a model to listen and to generate separately. The listening model acts as an advanced VAD+Turn Taking Predictor. An earlier concept by google for an ASR-only listening model also exists: https://arxiv.org/pdf/2208.13321, which can take into account acoustic queues. However with an LLM at the wheel, the linguistic context can be very useful in making better predictions.

Evaluation

Approach 2 allows us to train multiple smaller specialized models and use logic to control the overall system, which might produce a more explainable system, compared to approach 1, where the model makes all the decisions.
Another underappreciated downside of Approach 1 models, is that the model needs to constantly predict silence tokens for the entire time a user is speaking, whereas for approach 2 the listening model only needs to predict a single EOS token. Additionally, extra silence token predictions can lead to "phantom" latency in response.

That said, all these downsides could very well disappear with sufficient quality data and enough training. Duplex-UltraChat is a text-only dataset, so we could be training a model to make turn-taking predictions based only upon linguistic queues.

Proposal

With the T2S approach of ichigo, we can try training on Duplex-UltraChat with minimal changes. If it works, the changes in implementation we would need to make are as follows:

  1. Once the mic has been turned on, send WhisperVQ tokens to ichigo 2 seconds as a time.
  2. Based on what is generated by ichigo, intercept any "idle" token tags to prevent it from being sent to the TTS
  3. When a new non-idle token is generated by ichigo, and the TTS is still speaking, turn off the current TTS and start playing the new response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: epic A major feature or initiative
Projects
Status: No status
Development

No branches or pull requests

2 participants