Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility of change of emotion in single speaker and text? #820

Open
Tortoise17 opened this issue Sep 23, 2024 · 0 comments
Open

Possibility of change of emotion in single speaker and text? #820

Tortoise17 opened this issue Sep 23, 2024 · 0 comments

Comments

@Tortoise17
Copy link

@neonbjb Is there any possibility for such problem to get solved in the new version?
Kindly if you can guide.

          I was actually going to ask for this by a different mechanism

You say elsewhere that you have a plan for moving between two voices based on their latents

I had intended to just make a collection of voices by reading some source text in varying emotional cadences

If you were to just start by giving us the ability to smoothly shift between voices based on a token, then the question of figuring out when to do it could be pushed downstream

Consider the case of using this for generating video game dialogue, based on the way some or another character moves through a dialogue tree, or repeated variations for a character in a game like Civ, based on the reaction between the two nations (awe, disgust, fear, colloquial, hatred, distrust, etc.) At that point, emotional inference can come from just whatever is happening in the game, and need not - indeed, should not - come from the text at all.

I'm not saying that should never come from the text? But I am saying that I think they're separate problems.

If you would help us create text that can, itself, indicate when it's time to shift from voice 1 to voice 2, I think that's a big step in the right direction. Don't need anything fancy like easing; a linear interpolation between located tags would be enough.

No need to figure out what the "right" emotional cues are. They'll vary character to character. Just let us have string labels and we can figure it out.

Here's one quick example of how it could be done. It's maybe a little counter-intuitive, but I think it'd work really well.

Add some tag (here I'm using [voice: ... ] but whatever works) which effectively means "at this tag, start transitioning from the voice I'm in to the voice I'm naming here; the transition finishes at the next tag."

So, notice that I repeat newscasterExcited. That's because it's neutral here for the first five words, then at the first newscaster excited it starts tweening from the neutral it started in to the excited that's being requested; it then tweens from excited to excited, meaning it's not actually tweening, but just staying excited there. We do the same thing at the end for newscasterVeryAngry. That notation also means you can double-state for an immediate switch, as the tween phase is over a zero length band. (Alternatlely, you could have a more complex parser and start adding flags, but it's unnecessary, and I'd advise against it.)

python do_tts.py --text "I'm going to speak this [voice:newscasterExcited] and 
it's going to go great, [voice:newscasterExcited] like super great, 
[voice:newscasterSad] but it will get replaced with better 
[voice:newscasterAngry] and that person will feel my [voice:newscasterVeryAngry] 
ultimate indignant wrath [voice:newscasterVeryAngry] and my vengeance will be 
done!" --voice newscasterNeutral

Originally posted by @StoneCypher in #10 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant