Great, the voices make speech. How to make them TALK? #331
TryToRemember
started this conversation in
General
Replies: 1 comment
-
Hop off the meth for a while |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Sorry this is a book. I'd recommend skimming it, I am all over the place because I don't know the terminology.
Look, this is your guys world, not mine, right now I'm still an end user d/l'ing all the hundreds of gigabytes (think I've crossed into terabytes this week) of training models and voices and code, oh my.
But the more voices I train and the more hours I put into this, the more it seems - more basic than expected. I really the tortise/elevenlabs guy here, or maybe it's reading jbetker's .py code "clicks" and seems sane. I'm just a guy who finds it very interesting and has access to a bunch of CUDA (TensorFlow) machines and want to really ruin someone elses' electric bill. :-)
What I have issue with, it seems ALL of them is the same as the issue from 40 damn years ago... with S.A.M. on the Commodore. I AM SAM SOFTWARE AUTOMATIC MOUTH. Flat, deadpan, generic cadence. It was amazing that it could be done!
But here we are 40 years later, and no matter how realistic a voice you create sounds, I cannot easily make it scream in rage "get out of my house!" with one voice, this one a man, let's say caught wife cheating on their anniversary day, so not only murderous rage but tinged sadness/depression... or "get out of my house!" squeeked out through clenched teeth almost above human hearing female so scared, her adrenaline level and her untreated hypertension invoke a massive myocardial infarction from fear right after uttering it, and dies on the spot within minutes.
(Note: or whatever horrific thing you need to shock you. I'm trying to make a point over the many thousands-upon-thousands of ways a "trained voice" needs to be able to emit sound VERY DIFFERENTLY, not to justify triggering people mentioning cheating/violence/death.)
Instead it's like a fancier version of a PBX voicemail voice or a better sounding Siri.
Cadence, pacing, excitement, pitch, sotto/whisper/under-your-breath, screaming excited, screaming fear, screaming rage... how many thousands (millions?) of variants in verbal speech from one person, in just one sentence, are possible? I'm not a linguistics major so I have no damn idea what the terminology is, my guess is it's named nomenclature is the articulation of voicing, articulatory phonetics, phonetic intonation of voicing maybe?
By example, let's take one any person over 40 in America knows...
"I'm just a bill, yes I'm only a bill. And I'm sitting here on Capitol Hill."
Even if you want to "speak" this with a well-trained Darth Vader voice file and vocoder, you know you need to sing it not speak it, draw out some sounds and shorten others, while constantly adjusting pitch/timbre since in that one little phrase it goes all over the place!
How could I spec that, in a line of text between quotes?
Phonetically spell it out, or maybe mixing phoneticspeak with American punctuation?
"IIIM justa. bill... yes, I'm own-lee-a bill."
etc. If we can't do it via AI to put (fear) or (enraged) in braces maybe just force the speech, the diction, whatever, directly, explicitly?
Feel like I'm rambling. Does this explain what I implying well enough, without knowing the terminology?
Another example... take the bombastic House of Commons ex-Speaker, John Bercow: "ORDER!" - most of the world knows this voice, especially adults who watched news clips online during Brexit / COVID lockdowns.
But no matter how many billions of samples you might use to make a johnbercow-trained and vocoded to a resolution beyond human vocal cords and hearing, a model of perfection, it still will N E V E R utter the phrasing correctly as implied below without some kind of defining or situational clarity.
Order! - John fresh out of bed, perky and wanting coffee not tea.
Order. - John after getting bad news while sitting in the loo 5 minutes before the chamber is in session.
ORDER! - John orgasming during the best sex of his life (so far).
That's just one damned word. One.
The reason I spelled it out like that, can we control it in a DALLE kind of way? Do it just as I typed it?
"have Mickey Mouse say "Only you can prevent forest fires" in the style of 1950 Smokey The Bear spokesman"
Or, a bit more complex, let's take a gigglepus... when you're flirting or just drunk or even just animated and happy, you might constantly laugh or giggle mid-word or mid-sentence... but normal speaking you would not. And if you are super-giggly, you might stop and repeat or even re-start the enitre story or sentence you were speaking. (Howie Mandel - Wait, wait, wait!!) How could a trained model accomodate that, especially intermixed in the same sentence without it?
I love all the A.I. voice things people are doing with game mods, using the thousands of lines of dialogue to nearly duplicate the voice, allowing extending a game's content with the characters that are already in it - that's pretty cool. Probably not to a human paid VO actor, but that's the way of the world Bands used to be able to sell these things called "CDs" too.
But the more I dive into this, the more people just seem to care they can make a realistic-sounding voice, not that the voice can speak as a realistic human.
It's got to be me, this could not be so limited and still have people this excited about it. Because to me it appears it's more about HOW it picks what to say, like an AI chatbot, than the actual voicing and intonation of sound, anymore than it was 40 years ago.
Hopefully I'm explaining myself enough to be understood, another problem with language! :-)
Thanks for reading my insane thoughts.
Beta Was this translation helpful? Give feedback.
All reactions