Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add voices from Super Dialogue Audio Pack #425

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

n8bot
Copy link
Contributor

@n8bot n8bot commented Apr 30, 2023

https://dillonbecker.itch.io/sdap

Attribution 4.0 International (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/

These voices perform very well. They can be arranged in different ways to evoke certain emotions. Example of angry emotion with one of the voices.

@n8bot
Copy link
Contributor Author

n8bot commented Apr 30, 2023

Attached is a sample of the output of some of the voices.

Newest samples:

Here is a demonstration of all the voices, as well as a failure of the "angry" voice to achieve consistency. I am puzzled, because earlier the same "angry" voice was quite consistent like the rest.

sdap12.zip

sdap12_angry_failure.zip

Old samples:

Example of a voice:

Ideal for normal speech — sdap1.zip

Example of a voice including non-speech vocalizations:

Not ideal, includes clips with weird vocalizations — sdap2.zip

Example of a curated "angry" voice:

Good demonstration of the ability to mix the clips in different ways to evoke emotions — sdap3.zip

Copy link
Owner

@neonbjb neonbjb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - thanks for putting this together. Normally I wouldn't recommend more than 3 conditioning clips per voice. Do you find the model performs better with all these clips? Would you mind cutting it down a bit to keep the repo size in check?

@n8bot
Copy link
Contributor Author

n8bot commented Apr 30, 2023

I did not do extensive testing with reduced numbers of these particular clips. They immediately worked so well, particularly the speech-only ones, that I left it as is with all the clips. The performance does not seem to be harmed by so many clips.

When I compare the output from these voices to other voices I have put together myself with fewer clips, and even some of the training voices, these voices perform much more consistently. The quality is high, for sure, but the consistency is what is nice.

For the PR, I could remove the _all voices, as the added non-speech vocalizations don't help with normal TTS use. However, they can sometimes be useful when crafting specific emotive voices from the clips.

Each variation of the voice (_all, _speech and the one _angry) has redundant copies of files. Rectifying this would reduce the added repo size significantly.

Maybe what I will do is put the "extra" audio clips in a separate folder alongside/within the voices folder, so users can use them if they wish.

Does the script scan subfolders for audio clips, too? I could have the extra clips in a subfolder with a text file with instructions.

The reason for the redundant license files was in case someone shared just a single voice, it would retain the license info.

Let me know if you have any thoughts I'll see about the subfolder thing.

@n8bot
Copy link
Contributor Author

n8bot commented Apr 30, 2023

I removed all binary file redundancy.

Now there are only redundant markdown files with license and instructions for constructing subset voices.

@n8bot
Copy link
Contributor Author

n8bot commented May 1, 2023

I just did a test with an "angry" subset voice, and the results are way less consistent. The voice completely changes.

So it does appear that the sheer volume of clips contributes to consistency.

@n8bot
Copy link
Contributor Author

n8bot commented May 1, 2023

By the way, I have no problem if you decide not to include this in your repo. It requires no maintenance so I can easily patch this on top of any changes you make on my end. I just wanted to share it with anyone who might want it, because it's open source and I already did all the work I figured I might as well give you the choice to include it or not.

After doing a side-by-side test of these new voices, with the tortoise default training voices, I must concede that the training voices are overall much better. So, the purpose of these massive voices is not very clear.

@n8bot
Copy link
Contributor Author

n8bot commented May 2, 2023

It's interesting, when I use the "angry" subset of clips for a voice, the consistency is much lower overall — random female voices and other voices pop up in clips. However, when the prompt is actually words that seem like something an angry person would say, the consistency is much greater and all candidate results are consistent.

@G-force78
Copy link

It's interesting, when I use the "angry" subset of clips for a voice, the consistency is much lower overall — random female voices and other voices pop up in clips. However, when the prompt is actually words that seem like something an angry person would say, the consistency is much greater and all candidate results are consistent.

I noticed something on your samples that I got too, a strange yelp-groan at the end of the passage. I dont know if its misinterpreted emotional emphasis or what

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants