-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prep for Voice Steering feature #141
Conversation
Credits: 1. ylacombe - Add input_values to DACModel - dac_wrapper/modeling_dac.py - huggingface#110 (comment) 2. stg2015 - Delay mask adjustment for input_values - modeling_parler_tts.py - huggingface#81 (comment)
Heya, I am the author of #110 (comment). It seems like we have similar results, but our implementations are slightly different. It looks like you are following #81 (comment) to adjust the delay patter mask after the tokens have been generated from the Decoder. Where as my implementation modifies the mask before the tokens are injected into the Decoder. Are our approaches equivalent, or is there a difference? |
Howdy!
Thanks for the optimization.
The power has been out here for over 6 hours here due to Helene. Just got the genny fired up. After I take care of getting the fam all situated I'll test your recommended change.
Once I've got the code done, I can do a tutorial video if there is enough interest.
Thanks!
…________________________________
From: Akash Gupta ***@***.***>
Sent: Friday, September 27, 2024 8:44:43 AM
To: huggingface/parler-tts ***@***.***>
Cc: apresence ***@***.***>; Author ***@***.***>
Subject: Re: [huggingface/parler-tts] Prep for Voice Steering feature (PR #141)
Heya, I am the author of #110 (comment)<#110 (comment)>.
It seems like we have similar results, but our implementations are slightly different. It looks like you are following #81 (comment)<#81 (comment)> to adjust the delay patter mask after the tokens have been generated from the Decoder. Where as my implementation modifies the mask before the tokens are injected into the Decoder.
Are our approaches equivalent, or is there a difference?
—
Reply to this email directly, view it on GitHub<#141 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB5EQDTB753Z4VJ6AX7LGP3ZYVHLXAVCNFSM6AAAAABOYMZWIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZZGE4TOMZZGE>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I modified my change with your suggestion, and got exactly the same output. Specifically, I used the same inputs with and without your suggested change and compared a hash of the WAV file outputs and they are the same. I've adjusted my PR based on your suggestion. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there, thanks for opening the PR!
Happy to be corrected here, but the modif proposed here seems incorrect. As highlighted in #110, I believe the modeling code is already correct and can be used even you pass input_values
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide a code snippet that reproduce your original result?
In my comment, I don't have the issue you've shared (except with the adding main_input_name = "input_values" as a class attribute of DACModel
which indeed needs to be added) : #110 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Here's the code snippet: #139 (comment)
To clarify, if you try this code snippet on the current version of ParlerTTS, it will give the "crackling" like sound. If you try to do this with either this MR or #110 , then this crackling noise is fixed.
Hey @apresence, sorry for the long delay ! Turns out you were right. Merging to fix this! Thanks for the work |
Does this PR enable inference on longer text, so the voice doesn't change with each chunk? If so, is there any documentation of how to actually do that? Or am I misunderstanding the purpose of this patch? |
Yes. And I am working on it, just got tied up with other things.
Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Adam J. Kessel ***@***.***>
Sent: Friday, November 15, 2024 4:33:22 PM
To: huggingface/parler-tts ***@***.***>
Cc: apresence ***@***.***>; Mention ***@***.***>
Subject: Re: [huggingface/parler-tts] Prep for Voice Steering feature (PR #141)
Does this PR enable inference on longer text, so the voice doesn't change with each chunk? If so, is there any documentation of how to actually do that? Or am I misunderstanding the purpose of this patch?
—
Reply to this email directly, view it on GitHub<#141 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB5EQDREKXLBOCV6MP3Q73T2AZSCFAVCNFSM6AAAAABOYMZWIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZZHE3DIMZRHE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Credits: