Replies: 1 comment 2 replies
-
>>> naurto It is not an easy problem. You can get some insights from this paper [Archived Post] |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
>>> joshua.eisenberg
[May 7, 2019, 8:26pm]
Hey all!
Thankfully I have been able to get the pre-trained model up and running,
and producing great synthesized speech.
Some context: I want to animate a face / mouth to speak while the
synthesized audio is playing. In order to do this I need the start and
stop time of each phoneme in the synthesized speech.
I am wondering if it is possible to use the attention map to extract the
timings of then synthesized words? Once I have this I would like to
extract the timings of each phonemes...
I would like to analyze the attention map to do this even though I know
I could use an acoustic model to calculate this, but this is overkill,
and I thought it would be better to find a solution that's already in
the TTS library.
I originally posted on the git
hub, and
erogol suggested to look at the
attention maps. I'm also just wondering if there is a way to get the
image / data structure that contains the attention map of a synthesized
phrase, and analyze this to get the proper timings.
Thanks for any help!
😄
[This is an archived TTS discussion thread from discourse.mozilla.org/t/extract-timing-of-phonemes-and-words-from-attention-map]
Beta Was this translation helpful? Give feedback.
All reactions