[Deepgram whisper] invalid characters in Non-latin alphabets (Arabic, Hebrew) #196
-
Which Deepgram product are you using?Deepgram API DetailsWhen transcribing pretty much any Arabic file with the Deepgram Whisper API, we get some invalid characters. Here is an example: I've double checked this is present in the API webhook that we receive without any processing on our side, but here you have a link to the result in our platform so you can see it: https://www.happyscribe.com/transcriptions/52ad8c008a044db3be501638f1aee329/view?organization_id=1 In my experience with Whisper (we have our own fork of Whisper), tokens in some non-latin alphabets are actually smaller than a character (meaning you need several tokens to get one character). The issue we are seeing usually happens when you decode tokens separately instead of all at once. If you are making a request to the Deepgram API, what is the full Deepgram URL you are making a request to?https://api.deepgram.com/v1/listen?model?whisper-large (see the rest of query parameters in the code I pasted below) If you are making a request to the Deepgram API and have a request ID, please paste it below:16e68ce0-449e-45b0-8920-7aae3be34638 If possible, please attach your code or paste it into the text box.
If possible, please attach an example audio file to reproduce the issue.deepgram_whisper-arabic_test.zip You can see the result here: https://www.happyscribe.com/transcriptions/52ad8c008a044db3be501638f1aee329/view?organization_id=1 |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Hi @yoelcabo, thanks for the detailed info! This is extremely helpful. I'll pass this along to our engineering team and see what we can do. |
Beta Was this translation helpful? Give feedback.
-
Hey @yoelcabo this should be fixed now. Can you rerun the request and test if it's working for you? |
Beta Was this translation helpful? Give feedback.
Hey @yoelcabo this should be fixed now. Can you rerun the request and test if it's working for you?