[Deepgram whisper] invalid characters in Non-latin alphabets (Arabic, Hebrew) #196

yoelcabo · 2023-06-08T11:50:39Z

yoelcabo
Jun 8, 2023

Which Deepgram product are you using?

Deepgram API

Details

When transcribing pretty much any Arabic file with the Deepgram Whisper API, we get some invalid characters. Here is an example:
"الوضع ونتابع سرعة خفا� مستوى المياه و"

I've double checked this is present in the API webhook that we receive without any processing on our side, but here you have a link to the result in our platform so you can see it: https://www.happyscribe.com/transcriptions/52ad8c008a044db3be501638f1aee329/view?organization_id=1

In my experience with Whisper (we have our own fork of Whisper), tokens in some non-latin alphabets are actually smaller than a character (meaning you need several tokens to get one character). The issue we are seeing usually happens when you decode tokens separately instead of all at once.

If you are making a request to the Deepgram API, what is the full Deepgram URL you are making a request to?

https://api.deepgram.com/v1/listen?model?whisper-large (see the rest of query parameters in the code I pasted below)

If you are making a request to the Deepgram API and have a request ID, please paste it below:

16e68ce0-449e-45b0-8920-7aae3be34638

If possible, please attach your code or paste it into the text box.

      REQUEST_URL = "https://api.deepgram.com/v1/listen".freeze

       def submit_request
          response = HTTParty.post(
            REQUEST_URL,
            headers: {
              'Authorization' => "Token #{AUTH_TOKEN}",
              'content-type' => 'application/json',
            },
            body: request_body.to_json,
            query: request_query,
            timeout: 1800
          )
          if response.success?
            JSON.parse(response.body)['request_id']
          else
            raise StandardError, "Deepgram request failed with status #{response.code} and body #{response.body}"
          end
        end
        def request_query
          {
            model: 'whisper-large',
            language: @asr_job.language.split("-").first,
            punctuate: true,
            numerals: true,
            smart_format: true,
            measurements: true,
            diarize: true,
            paragraphs: true,
            callback: webhook_url,
          }
        end
      end

If possible, please attach an example audio file to reproduce the issue.

deepgram_whisper-arabic_test.zip

You can see the result here: https://www.happyscribe.com/transcriptions/52ad8c008a044db3be501638f1aee329/view?organization_id=1

Answered by jjmaldonis

Jun 14, 2023

Hey @yoelcabo this should be fixed now. Can you rerun the request and test if it's working for you?

View full answer

jjmaldonis · 2023-06-08T14:54:14Z

jjmaldonis
Jun 8, 2023
Maintainer

Hi @yoelcabo, thanks for the detailed info! This is extremely helpful.

I'll pass this along to our engineering team and see what we can do.

0 replies

jjmaldonis · 2023-06-14T14:28:39Z

jjmaldonis
Jun 14, 2023
Maintainer

Hey @yoelcabo this should be fixed now. Can you rerun the request and test if it's working for you?

1 reply

yoelcabo Jul 1, 2023
Author

It's working well indeed! Thanks ☺️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepgram

[Deepgram whisper] invalid characters in Non-latin alphabets (Arabic, Hebrew) #196

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Deepgram

[Deepgram whisper] invalid characters in Non-latin alphabets (Arabic, Hebrew) #196

yoelcabo Jun 8, 2023

Which Deepgram product are you using?

Details

If you are making a request to the Deepgram API, what is the full Deepgram URL you are making a request to?

If you are making a request to the Deepgram API and have a request ID, please paste it below:

If possible, please attach your code or paste it into the text box.

If possible, please attach an example audio file to reproduce the issue.

Replies: 2 comments · 1 reply

jjmaldonis Jun 8, 2023 Maintainer

jjmaldonis Jun 14, 2023 Maintainer

yoelcabo Jul 1, 2023 Author

yoelcabo
Jun 8, 2023

Replies: 2 comments 1 reply

jjmaldonis
Jun 8, 2023
Maintainer

jjmaldonis
Jun 14, 2023
Maintainer

yoelcabo Jul 1, 2023
Author