Speech To Text processing timeout #1032

Hadymohammed · 2024-12-22T11:56:41Z

Hadymohammed
Dec 22, 2024

Seeking Support for Optimizing Long Audio Transcriptions with Whisper Large Model

Hello everyone,

We are working on a project to transcribe our internally produced video content, which currently amounts to about 40 hours per month. Our ultimate goal is to scale this system for external users, which could substantially increase our transcription volume.

At present, we’re using the Whisper Large model, primarily because of its support for Arabic transcription and speaker diarization — both are critical features for our workflow. However, we’re running into some significant challenges due to processing constraints:

Processing Max Time: The model has a processing limit of 20-minute for max processing time, and exceeding this often results in timeouts.
Processing Speed: The model processes approximately 1 second of audio every 15 seconds, limiting it to less than 2 hours of audio per hour. Many of our audio and video files exceed these limits, making it difficult to use the current setup efficiently.

What We’re Looking For

We’re seeking guidance or suggestions on the following:

Optimizing Long Audio Processing: Are there optimizations or configurations available to process longer audio files without timeouts?
Improving Processing Speed: Can the processing speed be improved to handle a higher volume of audio files in a shorter timeframe?
Maintaining Feature Support: Are there alternative approaches or solutions within the Whisper ecosystem that maintain Arabic transcription and speaker diarization support?

Any insights, workarounds, or suggestions would be greatly appreciated.

Thank you!

Answered by jkroll-deepgram

Dec 27, 2024

Hi @Hadymohammed, unfortunately Deepgram doesn't currently have a roadmap for supporting right-to-left languages, including Arabic.

Is the 40 hours per month transcribed in a single time-bound batch, or can you develop a strategy for doing more, smaller batches? Given 40 hours of audio, in files of 2-4 hours duration, that should amount to 10-20 total API requests. With Deepgram's rate limit of 5 concurrent Whisper requests, that could be done in about 2-4 batches of 5 requests. For instance, if the 40 hours was in weekly batches, then you could likely even do all requests in a single weekly batch.

I'll add that your initial estimate of 1 second of audio processed in 15 seconds sounds ver…

View full answer

2024-12-22T11:56:43Z

deepgram-community[bot]
bot Dec 22, 2024

Thanks for asking your question. Please be sure to reply with as much detail as possible so the community can assist you efficiently.
_{Consider joining our Discord community for more opportunity to engage with your fellow Deepgram users. You can earn points which can be redeemed for cool stuff by being active in our communities!}

0 replies

2024-12-22T11:56:53Z

deepgram-community[bot]
bot Dec 22, 2024

Hey there! It looks like you haven't connected your GitHub account to your Deepgram account. You can do this at https://community.deepgram.com - being verified through this process will allow our team to help you in a much more streamlined fashion.

0 replies

2024-12-22T11:56:55Z

deepgram-community[bot]
bot Dec 22, 2024

It looks like we're missing some important information to help debug your issue. Would you mind providing us with the following details in a reply?

The programming language you are working in (e.g. JavaScript, Python).
A request ID that triggered your error or issue.

0 replies

jkroll-deepgram · 2024-12-23T20:03:14Z

jkroll-deepgram
Dec 23, 2024
Collaborator

Hi @Hadymohammed, while we offer a managed Whisper service for broader language coverage, unfortunately the Whisper models are significantly less efficient than Deepgram's own. However, our models don't support Arabic, so you'll need to use Whisper.

I would recommend submitting requests at off-peak hours, when we have lower traffic load and can better serve Whisper requests. That would be US nights and weekends.

How long of audio are you sending? I would recommend sending shorter audio to Whisper (such as under 60 minutes, or under 30 minutes if possible). I wouldn't recommend sending audio over an hour or so in length, as it will be more likely to time out.

Ultimately, there is going to be a limit to which you can scale using Deepgram's managed Whisper. Processing times are slow, and we have a rate limit of 5 concurrent Whisper requests.

2 replies

Hadymohammed Dec 24, 2024
Author

Hi @jkroll-deepgram ,

Thank you for your response.

Currently, we transcribe around 40 hours of audio per month, with individual files averaging 2 hours and occasionally reaching up to 4 hours in length. Given these durations, chunking seems like a plausible approach to manage the longer files. However, this would likely exceed the 5 concurrent Whisper request limit, creating a bottleneck in processing time.

On another note, I was curious if there are any plans to support Arabic in Deepgram's models in the near future. Arabic transcription is a critical feature for us as it’s the primary language of our content. Considering the growing demand for Arabic content across various domains such as media, education, and business, adding support for Arabic could significantly enhance the usability of your platform for users in the Middle East and other Arabic-speaking regions.

Any updates on potential Arabic support or recommendations for efficiently handling chunked audio within the current rate limits would be greatly appreciated.

jkroll-deepgram Dec 27, 2024
Collaborator

Hi @Hadymohammed, unfortunately Deepgram doesn't currently have a roadmap for supporting right-to-left languages, including Arabic.

Is the 40 hours per month transcribed in a single time-bound batch, or can you develop a strategy for doing more, smaller batches? Given 40 hours of audio, in files of 2-4 hours duration, that should amount to 10-20 total API requests. With Deepgram's rate limit of 5 concurrent Whisper requests, that could be done in about 2-4 batches of 5 requests. For instance, if the 40 hours was in weekly batches, then you could likely even do all requests in a single weekly batch.

I'll add that your initial estimate of 1 second of audio processed in 15 seconds sounds very slow, and will vary by time of day. I just did a test now and found a 35-minute Arabic file transcribed on Deepgram Whisper in 43 seconds. I'd suggest testing further to the processing times you find on a few different occurrences. If you scale to larger volumes, that may require more adjustments, but 40 hours a month on Whisper should still be quite feasible with Deepgram.

Answer selected by deepgram-community

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepgram

Speech To Text processing timeout #1032

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Deepgram

Speech To Text processing timeout #1032

Hadymohammed Dec 22, 2024

Seeking Support for Optimizing Long Audio Transcriptions with Whisper Large Model

What We’re Looking For

Replies: 4 comments · 2 replies

deepgram-community[bot] bot Dec 22, 2024

deepgram-community[bot] bot Dec 22, 2024

deepgram-community[bot] bot Dec 22, 2024

jkroll-deepgram Dec 23, 2024 Collaborator

Hadymohammed Dec 24, 2024 Author

jkroll-deepgram Dec 27, 2024 Collaborator

Hadymohammed
Dec 22, 2024

Replies: 4 comments 2 replies

deepgram-community[bot]
bot Dec 22, 2024

deepgram-community[bot]
bot Dec 22, 2024

deepgram-community[bot]
bot Dec 22, 2024

jkroll-deepgram
Dec 23, 2024
Collaborator

Hadymohammed Dec 24, 2024
Author

jkroll-deepgram Dec 27, 2024
Collaborator