Maximum allowed value of endpointing parameter, if any? #177

3adel · 2023-06-01T09:57:54Z

3adel
Jun 1, 2023

I am using model = enhanced and language = de. Is there a maximum value for endpointing parameter? I only see examples of it being set to 500ms, but when I try to set it to a larger value (e.g., 2000ms), it does not appear to work. Could you advise on what maximum values of endpointing is allowed? I find 500ms to be too short to detect the end of speech, especially for streaming applications.

Answered by nikolawhallon

Jun 1, 2023

The Deepgram endpointing algorithm is audio-based - so if you set endpointing=2000 it will wait to trigger a speech_final message if it detects 2000ms of silence in the audio. For very clear audio signals, this is no problem, but I find with even slightly noisy signals, like phone calls, it is rare to have 2000ms of silence - some noise/blip/bump/sound is likely to occur within that time period.

Because of this, the longer endpointing is set to, the more likely it is that some noise will occur, effectively barring endpointing/speech_final from triggering. For this reason, endpointing works best with shorter times (< 1000ms is my rule of thumb, though it depends on how noisy the audio sour…

View full answer

nikolawhallon · 2023-06-01T14:15:08Z

nikolawhallon
Jun 1, 2023
Collaborator

The Deepgram endpointing algorithm is audio-based - so if you set endpointing=2000 it will wait to trigger a speech_final message if it detects 2000ms of silence in the audio. For very clear audio signals, this is no problem, but I find with even slightly noisy signals, like phone calls, it is rare to have 2000ms of silence - some noise/blip/bump/sound is likely to occur within that time period.

Because of this, the longer endpointing is set to, the more likely it is that some noise will occur, effectively barring endpointing/speech_final from triggering. For this reason, endpointing works best with shorter times (< 1000ms is my rule of thumb, though it depends on how noisy the audio source is, by default endpointing is 10ms, which is resilient to noise, but not great for use-cases where you really want to make sure a speaker has finished their utterance and isn't just take a brief pause in between words).

For times > 1000ms, there is another feature which is useful that I use for my phone apps called utterance_end_ms. This feature looks at the transcripts being produced by Deepgram - not the audio signal - and triggers when it detects X ms of silence between when words were spoken, based on word timestamps. This means it doesn't have a chance to trigger until Deepgram transcription is produced, though, which is every 3-5 seconds on average for is_final results. However, if interim_results are turned on, Deepgram will produce some transcript every 1 second on average. Because of this, using utterance_end_ms with values > 1000 ms can work quite nicely.

To use this feature, try setting utterance_end_ms=2000&interim_results=true, and then you will notice that after 2000ms of silence based on word timings (not audio signal) you will receive a message from Deepgram in the form {"type":"UtteranceEnd"} and you can use that to trigger your logic.

Apologies for the long-winded explanation, this is a tricky thing and finding the best solution requires some fine-tuning for sure! I've been using utterance_end_ms=2000&interim_results=true myself in several of my applications, hopefully it does the trick for you.

4 replies

3adel Jun 4, 2023
Author

@nikolawhallon thanks, i can't find utterance_end_ms feature in deepgram API.

nikolawhallon Jun 4, 2023
Collaborator

Yup, we haven't documented it yet, but will very soon!

3adel Jun 5, 2023
Author

I did implement it though as you recommended, but I haven't received in the response {"type":"UtteranceEnd"}. I'm using language=de and the enhanced model.

nikolawhallon Jun 5, 2023
Collaborator

I just tried with the following url/query parameters:

wss://api.deepgram.com/v1/listen?encoding=linear16&sample_rate=16000&utterance_end_ms=1000&interim_results=true

and I'm getting responses such as:

{"channel_index":[0,1],"duration":3.34,"start":0.0,"is_final":true,"speech_final":true,"channel":{"alternatives":[{"transcript":"hello","confidence":0.9975586,"words":[{"word":"hello","start":2.735357,"end":3.1317856,"confidence":0.9975586}]}]},"metadata":{"request_id":"4cff4492-d4a4-4f97-8223-7d986bc2e090","model_info":{"name":"general","version":"2023-02-22.3","arch":"base"},"model_uuid":"96a295ec-6336-43d5-b1cb-1e48b5e6d9a4"}}

{"channel_index":[0,1],"duration":1.1600001,"start":3.34,"is_final":false,"speech_final":false,"channel":{"alternatives":[{"transcript":"","confidence":0.0,"words":[]}]},"metadata":{"request_id":"4cff4492-d4a4-4f97-8223-7d986bc2e090","model_info":{"name":"general","version":"2023-02-22.3","arch":"base"},"model_uuid":"96a295ec-6336-43d5-b1cb-1e48b5e6d9a4"}}

{"type":"UtteranceEnd"}

Note that the UtteranceEnd message is a separate message and not part of an ASR message (the above were 3 websocket messages I got in a row).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepgram

Maximum allowed value of endpointing parameter, if any? #177

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Deepgram

Maximum allowed value of endpointing parameter, if any? #177

3adel Jun 1, 2023

Replies: 1 comment · 4 replies

nikolawhallon Jun 1, 2023 Collaborator

3adel Jun 4, 2023 Author

nikolawhallon Jun 4, 2023 Collaborator

3adel Jun 5, 2023 Author

nikolawhallon Jun 5, 2023 Collaborator

3adel
Jun 1, 2023

Replies: 1 comment 4 replies

nikolawhallon
Jun 1, 2023
Collaborator

3adel Jun 4, 2023
Author

nikolawhallon Jun 4, 2023
Collaborator

3adel Jun 5, 2023
Author

nikolawhallon Jun 5, 2023
Collaborator