|
1 | 1 | ---
|
2 |
| -title: "Turn detection" |
3 |
| -description: "Intelligent turn detection with Streaming Speech-to-Text" |
| 2 | +title: "Vapi" |
| 3 | +description: "Vapi voice agent integration" |
4 | 4 | ---
|
5 | 5 |
|
6 |
| -### Overview |
| 6 | +## Overview |
7 | 7 |
|
8 |
| -AssemblyAI's end-of-turn detection functionality is integrated into our Streaming STT model, leveraging both acoustic and semantic features, and is coupled with a traditional silence-based heuristic approach. Both mechanisms work jointly and either can trigger end-of-turn detection throughout the audio stream. This joint approach significantly enhances the speed and accuracy of end-of-turn detection while allowing this functionality to fall back to the traditional method when the model makes a misprediction. |
| 8 | +Vapi is a developer platform for building voice AI agents, they handle the complex backend of voice agents for you so you can focus on creating great voice experiences. In this guide, we'll show you how to integrate AssemblyAI's streaming speech-to-text model into your Vapi voice agent. |
9 | 9 |
|
10 |
| -<Note> |
11 |
| - End-of-turn and end-of-utterances refer to the same thing and may be used |
12 |
| - interchangeably in these docs. |
13 |
| -</Note> |
| 10 | +<Card |
| 11 | + title="Vapi" |
| 12 | + icon={<img src="https://assemblyaiassets.com/images/Vapi.svg" alt="Vapi logo"/>} |
| 13 | + href="https://docs.vapi.ai/providers/transcriber/assembly-ai" |
| 14 | +> |
| 15 | + View Vapi's AssemblyAI STT provider documentation. |
| 16 | +</Card> |
14 | 17 |
|
15 |
| -### Model-based detection |
| 18 | +## Quick start |
16 | 19 |
|
17 |
| -Triggers when **all** conditions are met: |
| 20 | +<Steps> |
| 21 | + **Head to the "Assistants" tab in your Vapi dashboard.** |
18 | 22 |
|
19 |
| -#### EOT token predicted |
| 23 | + <Frame> |
| 24 | + <img src="/assets/img/vapi/Vapi-Step1.png" /> |
| 25 | + </Frame> |
20 | 26 |
|
21 |
| -- Model predicts semantic end-of-turn with a probability greater than `end_of_turn_confidence_threshold` |
22 |
| -- Default: 0.5 (user configurable) |
| 27 | + **Click on your assistant and then the "Transcriber" tab.** |
23 | 28 |
|
24 |
| -#### Minimum silence duration has passed |
| 29 | + <Frame> |
| 30 | + <img src="/assets/img/vapi/Vapi-Step2.png" /> |
| 31 | + </Frame> |
25 | 32 |
|
26 |
| -- After the last non-silence word token, `min_end_of_turn_silence_when_confident` milliseconds must pass |
27 |
| -- Default: 2400ms (user configurable) |
| 33 | + **Select "assembly-ai" on the Provider dropdown.** |
28 | 34 |
|
29 |
| -#### Minimum speech duration spoken |
| 35 | + <Frame> |
| 36 | + <img src="/assets/img/vapi/Vapi-Step3.png" /> |
| 37 | + </Frame> |
| 38 | +</Steps> |
30 | 39 |
|
31 |
| -- The user must speak for at least 80ms since the last end-of-turn (ensures at least one word) |
32 |
| -- Set to 80 ms (internal) |
| 40 | +Your voice agent now uses **AssemblyAI** for speech-to-text (STT) processing. |
33 | 41 |
|
34 |
| -#### Word finalized |
| 42 | +<Info> |
| 43 | +New to Vapi? Visit the [Quickstart Guide](https://docs.vapi.ai/quickstart/introduction) to explore various example voice agent workflows. For the easiest way to test a voice agent, follow this [simple phone-based guide](https://docs.vapi.ai/quickstart/phone). Vapi offers a wide range of example workflows to get you up and running quickly. |
| 44 | +</Info> |
35 | 45 |
|
36 |
| -- Last word in `turn.words` has been finalized |
37 |
| -- Internal configuration |
38 |
| - |
39 |
| -### Silence-based detection |
40 |
| - |
41 |
| -Triggers when **all** conditions are met: |
42 |
| - |
43 |
| -#### Minimum speech duration spoken |
44 |
| - |
45 |
| -- The user must speak for at least 80ms since the last end-of-turn (ensures at least one word) |
46 |
| -- Set to 80 ms (internal) |
47 |
| - |
48 |
| -#### Maximum silence duration has passed |
49 |
| - |
50 |
| -- After the last non-silence word token, `max_turn_silence` milliseconds must pass |
51 |
| -- Default: 2400ms (user configurable) |
52 |
| - |
53 |
| -### Important notes |
54 |
| - |
55 |
| -- Silence-based detection can override model-based detection even with high EOT confidence thresholds |
56 |
| -- Word finalization always takes precedence — endpointing won't occur until the last word is finalized |
57 |
| -- We define end-of-turn detection as the process of detecting the end of sustained speech activity, often called end-pointing in the Voice Agents context |
58 | 46 |
|
0 commit comments