diff --git a/content/posts/2024-09-05-odsc-europe-voice-agents-part-1.md b/content/posts/2024-09-05-odsc-europe-voice-agents-part-1.md new file mode 100644 index 00000000..073df92f --- /dev/null +++ b/content/posts/2024-09-05-odsc-europe-voice-agents-part-1.md @@ -0,0 +1,221 @@ +--- +title: "Building Reliable Voice Bots with Open Source Tools - Part 1" +date: 2024-09-19 +author: "ZanSara" +featuredImage: "/posts/2024-09-05-odsc-europe-voice-agents/cover.png" +--- + +*This is part one of the write-up of my talk at [ODSC Europe 2024](/talks/2024-09-05-odsc-europe-voice-agents/).* + +--- + +In the last few years, the world of voice agents saw dramatic leaps forward in the state of the art of all its most basic components. Thanks mostly to OpenAI, bots are now able to understand human speech almost like a human would, they're able to speak back with completely naturally sounding voices, and are able to hold a free conversation that feels extremely natural. + +But building voice bots is far from a solved problem. These improved capabilities are raising the bar, and even users accustomed to the simpler capabilities of old bots now expect a whole new level of quality when it comes to interacting with them. + +In this post we're going to focus mostly on **the challenges**: we'll discuss the basic structure of most voice bots today, their shortcomings and the main issues that you may face on your journey to improve the quality of the conversation. + +In [Part 2](/posts/2024-09-05-odsc-europe-voice-agents-part-2/) we are going to focus on **the solutions** that are available today, and we are going to build our own voice bot using [Pipecat](www.pipecat.ai), a recently released open-source library that makes building these bots a lot simpler. + +# Outline + +- [What is a voice agent?](#what-is-a-voice-agent) + - [Speech-to-text (STT)](#speech-to-text-stt) + - [Text-to-speech (TTS)](#text-to-speech-tts) + - [Logic engine](#logic-engine) + - [Tree-based](#tree-based) + - [Intent-based](#intent-based) + - [LLM-based](#llm-based) +- [New challenges](#new-challenges) + - [Real speech is not turn-based](#real-speech-is-not-turn-based) + - [Real conversation flows are not predictable](#real-conversation-flows-are-not-predictable) + - [LLMs bring their own problems](#llms-bring-their-own-problems) + - [The context window](#the-context-window) + - [Working in real time](#working-in-real-time) + +_Continues in [Part 2](/posts/2024-09-05-odsc-europe-voice-agents-part-2/)._ + + +# What is a voice agent? + +As the name says, voice agents are programs that are able to carry on a task and/or take actions and decisions on behalf of a user ("software agents") by using voice as their primary mean of communication (as poopsed to the much more common text chat format). Voice agents are inherently harder to build than their text based counterparts: computers operate primarily with text, and the art of making machines understand human voices has been an elusive problem for decades. + +Today, the basic architecture of a modern voice agent can be decomposed into three main fundamental building blocks: + +- a **speech-to-text (STT)** component, tasked to translate an audio stream into readable text, +- the agent's **logic engine**, which works entirely with text only, +- a **text-to-speech (TTS)** component, which converts the bot's text responses back into an audio stream of synthetic speech. + +![](/posts/2024-09-05-odsc-europe-voice-agents/structure-of-a-voice-bot.png) + +Let's see the details of each. + +## Speech to text (STT) + +Speech-to-text software is able to convert the audio stream of a person saying something and produce a transcription of what the person said. Speech-to-text engines have a [long history](https://en.wikipedia.org/wiki/Speech_recognition#History), but their limitations have always been quite severe: they used to require fine-tuning on each individual speaker, have a rather high word error rate (WER) and they mainly worked strictly with native speakers of major languages, failing hard on foreign and uncommon accents and native speakers of less mainstream languages. These issues limited the adoption of this technology for anything else than niche software and research applications. + +With the [first release of OpenAI's Whisper models](https://openai.com/index/whisper/) in late 2022, the state of the art improved dramatically. Whisper enabled transcription (and even direct translation) of speech from many languages with an impressively low WER, finally comparable to the performance of a human, all with relatively low resources, higher then realtime speed, and no finetuning required. Not only, but the model was free to use, as OpenAI [open-sourced it](https://huggingface.co/openai) together with a [Python SDK](https://github.com/openai/whisper), and the details of its architecture were [published](https://cdn.openai.com/papers/whisper.pdf), allowing the scientific community to improve on it. + +![](/posts/2024-09-05-odsc-europe-voice-agents/whisper-wer.png) + +_The WER (word error rate) of Whisper was extremely impressive at the time of its publication (see the full diagram [here](https://github.com/openai/whisper/assets/266841/f4619d66-1058-4005-8f67-a9d811b77c62))._ + + +Since then, speech-to-text models kept improving at a steady pace. Nowadays the Whisper's family of models sees some competition for the title of best STT model from companies such as [Deepgram](https://deepgram.com/), but it's still one of the best options in terms of open-source models. + +## Text-to-speech (TTS) + +Text-to-speech model perform the exact opposite task than speech-to-text models: their goal is to convert written text into an audio stream of synthetic speech. Text-to-speech has [historically been an easier feat](https://en.wikipedia.org/wiki/Speech_synthesis#History) than speech-to-text, but it also recently saw drastic improvements in the quality of the synthetic voices, to the point that it could nearly be considered a solved problem in its most basic form. + +Today many companies (such as OpenAI, [Cartesia](https://cartesia.ai/sonic), [ElevenLabs](https://elevenlabs.io/), Azure and many others) offer TTS software with voices that sound nearly indistinguishable to a human. They also have the capability to clone a specific human voice with remarkably little training data (just a few seconds of speech) and to tune accents, inflections, tone and even emotion. + +{{< raw >}} +
+ +
+{{< /raw >}} + +_[Cartesia's Sonic](https://cartesia.ai/sonic) TTS example of a gaming NPC. Note how the model subtly reproduces the breathing in between sentences._ + +TTS is still improving in quality by the day, but due to the incredibly high quality of the output competition now tends to focus on price and performance. + +## Logic engine + +Advancements in the agent's ability to talk to users goes hand in hand with the progress of natural language understanding (NLU), another field with a [long and complicated history](https://en.wikipedia.org/wiki/Natural_language_understanding#History). Until recently, the bot's ability to understand the user's request has been severely limited and often available only for major languages. + +Based on the way their logic is implemented, today you may come across bots that rely on three different categories. + +### Tree-based + +Tree-based (or rule-based) logic is one of the earliest method of implementing chatbot's logic, still very popular today for its simplicity. Tree-based bots don't really try to understand what the user is saying, but listen to the user looking for a keyword or key sentence that will trigger the next step. For example, a customer support chatbot may look for the keyword "refund" to give the user any infomation about how to perform a refund, or the name of a discount campaign to explain the user how to take advantage of that. + +Tree-based logic, while somewhat functional, doesn't really resemble a conversation and can become very frustrating to the user when the conversation tree was not designed with care, because it's difficult for the end user to understand which option or keyword they should use to achieve the desired outcome. It is also unsuitable to handle real questions and requests like a human would. + +One of its most effective usecases is as a first-line screening to triage incoming messages. + +![](/posts/2024-09-05-odsc-europe-voice-agents/tree-based-logic.png) + +_Example of a very simple decision tree for a chatbot. While rather minimal, this bot already has several flaws: there's no way to correct the information you entered at a previous step, and it has no ability to recognize synonyms ("I want to buy an item" would trigger the fallback route.)_ + +### Intent-based + +In intent-based bots, **intents** are defined roughtly as "actions the users may want to do". With respect to a strict, keyword-based tree structure, intent-based bots may switch from an intent to another much more easily (because they lack a strict tree-based routing) and may use advanced AI techniques to understand what the user is actually trying to accomplish and perform the required action. + +Advanced voice assistants such as Siri and Alexa use variations of this intent-based system. However, as their owners are usually familiar with, interacting with an intent-based bot doesn't always feel natural, especially when the available intents don't match the user's expectation and the bot ends up triggering an unexpected action. In the long run, this ends with users carefully second-guessing what words and sentence structures activate the response they need and eventually leads to a sort of "magical incantation" style of prompting the agent, where the user has to learn what is the "magical sentence" that the bot will recognize to perform a specific intent without misunderstandings. + +![](/posts/2024-09-05-odsc-europe-voice-agents/amazon-echo.webp) + +_Modern voice assistants like Alexa and Siri are often built on the concept of intent (image from Amazon)._ + +### LLM-based + +The introduction of instruction-tuned GPT models like ChatGPT revolutionized the field of natural languahe understanding and, with it, the way bots can be built today. LLMs are naturally good at conversation and can formulate natural replies to any sort of question, making the conversation feel much more natural than with any technique that was ever available earlier. + +However, LLMs tend to be harder to control. Their very ability of generating naturally souding responses for anything makes them behave in ways that are often unexpected to the developer of the chatbot: for example, users can get the LLM-based bot to promise them anything they ask for, or they can be convinced to say something incorrect or even occasionally lie. + +The problem of controlling the conversation, one that traditionally was always on the user's side, is now entirely on the shoulders of the developers and can easily backfire. + +![](/posts/2024-09-05-odsc-europe-voice-agents/chatgpt-takesies-backsies.png) + +_In a rather [famous instance](https://x.com/ChrisJBakke/status/1736533308849443121), a user managed to convince a Chevrolet dealership chatbot to promise selling him a Chevy Tahoe for a single dollar._ + +# New challenges + +Thanks to all these recent improvements, it would seem that making natural-sounding, smart bots is getting easier and easier. It is indeed much simpler to make a simple bot sound better, understand more and respond appropriately, but there's still a long way to go before users can interact with these new bots as they would with a human. + +The issue lays in the fact that **users expectations grow** with the quality of the bot. It's not enough for the bot to have a voice that shoulds human: users want to be able to interact with it in a way that it feels human too, which is far more rich and interactive than what the rigid tech of earlier chatbots allowed so far. + +What does this mean in practice? What are the expectations that users might have from our bots? + +## Real speech is not turn-based + +Traditional bots can only handle turn-based conversations: the user talks, then the bot talks as well, then the user talks some more, and so on. A conversation with another human, however, has no such limitation: people may talk over each other, give audible feedback without interrupting, and more. + +Here are some examples of this richer interaction style: + +- **Interruptions**. Interruptions occur when a person is talking and another one starts talking at the same time. It is expected that the first person stops talking, at least for a few seconds, to understand what the interruption was about, while the second person continue to talk. + +- **Back-channeling**. Back-channeling is the practice of saying "ok", "sure", "right" while the other person is explaining something, to give them feedback and letting them know we're paying attention to what is being said. The person that is talking is not supposed to stop: the aim of this sort of feedback is to let them know they are being heard. + +- **Pinging**. This is the natural reaction a long silence, especially over a voice-only medium such as a phone call. When one of the two parties is supposed to speak but instead stays silent, the last one that talked might "ping" the silent party by asking "Are you there?", "Did you hear?", or even just "Hello?" to test whether they're being heard. This behavior is especially difficult to handle for voice agents that have a significant delay, because it may trigger an ugly vicious cycle of repetitions and delayed replies. + +- **Buying time**. When one of the parties know that it will stay silent for a while, a natural reaction is to notify the other party in advance by saying something like "Hold on...", "Wait a second...", "Let me check..." and so on. This message has the benefit of preventing the "pinging" behavior we've seen before and can be very useful for voice bots that may need to carry on background work during the conversation, such as looking up information. + +- **Audible clues**. Not everything can be transcribed by a speech-to-text model, but audio carries a lot of nuance that is oten used by humans to communicate. A simple example is pitch: humans can often tell if they're talking to a child, a woman or a man by the pitch of their voice, but STT engines don't transcribe that information. So if a child picks up the phone when your bot asks for their mother or father, the model won't pick up the obvious audible clue and assume it is talking to the right person. Similar considerations should be made for tone (to detect mood, sarcasm, etc) or other sounds like laughter, sobs, and more + +## Real conversation flows are not predictable + +Tree-based bots, and to some degree intent-based too, work on the implicit assumption that conversation flows are largely predictable. Once the user said something and the bot replied accordingly, they can only follow up with a fixed set of replies and nothing else. + +This is often a flawed assumption and the primary reason why talking to chatbots tends to be so frustrating. + +In reality, natural conversations are largely unpredictable. For example, they may feature: + +- **Sudden changes of topic**. Maybe user and bot were talking about making a refund, but then the user changes their mind and decides to ask for assistance finding a repair center for the product. Well designed intent-based bots can deal with that, but most bots are in practice unable to do so in a way that feels natural to the user. + +- **Unexpected, erratic phrasing**. This is common when users are nervous or in a bad mood for any reason. Erratic, convoluted phrasing, long sentences, rambling, are all very natural ways of expressing themselves, but such outbursts very often confuse bots completely. + +- **Non native speakers**. Due to the nature la language learning, non native speakers may have trouble pronouncing words correctly, they may use highly unusual synonyms, or structure sentences in complicated ways. This is also difficult for bots to handle, because understanding the sentence is harder and transcription issues are far more likely. + +- _**Non sequitur**_. _Non sequitur_ is an umbrella term for a sequence of sentences that bear no relation to each other in a conversation. A simple example is the user asking the bot "What's the capital of France" and the bot replies "It's raining now". When done by the bot, this is often due to a severe transcription issue or a very flawed conversation design. When done by the user, it's often a malicious intent to break the bot's logic, so it should be handled with some care. + +## LLMs bring their own problems + +It may seem that some of these issues, especially the ones related to conversation flow, could be easily solved with an LLM. These models, however, bring their own set of issues too: + +- **Hallucinations**. This is a technical term to say that LLMs can occasionally mis-remember information, or straight up lie. The problem is that they're also very confident about their statements, sometimes to the point of trying to gaslight their users. Hallucinations are a major problem for all LLMs: although it may seem to get more manageable with larger and smarter models, the problem only gets more subtle and harder to spot. + +- **Misunderstandings**. While LLMs are great at understanding what the user is trying to say, they're not immune to misunderstandings. Unlike a human though, LLMs rarely suspect a misunderstanding and they rather make assumptions that ask for clarifications, resulting in surprising replies and behavior that are reminescent of intent-based bots. + +- **Lack of assertiveness**. LLMs are trained to listen to the user and do their best to be helpful. This means that LLMs are also not very good at taking the lead of the conversation when we would need them to, and are easily misled and distracted by a motivated user. Preventing your model to give your user's a literary analysis of their unpublished poetry may sound silly, but it's a lot harder than many suspect. + +- **Prompt hacking**. Often done with malicious intent by experienced users, prompt hacking is the practice of convincing an LLM to reveal its initial instructions, ignore them and perform actions they are explicitly forbidden from. This is especially dangerous and, while a lot of work has gone into this field, this is far from a solved problem. + +## The context window + +LLMs need to keep track of the whole conversation, or at least most of it, to be effective. However, they often have a limitation to the amount of text they can keep in mind at any given time: this limit is called **context window** and for many models is still relatively low, at about 2000 tokens **(between 1500-1800 words)**. + +The problem is that this window also need to include all the instructions your bot needs for the conversation. This initial set of instructions is called **system prompt**, and is slightly distinct from the other messages in the conversation to make the LLM understand that it's not part of it, but it's a set of instructions about how to handle the conversation. + +For example, a system prompt for a customer support bot may look like this: + +``` +You're a friendly customer support bot named VirtualAssistant. +You are always kind to the customer and you must do your best +to make them feel at ease and helped. + +You may receive a set of different requests. If the users asks +you to do anything that is not in the list below, kindly refuse +to do so. + +# Handle refunds + +If the user asks you to handle a refund, perform these actions: +- Ask for their shipping code +- Ask for their last name +- Use the tool `get_shipping_info` to verify the shipping exists +... +``` +and so on. + +Although very effective, system prompts have a tendency to become huge in terms of tokens. Adding information to it makes the LLM behave much more like you expect (although it's not infallible), hallucinate less, and can even shape its personality to some degree. But if the system prompt becomes too long (more than 1000 words), this means that the bot will only be able to exchange about 800 words worth of messages with the user before it starts to **forget** either its instructions or the first messages of the conversation. For example, the bot will easily forget its own name and role, or it will forget the user's name and initial demands, which can make the conversation drift completely. + +## Working in real time + +If all these issues weren't enough, there's also a fundamental issue related to voice interaction: **latency**. Voice bots interact with their users in real time: this means that the whole pipeline of transcription, understanding, formulating a reply and synthesizing it back but be very fast. + +How fast? On average, people expect a reply from another person to arrive within **300-500ms** to sound natural. They can normally wait for about 1-2 seconds. Any longer and they'll likely ping the bot, breaking the flow. + +This means that, even if we had some solutions to all of the above problems (and we do have some), these solutions needs to operate at blazing fast speed. Considering that LLM inference alone can take the better part of a second to even start being generated, latency is often one of the major issues that voice bots face when deployed at scale. + +![](/posts/2024-09-05-odsc-europe-voice-agents/ttft.jpg) + +_Time to First Token (TTFT) stats for several LLM inference providers running Llama 2 70B chat. From [LLMPerf leaderboard](https://github.com/ray-project/llmperf-leaderboard). You can see how the time it takes for a reply to even start being produced is highly variable, going up to more than one second in some scenarios._ + + +# To be continued... + +_Interested? Stay tuned for Part 2!_ + + +

F]

diff --git a/content/posts/2024-09-05-odsc-europe-voice-agents-part-2.md b/content/posts/2024-09-05-odsc-europe-voice-agents-part-2.md new file mode 100644 index 00000000..85f12f45 --- /dev/null +++ b/content/posts/2024-09-05-odsc-europe-voice-agents-part-2.md @@ -0,0 +1,118 @@ +--- +title: "Building Reliable Voice Bots with Open Source Tools - Part 2" +date: 2024-09-20 +author: "ZanSara" +featuredImage: "/posts/2024-09-05-odsc-europe-voice-agents/cover.png" +draft: true +--- + +*This is part two of the write-up of my talk at [ODSC Europe 2024](/talks/2024-09-05-odsc-europe-voice-agents/).* + +--- + +In the last few years, the world of voice agents saw dramatic leaps forward in the state of the art of all its most basic components. Thanks mostly to OpenAI, bots are now able to understand human speech almost like a human would, they're able to speak back with completely naturally sounding voices, and are able to hold a free conversation that feels extremely natural. + +But building voice bots is far from a solved problem. These improved capabilities are raising the bar, and even users accustomed to the simpler capabilities of old bots now expect a whole new level of quality when it comes to interacting with them. + +In [Part 1](/posts/2024-09-05-odsc-europe-voice-agents-part-1/) we've seen mostly **the challenges** related to building such bot: we discussed the basic structure of most voice bots today, their shortcomings and the main issues that you may face on your journey to improve the quality of the conversation. + +In this post instead we will focus on **the solutions** that are available today and we are going to build our own voice bot using [Pipecat](www.pipecat.ai), a recently released open-source library that makes building these bots a lot simpler. + +# Outline + +_Start from [Part 1](/posts/2024-09-05-odsc-europe-voice-agents-part-1/)._ + +- [Let's build a voice bot](#lets-build-a-voice-bot) + - [Voice Activity Detection](#voice-activity-detection-vad) + - [Blend intent's control with LLM's fluency](#blend-intents-control-with-llms-fluency) + - [Intent detection](#intent-detection) + - [Prompt building](#prompt-building) + - [Reply generation](#reply-generation) + - [What about latency](#what-about-latency) +- [The code](#the-code) +- [Looking forward](#looking-forward) + + +# Let's build a voice bot + +At this point we have a comprehensive view of the issues that we need to solve to create a reliable, usable and natural-sounding voice agents. How can we actually build one? + +First of all, let's take a look at the structure we defined earlier and see how we can improve on it. + +![](/posts/2024-09-05-odsc-europe-voice-agents/structure-of-a-voice-bot.png) + +## Voice Activity Detection (VAD) + +One of the simplest improvements to this simple pipeline is the addition of a robust Voice Activity Detection (VAD) model. VAD gives the bot the ability to hear interruptions from the user and react to them accordingly, helping to break the classic, rigid turn-based interactions of old-style bots. + +![](/posts/2024-09-05-odsc-europe-voice-agents/structure-of-a-voice-bot-vad.png) + +However, on its own VAD models are not enough. To make a bot truly interruptible we also need the rest of the pipeline to be aware of the possibility of an interruption and be ready to handle it: speech-to-text models need to start transcribing and the text-to-speech component needs to stop speaking as soon as the VAD picks up speech. + +The logic engine also needs to handle a half-spoken reply in a graceful way: it can't just assume that the whole reply it planned to deliver was received, and neither it can drop the whole reply as it never started happening. LLMs can handle this scenario, however implementing it in practice is often not straightorward. + +The quality of your VAD model matters a lot, as well as tuning its parameters appropriately. You don't want the bot to interrupt itself at every ambient sound it detects, but you also want the interruption to happen promptly. + +Some of the best and most used models out there are [Silero](https://github.com/snakers4/silero-vad)'s VAD models, or alternatively [Picovoice](https://picovoice.ai/)'s [Cobra](https://picovoice.ai/platform/cobra/) models. + +## Blend intent's control with LLM's fluency + +Despite the distinctions we made at the start, often the logic of voice bots is implemented in a blend of more than one style. Intent-based bots may contain small decision trees, as well as LLM prompts. Often these approaches deliver the best results by taking the best of each approach to compensate for the weaknesss of the others. + +One of the most effective approaches is to use intent detection to help control the flow of an LLM conversation. Let's see how. + +![](/posts/2024-09-05-odsc-europe-voice-agents/structure-of-a-voice-bot-intent.png) + +Suppose we're building a general purpose customer support bot. + +A bot like this needs to be able to handle a huge variety of requests: helping the user renew subscriptions, buy or return items, update them on the state of a shipping, telling the opening hours of the certified repair shop closer to their home, explaining the advantages of a promotion, and more. + +If we decide to implement this chatbot based on intents, many intent may end up looking so similar that the bot will have trouble deciding which one suits a specific request the best: for example, a user that wants to know if there's any repair shop within a hour's drive from their home, otherwise they'll return the item. + +However, if we decide to implement this chatbot with an LLM, it becomes really hard to check its replies and make sure that the bot is not lying, because the amount of information it needs to handle is huge. The bot may also perform actions that it is not supposed to, like letting users return an item they have no warranty on anymore. + +There is an intermediate solution: **first try to detect intent, then leverage the LLM**. + +### Intent detection + +Step one is detecting the intention of the user. Given that this is a hybrid approach, we don't need to micromanage the model here and we can stick to macro-categories safely. No need to specify "opening hours of certified repair shops in New York", bur rather "information on certified repair shops" in general will suffice. + +This first steps narrows down drastically the information the LLM needs to handle, and it can be repeated at every message, to make sure the user is still talking about the same topic and didn't change subject completely. + +Intent detection can be performed with several tools, but it can be done with an LLM as well. Large models like GPT 4o are especially good at this sort of classification even when queried with a simple prompt like the following: + +``` +Given the following conversation, select what the user is currently talking about, +picking an item from the "topics" list. Output only the topic name. + +Topics: + +[list of all the expected topics + catchall topics] + +Conversation: + +[conversation history] +``` + +### Prompt building + +Once we know more or less what the request is about, it's time to build the real prompt that will give us a reply for the user. + +With the general intent identified, we can equip the LLM strictly with the tools and information that it needs to proceed. If the user is asking about repair shops in their area, we can provide the LLM with a tool to search repair shops by zip code, a tool that would be useless if the user was asking about a shipment or a promotional campaign. Same for the background information: we don't need to tell the LLM that "you're a customer support bot", but we can narrow down its personality and background knowledge to make it focus a lot more on the task at hand, which is to help the user locating a suitable repair shop. And so on. + +This can be done by mapping each expected intent to a specific system prompt, pre-compiled to match the intent. At the prompt building stage we simply pick from our library of prompts and **replace the system prompt** with the one that we just selected. + +### Reply generation + +With the new, more appropriate system prompt in place at the head of the conversation, we can finally prompt the LLM again to generate a reply for the user. At this point the LLM, following the instructions of the updated prompt, has an easier time following its instructions (because they're simpler and more focused) and generated better quality answers for both the users and the developers. + +## What about latency? + +# The code + + + +# Looking forward + + +

F]

diff --git a/content/posts/2024-09-10-odsc-europe-voice-agents.md b/content/posts/2024-09-10-odsc-europe-voice-agents.md deleted file mode 100644 index f95dae48..00000000 --- a/content/posts/2024-09-10-odsc-europe-voice-agents.md +++ /dev/null @@ -1,139 +0,0 @@ ---- -title: "Building Reliable Voice Agents with Open Source Tools" -date: 2024-09-13 -author: "ZanSara" -featuredImage: "/posts/2024-09-13-odsc-europe-voice-agents/cover.png" -draft: true ---- - -*This is a write-up of my talk at [ODSC Europe 2024](/talks/2024-09-05-odsc-europe-voice-agents/).* - ---- - -In the last few years, the world of voice agents saw dramatic leaps forward in the state of the art of all its most basic components. Thanks mostly to OpenAI, bots are now able to understand human speech almost like a human would, they're able to speak back with completely naturally sounding voices, and are able to hold a free conversation that feels extremely natural. - -But building voice bots is far from a solved problem. These improved capabilities are raising the quality bar, and even users accustomed to the simpler capabilities of old bots now expect a whole new level of quality when it comes to interacting with them. - -In this post we're going to discuss how building a voice agent today looks like, from the very basics up to advanced prompting strategies to keep your LLM-based bots under control, and we're also going to see how to build such bots in practice from the ground up using a newly released library, Pipecat. - -# Outline - -- [What is a voice agent?](#what-is-a-voice-agent) - - [Speech-to-text (STT)](#speech-to-text-stt) - - [Text-to-speech (TTS)](#text-to-speech-tts) - - [Logic engine](#logic-engine) - - [Tree-based](#tree-based) - - [Intent-based](#intent-based) - - [LLM-based](#llm-based) -- [New challenges](#new-challenges) -- [Let's build a voice bot](#) -- [Looking forward](#) - - -# What is a voice agent? - -As the name says, voice agents are programs that are able to carry on a task and/or take actions and decisions on behalf of a user ("software agents") by using voice as their primary mean of communication (as poopsed to the much more common text chat format). Voice agents are inherently harder to build than their text based counterparts: computers operate primarily with text, and the art of making machines understand human voices has been an elusive problem for decades. - -Today, the basic architecture of a modern voice agent can be decomposed into three main fundamental building blocks: - -- a **speech-to-text (STT)** component, tasked to translate an audio stream into readable text, -- the agent's **logic engine**, which works entirely with text only, -- a **text-to-speech (TTS)** component, which converts the bot's text responses back into an audio stream of synthetic speech. - -![](/posts/2024-09-13-odsc-europe-voice-agents/structure-of-a-voice-bot.png) - -Let's see the details of each. - -## Speech to text (STT) - -Speech-to-text software is able to convert the audio stream of a person saying something and produce a transcription of what the person said. Speech-to-text engines have a [long history](https://en.wikipedia.org/wiki/Speech_recognition#History), but their limitations have always been quite severe: they used to require fine-tuning on each individual speaker, have a rather high word error rate (WER) and they mainly worked strictly with native speakers of major languages, failing hard on foreign and uncommon accents and native speakers of less mainstream languages. These issues limited the adoption of this technology for anything else than niche software and research applications. - -With the [first release of OpenAI's Whisper models](https://openai.com/index/whisper/) in late 2022, the state of the art improved dramatically. Whisper enabled transcription (and even direct translation) of speech from many languages with an impressively low WER, finally comparable to the performance of a human, all with relatively low resources, higher then realtime speed, and no finetuning required. Not only, but the model was free to use, as OpenAI [open-sourced it](https://huggingface.co/openai) together with a [Python SDK](https://github.com/openai/whisper), and the details of its architecture were [published](https://cdn.openai.com/papers/whisper.pdf), allowing the scientific community to improve on it. - -![](/posts/2024-09-13-odsc-europe-voice-agents/whisper-wer.png) - -_The WER (word error rate) of Whisper was extremely impressive at the time of its publication (see the full diagram [here](https://github.com/openai/whisper/assets/266841/f4619d66-1058-4005-8f67-a9d811b77c62))._ - - -Since then, speech-to-text models kept improving at a steady pace. Nowadays the Whisper's family of models sees some competition for the title of best STT model from companies such as [Deepgram](https://deepgram.com/), but it's still one of the best options in terms of open-source models. - -## Text-to-speech (TTS) - -Text-to-speech model perform the exact opposite task than speech-to-text models: their goal is to convert written text into an audio stream of synthetic speech. Text-to-speech has [historically been an easier feat](https://en.wikipedia.org/wiki/Speech_synthesis#History) than speech-to-text, but it also recently saw drastic improvements in the quality of the synthetic voices, to the point that it could nearly be considered a solved problem in its most basic form. - -Today many companies (such as OpenAI, [Cartesia](https://cartesia.ai/sonic), [ElevenLabs](https://elevenlabs.io/), Azure and many others) offer TTS software with voices that sound nearly indistinguishable to a human. They also have the capability to clone a specific human voice with remarkably little training data (just a few seconds of speech) and to tune accents, inflections, tone and even emotion. - -{{< raw >}} -
- -
-{{< /raw >}} - -_[Cartesia's Sonic](https://cartesia.ai/sonic) TTS example of a gaming NPC. Note how the model subtly reproduces the breathing in between sentences._ - -TTS is still improving in quality by the day, but due to the incredibly high quality of the output competition now tends to focus on price and performance. - -## Logic engine - -Advancements in the agent's ability to talk to users goes hand in hand with the progress of natural language understanding (NLU), another field with a [long and complicated history](https://en.wikipedia.org/wiki/Natural_language_understanding#History). Until recently, the bot's ability to understand the user's request has been severely limited and often available only for major languages. - -Based on the way their logic is implemented, today you may come across bots that rely on three different categories. - -### Tree-based - -Tree-based (or rule-based) logic is one of the earliest method of implementing chatbot's logic, still very popular today for its simplicity. Tree-based bots don't really try to understand what the user is saying, but listen to the user looking for a keyword or key sentence that will trigger the next step. For example, a customer support chatbot may look for the keyword "refund" to give the user any infomation about how to perform a refund, or the name of a discount campaign to explain the user how to take advantage of that. - -Tree-based logic, while somewhat functional, doesn't really resemble a conversation and can become very frustrating to the user when the conversation tree was not designed with care, because it's difficult for the end user to understand which option or keyword they should use to achieve the desired outcome. It is also unsuitable to handle real questions and requests like a human would. - -One of its most effective usecases is as a first-line screening to triage incoming messages. - -![](/posts/2024-09-13-odsc-europe-voice-agents/tree-based-logic.png) - -_Example of a very simple decision tree for a chatbot. While rather minimal, this bot already has several flaws: there's no way to correct the information you entered at a previous step, and it has no ability to recognize synonyms ("I want to buy an item" would trigger the fallback route.)_ - -### Intent-based - -In intent-based bots, **intents** are defined roughtly as "actions the users may want to do". With respect to a strict, keyword-based tree structure, intent-based bots may switch from an intent to another much more easily (because they lack a strict tree-based routing) and may use advanced AI techniques to understand what the user is actually trying to accomplish and perform the required action. - -Advanced voice assistants such as Siri and Alexa use variations of this intent-based system. However, as their owners are usually familiar with, interacting with an intent-based bot doesn't always feel natural, especially when the available intents don't match the user's expectation and the bot ends up triggering an unexpected action. In the long run, this ends with users carefully second-guessing what words and sentence structures activate the response they need and eventually leads to a sort of "magical incantation" style of prompting the agent, where the user has to learn what is the "magical sentence" that the bot will recognize to perform a specific intent without misunderstandings. - -![](/posts/2024-09-13-odsc-europe-voice-agents/amazon-echo.webp) - -_Modern voice assistants like Alexa and Siri are often built on the concept of intent (image from Amazon)._ - -### LLM-based - -The introduction of instruction-tuned GPT models like ChatGPT revolutionized the field of natural languahe understanding and, with it, the way bots can be built today. LLMs are naturally good at conversation and can formulate natural replies to any sort of question, making the conversation feel much more natural than with any technique that was ever available earlier. - -However, LLMs tend to be harder to control. Their very ability of generating naturally souding responses for anything makes them behave in ways that are often unexpected to the developer of the chatbot: for example, users can get the LLM-based bot to promise them anything they ask for, or they can be convinced to say something incorrect or even occasionally lie. - -The problem of controlling the conversation, one that traditionally was always on the user's side, is now entirely on the shoulders of the developers and can easily backfire. - -![](/posts/2024-09-13-odsc-europe-voice-agents/chatgpt-takesies-backsies.png) - -_In a rather [famous instance](https://x.com/ChrisJBakke/status/1736533308849443121), a user managed to convince a Chevrolet dealership chatbot to promise selling him a Chevy Tahoe for a single dollar._ - - -# New challenges - -Thanks to all these recent improvements, it would seem that making natural-sounding, smart voice bots is getting easier and easier. It is indeed much simpler to make a simple bot sound better, understand more and respond appropriately, but there's still a long way to go before users can interact with these new bots as they would with a human. - -The issue lays in the fact that users expectations grow with the quality of the bot. It's not enough for the bot to have a voice that shoulds human: users want to be able to interact with it in a way that it feels human too, which is far more rich and interactive than what the rigid tech of earlier chatbots made us assume. - -What does this mean in practice? What are the expectations that users might have from our bots? - -## Real speech is not turn-based - -Traditional bots can only handle turn-based conversations: the user talks, then the bot talks as well, then the user talks some more, and so on. A conversation with another human, however, has no such limitation: people may talk over each other, interrupt each other and give audible feedback in several ways. - -Here are some examples of this richer interaction style: - -- **Interruptions**. Interruptions occur when a person is talking and another one starts talking at the same time. It is expected that the first person stops talking, at least for a few seconds, to understand what the interruption was about, while the second person continue to talk. - -- **Back-channeling**: Back-channeling is the practice of saying "ok", "sure", "right" while the other p - - - - - -

F]

diff --git a/content/talks/2024-09-05-odsc-europe-voice-agents.md b/content/talks/2024-09-05-odsc-europe-voice-agents.md index 83aca55f..8c4643a7 100644 --- a/content/talks/2024-09-05-odsc-europe-voice-agents.md +++ b/content/talks/2024-09-05-odsc-europe-voice-agents.md @@ -10,6 +10,7 @@ featuredImage: "/talks/2024-09-06-odsc-europe-voice-agents.png" [notebook](https://colab.research.google.com/drive/1NCAAs8RB2FuqMChFKMIVWV0RiJr9O3IJ?usp=sharing). All resources can also be found on ODSC's website and in [my archive](https://drive.google.com/drive/folders/1rrXMTbfTZVuq9pMzneC8j-5GKdRQ6l2i?usp=sharing). +Did you miss the talk? Check out the write-up [here](/posts/2024-09-05-odsc-europe-voice-agents-part-1/). --- diff --git a/static/posts/2024-09-13-odsc-europe-voice-agents/amazon-echo.webp b/static/posts/2024-09-05-odsc-europe-voice-agents/amazon-echo.webp similarity index 100% rename from static/posts/2024-09-13-odsc-europe-voice-agents/amazon-echo.webp rename to static/posts/2024-09-05-odsc-europe-voice-agents/amazon-echo.webp diff --git a/static/posts/2024-09-13-odsc-europe-voice-agents/chatgpt-takesies-backsies.png b/static/posts/2024-09-05-odsc-europe-voice-agents/chatgpt-takesies-backsies.png similarity index 100% rename from static/posts/2024-09-13-odsc-europe-voice-agents/chatgpt-takesies-backsies.png rename to static/posts/2024-09-05-odsc-europe-voice-agents/chatgpt-takesies-backsies.png diff --git a/static/posts/2024-09-13-odsc-europe-voice-agents/cover.png b/static/posts/2024-09-05-odsc-europe-voice-agents/cover.png similarity index 100% rename from static/posts/2024-09-13-odsc-europe-voice-agents/cover.png rename to static/posts/2024-09-05-odsc-europe-voice-agents/cover.png diff --git a/static/posts/2024-09-13-odsc-europe-voice-agents/sonic-tts-sample.wav b/static/posts/2024-09-05-odsc-europe-voice-agents/sonic-tts-sample.wav similarity index 100% rename from static/posts/2024-09-13-odsc-europe-voice-agents/sonic-tts-sample.wav rename to static/posts/2024-09-05-odsc-europe-voice-agents/sonic-tts-sample.wav diff --git a/static/posts/2024-09-05-odsc-europe-voice-agents/structure-of-a-voice-bot-intent.png b/static/posts/2024-09-05-odsc-europe-voice-agents/structure-of-a-voice-bot-intent.png new file mode 100644 index 00000000..381245c4 Binary files /dev/null and b/static/posts/2024-09-05-odsc-europe-voice-agents/structure-of-a-voice-bot-intent.png differ diff --git a/static/posts/2024-09-05-odsc-europe-voice-agents/structure-of-a-voice-bot-vad.png b/static/posts/2024-09-05-odsc-europe-voice-agents/structure-of-a-voice-bot-vad.png new file mode 100644 index 00000000..092f30d9 Binary files /dev/null and b/static/posts/2024-09-05-odsc-europe-voice-agents/structure-of-a-voice-bot-vad.png differ diff --git a/static/posts/2024-09-13-odsc-europe-voice-agents/structure-of-a-voice-bot.png b/static/posts/2024-09-05-odsc-europe-voice-agents/structure-of-a-voice-bot.png similarity index 100% rename from static/posts/2024-09-13-odsc-europe-voice-agents/structure-of-a-voice-bot.png rename to static/posts/2024-09-05-odsc-europe-voice-agents/structure-of-a-voice-bot.png diff --git a/static/posts/2024-09-13-odsc-europe-voice-agents/tree-based-logic.png b/static/posts/2024-09-05-odsc-europe-voice-agents/tree-based-logic.png similarity index 100% rename from static/posts/2024-09-13-odsc-europe-voice-agents/tree-based-logic.png rename to static/posts/2024-09-05-odsc-europe-voice-agents/tree-based-logic.png diff --git a/static/posts/2024-09-05-odsc-europe-voice-agents/ttft.jpg b/static/posts/2024-09-05-odsc-europe-voice-agents/ttft.jpg new file mode 100644 index 00000000..25fd25fe Binary files /dev/null and b/static/posts/2024-09-05-odsc-europe-voice-agents/ttft.jpg differ diff --git a/static/posts/2024-09-13-odsc-europe-voice-agents/whisper-wer.png b/static/posts/2024-09-05-odsc-europe-voice-agents/whisper-wer.png similarity index 100% rename from static/posts/2024-09-13-odsc-europe-voice-agents/whisper-wer.png rename to static/posts/2024-09-05-odsc-europe-voice-agents/whisper-wer.png