(中文 | English)
This project provides the backend service for the open source smart hardware project xiaozhi-esp32. It is implemented in Python
based on the XiaoZhi Communication Protocol.
This project is designed to be used in conjunction with ESP32 hardware devices. If you have already purchased an ESP32 device, successfully connected to the backend service deployed by XieGe, and now wish to set up your own xiaozhi-esp32
backend service, then this project is perfect for you.
Want to see it in action? Check out the videos 🎥
![]() |
![]() |
![]() |
![]() |
![]() |
- Hardware: A set of devices compatible with
xiaozhi-esp32
(for specific models, please refer to this link). - Server: A computer with at least a 4-core CPU and 8GB of memory.
- Firmware Compilation: Please update the backend service API endpoint in the
xiaozhi-esp32
project, then recompile the firmware and flash it to your device.
This project is relatively new and has not yet undergone network security evaluations. Do not use it in a production environment.
If you deploy this project on a public network for learning purposes, be sure to enable protection in the configuration file config.yaml
:
server:
auth:
# Enable protection
enabled: true
Once protection is enabled, you will need to validate the machine's token or MAC address based on your actual situation. Please refer to the configuration documentation for details.
- Communication Protocol
Based on thexiaozhi-esp32
protocol, data exchange is implemented via WebSocket. - Dialogue Interaction
Supports wake-up dialogues, manual conversations, and real-time interruptions. Automatically enters sleep mode after long periods of inactivity. - Multilingual Recognition
Supports Mandarin, Cantonese, English, Japanese, and Korean (default using FunASR). - LLM Module
Allows flexible switching of LLM modules. The default is ChatGLMLLM, with options to use AliLLM, DeepSeek, Ollama, and others. - TTS Module
Supports multiple TTS interfaces including EdgeTTS (default) and Volcano Engine Doubao TTS to meet speech synthesis requirements.
- Conversation Memory Feature
- Multiple Mood Modes
- Smart Control Panel Web UI
Type | Platform Name | Usage Method | Pricing Model | Remarks |
---|---|---|---|---|
LLM | AliLLM (阿里百炼) | OpenAI API call | Token consumption | Click to apply for API key |
LLM | DeepSeekLLM (深度求索) | OpenAI API call | Token consumption | Click to apply for API key |
LLM | ChatGLMLLM (智谱) | OpenAI API call | Free | Although free, you still need to click to apply for an API key |
LLM | OllamaLLM | Ollama API call | Free/Custom | Requires pre-downloading the model (ollama pull ); service URL: http://localhost:11434 |
LLM | DifyLLM | Dify API call | Token consumption | For local deployment. Note that prompt configuration must be set in the Dify console. |
LLM | GeminiLLM | Gemini API call | Free | Click to apply for API key |
LLM | CozeLLM | Coze API call | Token consumption | Requires providing bot_id, user_id, and personal token. |
LLM | Home Assistant | Home Assistant voice assistant API call | Free | Requires providing a Home Assistant token. |
In fact, any LLM that supports OpenAI API calls can be integrated.
Type | Platform Name | Usage Method | Pricing Model | Remarks |
---|---|---|---|---|
TTS | EdgeTTS | API call | Free | Default TTS based on Microsoft's speech synthesis technology. |
TTS | DoubaoTTS (火山引擎豆包 TTS) | API call | Token consumption | Click to create an API key; it is recommended to use the paid version for higher concurrency. |
TTS | CosyVoiceSiliconflow | API call | Token consumption | Requires application for the Siliconflow API key; output format is WAV. |
TTS | CozeCnTTS | API call | Token consumption | Requires providing a Coze API key; output format is WAV. |
TTS | FishSpeech | API call | Free/Custom | Starts a local TTS service; see the configuration file for startup instructions. |
TTS | GPT_SOVITS_V2 | API call | Free/Custom | Starts a local TTS service, suitable for personalized speech synthesis scenarios. |
Type | Platform Name | Usage Method | Pricing Model | Remarks |
---|---|---|---|---|
VAD | SileroVAD | Local | Free |
Type | Platform Name | Usage Method | Pricing Model | Remarks |
---|---|---|---|---|
ASR | FunASR | Local | Free | |
ASR | DoubaoASR | API call | Paid |
This project supports three deployment methods. Choose the one that best fits your needs.
The documentation provided here is a written tutorial. If you prefer a video tutorial, you can refer to this expert's hands-on guide.
Combining both the written and video tutorials can help you get started more quickly.
-
Docker Quick Deployment
Suitable for general users who want a quick experience without extensive environment configuration. The only downside is that pulling the image can be a bit slow. -
Deployment Using Docker Environment
Ideal for software engineers who already have Docker installed and wish to customize the code. -
Running from Local Source Code
Suitable for users familiar with theConda
environment or those who wish to build the runtime environment from scratch.
For scenarios requiring higher response speeds, running from the local source code is recommended to reduce additional overhead.
Click here for a detailed guide on firmware compilation.
After successful compilation and network connection, wake up XiaoZhi using the wake-up word and monitor the server console for output.
Suggestion:
If EdgeTTS
frequently fails, please first check whether you are using a proxy (VPN). If so, try disabling the proxy and try again. If you are using Volcano Engine Doubao TTS and it often fails, it is recommended to use the paid version since the trial only supports 2 concurrent requests.
Suggestion:
Set the LLM
to HomeAssistant
in the configuration file and use the HomeAssistant
API to perform the relevant controls.
Suggestion:
Locate the following section in the configuration file and increase the value of min_silence_duration_ms
(for example, change it to 1000
):
VAD:
SileroVAD:
threshold: 0.5
model_dir: models/snakers4_silero-vad
min_silence_duration_ms: 700 # If your pauses are longer, increase this value
Suggestion:
Check whether the model.pt
file exists in the models/SenseVoiceSmall
directory. If it does not, please download it. See Download ASR Model Files for details.
Suggestion:
Verify that you have correctly installed the libopus
and ffmpeg
libraries using conda
. If not, install them using:
conda install conda-forge::libopus
conda install conda-forge::ffmpeg
The default configuration of this project is designed to be cost-effective. It is recommended that beginners first use the default free models to ensure that the system runs smoothly, then optimize for faster response times.
To improve response speed, you can try replacing individual components. Below are the response time test results for each component (for reference only, not a guarantee):
LLM Performance Ranking:
Module Name | Average First Token Time | Average Total Response Time |
---|---|---|
AliLLM | 0.547s | 1.485s |
ChatGLMLLM | 0.677s | 3.057s |
OllamaLLM | 0.003s | 0.003s |
TTS Performance Ranking:
Module Name | Average Synthesis Time |
---|---|
EdgeTTS | 1.019s |
DoubaoTTS | 0.503s |
CosyVoiceSiliconflow | 3.732s |
Recommended Configuration Combination (Overall Response Speed):
Combination Scheme | Overall Score | LLM First Token | TTS Synthesis |
---|---|---|---|
AliLLM + DoubaoTTS | 0.539 | 0.547s | 0.503s |
AliLLM + EdgeTTS | 0.642 | 0.547s | 1.019s |
ChatGLMLLM + DoubaoTTS | 0.642 | 0.677s | 0.503s |
ChatGLMLLM + EdgeTTS | 0.745 | 0.677s | 1.019s |
AliLLM + CosyVoiceSiliconflow | 1.184 | 0.547s | 3.732s |
Conclusion 🔍
As of February 19, 2025, if my computer were located in Haizhu District, Guangzhou, Guangdong Province, and connected via China Unicom, I would prioritize using:
- LLM:
AliLLM
- TTS:
DoubaoTTS
- This project was inspired by the Bailing Voice Dialogue Robot and implemented based on it.
- Many thanks to Tenclass for providing detailed documentation support for the XiaoZhi communication protocol.