XiaoZhi ESP-32 Backend Service (xiaozhi-esp32-server)

(中文 | English)

This project provides the backend service for the open source smart hardware project xiaozhi-esp32. It is implemented in Python based on the XiaoZhi Communication Protocol.

Target Audience 👥

This project is designed to be used in conjunction with ESP32 hardware devices. If you have already purchased an ESP32 device, successfully connected to the backend service deployed by XieGe, and now wish to set up your own xiaozhi-esp32 backend service, then this project is perfect for you.

Want to see it in action? Check out the videos 🎥

XiaoZhi ESP32 connecting to a custom backend model

System Requirements and Deployment Prerequisites 🖥️

Hardware: A set of devices compatible with xiaozhi-esp32 (for specific models, please refer to this link).
Server: A computer with at least a 4-core CPU and 8GB of memory.
Firmware Compilation: Please update the backend service API endpoint in the xiaozhi-esp32 project, then recompile the firmware and flash it to your device.

Warning ⚠️

This project is relatively new and has not yet undergone network security evaluations. Do not use it in a production environment.

If you deploy this project on a public network for learning purposes, be sure to enable protection in the configuration file config.yaml:

server:
  auth:
    # Enable protection
    enabled: true

Once protection is enabled, you will need to validate the machine's token or MAC address based on your actual situation. Please refer to the configuration documentation for details.

Feature List ✨

Implemented ✅

Communication Protocol
Based on the xiaozhi-esp32 protocol, data exchange is implemented via WebSocket.
Dialogue Interaction
Supports wake-up dialogues, manual conversations, and real-time interruptions. Automatically enters sleep mode after long periods of inactivity.
Multilingual Recognition
Supports Mandarin, Cantonese, English, Japanese, and Korean (default using FunASR).
LLM Module
Allows flexible switching of LLM modules. The default is ChatGLMLLM, with options to use AliLLM, DeepSeek, Ollama, and others.
TTS Module
Supports multiple TTS interfaces including EdgeTTS (default) and Volcano Engine Doubao TTS to meet speech synthesis requirements.

In Development 🚧

Conversation Memory Feature
Multiple Mood Modes
Smart Control Panel Web UI

Supported Platforms/Components 📋

LLM

Type	Platform Name	Usage Method	Pricing Model	Remarks
LLM	AliLLM (阿里百炼)	OpenAI API call	Token consumption	Click to apply for API key
LLM	DeepSeekLLM (深度求索)	OpenAI API call	Token consumption	Click to apply for API key
LLM	ChatGLMLLM (智谱)	OpenAI API call	Free	Although free, you still need to click to apply for an API key
LLM	OllamaLLM	Ollama API call	Free/Custom	Requires pre-downloading the model (`ollama pull`); service URL: `http://localhost:11434`
LLM	DifyLLM	Dify API call	Token consumption	For local deployment. Note that prompt configuration must be set in the Dify console.
LLM	GeminiLLM	Gemini API call	Free	Click to apply for API key
LLM	CozeLLM	Coze API call	Token consumption	Requires providing bot_id, user_id, and personal token.
LLM	Home Assistant	Home Assistant voice assistant API call	Free	Requires providing a Home Assistant token.

In fact, any LLM that supports OpenAI API calls can be integrated.

TTS

Type	Platform Name	Usage Method	Pricing Model	Remarks
TTS	EdgeTTS	API call	Free	Default TTS based on Microsoft's speech synthesis technology.
TTS	DoubaoTTS (火山引擎豆包 TTS)	API call	Token consumption	Click to create an API key; it is recommended to use the paid version for higher concurrency.
TTS	CosyVoiceSiliconflow	API call	Token consumption	Requires application for the Siliconflow API key; output format is WAV.
TTS	CozeCnTTS	API call	Token consumption	Requires providing a Coze API key; output format is WAV.
TTS	FishSpeech	API call	Free/Custom	Starts a local TTS service; see the configuration file for startup instructions.
TTS	GPT_SOVITS_V2	API call	Free/Custom	Starts a local TTS service, suitable for personalized speech synthesis scenarios.

VAD

Type	Platform Name	Usage Method	Pricing Model	Remarks
VAD	SileroVAD	Local	Free

ASR

Type	Platform Name	Usage Method	Pricing Model	Remarks
ASR	FunASR	Local	Free
ASR	DoubaoASR	API call	Paid

Usage 🚀

1. Deployment Documentation

This project supports three deployment methods. Choose the one that best fits your needs.

The documentation provided here is a written tutorial. If you prefer a video tutorial, you can refer to this expert's hands-on guide.

Combining both the written and video tutorials can help you get started more quickly.

Docker Quick Deployment
Suitable for general users who want a quick experience without extensive environment configuration. The only downside is that pulling the image can be a bit slow.
Deployment Using Docker Environment
Ideal for software engineers who already have Docker installed and wish to customize the code.
Running from Local Source Code
Suitable for users familiar with the Conda environment or those who wish to build the runtime environment from scratch.

For scenarios requiring higher response speeds, running from the local source code is recommended to reduce additional overhead.

2. Firmware Compilation

Click here for a detailed guide on firmware compilation.

After successful compilation and network connection, wake up XiaoZhi using the wake-up word and monitor the server console for output.

Frequently Asked Questions ❓

1. TTS often fails and times out ⏰

Suggestion:
If EdgeTTS frequently fails, please first check whether you are using a proxy (VPN). If so, try disabling the proxy and try again. If you are using Volcano Engine Doubao TTS and it often fails, it is recommended to use the paid version since the trial only supports 2 concurrent requests.

2. I want to control lights, air conditioners, remote power on/off, etc. with XiaoZhi 💡

Suggestion:
Set the LLM to HomeAssistant in the configuration file and use the HomeAssistant API to perform the relevant controls.

3. I speak slowly, and XiaoZhi always interrupts during pauses 🗣️

Suggestion:
Locate the following section in the configuration file and increase the value of min_silence_duration_ms (for example, change it to 1000):

VAD:
  SileroVAD:
    threshold: 0.5
    model_dir: models/snakers4_silero-vad
    min_silence_duration_ms: 700  # If your pauses are longer, increase this value

4. Why does XiaoZhi recognize a lot of Korean, Japanese, and English in what I say? 🇰🇷

Suggestion:
Check whether the model.pt file exists in the models/SenseVoiceSmall directory. If it does not, please download it. See Download ASR Model Files for details.

5. Why does the error “TTS task error: file does not exist” occur? 📁

Suggestion:
Verify that you have correctly installed the libopus and ffmpeg libraries using conda. If not, install them using:

conda install conda-forge::libopus
conda install conda-forge::ffmpeg

6. How can I improve XiaoZhi's dialogue response speed? ⚡

The default configuration of this project is designed to be cost-effective. It is recommended that beginners first use the default free models to ensure that the system runs smoothly, then optimize for faster response times.
To improve response speed, you can try replacing individual components. Below are the response time test results for each component (for reference only, not a guarantee):

LLM Performance Ranking:

Module Name	Average First Token Time	Average Total Response Time
AliLLM	0.547s	1.485s
ChatGLMLLM	0.677s	3.057s
OllamaLLM	0.003s	0.003s

TTS Performance Ranking:

Module Name	Average Synthesis Time
EdgeTTS	1.019s
DoubaoTTS	0.503s
CosyVoiceSiliconflow	3.732s

Recommended Configuration Combination (Overall Response Speed):

Combination Scheme	Overall Score	LLM First Token	TTS Synthesis
AliLLM + DoubaoTTS	0.539	0.547s	0.503s
AliLLM + EdgeTTS	0.642	0.547s	1.019s
ChatGLMLLM + DoubaoTTS	0.642	0.677s	0.503s
ChatGLMLLM + EdgeTTS	0.745	0.677s	1.019s
AliLLM + CosyVoiceSiliconflow	1.184	0.547s	3.732s

Conclusion 🔍

As of February 19, 2025, if my computer were located in Haizhu District, Guangzhou, Guangdong Province, and connected via China Unicom, I would prioritize using:

LLM: AliLLM
TTS: DoubaoTTS

7. For more questions, feel free to contact us for feedback 💬

Acknowledgements 🙏

This project was inspired by the Bailing Voice Dialogue Robot and implemented based on it.
Many thanks to Tenclass for providing detailed documentation support for the XiaoZhi communication protocol.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

XiaoZhi ESP-32 Backend Service (xiaozhi-esp32-server)

Target Audience 👥

System Requirements and Deployment Prerequisites 🖥️

Warning ⚠️

Feature List ✨

Implemented ✅

In Development 🚧

Supported Platforms/Components 📋

LLM

TTS

VAD

ASR

Usage 🚀

1. Deployment Documentation

2. Firmware Compilation

Frequently Asked Questions ❓

1. TTS often fails and times out ⏰

2. I want to control lights, air conditioners, remote power on/off, etc. with XiaoZhi 💡

3. I speak slowly, and XiaoZhi always interrupts during pauses 🗣️

4. Why does XiaoZhi recognize a lot of Korean, Japanese, and English in what I say? 🇰🇷

5. Why does the error “TTS task error: file does not exist” occur? 📁

6. How can I improve XiaoZhi's dialogue response speed? ⚡

7. For more questions, feel free to contact us for feedback 💬

Acknowledgements 🙏

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

XiaoZhi ESP-32 Backend Service (xiaozhi-esp32-server)

Target Audience 👥

System Requirements and Deployment Prerequisites 🖥️

Warning ⚠️

Feature List ✨

Implemented ✅

In Development 🚧

Supported Platforms/Components 📋

LLM

TTS

VAD

ASR

Usage 🚀

1. Deployment Documentation

2. Firmware Compilation

Frequently Asked Questions ❓

1. TTS often fails and times out ⏰

2. I want to control lights, air conditioners, remote power on/off, etc. with XiaoZhi 💡

3. I speak slowly, and XiaoZhi always interrupts during pauses 🗣️

4. Why does XiaoZhi recognize a lot of Korean, Japanese, and English in what I say? 🇰🇷

5. Why does the error “TTS task error: file does not exist” occur? 📁

6. How can I improve XiaoZhi's dialogue response speed? ⚡

7. For more questions, feel free to contact us for feedback 💬

Acknowledgements 🙏