This is Unreal Engine plugin for accurate speech recognition, and it doesn't require internet connection.
- Offline Speech Recognition
- Table of contents
- High level overview
- Project settings
- Test your microphone
- Where to download languages and how to test them
- Using built-in language server (USE THIS)
- Running language server as external process
- Running server process and game process at the same time
- Passing SoundWave as input, instead of microphone
- Platforms supported
- Links
Since this is the speech to text plugin (STT), first thing you need is to be able to record your voice (any recording device). Then recorded voice is passed to speech recognizer, speech recognizer is giving your speech back in textual form. Speech recognizer is working with 1 language at a time. Each language is a downloadable folder with files.
In order to package shipe you game or app to end user, you will need to package each language model with your game, as well as language server itself (this is optional, since your game itself can be a server).
To make microphone work, you need to add following lines to DefaultEngine.ini
of the project.
[Voice]
bEnabled=true
To not loose pauses in between words, you probably want to check silence detection threshold voice.SilenceDetectionThreshold
, value 0.01
is good.
This also goes to DefaultEngine.ini
.
[SystemSettings]
voice.SilenceDetectionThreshold=0.01
Starting from Engine version 4.25 also put
voice.MicNoiseGateThreshold=0.01
Another voice related variables worth playing with
voice.MicNoiseGateThreshold
voice.MicInputGain
voice.MicStereoBias
voice.MicNoiseAttackTime
voice.MicNoiseReleaseTime
voice.MicStereoBias
voice.SilenceDetectionAttackTime
voice.SilenceDetectionReleaseTime
To find available settings type voice.
in editor console, and autocompletion widget will pop up.
Console variables can be modified in runtime like this
Above values may differ depending on actual microphone characteristics.
To debug your microphone, input you can convert output sound buffer to unreal sound wave and play it.
Another thing to keep in mind, if component connected to server, by default, it will try to send voice data during microphone capture. If you don't want this behavior, you can disable it like this
Use this for push to talk style recognition (when you record whole phrase first, and then send it to server)
All available languages are available here
To test how specific language behaves, you can use external language server app
This method is preferable for simple scenarios, when you don't need to separate your game and language server, here you don't have all this hustle managing external process and communicating with server via web sockets.
For both automatic and push to talk style recognition, you start from adding SpeechRecognizer component to your actor
And then loading language into it. (This is non blocking function, and you know exactly when model is fully loaded into memory by connecting to Finished output pin)
Feed voice data node can handle any amount of pre recorded speech, see this section
In more complex cases this method is preferable over using built-in. You can have a single language server running in cloud or local server, and it can process multiple clients at the same time, since it's multithreaded.
-
Download latest version here
-
Run vls.exe, which is a user interface for asr_server.exe
NOTE: asr_server.exe is real server, you can run it without gui
-
Go to main menu -> File -> Download models
-
You will be redirected to a web page where you will find all available models (languages)
-
In order to start using language, first download one of them
-
Enter path to downloaded model to server UI and press start button
!NOTE!: Depending on model size, you need to wait until model loaded in to memory, before start feeding server with voice data. e.g. If model size is ~2GB, it acn take ~10-30 seconds. But this is one time event, you can load your language to memory once with OS startup.
-
Open unreal
-
Create actor blueprint
-
Add Vosk component in components panel
-
On begin play
-
Start talking
-
Check Partial Result Received event gets executed
Plugin offers following nodes
Build Server Parameters - helper method to simplify passing arguments to create process node
Create Process - Runs external program, this one is generic, you can use it to run whatever external program
NOTE: When you ship your game, you need to include language server as well, put language server files in your game bin folder (GAME/Binaries/Win64/**
), and use "GetProcessExecutablePath" node to build path to asr_server.exe
Kill Process - This is an equivalent of Alt+F4
, it will shut down external process based on Process ID, the process id is process handle. Save output of Create Process
node to a variable and use it later to terminate process.
Default use case:
- Create an
Actor
responsible for voice recognition - Start language server on
Begin Play
event - Add
Vosk
actor component and initialize it in begin play - Begin capturing voice data
- Bind to message receive events
- Uninitialize vosk component and terminate server process on end play
NOTE:
Uninitialize
will stop voice capture if it is active
To do so, plugin offers a node that will convert sound into array of bytes, it is called "Decompress Sound"
. You can than use output of decompress sound node in "Send Voice Data to Language Server"
node, and expect partial and final result events being invoked later, when server finishes recognition.
NOTE: Do not call
BeginCapture
andFinishCapture
in this case, since we don't want to use audio from the microphone
It takes sound bytes as first argument, and packet size as second argument. It will split all bytes into packets of given size, and send them one after another to language server, emulating microphone capture behavior. If packet size is greater than size of voice data, data will not be sent. 4096 packet size works relatively fast and suitable for short phrases. Note that if packet size is small, it will take more time to deliver entire voice to the server, and server will perform more iterations accordingly. You should play around with packet size in your specific case.
Tested on Windows
Find out more in documentation