Skip to content

Latest commit

 

History

History
65 lines (40 loc) · 2.65 KB

voice_interface.md

File metadata and controls

65 lines (40 loc) · 2.65 KB

Human Robot Interface via Voice

RAI provides two ROS enabled agents for Speech to Speech communication.

Automatic Speech Recognition Agent

See examples/s2s/asr.py for an example usage.

The agent requires configuration of sounddevice and ros2 connectors as well as a required voice activity detection (eg. SileroVAD) and transcription model e.g. (LocalWhisper), as well as optionally additional models to decide if the transcription should start (e.g. OpenWakeWord).

The Agent publishes information on two topics:

/from_human: rai_interfaces/msg/HRIMessages - containing transcriptions of the recorded speech

/voice_commands: std_msgs/msg/String - containing control commands, to inform the consumer if speech is currently detected ({"data": "pause"}), was detected, and now it stopped ({"data": "play"}), and if speech was transcribed ({"data": "stop"}).

The Agent utilises sounddevice module to access user's microphone, by default the "default" sound device is used. To get information about available sounddevices use:

python -c "import sounddevice; print(sounddevice.query_devices())"

The device can be identifed by name and passed to the configuration.

TextToSpeechAgent

See examples/s2s/tts.py for an example usage.

The agent requires configuration of sounddevice and ros2 connectors as well as a required TextToSpeech model (e.g. OpenTTS). The Agent listens for information on two topics:

/to_human: rai_interfaces/msg/HRIMessages - containing responses to be played to human. These responses are then transcribed and put into the playback queue.

/voice_commands: std_msgs/msg/String - containing control commands, to pause current playback ({"data": "pause"}), start/continue playback ({"data": "play"}), or stop the playback and drop the current playback queue ({"data": "play"}).

The Agent utilises sounddevice module to access user's speaker, by default the "default" sound device is used. To get a list of names of available sound devices use:

python -c 'import sounddevice as sd; print([x["name"] for x in list(sd.query_devices())])'

The device can be identifed by name and passed to the configuration.

OpenTTS

To run OpenTTS (and the example) a docker server containing the model must be running.

To start it run:

docker run -it -p 5500:5500 synesthesiam/opentts:en --no-espeak

Running example

To run the provided example of S2S configuration with a minimal LLM-based agent run in 4 separate terminals:

$ docker run -it -p 5500:5500 synesthesiam/opentts:en --no-espeak
$ python ./examples/s2s/asr.py
$ python ./examples/s2s/tts.py
$ python ./examples/s2s/conversational.py