Real-time Audio Interaction with OpenAI WebSocket API

This sample code real-time audio interaction using OpenAI's WebSocket API for GPT-4o's real-time audio streaming preview. The system sends an input audio file to the OpenAI server and plays the audio response in real-time using Node.js.

Overview

Input Audio: The system reads an input audio file (e.g., gettysburg.wav), encodes it into base64 PCM16 format, and sends it to the OpenAI server.
Real-time Response: The OpenAI server responds with real-time audio chunks, which are played directly from memory using the speaker library.
Real-time Playback: The audio is played as it is received without needing to save the audio to a file.

WebSocket Connection and Audio Setup

First, configure the WebSocket connection to OpenAI's API and set up the speaker to play the real-time audio response.

const url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-10-01";
const ws = new WebSocket(url, {
    headers: {
        "Authorization": "Bearer " + process.env.OPENAI_API_KEY,
        "OpenAI-Beta": "realtime=v1",
    },
});

Setting up the live playback

The speaker library is used to play the real-time audio response. The audio chunks received from the OpenAI server are appended to the speaker buffer for live playback.

const speaker = new Speaker({
    channels: numChannels,          // 1 channel (mono)
    bitDepth: 16,                   // 16-bit samples
    sampleRate: sampleRate          // 24,000 Hz sample rate
});

On-Message Event Handler

The onmessage event handler is used to process the WebSocket messages received from the OpenAI server. The audio chunks are appended to the speaker buffer for real-time playback.

ws.on("message", function incoming(message) {
    const parsedMessage = JSON.parse(message.toString());
    console.log("Message received from server:", parsedMessage);

    // Handle audio delta events
    if (parsedMessage.type === 'response.audio.delta' && parsedMessage.delta) {
        const base64Audio = parsedMessage.delta;
        appendBase64AudioToSpeaker(base64Audio); // Play audio chunk in real-time
    }

    // Handle the response.audio.done event
    if (parsedMessage.type === 'response.audio.done') {
        console.log("Audio generation done.");
        speaker.end(); // End the speaker stream
    }

    // If the message contains content, print it in detail
    if (parsedMessage.item && parsedMessage.item.content) {
        console.log("Message content:", JSON.stringify(parsedMessage.item.content, null, 2));
    }
});

Prerequisites

Node.js (v14+)
OpenAI API Key
An audio file in .wav format (mono, 16-bit PCM, sampled at 24,000 Hz)

Installation

Clone the repository or download the project files:

git clone [email protected]:sajithamma/openai-realtime-nodejs.git
cd openai-realtime-nodejs

Install the dependencies:
```
npm install
```
This will install the following dependencies:
- dotenv: To manage environment variables.
- speaker: To play real-time audio from PCM16 data.
- audio-decode: To decode the input audio file.
- ws: WebSocket client for OpenAI's WebSocket API.
Add your OpenAI API key to a .env file:
```
touch .env
```
Add the following line to the .env file:
```
OPENAI_API_KEY=your-openai-api-key-here
```

Running the Project

Place your input .wav audio file in the project directory. Ensure the file is mono, 16-bit PCM, sampled at 24,000 Hz. For example, gettysburg.wav.
Run the project:
```
node app.mjs
```
The program will:
- Read and encode the input audio file.
- Send the input audio to OpenAI’s WebSocket API.
- Play the response audio in real-time using your system's speaker.

Project Structure

.
├── app.mjs               # Main entry point of the app
├── package.json         # Project dependencies and scripts
├── .env                 # OpenAI API key (not included in the repo)
└── gettysburg.wav       # Input audio file (add your own)

Expected Output

Console Output: As the program runs, you will see WebSocket events being printed in the console. Example:

Connected to server.
Message received from server: { type: 'response.audio.delta', ... }
Appending 12345 bytes to speaker...
Message received from server: { type: 'response.audio.done', ... }
Audio generation done.

Real-time Audio Playback: You will hear the real-time response generated by the OpenAI server through your speakers as audio chunks are received.

Troubleshooting

Slow Audio: If the audio sounds slowed down, ensure that the sample rate of the input file is 24,000 Hz. The response audio is played assuming this rate.
No Audio: Make sure your system's audio is working and the correct audio device is selected. Also, ensure the input audio file is in the correct format.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.mjs		app.mjs
gettysburg.wav		gettysburg.wav
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Real-time Audio Interaction with OpenAI WebSocket API

Overview

WebSocket Connection and Audio Setup

Setting up the live playback

On-Message Event Handler

Prerequisites

Installation

Running the Project

Project Structure

Expected Output

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Languages

sajithamma/openai-realtime-nodejs

Folders and files

Latest commit

History

Repository files navigation

Real-time Audio Interaction with OpenAI WebSocket API

Overview

WebSocket Connection and Audio Setup

Setting up the live playback

On-Message Event Handler

Prerequisites

Installation

Running the Project

Project Structure

Expected Output

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages