WebSockets in the Vonage Voice API

This guide explains how WebSockets integrate with the Vonage Voice API and help you build sophisticated real-time applications, such as AI-powered voice bots or live transcription services.

What Are WebSockets?

WebSockets are a communication protocol that provides a persistent, full-duplex connection between an application and the Vonage Voice platform.
Unlike HTTP, which requires a separate request for each exchange, a WebSocket maintains a single open connection over which messages can be sent in both directions at any time.

This is ideal for scenarios requiring low-latency, streaming data, such as sending and receiving audio packets in real time.

Why Are WebSockets Important for AI Connectors?

In AI voice applications, you often need to:

Receive live audio from a caller.
Transcribe that live audio.
Send synthesized audio responses back to the caller.
Send metadata for the session.
Exchange control signals dynamically.

WebSockets enable this by:

Streaming audio as binary packets.
Exchanging text messages as control commands or events.
Sending metadata at the beginning of the session.
Allowing near-instant interaction with AI services, such as speech recognition, NLU, or TTS engines.

Setting Up a WebSocket Server in Your Application

To connect Vonage to your WebSocket server (your application), you must:

Deploy a WebSocket endpoint accessible over a secure (wss://) URL.
Handle incoming connections initiated by Vonage.
Process both binary messages (audio) and text messages (JSON commands/events).
Optionally implement authentication or authorization logic.

Example (Node.js, ws):

const WebSocket = require('ws');
const server = new WebSocket.Server({ port: 8080 });

server.on('connection', socket => {
  console.log('Vonage connected');
  
  socket.on('message', message => {
    if (typeof message === 'string') {
      // Handle JSON commands or events
      console.log('Text message:', message);
    } else {
      // Handle binary audio
      console.log('Binary audio packet received');
    }
  });
  
  socket.on('close', () => {
    console.log('WebSocket disconnected');
  });
});

Using an NCCO to Establish the WebSocket Connection

To instruct Vonage to stream audio to your WebSocket server, configure an NCCO (Nexmo Call Control Object) action of type connect:

Example NCCO:

[
  {
    "action": "connect",
    "endpoint": [
      {
        "type": "websocket",
        "uri": "wss://your-server.example.com",
        "content-type": "audio/l16;rate=16000",
        "headers": {
          "custom-header": "value"
        }
      }
    ]
  }
]

Audio Format:

You control the audio format via content-type:
- audio/l16;rate=24000: 16-bit linear PCM, 24kHz
- audio/l16;rate=16000: 16-bit linear PCM, 16kHz (recommended for speech recognition).
- audio/l16;rate=8000: 8kHz if needed.

Bidirectional Interaction: Audio and Text Messages

From Vonage to Your Application:

Binary messages: Audio chunks captured from the caller.
Text messages: JSON events (e.g., connection opened, cleared, notifications).

From Your Application to Vonage:

Binary messages: Audio to play to the caller.
Text messages: Commands to control playback or request notifications.

This bidirectional flow enables:

Real-time transcription.
Playback of synthesized speech.
Control over playback buffers.
Event-driven interactions.

Parsing Vonage Packets (Binary vs JSON)

When your WebSocket server receives a message:

If the message is a Buffer or ArrayBuffer:
- It’s audio data (raw PCM).
If the message is a string:
- It’s a JSON-formatted control message.

Example JSON event:

{
    "event":"websocket:connected",
    "content-type":"audio/l16;rate=16000",
    "prop1": "value1",
    "prop2": "value2"
}

Always inspect the message type to route processing logic correctly.

Handling Incoming Binary Audio Packets

Binary messages contain raw PCM audio captured from the caller.

Key characteristics:

16-bit signed little-endian PCM.
Sample rate defined by content-type (e.g., 16,000 Hz).
Each packet represents a short slice of audio (~20ms).

Typical processing:

Feed the audio into a speech recognition engine.
Buffer for later playback.
Save to disk for analysis.

Sending Binary Audio Packets to Vonage

To play audio back to the caller:

Encode your audio as raw PCM.
Match the sample rate and format specified in the NCCO.
Send the audio data as binary WebSocket messages.

Important:
Vonage buffers incoming audio to play in order. This allows you to queue audio without gaps but requires buffer management, explained as follows.

How Audio Buffering Works

When you send binary audio packets:

Vonage buffers them internally.
WebSocket buffer size is 3072 packets which should be enough for around 60 seconds of audio.
Playback starts automatically.
Subsequent packets are queued.
You cannot interrupt playback mid-buffer without a control command.

This design ensures consistent playback without audio gaps.

Clearing the Audio Buffer (`clear` Command)

To immediately stop playback of buffered audio, send the clear command.

Outbound Command (from your application):

{
  "action": "clear"
}

` Effect:

All queued audio is discarded.
Playback stops immediately.

Inbound Acknowledgement (from Vonage platform):

{
  "event": "websocket:cleared"
}

Usage Scenario:
You need to interrupt playback to respond dynamically to the caller (e.g., after barge-in detection).

Notifying When Audio Ends (`notify` Command)

To get notified when the current audio buffer has finished playing, use the notify command.

Outbound Command (send after an audio payload that you want to know if it has finished playing):

{
  "action": "notify",
  "payload": {
    "customKey": "customValue"
  }
}

Behavior:

If audio is playing, Vonage sends an inbound notification to your application after playback ends.
If no audio is playing, the inbound notification is sent back to your application immediately.

Inbound Notification:

{
  "event": "websocket:notify",
  "payload": {
    "customKey": "customValue"
  }
}

Usage Scenario:
Synchronize your application logic (e.g., start recording or play a new prompt after the previous finishes).

Listening to a Specific Participant in a Multiparty Conversation

When your application participates in a conversation with multiple call legs—such as a customer and an agent—you may want your WebSocket connection to only receive audio from a specific participant rather than the entire mixed audio.

This is called selective audio control, and it is achieved using the canHear and canSpeak properties of the NCCO conversation action.

Why Would You Use This?

Speech analytics: Capture only what the customer says, ignoring the agent.
Real-time transcription: Record the customer side for compliance.
Whisper prompts: Send audio only to the agent without the customer hearing.

How to Configure Selective Listening

To set this up:

Create a named conversation (e.g., "customer_support").
Connect the customer and agent call legs to the conversation.
Add your WebSocket connection to the same conversation, specifying canHear and canSpeak as needed.

Example: WebSocket Listening to Customer Only

Below is an NCCO example where:

The customer is joined to the conversation.
The agent is joined to the conversation.
The WebSocket connects but only receives the customer’s audio.

Customer Leg NCCO:

[
  {
    "action": "conversation",
    "name": "support_room"
  }
]

Agent Leg NCCO:

[
  {
    "action": "conversation",
    "name": "support_room"
  }
]

WebSocket Leg NCCO:

[
  {
    "action": "conversation",
    "name": "support_room",
    "canHear": ["6a4d6af0-55a6-4667-be90-8614e4c8e83c"], // Customer leg ID
    "canSpeak": []
  }
]

How this works:

The WebSocket only hears the customer participant.
It does not send any audio back into the conversation (canSpeak is empty).
If you want it to inject audio (e.g., AI prompts) and play the audio to only a designated participant, you could include the participant call (leg) ID in canSpeak.
If you want it to inject audio (e.g., AI prompts) and play the audio to all participants, you do not include canSpeak parameter.

Handling WebSocket Disconnections and Fallback Options

When you use WebSockets with the Vonage Voice API, you don’t just rely on the WebSocket itself to know what’s happening.
Vonage also sends event callbacks to your eventUrl webhook. These HTTP POST requests provide authoritative information about call status and enable fallback behavior if the WebSocket connection fails.

This is important because simply observing the WebSocket closing doesn’t tell you why it closed. You need the event webhook to determine whether the disconnection was intentional or caused by an error.

Why Does This Matter?

When building production voice experiences—especially AI-powered or real-time ones—connections may drop unpredictably (e.g., server crashes, network timeouts).
To provide a graceful experience to the caller, you can implement fallback strategies such as playing a prompt, transferring to a human agent, or ending the call politely.

Webhook events give you a reliable mechanism to detect these situations and act accordingly.

How Vonage Notifies You of WebSocket Events

Whenever there is a significant change in the WebSocket connection status, Vonage sends an event webhook to your eventUrl.

Examples of relevant statuses:

unanswered: Vonage could not establish the WebSocket connection.
failed: The connection attempt failed.
disconnected: The WebSocket connection was dropped after being established.

Each event includes:

The uuid, identifying the call.
Timestamps.
Any custom headers you specified in the NCCO connect action.
The status field describing what happened.

Example Disconnected Event Payload

This event is sent when the WebSocket is disconnected after connection - whether due to an error or because your application closed it:

{
  "from": "442079460000",
  "to": "wss://example.com/socket",
  "uuid": "aaaaaaaa-bbbb-cccc-dddd-0123456789ab",
  "conversation_uuid": "CON-aaaaaaaa-bbbb-cccc-dddd-0123456789ab",
  "status": "disconnected",
  "timestamp": "2020-03-31T12:00:00.000Z",
  "headers": {
    "caller-id": "447700900123"
  }
}

How to Distinguish Intentional vs. Unintentional Disconnects

Important to understand:

Any disconnection, whether your application intentionally closed the WebSocket or it was dropped due to an error, raises a disconnected event.
If you want to intentionally terminate the WebSocket without triggering a fallback, Vonage recommends terminating the call leg via the Vonage Voice API rather than just closing the WebSocket connection.
- This way, no disconnected webhook is sent, and you can rely on receiving disconnected only for unintentional failures.

Handling Failed Connections During Setup

Sometimes the WebSocket connection cannot be established in the first place (for example, if your server is offline).
You can configure your connect action to use synchronous event handling:

Example NCCO with synchronous eventType:

[
  {
    "action": "connect",
    "eventType": "synchronous",
    "eventUrl": [
      "https://example.com/events"
    ],
    "from": "447700900000",
    "endpoint": [
      {
        "type": "websocket",
        "uri": "wss://example.com/socket",
        "content-type": "audio/l16;rate=16000",
        "headers": {
          "caller-id": "447700900123"
        }
      }
    ]
  }
]

How it works:

If the connection attempt fails, Vonage immediately POSTs an event to your eventUrl.
The event status will be unanswered or failed.
You can respond with a new NCCO describing fallback behavior, such as playing a message or redirecting the call.

Example Failed Connection Event Payload

{
  "from": "442079460000",
  "to": "wss://example.com/socket",
  "uuid": "aaaaaaaa-bbbb-cccc-dddd-0123456789ab",
  "conversation_uuid": "CON-aaaaaaaa-bbbb-cccc-dddd-0123456789ab",
  "status": "unanswered",
  "timestamp": "2020-03-31T12:00:00.000Z",
  "headers": {
    "caller-id": "447700900123"
  }
}

Implementing Fallback Strategies

When you receive a webhook with status: disconnected, failed, or unanswered, you can:

Return a new NCCO in your webhook response to handle the fallback (e.g., play a prompt).
Allow the original NCCO to continue, if there are additional actions.
End the call, if no further action is specified.

Example fallback NCCO:

[
  {
    "action": "talk",
    "text": "We are unable to connect you at the moment. Please try again later."
  }
]

Connecting to AI Engines

Find Vonage sample applications to connect to popular AI engines below:

Navigation