Audio Connector

Audio Connector lets you send raw audio (PCM 16 khz/16bit) streams from a live Vonage Video session to external services such as AWS, GCP, Azure, etc., through your own servers for further processing and analysis.

Using Audio Connector, you can send audio streams individually or mixed. You can identify the speaker by sending the audio streams individually by opening multiple WS connections.

The further processing of audio streams in real-time and offline enables building capabilities such as captions, transcriptions, translations, search and index, content moderation, media intelligence, Electronic Health Records, sentiment analysis, etc.

You can also use Audio Connector to use a WebSocket connection to publish audio to a session.

Audio Connector is enabled by default for all projects, and it is a usage-based product. Audio Connector usage is charged based on the number of audio streams of participants (or stream IDs) that are sent to the WebSocket server. The Audio Connector feature is only supported in routed sessions (sessions that use the Media Router). You can send up to 50 audio streams from a single session at a time.

Important If a connection to your WebSocket server is not established within 6 seconds, the Connect API call will fail.

Starting a WebSocket connection

To start an Audio Connector WebSocket connection, use the The REST API.

You can also you can also start an Audio Connector WebSocket connection using the Server SDKs:

Java — See the vonage.video.connectToWebsocket() method.
Node — See the vonage.video.connectToWebsocket() method.
PHP — See the vonage->video->connectAudio() method.
Python — See the vonage.video.start_audio_connector() method.
Ruby — See the vonage.video.web_socket.connect() method
.NET — See the vonage.video.Broadcast.StartBroadcast() method

Make an HTTPS POST request to the following URL:

https://video.api.vonage.com/v2/project/:application_id/connect

Replace application_id with your Application ID.

Use the OAuth 2.0 generation non-interactive method for creating a Bearer Token using the JWT structure and your private key. For more information see the REST API call authentication.

Set the body of the request to JSON data of the following format:

{
  "sessionId": "session ID",
  "token": "A valid token",
  "websocket": {
    "uri": "wss://service.com/ws-endpoint",
    "streams": [
      "streamId-1",
      "streamId-2"
    ],
    "headers": {
      "headerKey": "headerValue"
    },
    "audioRate": 8000,
    "bidirectional": false
  }
}

The JSON object includes the following properties:

sessionId (required) — The session ID that includes the streams you want to include in the WebSocket stream.
token (required) — The Vonage video token to be used for the Audio Connector connection to the session. You can add token data to identify that the connection is the Audio Connector endpoint or for other identifying data. (The client libraries include properties for inspecting the connection data for a client connected to a session.) See the Token Creation developer guide.
websocket (required): Included details for the WebSocket:
- uri (required): A publicly reachable WebSocket URI to be used for the destination of the audio stream (such as "wss://service.com/ws-endpoint").
- streams (optional) — An array of stream IDs for the streams you want to include in the WebSocket stream. If you omit this property, all streams in the session will be included.
- headers (optional) — An object of key-value pairs of headers to be sent to your WebSocket server with each message, with a maximum length of 512 bytes.
- audioRate (optional) — A number representing the audio sampling rate in Hz. Accepted values are 8000 and 16000 (the default).

A successful call results in an HTTP 200 response, with details included in the JSON response data:

{
  "id": "b0a5a8c7-dc38-459f-a48d-a7f2008da853",
  "connectionId": "e9f8c166-6c67-440d-994a-04fb6dfed007"
}

The JSON response data includes the following properties:

id — A unique ID identifying the Audio Connector WebSocket connection.
connectionId — The connection ID for the Audio Connector WebSocket connection in the session.

For more details, see the Audio Connector REST API documentation.

Custom headers (HTTP and WebSocket)

When starting the WebSocket connection via REST or server SDKs, a HTTP connection request, to be upgraded to WebSocket, will be sent to your WebSocket server.

During the initial HTTP Upgrade, the HTTP request headers to your WebSocket server will include:

x-opentok-ws-conferenceid: Set to the conference ID;
x-opentok-ws-connectionid: Set to the connection ID;
x-opentok-ws-sessionid: Set to the session ID.

Additionally, you may include a headers JSON object within the websocket section of the REST API body or SDK request body. These headers behave differently depending on the phase of the connection:

During the initial HTTP Upgrade: All provided headers are sent as HTTP request headers to your WebSocket server.
After the WebSocket is established: Your headers are included in every text-based WebSocket control message as a text block. This includes:
- The initial websocket:connected message
- websocket:media:update messages (when audio becomes active or inactive)
- The final websocket:disconnected message
Exception: The buffer-clear confirmation ({"event":"websocket:cleared"}) does not include custom headers. We will fix this in the next releases of AudioConnector.
Binary audio frames: These contain only audio data and do not include headers.
Special handling for x-opentok-ws* headers: Any header keys prefixed with x-opentok-ws are only used in the HTTP handshake and are removed from the JSON payload echoed in text messages.

The CUSTOM-HEADER-* properties shown in the message examples below originate from the headers property provided when starting the connection.

Custom headers are limited to 512 bytes, calculated based on the serialized JSON length of the headers websocket object. If the serialized size exceeds 512 bytes, the connection request will fail.

WebSocket messages

First message

The initial message sent on an established WebSocket connection will be text-based and contain a JSON payload, it will have an event field set to websocket:connected, it will detail the audio format in content-type along with any other metadata that you have put in the headers property of the body in the POST endpoint. The headers property is not present in the message's JSON payload so the properties are at the top-level of the JSON. For example:

{
    "content-type":"audio/l16;rate=16000",
    "event": "websocket:connected",
    "CUSTOM-HEADER-1": "value-1",
    "CUSTOM-HEADER-2": "value-2"
}

Binary audio messages

Messages that are binary represent the audio of the call. The audio codec supported on the WebSocket interface is Linear PCM 16-bit, with a 16kHz sample rate. Each message includes one 640-byte frame of data (20ms of audio) at 50 frames (messages) per second.

Audio active/inactive messages

When audio in the streams included in the WebSocket is muted, a text message is sent with the following JSON payload (with active set to false):

{
    "content-type":"audio/l16;rate=16000",
    "method": "update",
    "event": "websocket:media:update",
    "active": false,
    "CUSTOM-HEADER-1": "value-1",
    "CUSTOM-HEADER-2": "value-2"
}

(The CUSTOM-HEADER properties in this example represent metadata that you include in the headers property of the body in the POST request to start the WebSocket connection.)

Audio may be muted because all clients stop publishing audio or as a result of a force mute moderation event.

When audio of one of the streams resumes, a text message is sent with the following JSON payload (with active set to true):

{
    "content-type":"audio/l16;rate=16000",
    "method": "update",
    "event": "websocket:media:update",
    "active": true,
    "CUSTOM-HEADER-1": "value-1",
    "CUSTOM-HEADER-2": "value-2"
}

Flush Buffer (CLEAR) Message

Your WebSocket server can optionally send a text-based control message to instruct the Audio Connector to immediately discard any audio frames that are currently buffered but not yet delivered. This is useful for real-time use cases such as barge-in, interrupting TTS playback or resetting a conversation turn.

To flush buffered audio, send the following JSON message over the WebSocket:

{
    "action": "CLEAR"
}

When the Audio Connector receives this message, all pending buffered audio frames are discarded, new incoming audio continues streaming without interruption and a confirmation message is returned:

{
    "event": "websocket:cleared"
}

This control message is optional. If you do not send "action": "CLEAR", audio streaming proceeds normally.

Disconnected message

When the Audio Connector WebSocket stops because of a call to the force disconnect REST method or because the 6-hour time limit is reached (see Stopping a WebSocket connection), a text message is sent with the following JSON payload:

{
    "content-type":"audio/l16;rate=16000",
    "method": "delete",
    "event": "websocket:disconnected",
    "CUSTOM-HEADER-1": "value-1",
    "CUSTOM-HEADER-2": "value-2"
}

This message marks the termination of the WebSocket connection.

(The CUSTOM-HEADER properties in this example represent metadata that you include in the headers property of the body in the POST request to start the WebSocket connection.)

Stopping a WebSocket connection

When your WebSocket server closes the connection, the Vonage video connection for the call also ends. In each client connected to the session, the client-side SDK dispatches events indicating the connection ended (as it would when other clients disconnect from the session).

You can disconnect the Audio Connector WebSocket connection using the force disconnect REST method. Use the connection ID of the Audio Connector WebSocket connection with this method.

As a security measure, the WebSocket will be closed automatically after 6 hours.

Automatic reconnections

Audio Connector will make a few attempts to re-establish a WebSocket connection that closes unexpectedly (for example, if the WebSocket closes without resulting from a call to the force disconnect REST method).

Publishing audio to a session via the WebSocket

You can use the Audio Connector WebSocket connection to send audio data from the WebSocket connection to a stream published in a Vonage session (in addition to having WebSocket connection receive audio from the session). Set the bidirectional property to true in the data you send with the REST API method to start the Audio Connector.

See Binary audio messages for details on the format of the audio data to send via the WebSocket connection.

When creating the token used by the Audio Connector, you can add token data to identify the the Audio Connector stream. (The Vonage client libraries include methods for inspecting the connection data for the connection of a stream in session.)

Sample application

See the Bidirectional-Audio-Connector project for a sample Node application that uses bi-directional Audio Connector.