Audio Connector Server SDK

Overview

The Vonage Audio Connector Server SDK is a Python library for building server-side WebSocket endpoints that send and receive real-time PCM audio from Vonage Video API sessions. It is built on top of the Audio Connector, which lets you extract raw audio streams from a live session and route them to an external WebSocket server.

The SDK abstracts the low-level WebSocket protocol, connection lifecycle, audio frame buffering, and timing management so that you can focus on processing audio and integrating AI services.

How It Works

When a Vonage Video session uses the Audio Connector, the session's media router opens a WebSocket connection to your server and begins streaming PCM audio. The Audio Connector Server SDK handles that connection through an event-driven model:

The SDK starts a WebSocket server listening on a configurable host and port.
When Audio Connector opens a connection, the SDK fires an on_connect event and passes a client handle to your application code.
Your application registers handlers on the client handle to receive audio frames (on_message), detect disconnection (on_disconnect), and handle errors (on_error).
Your application processes the audio—for example, by forwarding it to a speech-to-text service—and sends processed audio or control messages back to the session via the same client handle.

The SDK manages audio buffering and frame timing internally, ensuring smooth playback synchronization when you send audio back into the session.

Key Capabilities

Event-driven architecture: Server lifecycle (start, stop) and connection events (connect, disconnect, message, error) are handled through async callbacks, keeping your application logic decoupled from connection management.
Bi-directional real-time audio: Receive raw PCM audio from the session and send processed PCM audio back, with configurable sample rates (8kHz, 16kHz, 24kHz).
Multiple concurrent connections: Handle multiple Audio Connector sessions simultaneously, making it suitable for multi-tenant or scaled AI workflows.
SSL/TLS support: Secure WebSocket connections with a provided SSL context for production deployments.
Audio frame management: Built-in buffering and timing control synchronize outgoing audio frames, so you don't have to implement pacing logic yourself.

When to Use This SDK

Use the Audio Connector Server SDK when you need to connect live Vonage Video session audio to a server-side processing pipeline. Common scenarios include:

Conversational AI assistants: Build voice bots using a speech-to-text → LLM → text-to-speech pipeline directly within a Video session.
Live transcription and translation: Stream audio to a transcription service and return captions or translated speech in real-time.
Sentiment and tone analysis: Detect emotion or compliance signals during live calls.
Voice biometrics: Identify or authenticate speakers from their audio stream.
Real-time coaching: Provide AI-generated feedback to agents during customer calls.
Automated note-taking: Generate summaries, transcripts, and action items from session audio.
Content moderation: Flag inappropriate or non-compliant speech as it happens.

If your use case involves video processing or video avatars in addition to audio, consider the Video Connector or the Video Connector Pipecat Integration instead.

If you want to connect to a pre-built Pipecat AI framework pipeline rather than implement your own audio processing, see the Vonage Audio Serializer for Pipecat.

Audio Connector Server SDK

Overview

How It Works

Key Capabilities

When to Use This SDK

See Also