Vonage AI Tools to Build Real-Time AI Agents on Vonage Video and Voice APIs

Many AI workflows, such as speech-to-text, LLM-driven analysis, voice synthesis, and multimodal perception, depend on real-time audio and video. Developers working with the Vonage Voice and Video APIs have long asked for a simple, reliable way to receive media from an active session, process it with AI, and send responses back.

Building this infrastructure from scratch, managing WebSocket servers, binary audio frames, sample rates, WebRTC connections, and stateful sessions, is complex and error-prone. It slows down experimentation, proof-of-concept development, and production deployments.

Vonage solves this with two complementary toolsets that remove this friction and let developers focus on what they want to build. These toolsets support a wide range of real-time AI experiences, including:

  • Speech-to-text transcription
  • LLM-based meeting assistants
  • Sentiment or intent analysis in live calls
  • Interactive voice bots
  • Real-time language translation
  • Automated note-taking or summarization
  • Audio moderation and compliance detection

Vonage provides a variety of connectors and tools for integrating AI into Voice and Video API sessions through our Vonage AI Connectors. Whether powering transcription services, conversational agents, real-time translation, or sentiment analysis, modern applications increasingly require access to raw audio and video in motion, not just at the end of a recording or after a file upload. To support this next generation of intelligent applications, Vonage offers two approaches for developers integrating AI into Voice and Video API sessions: the Vonage AI Connector SDKs for developers building their own AI middleware, and the Vonage Pipecat Integrations for developers who want a flexible, open-source agent framework with mix-and-match AI vendor support.

Vonage AI Connector SDKs

The Vonage AI Connector SDKs are Python libraries that simplify how developers connect Vonage Voice and Video API sessions to their own AI endpoints. These SDKs handle the media conditioning and API interfacing, so developers can focus entirely on their AI logic rather than infrastructure.

There are two Vonage AI Connector SDKs, each designed for a different transport and use case:

Audio Connector Server SDK Video Connector Server SDK
API Compatibility Vonage Video API & Vonage Voice API Vonage Video API only
Transport WebSocket WebRTC
Media Audio only Audio + Video
Use Case Bridge audio to AI via WebSocket server Connect AI as a Video session participant
Availability (PyPI package) vonage-audio-connector-server vonage-video-connector

Audio Connector Server SDK

The Audio Connector Server SDK is a Python WebSocket server library that bridges audio between Vonage sessions and AI endpoints. It works with both the Video API (via Audio Connector) and the Voice API (via Voice WebSockets), making it the right choice for any audio-first AI use case across either API.

Key capabilities include:

  • Event-driven WebSocket server for receiving and sending PCM audio
  • Support for 8 kHz, 16 kHz, and 24 kHz samples with automatic frame handling
  • Clean async callbacks for connect, disconnect, message, and error events
  • Built-in buffering and timing control for smooth playback
  • Multiple concurrent connections for multi-agent or multi-participant workflows
  • TLS support for secure production deployments

Video Connector Server SDK

The Video Connector Server SDK is a Python WebRTC client library for Linux that connects directly to Vonage Video Sessions. Unlike the Audio Connector Server SDK, it supports both audio and video streams, making it the right choice for Vonage Video sessions, especially when your AI workflow needs to process or generate video, supports higher audio fidelity (up to 48 kHz or stereo), or when you want the lower latency characteristics of a WebRTC connection.

Key capabilities include:

  • WebRTC-based connection to Vonage Video Sessions
  • Audio and video stream access to and from the Video Session
  • Support for up to 48 kHz audio samples and 1 (mono) or 2 (stereo) audio channels with automatic frame handling
  • Support for up to Full HD (1080p) resolution with controls over resolution and frame rate
  • Support for receiving Live Captions data from the session
  • Python-friendly interface for AI endpoint integration
  • Designed for Linux-based AI server deployments

Vonage Pipecat Integrations

Pipecat is an open-source Python framework for orchestrating complex AI agent workflows across audio, video, images, and text. It provides a modular, vendor-neutral pipeline where developers can mix and match STT, LLM, and TTS providers — such as OpenAI, Deepgram, ElevenLabs, or AWS Nova Sonic — without writing media translation code.

Vonage offers two integrations with Pipecat, each using a different transport:

Vonage Audio Serializer for Pipecat Vonage Video Transport for Pipecat
API Compatibility Vonage Video API & Vonage Voice API Vonage Video API only
Transport WebSocket WebRTC
Media Audio only Audio + Video
Availability Vonage Audio Serializer for Pipecat included in Pipecat distribution, with samples for both Voice and Video APIs Vonage Video Transport for Pipecat available now on the Vonage developer portal, with broader distribution coming soon
Best For Most Audio AI use cases — audio-first, broad API compatibility Video API AI use cases, lower latency requirements

Vonage Audio Serializer for Pipecat

The Vonage Audio Serializer for Pipecat bridges audio between Vonage Voice and Video sessions and a Pipecat processing pipeline via WebSocket. It handles audio frame conversion, sample rate alignment, and DTMF metadata, so developers can connect directly to Pipecat's growing library of AI nodes without writing any media translation code. The Vonage Audio Serializer for Pipecat is already built into the Pipecat distribution and includes samples for both the Vonage Voice and Vonage Video APIs.

Vonage Video Transport for Pipecat

The Vonage Video Transport for Pipecat connects AI agents to Vonage Video Sessions via WebRTC, offering improved latency over WebSocket-based implementations and full support for both audio and video streams. It is the right choice for video AI use cases or any scenario where latency is a priority. The Vonage Video Transport for Pipecat is available now on the Vonage developer portal and works with the Video API.

Which Path is Right for You?

Choose the AI Connector SDKs if you want full control over your AI middleware and prefer to build and own your own interface to AI endpoints using Python.

Choose the Pipecat Integrations if you want a flexible, open-source agent framework with mix-and-match STT, LLM, and TTS vendors — and want to benefit from community optimizations and a growing AI ecosystem.

Both paths are fully supported by Vonage and can be used depending on your architecture.

Pricing

The Vonage AI Connectors are libraries that enable connectivity to AI, billed at the usage rate of the underlying connection to the Video session or Voice call.

Video API Voice API
AI Connectors No charge No charge
WebRTC connection Per participant N/A
WebSocket connection Audio Connector rate Per WebSocket duration

Conclusion: Deploy Your First AI Agent

Using the Vonage AI Connectors, developers have a clean, modern, Python-friendly path to building real-time AI agents, without needing to develop media infrastructure from scratch. Whether you want to build a voice bot, integrate speech-to-text with an LLM, generate real-time synthesized responses, or build a fully multimodal video AI experience, Vonage provides the foundations you need.

The following resources can help you get started: