https://a.storyblok.com/f/270183/1368x665/229d3bd67d/26may_dev-blog_ai-video-apps-conn-sdk-pipecat.jpg

Michael VernickCustomer Solutions Architect

Introducing Video Connector SDK & Pipecat Transport for AI Video Apps

Published on May 28, 2026

#video-api

#pipecat

#ai

Time to read: 7 minutes

Introduction

Real-time AI applications are transforming how developers build video experiences. Whether powering AI voice bots, video avatars, real-time transcription, emotion detection, or live language translation, modern applications increasingly require real-time access to raw audio and video streams, not just recordings after the fact.

Until now, integrating AI workflows into a live Vonage Video session required deep expertise in C++ and low-level media handling. That barrier is gone. Vonage has introduced two complementary Python-based tools designed specifically for developers building intelligent, media-aware applications: the Video Connector Server SDK and the Vonage Video Transport for Pipecat.

Together, they make it dramatically easier to stream audio and video between Vonage Video sessions and AI frameworks such as OpenAI, Deepgram, AWS Nova Sonic, HeyGen, and more. This blog post gives an overview of these tools, explains how they fit together, and provides references for deploying your first AI-powered video agent.

Why Developers Need Direct Access to Real-Time Video and Audio

Many AI workflows, including speech-to-text, LLM-driven analysis, voice synthesis, facial expression tracking, and multimodal perception, depend on real-time media. Developers working with the Vonage Video API have long asked for a simple, reliable way to receive audio and video from an active session, process it with AI, and send responses back.

Previously, the only server-side option for accessing raw media from a Vonage Video session was the Linux C++ SDK. While powerful, its low-level nature created a steep learning curve that slowed innovation and limited adoption, particularly among Python developers who make up the majority of the AI/ML community.

The Vonage Video Connector SDK removes this friction.

Example Use Cases

The toolchain supports a wide range of real-time AI experiences, including:

Voice and video AI agents (bots): interactive assistants that see and hear participants
Real-time transcription and captions: live speech-to-text for accessibility and comprehension
Meeting summaries and notes: automated note-taking with speaker identification
Language translation: real-time audio translation across participants
Emotion and facial expression detection: sentiment analysis from video frames
Patient and exam monitoring: remote monitoring using live video feeds
Video avatars: AI-generated video responses synchronized with voice

Content moderation: detecting inappropriate audio or visual content in real time

What the Video Connector SDK Provides

The Video Connector Server SDK is a Python package (available on PyPI) that acts as a server-side WebRTC client for Vonage Video sessions. It is a Python wrapper around the Vonage Linux C++ SDK, enabling headless, cloud-deployable applications without requiring any C++ expertise.

A diagram showing how the Video Connector Server SDK is integrated into an end-to-end video session. Topology of Video Connector Server SDK

Key Capabilities

Server-side WebRTC participation: join a Vonage Video session as a server-side client
Bidirectional audio and video: publish and subscribe to real-time audio and video streams
High-quality media formats: audio delivered as PCM 16-bit (up to 48 kHz); video as 8-bit raw frames up to FHD 1080p30
Automated media continuity: intelligently handles gaps in media delivery
Captions subscription (beta): receive auto-generated captions from the session
Individual audio stream identification (beta): differentiate and process audio packets per participant
Event-driven architecture: rich async callbacks for session and media events
Cloud and headless deployment: designed for server-side use in containerized environments

This lets developers focus entirely on what they want to build—AI pipelines, analysis tools, video bots—without needing to write any WebRTC or media infrastructure code.

Getting Started with the Video Connector SDK

The SDK can be installed from the Python Package Index, and is designed around an event-driven workflow: connect to a session, subscribe to participant streams, receive audio and video frames via callbacks, process them with your AI pipeline, and publish responses back into the session.

Reference

The Video Connector developer documentation provides a full reference for configuring sessions, setting up media handlers, and publishing audio and video back into a session.

Vonage Video Transport for Pipecat

Pipecat is an open-source Python framework for orchestrating complex AI workflows across audio, video, images, and text. It provides a modular, vendor-neutral platform for building real-time AI pipelines. It connects speech-to-text, LLMs, text-to-speech, video avatars, and more with minimal coding.

For video-focused applications, the new Vonage Video Transport for Pipecat acts as a bridge between a live Vonage Video session and a Pipecat processing pipeline. Unlike a serializer (which handles audio-only format conversion), a transport enables full bidirectional audio and video to flow between a Vonage WebRTC session and the Pipecat pipeline.

Flowchart of a video setup: Vonage Video Session connects via WebRTC to a Customer Application Server, which processes raw audio/video and links to AI Engines.

The Vonage Transport for Pipecat

Connects a Vonage Video session to a Pipecat pipeline via the Video Connector SDK
Supports bidirectional audio and video streams
Inherits from Pipecat's BaseTransport, BaseInputTransport, and BaseOutputTransport abstract classes
Initializes using a Vonage session ID, token, and an optional list of stream IDs to subscribe
Enables access to the full Pipecat ecosystem of AI services

This means developers can use Pipecat's growing list of AI integrations—OpenAI Realtime, AWS Nova Sonic, Deepgram, ElevenLabs, HeyGen, Tavus, Simli, and more—without writing any media translation or WebRTC code.

Pipecat AI Service Integrations

Pipecat supports a rich and growing set of AI services out of the box:

Category	Supported Services
Speech-to-Text	Deepgram, OpenAI Whisper, AssemblyAI, Azure, Google, AWS Transcribe, and more
LLM	OpenAI, Anthropic, Gemini, Grok, Bedrock, Ollama, and more
Text-to-Speech	ElevenLabs, Cartesia, OpenAI, AWS Polly, Google, and more
Speech-to-Speech	AWS Nova Sonic, OpenAI Realtime, Gemini Live
Video Avatars	HeyGen, Simli, Tavus

For the latest list, see the Pipecat supported services page.

Reference

The Vonage Video Connector Pipecat Transport developer documentation provides a full reference for setting up the Vonage Transport and building your first Pipecat-powered video agent.

Sample Applications

To help you get started quickly, Vonage provides sample applications demonstrating real-world use cases:

Echo Server: a simple app that echoes audio and video back to the session, useful for validating your setup
Video Avatar with Captions (Pipecat): a full pipeline using AWS Nova Sonic for speech-to-speech and HeyGen for AI-generated video avatar responses, with live captions
Audio Description of Video (Pipecat): uses Moondream AI for video recognition and text-to-speech to describe what is happening in the video stream in real time

Sample code is available in the releases repository. Getting started is straightforward:

Download and extract the SDK tarball from the releases repo
Install Docker
Open the README.md in the main directory and build the Docker image
Create a Vonage Video session and a session.json file with your session credentials
Run the echo server or Pipecat examples following the README.md in each example directory

How the Tools Fit Together

Vonage now offers a complete Python toolchain for integrating AI into both Voice and Video sessions:

Tool	Transport	Capabilities
Audio Connector SDK	WebSocket	Audio only (Voice API and Video API sessions)
Pipecat Serializer	WebSocket	Audio only (Voice API and Video API sessions)
Video Connector SDK	WebRTC	Audio + Video (Video API sessions)
Pipecat Transport	WebRTC	Audio + Video (Video API sessions)

If your use case is audio-only, the Audio Connector SDK and Pipecat Serializer are the right starting point. If you need full audio and video access—for avatars, emotion detection, visual AI, or multimodal agents—the Video Connector SDK and Pipecat Transport are the tools for you.

Conclusion

The Vonage Video Connector SDK and Pipecat Transport offer a streamlined, modern, and Python-centric approach for creating real-time audio and video agents. These tools eliminate the need for developers to manage complex WebRTC internals, write C++ code, or manually construct media pipelines.

Whether you’re building a video avatar bot, integrating speech-to-text with an LLM, analyzing facial expressions in real time, or creating an AI-powered meeting assistant, these tools provide the foundations you need.

If you are ready to begin, explore:

You can now deploy your first AI video agent in minutes and build confidently toward fully intelligent, media-aware applications on the Vonage platform.

Michael VernickCustomer Solutions Architect

Share:

Introducing Video Connector SDK & Pipecat Transport for AI Video Apps

Introduction

Why Developers Need Direct Access to Real-Time Video and Audio

Example Use Cases

What the Video Connector SDK Provides

Key Capabilities

Getting Started with the Video Connector SDK

Reference

Vonage Video Transport for Pipecat

The Vonage Transport for Pipecat

Pipecat AI Service Integrations

Reference

Sample Applications

How the Tools Fit Together

Conclusion

Share: