Understanding Voice Automation
Introduction
Voice automation enables organizations to manage inbound phone calls without needing a human agent to answer each call. It contains a range of solutions, from simple menu systems that direct callers to the appropriate department to fully conversational AI agents that comprehend natural language and retain context. All these solutions are based on programmable call flows that react to user inputs, whether through speech or keypad selection.
This guide explains the key concepts of voice automation with the Vonage Voice API, outlines three implementation approaches, and helps you select the right one for your needs.
Simple IVR
An Interactive Voice Response (IVR) system automates phone interactions by providing callers with a menu of options. When a caller dials a number, they hear a prompt like: Please enter a digit or say something. The system responds based on their input.
Traditional IVRs depend on keypad (DTMF) input, which refers to the tones generated when a caller presses keys on their phone. Modern IVR implementations can accept spoken responses in addition to keypad input.
Advanced IVR / Voice Bot
An Advanced IVR / Voice Bot can support natural language understanding when you integrate an NLU/LLM in your application. For example, a caller can say Why is the sky blue? and your application can interpret the intent, ask follow-up questions, and either resolve the issue or direct the caller to the appropriate team while maintaining the conversation's context. This approach typically uses webhooks to control the call flow.
AI Voice Agent
An AI Voice Agent is an intelligent assistant that handles phone calls, listens to users using Automatic Speech Recognition (ASR), processes requests with a Large Language Model (LLM), and replies with natural-sounding text-to-speech in real time when you integrate these capabilities in your application. With the Vonage Voice API, this is commonly implemented using WebSocket audio streaming for low latency, which can help you implement experiences such as barge-in.
HTTP Webhooks vs. WebSocket Streaming
These approaches are commonly implemented using two patterns: HTTP webhooks or WebSocket streaming.
HTTP Webhooks: the Vonage Voice API sends HTTP requests to your application as the call progresses. Your application returns an NCCO (Call Control Object) to tell Vonage what to do next. This is commonly used in Simple IVR and Advanced IVR / Voice Bot guides.
WebSocket Streaming: a persistent, full-duplex connection between your application and the Vonage Voice platform. This is used in the AI Voice Agent guide for low-latency implementations, which can help you implement experiences such as barge-in. For details, see WebSockets in the Vonage Voice API.
You can also combine both patterns in a single solution.
Choosing the Approach
The capabilities below describe what a typical implementation can provide (your application and chosen AI providers), not built-in Voice API features.
| Simple IVR | Advanced IVR / Voice Bot | AI Voice Agent | |
|---|---|---|---|
| Best for | High-volume, predictable interactions | Complex, multi-turn conversations | Real-time, latency-sensitive experiences |
| Input type | Keypad (DTMF) + speech input | Natural language speech | Natural language speech |
| Can support natural language (with NLU/LLM) | |||
| Can maintain conversation context (in your application) | |||
| Response latency | Standard (HTTP webhook) | Standard (HTTP webhook) | Low (WebSocket streaming) |
| Example implementation uses |
Further Reading
Explore the following how-to guides to learn how to implement the solutions discussed in this guide:
- Simple IVR: Create a programmable call flow that captures both keypad and speech input, forming the foundation for any voice automation solution.
- Advanced IVR / Voice Bot: Develop a conversational voice bot powered by OpenAI. It handles natural language, maintains conversation context, and transfers to a human agent when needed.
- AI Voice Agent: Build a real-time AI voice agent using WebSocket streaming and Deepgram's Voice Agent platform.