This article was written in collaboration with Mark Berkeland
Introduction
Automatic Speech Recognition (ASR) is a critical component of modern voice-driven applications, enabling machines to convert spoken language into text for processing and interaction. ASR plays a pivotal role in contact centers, virtual assistants, and other voice-enabled services by facilitating real-time communication between users and AI agents. While Vonage AI Studio offers a built-in ASR solution, there are scenarios where using a custom ASR engine might be preferable. Custom ASR can be tailored to specific industry jargon, languages, or regional accents that the built-in ASR may not handle as effectively.
Additionally, integrating an external ASR engine, like Deepgram can offer more control over accuracy, transcription speed, and custom vocabulary, providing a higher degree of flexibility in specialized applications.
This article will use Deepgram, an ASR Engine service provider. We will use the provided ASR Connector to integrate the sample AI Agent or your AI Agent with the Deepgram ASR Engine. Since the ASR Connector manages the telephony part, we will use an HTTP-type AI Agent instead of a Telephony-type AI Agent. The AI Agent and the ASR Connector communicate via HTTP.
TLDR; You can find the fully built-out integration on Github
Prerequisites
All the links and corresponding versions for the prerequisites are available in the code repository:
A Vonage API account
Vonage API Account
To complete this tutorial, you will need a Vonage API account. If you don’t have one already, you can sign up today and start building with free credit. Once you have an account, you can find your API Key and API Secret at the top of the Vonage API Dashboard.
Deepgram ASR and Vonage AI Studio Integration Overview
In this process, a customer initiates a phone call through the Public Switched Telephone Network (PSTN), which is handled by the ASR Connector. The ASR Connector bridges the customer’s voice call (PSTN 1) to a WebSocket connection, allowing the system to process the call in real-time. Audio from the customer is passed through this WebSocket to the ASR Engine, which transcribes the spoken words into text. This transcription is then forwarded by the ASR Connector to the AI Studio Agent.
Once the AI Studio Agent receives the transcribed text, it processes the customer's input and generates a suitable response. This response is sent back to the ASR Connector, where it is converted into Text-to-Speech (TTS) and played back to the customer. Throughout the interaction, the AI continuously listens and responds to the customer, guiding the conversation. At any point, if the AI determines that further assistance is needed, the system can transfer the call to a live human agent (PSTN 2), signaling that the AI session has ended.
This setup ensures seamless communication between the customer and the AI through automated voice recognition and response. The ASR Connector serves as the intermediary, managing both the transcription and the interaction with the AI, while the ASR Engine enables real-time processing of spoken words. The system is designed to handle calls efficiently but can escalate to a live agent if more complex input is required.
How Does the ASR Connector Work?
The ASR Connector is a single executable program that manages the flow of voice calls between a customer, the ASR engine, and the AI Agent. Although the program can function as a unified code file, it is conceptually divided into two parts to simplify understanding and deployment. Part 2 manages the phone calls (PSTN 1 and PSTN 2) and the interaction with the AI Agent, while Part 1 handles the audio processing by relaying the audio from the customer to the ASR engine and returning the transcriptions to Part 2. This split allows for easier understanding of the flow and can also be implemented on separate servers if needed.
When a call is initiated (either inbound or outbound), Part 2 sets up a session with the AI Agent and uses a WebSocket to transmit the customer’s audio to Part 1. This audio is then forwarded to the ASR engine for transcription. The ASR engine returns the transcribed text to Part 2, which sends it to the AI Agent as part of an AI "step." The AI Agent processes this transcription and returns a response, which is played back to the customer through Text-to-Speech (TTS) over the PSTN 1 leg.
The interaction continues in this way, with each transcription being sent as an AI step and responses played to the customer. If the AI Agent determines that the customer needs to be transferred to a live agent, it sends a request to Part 2 to terminate the AI session and initiate the transfer. Part 2 then bridges the customer (PSTN 1) with the live agent (PSTN 2), terminating the WebSocket connection and the ASR session. From this point, the audio flows directly between the customer and the live agent, completing the transfer.
How Does the AI Agent Work?
The sample AI Studio Agent operates in both inbound and outbound call modes, and the direction is determined by the ASR Connector, which initiates the AI Agent's flow. The main function of the agent is to prompt the user for their name and then engage them in a loop, where they can either hear a joke or be transferred to a live agent. When the agent session is initialized, it receives key parameters such as the call direction, a unique call ID (UUID), and a webhook address for handling call transfers.
The agent starts by checking the direction parameter and greets the user accordingly. It then prompts the user for their name using a "Collect Input" node, which stores the response. All communication between the AI Agent and the user flows through the ASR Connector, which acts as the intermediary for transmitting messages and user inputs between the two. The AI Agent doesn’t interact directly with the user but rather through API calls, where the ASR Connector informs the AI Agent of user inputs and waits for the agent's response.
The agent continues by asking if the user wants to hear a joke or transfer to a live agent. To deliver the jokes, the agent calls a Webhook that retrieves jokes from an external website (like a "Dad Jokes" API) and then reads the joke to the user via a "Send Message" node. The user is prompted again to decide if they want to hear another joke or be transferred. Based on the user’s response, a Classification node determines whether the user wants another joke or wishes to connect with a live agent.
When the user chooses to transfer to a live agent, the agent uses another Webhook to notify the application controlling the telephony side. The webhook passes essential information, such as an announcement for the user, the UUID to uniquely identify the call, and the phone number of the live agent. Once this webhook is triggered, the agent has completed its task, and the flow ends as the call is transferred to the live agent.
How to Setup Your Application
Create a Deepgram Project and retrieve your Deepgram API Key
Create an ngrok tunnel to listen for port 8000
Please take note of the ngrok Endpoint URL as it will be needed in the next sections, that URL looks like: https://yyyyyyyy.ngrok.io
For more help see testing with ngrok
Login to your Vonage Developer Dashboard and create a new application and do the following:
Enable Voice capabilities
You will see this under the capabilities section or click edit if you don’t see it.
Under Answer URL, leave HTTP GET, and enterhttps://yyyyyyyy.ngrok.io/answer(replace <host> and <port> with the public host name and if necessary public port of the server where this sample application is running)
Under Event URL, select HTTP POST, and enterhttps://yyyyyyyy.ngrok.io/event(replace <host> and <port> with the public host name and if necessary public port of the server where this sample application is running)
Take note of your application’s region
Click on “Generate public and private key”
Save the private key file in this application folder as .private.key
IMPORTANT: Do not forget to click on [Save changes] at the bottom of the screen if you have created a new key set.
Link a phone number to this application if none has been linked to the application.
Take note of a few values that you will need later in the tutorial:
Your Vonage account API key, referred to hereafter as environment variable API_KEY
Your Vonage account API secret, referred to hereafter as environment variable API_SECRET
Your application ID, referred to hereafter as APP_ID
Your selected Region, referred to hereafter as environment variable API_REGION
The phone number linked to your application, referred to hereafter as environment variable SERVICE_PHONE_NUMBER
How to Setup Your HTTP Agent
Go to the AI Studio Dashboard
Import the sample AI Agent BlogAgent.zip from the Github repo
You may change the Agent Name
Click on [Import Agent].
Open your new agent
Set the Live Agent phone number
Go to Properties (left column icons) > Parameters >Custom Parameters > callee
Change the value to your desired recipient number
it must be a phone number in E.164 format without the leading '+' sign
Click on “Close”
Take note of a few more values that you will need later in the tutorial:
Your Agent ID, referred to hereafter as environment variable AGENT_ID
Your Vonage AI API key, referred to hereafter as environment variable X_VGAI_KEY
Do not confuse your Vonage AI API key (X-Vgai-Key) with your Vonage account API key
How to Run the ASR Application Locally on Your Computer
Copy or rename .env-example to .env
Update all the parameters in the .env file as per the previous sections' contents.The argument for environment variable VG_AI_HOST should be consistent with the argument of environment variable API_REGION.
Install node modules with the command: npm install
Launch the application in a different terminal tab than your ngrok tunnel: node asr-connector
The application runs by default on port 8000. Remember to have your ngrok tunnel running on the same port as the app.
How to Test the Sample AI Agent and ASR Connector
Whether the first call is outbound or inbound, the user (e.g. customer) will be asked for a name, then some jokes will be played instead of music-on-hold, until the user says "no" for no more jokes, after which the call is transferred to the other user (e.g. Live Agent) which phone number is the one that has been set when deploying the AI Agent.
How to Test Outbound PSTN Calls
You may trigger the outbound call by opening the following web address with your ngrok URL: https://yyyyyyyy.ngrok.io/startcall?callee=12995551515
How to Test Inbound PSTN Calls
Call the phone number linked to your Vonage API account.
Conclusion
In this tutorial, you’ve learned how to extend Vonage AI Studio to integrate with an alternate ASR Engine using an HTTP-type AI Agent and the ASR Connector server application, which handles the telephony part (via the Vonage Voice API) and manages the connection to the ASR Engine.
This integration efficiently handles PSTN-type calls, but it can also be extended to support additional communication channels. One potential enhancement is to expand this setup for SIP calls, which would allow you to leverage IP-based telephony, providing even more flexibility for businesses using private networks. Additionally, you could adapt the integration to work with WebRTC clients, enabling real-time communication directly through web browsers, making it a suitable solution for customer support on websites without requiring traditional telephony services.
If you have questions, join our Community Slack or message us on X.
Additional Resources
I owe a considerable debt of gratitude to Mark Berkeland, whose invaluable assistance and stunning brilliance turned this otherwise standard-issue blog into a world-class work of technical art.
Customer Solutions Engineer at Vonage. With a background in product management, network and systems operations, customer support, quality assurance, software development team manager, Tony is working in the Telecommunications industry formerly in France and now in the US. He helps global large and smaller companies develop solutions using programmable voice, messaging, video, and multi-factor authentication services.