Dynamic Speech Generation with Third Party Providers in AI Studio
Published on September 3, 2024

AI Studio offers a wide variety of Text-to-Speech (TTS)  integrated languages and voice styles, enhanced by Speech Synthesis Markup Language (SSML) for creating human-like utterances. There are many more TTS options out there that you can use instead. AI Studio allows you the flexibility to connect with any 3rd party that has accessible REST API endpoints. In this blog, we will demonstrate how to utilize dynamically generated Deepgram synthetic voices with AI Studio.

An engaging, humanlike TTS experience is critical for voice agents because it fosters a natural and relatable interaction, making users feel understood and valued. This level of engagement enhances overall customer satisfaction and loyalty, as it reduces frustration and improves communication efficiency. This blog post will cover the use of dynamically generated speech audio files for complete customization of the Agent <> Human interaction.

A previous blog post covered statically generated speech audio files.

Project Overview

In this blog post, we’ll discuss how to use Vonage AI Studio with a third-party speech synthesis provider to dynamically generate speech files based on customized parameters that are used throughout an AI Studio flow.  For this blog, we will be using a toy Electronic Health Record (EHR) application, and demonstrate an inbound call from a patient user to a physician’s office.  The Studio application is built to collect information about the user via the Calling Line ID (CLID), or phone number of the inbound user.  The Studio agent will use a webhook to collect information about the patient, such as the patient’s first and last name, the patient identification number, and whether or not there are any scheduled appointments.  The user will be greeted using the Deepgram Aura TTS model.

Prerequisites

Vonage API Account

To complete this tutorial, you will need a Vonage API account. If you don’t have one already, you can sign up today and start building with free credit. Once you have an account, you can find your API Key and API Secret at the top of the Vonage API Dashboard.

How to Create a Voice Agent

Step 1: Create Your Agent

From the AI Studio Dashboard, select Create Agent. Then since this is a voice use case, select “Telephony”.

ai-studio_agent_typeSelect Agent Type

Step 2: Configure Your Agent 

Agent configuration is a required first step and more information can be found here.  In our case, we are providing some basic agent constructs including localization, assignment of the agent to a specific API key or subaccount, and the language for the agent.

ai_studio_agent_detailsAgent DetailsPlease note that the Voice/Telephony agent does require that you choose a voice.  However, this configuration construct is not pertinent because we will be using 3rd party (Deepgram) voices.  Your account will not be charged for any Vonage TTS usage in this case, as long as you are using the approaches mentioned here.

Step 3: Choose Your Template

Choose “Caller Identification”.  We will be using the inbound Calling Line ID (CLID) of the caller to provide a customized greeting. This template gives you a great starting point.  For more information on this template, you can view this article.

Choose a TemplateChoose a Template

Step 4: Choose Inbound Call

Choose the Inbound Call option. Our voice agent will respond to an incoming call but this does not constrain the agent to only inbound calling. For example, if you want to send a follow-up email after the engagement with the agent is complete, you can accomplish this as part of the Inbound Call flow. Learn more about AI Studio Conversation Events.

Choose Inbound CallChoose Inbound Call

Voilà, you are ready to start building your Voice agent with Deepgram’s customized voices.  Let’s go! 

How to Integrate Dynamically Generated Text-To-Speech Audio Files with AI Studio

It’s important to understand that in a telephony agent, two nodes can be used for customized TTS.  You can read more about the Speak node and the Collect Input node.  We will focus on the Collect Input node.

Using this approach, conversational designers can create speech audio files with any provider’s API endpoint and use the returned file in a Collect Input node.

In this use case, you might find it important to provide a fully customized UX.  This means that you would potentially want to use attributes about the user, collected throughout the call, as part of the generated TTS stream.  As an example, in my Existing Patient workflows, I want to personally greet and communicate with each user by first name, and I want the agent to read out some of those attributes as part of my flows.  Such examples include:

  • Hello,  F_NAME.  It’s nice to have you back with us today.  I have your appointment currently set for APPT_DATE, at APPT_TIME.  Do you need any help with that?  

  • Good news, F_NAME, I have your rescheduled appointment set to NEW_APPT_DATE at NEW_APPT_TIME.  Would you like me to send a text to you for confirmation? 

  • Great, I have your mobile number as PATIENT_MOBILE.  Is this still correct?  If so, I’ll send you a text reminder now and the day before the appointment.  If not, just let me know that we need to change your contact information and I will help you with that.  

Let’s dive in on this approach. It’s a little more complex to create dynamic speech files based on the collected user information stored in various parameters. All of the endpoint-driven TTS providers do not provide a cloud storage facility; rather they will either pass the returned audio binary file directly in the response or a webhook event notifier. This is especially the case for Deepgram, where you can choose to receive your audio on a webhook, or as part of a 200 response to your API call.  So you will build out a small back-end ephemeral storage service that would handle the inbound route servicing the intermediate speech file handling.    Depending on your approach (a route handler versus a webhook server), you could use Vonage Cloud Runtime’s Asset provider, or any other blob / unstructured storage bucket (AWS S3, etc). The idea here is to make the file storage and usage temporary, such that these files are cleaned up after they are used, enhancing security and privacy controls in your application.  Now, let’s explore the solution.

In my case, because my application is an Electronic Health Record (EHR) system, that handles multiple use cases such as appointment scheduling, rescheduling, payments, and prescription requests, I am creating a general-purpose webhook node (Deepgram_Node) that I can use across all of my flows.   I have added a webhook node to handle recognizing the patient by their inbound caller ID.  

  1. Once my application’s preliminary webhook node has fetched the inbound caller’s first name, I map that JSON attribute into a Studio parameter called $F_NAME.  I will use this stored value in that parameter to add to my TTS request to Deepgram.  The Deepgram webhook Node is configured like this.Deepgram Node Query ParamsDeepgram Node Query Params

  2. Note the JSON formatted body configuration contains my desired generic text, based on the flow servicing, as well as the customized $F_NAME stored parameter. Deepgram Webhook Node JSON BodyDeepgram Webhook Node JSON Body

  3.  The Response Mapping in the webhook node is critical because I will use the 2xx response received and map those attributes to other Studio parameters for later use: Deepgram Node Response MappingDeepgram Node Response Mapping

  4. Note that I am mapping the returned URL attribute to a new parameter, called $SYNTHETIC_VOICE.  This url attribute is the hosting URL for the audio file.  Depending on the storage provider you are using (S3, etc), it is important that this url parameter maps to a structure similar to this:   https://storagefile.com/filename.mp3.  The Studio parser will look explicitly for the .mp3 file extension on the URL appendage. This will be the parameter used in the next node to play out the full synthesized speech utterance.  See the screenshot below: Using the stored URL in the SYNTHETIC_VOICE parameterUsing the stored URL in the SYNTHETIC_VOICE parameter

  5.  This Collect Input node is now ready to be configured with other settings as you might see fit, such as:

    1. Retry Prompt (which can be the same $SYNTHETIC_VOICE parameter or another parameter.

    2. Additionally, you may want to set up Caller Response Input (Speech, DTMF), and Silence Detection.

    3. In addition, based on your specific needs, you might set up Context Keywords and Entity Ambiguation to ensure your Agent is interacting optimally with humans.  

  6. It’s a good practice after using the Collect Input node to add a Classification node if your agent’s response is dependent upon the human user input.  You can learn more about this node here.  In this node, you create the knowledge and training for the agent to be able to handle the user’s intent.

Conclusion

Today, many TTS providers incorporate LLMs into their speech synthesis pipelines, allowing for more natural and expressive speech. Previously, TTS was primarily driven by methods like concatenative synthesis, which stitched together pre-recorded speech snippets, and parametric synthesis, which used statistical models such as Hidden Markov Models (HMMs). These earlier techniques often resulted in more robotic and less nuanced voices.

LLMs enhance speech synthesis by leveraging vast amounts of data and advanced neural network architectures to understand and generate human-like speech patterns. They analyze text for context, sentiment, and natural speech rhythms, allowing for more accurate intonation, stress, and emotion in the generated speech. This results in voices that are not only clearer and more pleasant to listen to but also capable of conveying subtle emotions and natural conversational dynamics, significantly improving the overall user experience.

In this blog post, we explored how to utilize 3rd party Text-to-Speech (TTS) providers in conjunction with AI Studio to fully customize the user experience.  We always welcome community involvement.  Please feel free to join us on GitHub and the Vonage Community Slack

Tim DentryHealthcare Customer Solutions Architect

Tim is a Healthcare Customer Solutions Architect and a passionate AI/ML enthusiast, particularly in the realm of Natural Language Processing/Understanding and Knowledge Graphs.  Outside of work, Tim enjoys global travel, AI research and is a competitive ballroom dancer.

Ready to start building?

Experience seamless connectivity, real-time messaging, and crystal-clear voice and video calls-all at your fingertips.

Subscribe to Our Developer Newsletter

Subscribe to our monthly newsletter to receive our latest updates on tutorials, releases, and events. No spam.