https://a.storyblok.com/f/270183/1368x665/7bb8d4d606/became-a-millionaire_vonage-api-ai.png

Guillaume FaasSenior .Net Developer Advocate

Guillaume is a Senior .Net Developer Advocate for Vonage. He has been working in .Net for almost 15 years while focusing on advocating Software Craftsmanship in the last few years. His favorite topics include code quality, test automation, mobbing, and code katas. Outside work, he enjoys spending time with his wife & daughter, working out, or gaming.

How I Became A Millionaire With Vonage APIs

Published on November 12, 2024

Time to read: 7 minutes

Spoiler: I'm not actually a millionaire.

We’ve all dreamt about it — sitting in the hot seat on 'Who Wants to Be a Millionaire', staring down that final question, the host waiting for your answer. But what if, in a world of multiverses and alternate realities, I walked away with a fat check in hand?

What if I told you that I figured out a way to win? Every. Single. Time.

In this post, we'll be stepping into this alternate universe for a moment — where Vonage’s APIs made me a millionaire, and no one suspected a thing.

The Perfect Heist

In this universe, I'm not just a contestant on a game show - I'm the mastermind behind the perfect heist. One final question separates me from the grand prize. I have carefully saved my last lifeline, "Call a friend", for this exact moment. The stakes couldn't be higher, but I'm cool. I'm in control.

Why? Because this isn't your ordinary phone call.

I've got an ace up my sleeve - a friend who knows everything. Instead of calling an actual person, I've set up something a little... unconventional. The number I gave the show is a virtual phone number I purchased from Vonage.

The second the host dials the number, my meticulously planned operation begins. No one in the audience, not even the host, has any idea what's happening behind the scenes. A chain of automated processes, all powered by Vonage APIs, springs to life to provide me with the correct answer.

The perfect heist isn't just about getting away with it - it's about pulling it off so smoothly that no one even realizes anything unusual happened.

Let me guide you through how I pulled it off.

Insert 'Evil Laugh' here

Step 1: We Need A Number

Every heist starts with the right tools; in this case, we need a number — a virtual one. Why a virtual number? Because it’s the key to controlling the flow of information, allowing us to intercept the call and manipulate it on our terms. It supports both inbound and outbound calls and messages. No device required, no manual intervention — just a virtual number and a webhook.

A virtual number lets you receive calls and route them anywhere — even to an automated system. In this case, the call will trigger a series of interactions without anyone realizing it’s anything other than a regular phone call. You can read more about virtual numbers here.

More specifically with Vonage, the process is relatively straightforward. You can start with the Vonage Voice API to purchase a number and configure it to point to a webhook of your choice — in this case, the webhook that will handle the call.

Once the host dials the number, the call will automatically be routed to our webhook. From there, the real fun begins.

Step 2: Our Inside Man

When the host dials the number, someone has to pick up the phone, right? But it’s not just anyone on the other end of the line. No, this is where we introduce our inside man: Adam. He’s a very knowledgeable 40-year-old friend living in London, ready to help with the most challenging questions.

But here’s the twist — Adam doesn’t actually exist. He’s completely synthetic, brought to life by the Text-to-Speech (TTS) feature. When the host calls, Adam’s voice is generated in real-time, and it’s so smooth and convincing that no one would ever suspect they're not talking to a real person.

He has the full package: a relaxed, natural-sounding voice and a proper London accent, making him the perfect accomplice.

// Our webhook
[HttpPost("answer")]
public IActionResult Answer()
{
    // Creating a Talk action with Adam's greeting message
    var talkAction = VoiceAdapter.MakeAdamTalk(GreetingsMessage);
    ...
    // Return a call control object to define the call flow 
    return this.Ok(new Ncco(talkAction, ...));
}

public static TalkAction MakeAdamTalk(string text) =>
    new TalkAction
    {
        Text = text,
        Language = "en-GB",
        Style = 6, 
        Premium = true, // Provides a more natural voice    
    };

private const string GreetingsMessage = "Adam here. What can I do for you?";

As shown in the code snippet, we return a Call Control Object (NCCO) with a Talk action to make Adam speak. When the call comes through, he will greet the host with a message, ensuring the conversation feels authentic from the very first word.

Step 3: Text Transcription

We've been welcomed by our inside man; it's time to share our final question with him. The next step in our heist is critical: converting my spoken question into text. For this, we'll use the Speech-to-Text functionality of Vonage Voice API.

In our webhook, after Adam's greeting, we need to add a Speech input action to capture my question.

[HttpPost("answer")]
public IActionResult Answer()
{
    var talkAction = VoiceAdapter.MakeAdamTalk(GreetingsMessage);
    var inputAction = new MultiInputAction
    {
        Type = [NccoInputType.Speech],
        // Url where the transcription will be sent
        EventUrl = apiUrl + "/Webhooks/asr"],
        Speech = new SpeechSettings
        {
            Language = "en-GB",
            MaxDuration = 20,
            EndOnSilence = 2,
        },
    };
    return this.Ok(new Ncco(talkAction, inputAction));
}

// Webhook that will receive the text transcription
[HttpPost("asr")]
public async Task<IActionResult> Speech(MultiInput speechResponse) 
{
    ...
}

Our Call Control Object now contains two actions that will execute in sequence: Adam will greet the host, and then the system will listen to my question. Once the system detects that my speech has ended, either by a period of silence or reaching the maximum duration, the transcription will be sent to the webhook we’ve set up.

Here's the transcription of my question:

Hey Adam! Look, I'm currently in the game "Who wants to be a millionaire" with Jeremy Clarkson, and I'm calling you as part of my last joker "Call a friend". I need your help with the final question.
Here it is: "During the Cold War, the US government built a bunker to house Congress under what golf resort?"
A: TheBreakers B: TheGreenbrier C: Pinehurst D: TheBroadmoor

This step is absolutely essential. Text is data — and in this heist, data is everything. It can be manipulated, analyzed, and, most importantly, passed along to the next phase of our plan.

Step 4: The Oracle

Now that we have the text transcription of the question, it’s time to consult our all-knowing entity: GPT4. In this heist, it is our oracle — the one who holds all the answers.

Once the transcription reaches our webhook, we send the text, along with a custom prompt, to GPT4. This prompt ensures that the AI delivers the accurate response we need to win, as Adam.

Context:
We're in the game "Who wants to be a millionaire", and I'm calling you as part of my last joker "Call a friend".
You are Adam, a 40 year old male using en-UK language.
You answer must look like you genuinely picked up the phone to give me an answer, and pretend to think about the answer.
You answer will be transcribed through text-to-speech so avoid anything a person wouldn't say.
No need for greetings, and wish me luch for the game at the end of your reply.

[HttpPost("asr")]
public async Task<IActionResult> Speech(MultiInput speechResponse) =>
    await FetchQuestion(speechResponse.Speech.SpeechResults)
        .MapAsync(aiAdapter.AskAsync)
        .Map(VoiceAdapter.MakeAdamTalk)
        .Map(this.Ok)
        .IfNone(this.Ok(VoiceAdapter.MakeAdamTalk(FailedToUnderstandQuestion)));

private static Maybe<string> FetchQuestion(SpeechRecognitionResult[] results) =>
    results.Length != 0 ? results.First().Text : Maybe<string>.None;

public Task<string> AskAsync(string question)
{
    var client = new ChatGpt(
        configuration["openai-key"] ?? throw new InvalidOperationException("Missing OpenAI key."),
        new ChatGptOptions { Model = "gpt-4-turbo" });
    return client.Ask(configuration["prompt"] + "\n" + question);
}

Thanks to our custom prompt, the Oracle will generate a text response where Adam appears to be thoughtfully considering the answer. Whether it’s a complex historical fact or a tricky bit of trivia, it will produce the perfect response, ready to be delivered in Adam’s voice.

Here's the response we get from my question:

Hmm... that’s a tough one. Let me think...
I remember reading about this a while ago.
I don’t think it’s The Breakers or Pinehurst, those don’t seem right.
I’m leaning towards The Greenbrier, if I’m not mistaken.
Yeah, I’m pretty sure it’s B: The Greenbrier.
That’s the one they used for Congress during the Cold War.
Good luck, hope you nail it!

However, there’s one caveat: these things don’t happen as fast as snapping your fingers. Let’s consider our scenario:

I deliver my question.
After two seconds of silence, the voice feed is transcribed into text.
The question is sent to GPT, which generates the answer.

We’re looking at potential latency — anywhere from five to ten seconds before we receive the answer. That’s precious time ticking away, and it could jeopardize my whole plan.

A potential technical solution would be to rely on OpenAI's Realtime API, which was recently released. But unfortunately, that's a luxury of time I don't have.

Instead, I’ve thought of a clever workaround: small talk. I know Jeremy Clarkson loves his Ford GT40, and if I casually bring it up, I can get him talking. His enthusiasm for cars will buy me the few extra seconds I need for the AI to generate the answer.

And once it’s ready? We send a Talk action with the response as a response to our endpoint, and Adam will confidently deliver the correct answer to the host.

Step 5: The Win

With the answer in hand, I repeat Adam’s response — "B: The Greenbrier" — back to the host, calm as ever. And just like that, I’ve won!

From the outside, no one suspects a thing. To the audience, it looks like I’ve called a living, breathing human who just happens to know every answer. Little do they know, it was all powered by Vonage APIs, a bit of AI, and — most importantly — creativity.

The perfect heist wasn’t about the tools but how I used them.

Wrap Up

Back to reality, I’m obviously not a millionaire — nor a criminal mastermind.

In truth, my plan has plenty of holes. For one, I’d have to make sure the host doesn’t ask Adam any follow-up questions, or worse, call the number before the final moment. The workflow I’ve set up is also pretty rigid: Greeting → Question → Answer. If we deviate from that, the entire plan falls apart.

And, of course, there’s the small matter of making it to the final question! I don’t have the knowledge to answer 14 questions on "Who Wants to Be a Millionaire" — maybe one of my multiverse variants does, but certainly not me.

On the technical side, Speech-to-text and Text-to-speech may not be new technologies, but they remain incredibly useful. However, the real takeaway from this post comes from automation - plugging platforms together like Vonage and OpenAI to create a specific workflow. It's not just about using cutting-edge tools but how you combine them to create something unique.

As usual, feel free to hit me up on my LinkedIn or join us on the Vonage Developer Slack. You can also message us on @VonageDev on X.

Happy coding, and I'll catch you later!