Introducing Voice API Speech Recognition
Published on May 25, 2021

We recently announced a new Automatic Speech Recognition (ASR) feature which enables your application to understand what humans are saying when they speak. This feature allows you to create a full range of voice interactions from simple IVRs with voice navigation, to sophisticated voice bots and assistants.

Using ASR you can provide customers with the fastest service possible, easily enable speech-based self-serve operations, whilst delivering a superior user experience, and reducing operational costs. In this post, we’ll show you how to build a simple IVR app a user can navigate using only their voice.

Vonage API Account

To complete this tutorial, you will need a Vonage API account. If you don’t have one already, you can sign up today and start building with free credit. Once you have an account, you can find your API Key and API Secret at the top of the Vonage API Dashboard.

Before You Begin

To start, please be sure you have a Vonage API account and that you have created a voice application.

We will use Node.js for this example, as well as the Express web application framework and the body-parser packages. For ease you can use the NPM command below to install them into your project:

npm install express body-parser

Although this example uses Node.js, it is possible to recreate the same functionality using your preferred code language/framework by using the same NCCO as we show below.

Writing the Code

Speech recognition is activated by the NCCO input command, which is also suitable for capturing DTMF tones. Assuming you have a number assigned to your application already, create a new file called index.js and start by implementing the answer webhook as shown in the code below:

'use strict'
 
const express = require('express')
const bodyParser = require('body-parser')
const app = express()
const http = require('http')
 
app.use(bodyParser.json())
 
app.get('/webhooks/answer', (request, response) => {
 
  const ncco = [{
      action: 'talk',
      text: 'Thank you for calling Example Inc.! Please tell us how we can help you today. Say Sales to talk to the sales department, Support to get technical support.',
      bargeIn: true
    },
    {
      action: 'input',
      eventUrl: [
        `${request.protocol}://${request.get('host')}/webhooks/asr`
      ],
      speech: {
        uuid: [request.query.uuid],
        context: ["Sales", "Support"]
      }
    }
  ]
 
  response.json(ncco)
})

In the code snippet above:

uuid is the call (leg) identifier and is a required parameter for this action. You can get this UUID from the answer webhook query params.

bargeIn: true in the talk action allows the user to start speaking at any moment while the Text to Speech message is being played, which might be suitable if the user has already heard this message on a previous call.

context in the input action increases the accuracy of speech recognition and is suitable for IVR-style cases.

When the user says the department name and the word is recognized by Vonage, you’ll get a webhook callback to the event_url you specified in the input action. The request body for this callback contains speech recognition results and looks like this:

{
  "speech": {
    "timeout_reason": "end_on_silence_timeout",
    "results": [
      {
        "confidence": "0.9320692",
        "text": "sales"
      }
    ]
  },
  "dtmf": {
    "digits": null,
    "timed_out": false
  },
  "from": "15557654321",
  "to": "15551234567",
  "uuid": "abfd679701d7f810a0a9a44f8e298b33",
  "conversation_uuid": "CON-64e6c8ef-91a9-4a21-b664-b00a1f41340f",
  "timestamp": "2020-04-17T17:31:53.638Z"
}

Next, you need to implement a webhook that decides what to do with the information returned by ASR:

app.post('/webhooks/asr', (request, response) => {
 
  console.log(request.body)
 
  var department = ""
 
  if (request.body.speech.results)
    department = request.body.speech.results[0].text
 
  var departmentNumber = ""
 
  switch (department) {
    case 'sales':
      departmentNumber = "15551234561"
      break;
    case 'support':
      departmentNumber = "15551234562"
      break;
    default:
      break;
  }
 
  var ncco = ""
 
  if (departmentNumber != "") {
    ncco = [{
      "action": "connect",
      "from": "15551234563",
      "endpoint": [{
        "type": "phone",
        "number": departmentNumber
      }]
    }]
  } else {
    ncco = [{
      action: 'talk',
      text: `Sorry, we didn't understand your message. Please try again.`
    }, {
      action: 'input',
      eventUrl: [
        `${request.protocol}://${request.get('host')}/webhooks/asr`
      ],
      speech: {
        uuid: [request.body.uuid],
        context: ["Sales", "Support"]
      }
    }]
 
  }
 
  response.json(ncco)
})

In the snippet above, you should replace the departmentNumber values with some other phone numbers so you can receive a call to, and from number to one of your Nexmo account numbers.

Finally, create your Node.js server:

const port = 3000
app.listen(port, () => console.log(`Listening on port ${port}`))

Testing Your Application

To begin testing locally you will need to expose your local server to the rest of the world so that your answer and event_url webhooks can be reached. You can use Ngrok to do this by following the Testing with Ngrok guide in our documentation.

With your app running call the number associated with the application you created in the dashboard. You will hear the greeting message with IVR options and be able to connect to one of your numbers by saying the department name.

Try to add other options and different words to capture, for example, instead of announcing the options in the greeting, keep just the question quite generic and then try to analyze the user’s answer by searching for the words "sales", "support" or even "buy" to convert the IVR to a smart assistant.

What’s Next?

Should you have any questions or feedback let us know on our Community Slack or by getting in touch on support@nexmo.com

Victor ShisterovVoice API Product Manager

Victor is a product manager for Vonage Voice API with seven years of experience in the telecom industry, and a software developer since his childhood. He is passionate about making technically complex things easy to understand and to use, by keeping powerful API self-descriptive and consistent. When not inventing and coding, he builds scale models and plays folk musical instruments.

Ready to start building?

Experience seamless connectivity, real-time messaging, and crystal-clear voice and video calls-all at your fingertips.

Subscribe to Our Developer Newsletter

Subscribe to our monthly newsletter to receive our latest updates on tutorials, releases, and events. No spam.