How to Build a Simple IVR with Speech and Touch-Tone Input
An Interactive Voice Response (IVR) allows you to automate phone interactions by providing callers with a menu of options. While traditional IVRs rely on keypad (DTMF) input, modern systems often include Speech-to-Text (ASR) for a more natural user experience. You can refer to the Advanced IVR guide for more details.
In this guide, you will build a Node.js application that answers a call and prompts the user to either press a key or speak. The application will then repeat that input back to the caller.
Prerequisites
- A Vonage API account. Sign up for free.
- Node.js installed on your machine.
- ngrok installed on your machine.
Initialize Your Project Folder
Before configuring your Vonage resources, create a home for your code. This ensures you have a destination for your security keys later.
Expose Your Local Server
Vonage needs to send webhooks to your local machine. Use ngrok to expose your server:
Note: Keep this terminal open and copy your ngrok URL. You'll need it in the next steps.
Provision Your Vonage Resources
Configure your environment using the Vonage API Dashboard.
Create a Voice Application
- Navigate to Applications > Create a new application.
- Give your application a name (e.g., Simple-IVR-Speech-DTMF).
- Click Generate public and private key.
- Move the downloaded
private.keyfile into your simple-ivr project folder. - Under Capabilities, enable Voice.
- Set the Answer URL to your ngrok URL with
/webhooks/answerappended. Example:https://{random-id}.ngrok.app/webhooks/answer. Set the method toGET. - Set the Event URL to your ngrok URL with
/webhooks/eventsappended. Example:https://{random-id}.ngrok.app/webhooks/events. Set the method toPOST. - Click Generate new application at the bottom.
Link a Virtual Number
- Navigate to Phone Numbers > Buy Numbers and rent a number with Voice capabilities.
- Go back to Applications, select your IVR application, and link the new number to it.
Handle the Inbound Call
When a user calls your number, Vonage requests an NCCO (Call Control Object) from your Answer URL. Create a file named index.js and add the following code:
const express = require('express');
const bodyParser = require('body-parser');
const app = express();
app.use(bodyParser.json());
// 1. Initial greeting and input request
app.get('/webhooks/answer', (req, res) => {
const ncco = [
{
action: 'talk',
text: 'Hello. Please enter a digit or say something.',
bargeIn: true
},
{
action: 'input',
type: ['dtmf', 'speech'],
dtmf: { maxDigits: 1 },
speech: { language: 'en-us' },
eventUrl: [`${req.protocol}://${req.get('host')}/webhooks/input`]
}
];
res.json(ncco);
});
Process Speech and DTMF Input
Add the input handler to your index.js to process the payload Vonage sends back once the user interacts with the menu.
// 2. Handle the user's response
app.post('/webhooks/input', (req, res) => {
let responseText = "I'm sorry, I didn't catch that.";
if (req.body.dtmf && req.body.dtmf.digits) {
responseText = `You pressed ${req.body.dtmf.digits}.`;
}
else if (req.body.speech && req.body.speech.results) {
const transcript = req.body.speech.results[0].text;
responseText = `You said: ${transcript}.`;
}
res.json([{ action: 'talk', text: responseText }]);
});
// 3. Log call events
app.post('/webhooks/events', (req, res) => {
res.sendStatus(200);
});
app.listen(3000, () => console.log(`IVR server running on port 3000`));
Test the IVR
- Start your server:
node index.js. - Call your Vonage number:
- Keypad: Press 1. The IVR should say, You pressed 1.
- Speech: Say Hello. The IVR should say, You said: Hello.
Next Steps
- Call Transfer: Add a
connectaction to connect the caller to the relevant department based on user input via phone number, SIP endpoint, or your web application using the Client SDK. - Call Recording: Record and transcribe the call with
recordaction. - Voice AI: Pass user input to your AI agent to provide the caller with valuable information.
- Customization: Change text-to-speech voice or use the stream action in your NCCO to play pre-recorded MP3 files instead of text-to-speech.