Speech to Text

Automatic Speech Recognition (ASR) enables apps to support voice input for such use cases as IVR, identification and different kinds of voice bots/assistants. Using this feature, the app gets transcribed user speech (in the text form) once it expects the user to answer some question by saying it rather than entering digits (DTMF); and then may continue the call flow according to its business logic based on what the user said.

Note: This is a chargeable feature. Exact rates can be found on the Voice API Pricing page under 'Programmable Features'.

How it works

Typically, ASR is used in conjunction with an audio message playing to the user. The message might be an audio file or Text-to-Speech, or a combination of both played sequentially. To activate the ASR, NCCO should contain an input action with speech parameters specified.

NCCO Example

[
  {
    "action": "talk",
    "text": "Hello!"
  },
  {
    "action": "talk",
    "text": "Please tell us, how can we help you today?",
    "bargeIn": true
  },
  {
    "eventUrl": ["https://api.example.com/callbacks/events"],
    "eventMethod": "POST",
    "action": "input",
    "type": ["speech"],
    "speech": {
      "provider": "google",
      "providerOptions": {
        "language_code": "en-US",
        "speech_contexts": [
          {
            "phrases": ["hello", "world", "speech recognition"],
            "boost": 10.0
          }
        ]
      }
    }
  }
]

The NCCO Reference Guide contains information on all the possible parameters that can be used in conjunction with the ASR input NCCO action.

Input Type

Set type as speech for speech input only, or [ "dtmf", "speech" ] to accept both speech or DTMF.

Provider and Provider Options

When you use speech input, you can explicitly choose which Speech-to-Text provider Vonage uses by setting provider. The only supported value is google. If you do not set provider, Vonage uses the default behavior.

Example speech object with providerOptions:

{
  "speech": {
    "provider": "google",
    "providerOptions": {
      "language_code": "en-US",
      "speech_contexts": [
        {
          "phrases": ["hello", "world", "speech recognition"],
          "boost": 10.0
        }
      ]
    }
  }
}

Use providerOptions to pass provider-specific Speech-to-Text configuration. When using providerOptions, the keys and values in providerOptions are determined by the chosen provider. For details, refer to the documentation for Google on how to specify them inside providerOptions.

If you omit provider, you can omit providerOptions as well. If you set provider, include providerOptions (it can be an empty object {} or null).

Important: Vonage imposes some limitations on providerOptions. Right now, you can only set certain model values in providerOptions, although Vonage uses a default model when you do not specify one.

Model support

Vonage supports the following model values in providerOptions:

Provider	Supported `model` values	Default `model`
Google	`latest_long`, `latest_short`, `command_and_search`, `phone_call`, `default`, `medical_conversation`, `medical_dictation`, `telephony`, `telephony_short`	`default`

Language (Legacy)

The expected language of user speech should be specified as the language parameter (en-US by default).

When using provider and providerOptions, the supported languages and language formats are determined by the chosen provider. For details, refer to the documentation for Google (language_code) on how to specify them inside providerOptions.

Supported languages

Language	Code
Afrikaans (South Africa)	`af-ZA`
Albanian (Albania)	`sq-AL`
Amharic (Ethiopia)	`am-ET`
Arabic (Algeria)	`ar-DZ`
Arabic (Bahrain)	`ar-BH`
Arabic (Egypt)	`ar-EG`
Arabic (Iraq)	`ar-IQ`
Arabic (Israel)	`ar-IL`
Arabic (Jordan)	`ar-JO`
Arabic (Kuwait)	`ar-KW`
Arabic (Lebanon)	`ar-LB`
Arabic (Morocco)	`ar-MA`
Arabic (Oman)	`ar-OM`
Arabic (Qatar)	`ar-QA`
Arabic (Saudi Arabia)	`ar-SA`
Arabic (State of Palestine)	`ar-PS`
Arabic (Tunisia)	`ar-TN`
Arabic (United Arab Emirates)	`ar-AE`
Armenian (Armenia)	`hy-AM`
Azerbaijani (Azerbaijan)	`az-AZ`
Basque (Spain)	`eu-ES`
Bengali (Bangladesh)	`bn-BD`
Bengali (India)	`bn-IN`
Bulgarian (Bulgaria)	`bg-BG`
Burmese (Myanmar)	`my-MM`
Catalan (Spain)	`ca-ES`
Chinese, Cantonese (Traditional, Hong Kong)	`yue-hant-HK`
Chinese, Mandarin (Simplified, China)	`zh` (`cmn-hans-cn`)
Chinese, Mandarin (Traditional, Taiwan)	`zh-TW` (`cmn-hant-TW`)
Croatian (Croatia)	`hr-HR`
Czech (Czech Republic)	`cs-CZ`
Danish (Denmark)	`da-DK`
Dutch (Belgium)	`nl-BE`
Dutch (Netherlands)	`nl-NL`
English (Australia)	`en-AU`
English (Canada)	`en-CA`
English (Ghana)	`en-GH`
English (India)	`en-IN`
English (Ireland)	`en-IE`
English (Kenya)	`en-KE`
English (New Zealand)	`en-NZ`
English (Nigeria)	`en-NG`
English (Philippines)	`en-PH`
English (Singapore)	`en-SG`
English (South Africa)	`en-ZA`
English (Tanzania)	`en-TZ`
English (United Kingdom)	`en-GB`
English (United States)	`en-US`
Estonian (Estonia)	`et-EE`
Filipino	`fil-PH`
Finnish (Finland)	`fi-FI`
French (Canada)	`fr-CA`
French (France)	`fr-FR`
Galician (Spain)	`gl-ES`
Georgian (Georgia)	`ka-GE`
German (Germany)	`de-DE`
Greek (Greece)	`el-GR`
Gujarati (India)	`gu-IN`
Hebrew (Israel)	`he-IL`
Hindi (India)	`hi-IN`
Hungarian (Hungary)	`hu-HU`
Icelandic (Iceland)	`is-IS`
Indonesian (Indonesia)	`id-ID`
Italian (Italy)	`it-IT`
Japanese (Japan)	`ja-JP`
Javanese (Indonesia)	`jv-ID`
Kannada (India)	`kn-IN`
Khmer (Cambodia)	`km-KH`
Korean (South Korea)	`ko-KR`
Lao (Laos)	`lo-LA`
Latvian (Latvia)	`lv-LV`
Lithuanian (Lithuania)	`lt-LT`
Macedonian (North Macedonia)	`mk-MK`
Malay (Malaysia)	`ms-MY`
Malayalam (India)	`ml-IN`
Marathi (India)	`mr-IN`
Mongolian (Mongolia)	`mn-MN`
Nepali (Nepal)	`ne-NP`
Norwegian Bokmål (Norway)	`nb-NO`
Persian (Iran)	`fa-IR`
Polish (Poland)	`pl-PL`
Portuguese (Brazil)	`pt-BR`
Portuguese (Portugal)	`pt-PT`
Punjabi (Gurmukhi, India)	`pa-guru-IN`
Romanian (Romania)	`ro-RO`
Russian (Russia)	`ru-RU`
Serbian (Serbia)	`sr-RS`
Sinhala (Sri Lanka)	`si-LK`
Slovak (Slovakia)	`sk-SK`
Slovenian (Slovenia)	`sl-SI`
Spanish (Argentina)	`es-AR`
Spanish (Bolivia)	`es-BO`
Spanish (Chile)	`es-CL`
Spanish (Colombia)	`es-CO`
Spanish (Costa Rica)	`es-CR`
Spanish (Dominican Republic)	`es-DO`
Spanish (Ecuador)	`es-EC`
Spanish (El Salvador)	`es-SV`
Spanish (Guatemala)	`es-GT`
Spanish (Honduras)	`es-HN`
Spanish (Mexico)	`es-MX`
Spanish (Nicaragua)	`es-NI`
Spanish (Panama)	`es-PA`
Spanish (Paraguay)	`es-PY`
Spanish (Peru)	`es-PE`
Spanish (Puerto Rico)	`es-PR`
Spanish (Spain)	`es-ES`
Spanish (United States)	`es-US`
Spanish (Uruguay)	`es-UY`
Spanish (Venezuela)	`es-VE`
Sundanese (Indonesia)	`su-ID`
Swahili (Kenya)	`sw-KE`
Swahili (Tanzania)	`sw-TZ`
Swedish (Sweden)	`sv-SE`
Tamil (India)	`ta-IN`
Tamil (Malaysia)	`ta-MY`
Tamil (Singapore)	`ta-SG`
Tamil (Sri Lanka)	`ta-LK`
Telugu (India)	`te-IN`
Thai (Thailand)	`th-TH`
Turkish (Turkey)	`tr-TR`
Ukrainian (Ukraine)	`uk-UA`
Urdu (India)	`ur-IN`
Urdu (Pakistan)	`ur-PK`
Uzbek (Uzbekistan)	`uz-UZ`
Vietnamese (Vietnam)	`vi-VN`
Zulu (South Africa)	`zu-ZA`

Context (Legacy)

Some hints might be provided using the context array parameter to improve recognition quality. The values set might be the words or phrases expected from the user, for example, ['one', 'two', 'three'] or ['support', 'order starship'].

When using provider and providerOptions, the configuration of this parameter is determined by the chosen provider. For details, refer to the documentation for Google (speech_contexts[]) on how to specify it inside providerOptions.

Barge In

If the user is calling not for the first time, they may already know the question to be asked, so the user may start speaking even before the audio message finishes. In order to support that, bargeIn parameter of the TTS (or stream - whatever action is used for the message) should be activated.

It is recommended to have the initial TTS/audio message be a short initial greeting without activating the bargeIn option to improve the user experience. If bargeIn is turned on for the first initial greeting, then the user may inadvertently interrupt it without hearing the prompt at all, since background noise may be interpreted by the application as an active interaction in those first moments.

Save Audio

The speech audio may be optionally stored. To get the recording, the saveAudio parameter should be set, then the recording_url will be included in the callback request payload. This might be useful to compare the transcription with the original audio. Use the Download a recording code snippet to download the audio.

Event Payload Example

Once the NCCO input action is completed, the input callback will be sent:

{
  "speech": {
    "recording_url": "https://api-us.nexmo.com/v1/files/ee94a327-ab6b-4bef-86bf-cadca33343e9",
    "timeout_reason": "end_on_silence_timeout",
    "results": [
      {
        "confidence": "1.0",
        "text": "one two three four five six seven eight nine ten"
      }
    ]
  },
  "dtmf": {
    "digits": null,
    "timed_out": false
  },
  "uuid": "529aa8f0-0ad8-42b0-ad61-b76dca18bb52",
  "conversation_uuid": "CON-7347b08d-5f51-478f-bfb2-e8c3fd39509f",
  "timestamp": "2020-02-07T11:42:40.933Z"
}

In the response body, a new NCCO is expected, containing next call flow actions based on application logic and user input (speech.results.text). See also Webhook Reference for details.