https://d226lax1qjow5r.cloudfront.net/blog/blogposts/extending-nexmo-streaming-transcription-dr/E_Extending-Nexmo_Steaming-Transcript_1200x600.jpg

Kelly J AndrewsFormer Team Member

Kelly J Andrews is a developer advocate for Nexmo and has been tinkering with computers for over 30 years, using BASIC for the first time at the age of 5.

It wasn't until building his first webpage in 1997, and trying out JavaScript for the first time that he found a true calling. Kelly now fights for JavaScript, testable code, and fast delivery.

You can find him singing karaoke, performing magic, or cheering for the Cubs and Fighting Irish.

Extending Nexmo: Streaming Transcription Comparison

Published on May 10, 2021

#streaming-transcription

Time to read: 6 minutes

In the world of communication, words are vital, and getting the right ones is incredibly important. Transcriptions of voice calls can help businesses understand emerging trends in content from sales and support phone calls.

Transcription is the process of taking audio conversations and turning them into written words. In the past, this was handled by humans listening to a recording and typing out the words manually. This practice still exists today. There are now automated services that have emerged, allowing software developers to send audio files to services for a more immediate result.

This work alone can provide incredible insights, but I'm focused specifically on streaming transcriptions where real-time information can provide another level of detail to your sales and support team.

Overview

This post is a comparison and overview of the four leading cloud provider's streaming transcription services: Amazon Transcribe, Azure Cognitive Speech Service, Google Cloud Speech-to-Text, and IBM Watson Speech-to-Text.

Using the Nexmo Voice API and WebSockets, I connected a voice call from my cell phone and streamed the audio to test the service. During the process, I took a few notes regarding things like features, ease of use, accuracy, and costs.

Accuracy

One of the more critical factors when selecting a transcription service is its accuracy. Having incorrect information would cause any data to be untrusted.

I tested each service using an 8kHz sample rate using my cell phone. You can also use 16kHz, but in some cases, it's actually less accurate. Using various phrases, I spoke at a relatively average rate of speech at a moderate volume. The room was also quiet enough to remove background noises.

Every service transcribed the phrases I used with no issues, but some of it depended on how well I spoke. Unfortunately, any service struggles with lack of enunciation.

Otherwise, you can feel reasonably confident with any one of these services to handle your transcriptions.

"Stand clear of the doors."

IBM Watson Speech-to-Text - 90% (It couldn't get "stand" no matter how clear I spoke)
Azure Cognitive Speech Service - 100%
Google Cloud Speech-to-Text - 100%
Amazon Transcribe - 100%

"Shake well before serving"

IBM Watson Speech-to-Text - 100%
Azure Cognitive Speech Service - 100%
Google Cloud Speech-to-Text - 100%
Amazon Transcribe - 100%

"Peter Piper picked a peck of pickled peppers."

IBM Watson Speech-to-Text - 100%
Azure Cognitive Speech Service - 100%
Google Cloud Speech-to-Text - 100%
Amazon Transcribe - 100%

Features

The feature sets for most of the services are very similar:

Uses Machine Learning for Accuracy
Custom models for optimization
Audio files and streaming
Contextual formatting for proper nouns
Punctuation support

Language Support

Google Cloud Speech-to-Text has far-and-away the highest number of supported languages at 120 and can auto-detect the language in most cases. The next closest language support is Azure Cognitive Speech Service with 30 and IBM Watson Speech-to-Text with 15. Amazon Transcribe is most restrictive with only 5 languages available for streaming transcription (more available for audio files).

Ease of Use

Each provider has their unique path to success. All of them provide getting started guides in various formats. However, they tend to either be for sending an audio file or capturing microphone input. While microphone input is the streaming method we need to use with the Nexmo WebSocket, it comes with some additional front end code that was unnecessary.

Google Cloud Speech-to-Text

The Google SDK was simple to implement by creating a SpeechClient with the provided SDK and the Google credentials file provided during the service account user.

The SpeechClient provides a method called streamingRecognize which provides events.

const client = new speech.SpeechClient();

  let request ={
    config: {
      encoding: 'LINEAR16',
      sampleRateHertz: 8000,
      languageCode: 'en-US'
    },
    interimResults: false
  };

  const recognizeStream = client
  .streamingRecognize(request)
  .on('error', console.error)
  .on('data', data => {
    console.dir(data, {depth: null});
  });

Google's simplicity gets an advantage over other competitors. You can get up and running with little effort, and for me, that's a huge win.

Google Cloud Speech-to-Text Documentation

Azure/IBM

Both IBM and Azure provide language-specific SDKs and plenty of getting started documentation to get you going, but the API direct call is straightforward to implement. These services can both be used with just connecting to an API route with the wss:// protocol and providing a key.

Azure Cognitive Speech Service

wss://region.stt.speech.microsoft.com/speech/recognition/interactive/cognitiveservices/v1?format=simple&language=LANG_CODE

IBM Watson Speech-to-Text

wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize?model=langage_model

There are some additional headers to be provided, but you can find the documentation for each service to be a beneficial resource.

Amazon Transcribe

I have to admit, AWS was the hardest for me to implement. The documentation had very few examples, and the SDK didn't seem to support streaming services at all. When I first set out to build the Amazon Transcribe integration, HTTP/2 was the only protocol available and required signing every request sent to the server. Being a bit of a novice with AWS in general, this proved to be quite tricky for me.

Where this story makes a great turn around, is when Developer Relations lead, Brandon West stepped in, and wrote some sample code using a WebSocket connection and saved me. The sample code made things much more accessible, but the general authorization process and call signing make Amazon Transcribe a bit harder to implement in general.

Here I used the WebSocket Node package and created a signed URL to create the connection.

  let url = v4.createPresignedURL(
    'GET',
    `transcribestreaming.${process.env.AWS_REGION}.amazonaws.com:8443`,
    '/stream-transcription-websocket',
    'transcribe',
    crypto.createHash('sha256').update('', 'utf8').digest('hex'), {
        'key': process.env.AWS_ACCESS_KEY_ID,
        'secret': process.env.AWS_SECRET_ACCESS_KEY,
        'protocol': 'wss',
        'expires': 15,
        'region': process.env.AWS_REGION,
        'query': `language-code=${process.env.LANG_CODE}&media-encoding=pcm&sample-rate=${process.env.SAMPLE_RATE}`
    }
  );

  let socket = new WebSocket(url);

Amazon Transcribe is an excellent service, but if it's not something you are accustomed to, the learning curve can be a challenge.

Amazon Transcribe Documentation

Cost

All of the service providers offer a lite or free tier to get you started. The base cost is on the amount of audio time transcribed. The providers use different increments of time, so I normalized them in the table below to help you understand precisely the difference in costs.

Provider	Free Tier	Cost	Normalized Cost
Amazon Transcribe	60 min/month for 12 months	$0.006/~10 seconds	$0.036/minute
Azure Cognitive Speech Service	5 audio hours free per month	$1/audio hour	$0.016/minute
Google Cloud Speech-to-Text	60 minutes free	$0.006/15 seconds	$0.024/minute
IBM Watson Speech-to-Text	500 Minutes per Month	$0.02/minute	$0.02/minute

Recap

Streaming transcription is a great way to provide insight in real-time. Each of the four leading cloud providers has great offerings with reliable, accurate speech-to-text services. Using Nexmo as the audio stream is a great extension that allows for more in-depth communication experiences for your customers.

I highly recommend Google Speech-to-text as a solid pick. The price is competitive, and the service is robust and reliable. It's easy to get set up and start using it immediately, which is a bonus.

If you would like to try out any or all of these services, the Nexmo Extend team has created example code to help you get started.

Receive Voice Call Events with Ruby on Rails

August 20, 2025

Time to read: 3 minutes

#ruby

#voice-api

#ruby-on-rails

Receive Voice Call Events with Ruby on Rails

Learn how to receive voice call events in Ruby on Rails using Vonage webhooks. Track delivery status, handle webhooks, and update your database in real-time.

How to Create a Virtual Green Screen Using the Vonage Video API

August 19, 2025

Time to read: 9 minutes

#nodejs

#html

How to Create a Virtual Green Screen Using the Vonage Video API

Overlay presenters onto shared content using background replacement and HTML5 canvas with the Vonage Video API.

How to Make Outbound Phone Calls with Ruby on Rails

August 14, 2025

Time to read: 6 minutes

#ruby

#voice-api

#ruby-on-rails

How to Make Outbound Phone Calls with Ruby on Rails

Learn how to integrate the Vonage Voice API in Ruby on Rails to make text-to-speech calls programmatically.

Kelly J AndrewsFormer Team Member

Kelly J Andrews is a developer advocate for Nexmo and has been tinkering with computers for over 30 years, using BASIC for the first time at the age of 5.

It wasn't until building his first webpage in 1997, and trying out JavaScript for the first time that he found a true calling. Kelly now fights for JavaScript, testable code, and fast delivery.

You can find him singing karaoke, performing magic, or cheering for the Cubs and Fighting Irish.

August 20, 2025

Time to read: 3 minutes

#ruby

#voice-api

#ruby-on-rails

Receive Voice Call Events with Ruby on Rails

Learn how to receive voice call events in Ruby on Rails using Vonage webhooks. Track delivery status, handle webhooks, and update your database in real-time.

August 19, 2025

Time to read: 9 minutes

#nodejs

#html