Live Captions

Use the Live Captions API to transcribe audio streams and generate real-time captions for your application.

The Vonage Video Live Captions API lets you show live captions to end-users in a Vonage Video session, using a transcription service. We are using AWS Transcribe as a transcription provider. Since Live Captions captures the audio from the Media Router, it can provide the captions for the audio of SIP dial-in participants as well.

Live Captions is enabled by default for all projects, and it is a usage-based product. Live Captions usage is charged based on the number of audio streams of participants (or stream IDs) that are sent to the transcription service. For more information, see Live Captions API pricing rules.

The Live Captions feature is only supported in routed sessions (sessions that use the Media Router). You can send up to 50 audio streams from a single Vonage session at a time to the transcription service for captions.

Live caption illustration

Steps to enable Live Captions

Use the REST API to enable captioning for a session.

Use the method in the client SDK to publish audio to the captions service. See Implementing live caption

In subscribing clients, call the respective client SDK method for a subscriber to subscribe to captions for a stream.

Upon starting live captioning, securely streams audio to a third-party audio transcriptions service such as Amazon Transcribe.

Use the captioning API in the client SDKs to enable or disable receiving live captions in your application:

Starting or stopping to receive live captions in one web client does not impact captions received by other clients connected to the session.

Supported languages

Live Captions Support 11 Languages and 3 dialects of English. Pass in the desired language as the languageCode option when enabling live captions with the REST API:

  • "en-US" — English, US
  • "en-AU" — English, Australia
  • "en-GB" — English, UK
  • "es-US" — Spanish, US
  • "zh-CN" — Chinese, Simplified
  • "fr-FR" — French
  • "fr-CA" — French, Canadian
  • "de-DE" — German
  • "hi-IN" — Hindi, Indian
  • "it-IT" — Italian
  • "ja-JP" — Japanese
  • "ko-KR" — Korean
  • "pt-BR" — Portuguese, Brazilian
  • "th-TH" — Thai

Use cases

Live captions can improve an application's user experience and user engagement. Captioning improves the accessibility score of your application, which often results in participation from individuals with hearing disabilities. Some laws worldwide require applications to provide captioning.

Captioning can result in increased speaker comprehension in uncontrolled surroundings, thereby improving user engagement.

Live captions are only available for routed sessions (sessions that use the Media Router).

Upon enabling the Live Captions feature:

  • Use the client audio captioning API to start audio captioning for each published stream.
  • The audio stream is sent to a third-party audio transcription service.
  • Use the client audio captioning API to subscribe to the live captions for each published stream.
  • Choosing to not receive the captions by an individual subscriber does not affect the receiving captions by other subscribers in other clients connected to the session.
  • When session is over (when all clients have stopped publishing streams to the session), you can explicitly stop captioning using the Stop Captions API. Otherwise, audio captioning automatically stops after maximum duration (specified when calling the Start Captions API) has expired.
  • Live captioning automatically stops after maximum duration specified in the Start Captions API has expired.

Notes

  • Use the client SDK audio captioning API to start audio captioning for each published stream.

  • The audio stream is sent to a third-party audio transcription service (AWS Transcribe).

  • Use the client audio captioning API to subscribe to the live captions for each published stream.

  • Choosing to not receive the captions by an individual subscriber does not affect the receiving captions by other subscribers in other clients connected to the session.

  • When the session is over (when all clients have stopped publishing streams to the session), you can explicitly stop captioning using the Stop Captions API. Otherwise, audio captioning automatically stops after maximum duration (specified when calling the Start Captions API) has expired.

The default maximum allowed captioning duration for each session is 4 hours. You can set this to another maximum duration when you call the Start Captions API. Upon expiration, the audio captioning will stop without any effect on the ongoing session.

Note that in the current phase, this feature is only available as a REST API interface and in the client SDKs as listed above.

Live caption status updates

You can set up a webhook to receive events when live captions start, stop, and fail for a session.

  1. Go to your Video API account and select the project from the list of projects in the left-hand menu.
  2. Under Project settings, find Live Captions Monitoring and click Configure.
  3. Submit the URL for callbacks to be sent to.

Secure callbacks: Set a Signature Secret to use secure webhook callback requests with signed callbacks, using the signature secret. See Secure callbacks.

When the status of live captions changes, an HTTP POST is delivered to the callback URLs. If no callback URL is configured, no status update is delivered. The raw data of the HTTP request is a JSON-encoded message of the following form:

{
  "captionId": "<captionsId>",
  "projectId": "<apiKey>",
  "sessionId": "<sessionId>",
  "status": "stopped",
  "createdAt": 1651253477,
  "updatedAt": 1651253837,
  "duration": 360,
  "languageCode": "en-US",
  "reason": "Maximum duration exceeds.",
  "provider": "aws-transcribe",
  "group": "captions"
}

The JSON object includes the following properties:

  • captionsId The unique ID for Audio Captioning session.
  • projectId API Key
  • sessionId OpenTok session for which Audio Captioning has started.
  • status Current status of the live captions.
    • "started" The Vonage Video API platform has successfully allocated necessary resources to send audio streams for captioning.
    • "transcribing" The transcription service has started (and captioning is in progress).
    • "stopped" Captioning has stopped and all the resources have been deleted.
    • "failed" Captioning has failed to allocate the necessary resources or failed to send streams for captioning.
  • createdAt The Unix timestamp (Epoch) at which the audio captioning has started.
  • updatedAt The Unix timestamp (Epoch) at which the audio captioning has updated. If the status is "stopped", the updatedAt indicates the time at which captioning has stopped.
  • languageCode The BCP-47 language code used
  • reason Additional error information about the status change
  • providerThe third-party service provider used for the audio captioning:
    • "aws-transcribe" Amazon Transcribe
  • group The type of the event, which is always set to "captions" for audio caption API events.

Implementing live caption

Use the live captions API to enable real-time audio captioning of the publishers and subscribers connected to an session.

Live captioning must be enabled at the session level via the REST API.

Live captioning is only supported in routed session.

Publishing live captions

To enable live captions, initialize the `OTPublisher` component with the optional boolean `publishCaptions` property of the `properties` prop set to true:

This setting is false by default.

You can dynamically change this property (based on a React state change) to toggle captions on an off for the published stream.

Subscribing to live captions

To start receiving captions, set the `subscribeToCaptions` property of the `properties` prop of the `OTSubscriber` component:

You can set the subscribeToCaptions property to true regardless of whether the client publishing the stream is currently publishing live captions. The subscriber will start receiving captions data once the publisher begins publishing captions.

Subscribers receive captions via the captionReceived event handler (shown above).

The captionReceived event object has two properties:

  • text — The text of the caption (a string)
  • isFinal — Whether the caption text is final for a phrase (true) or partial (false). The React Native SDK does not display the text of the captions events. You can create your own UI element to render the captions text based on captions events.

Receiving your own live captions

The Vonage web client SDK does not support a publisher receiving events for its own captions. To render captions for a stream published by the local client, create a hidden subscriber (to the local publisher's stream) to listen for the caption events. Set the subscribeToSelf property of the OTSubscriber to true. You should not render this subscriber's video (by setting its width and height to 0) and you should not subscribe to audio (to avoid echo, by setting subscribeToAudio to false).

You can add the captions to the UI, as you would for other stream's captions. See Custom rendering of subscribers.

Known issues

  • When a participant mutes for more than 15 seconds, the connection to the third-party transcription provider is disconnected, to save billing costs. It might take 2-5 seconds to reconnect and see the captions resume.

More information

See this Vonage API Support article for more technical specifications and FAQs.