Post-call Transcriptions

Public Beta

The Public Beta program is available for all partners to preview and evaluate the feature and provide feedback on the implementation. Please be aware that it may be necessary to modify code written during the Public Beta phase after the product is made generally available.

Pricing

Transcription is charged at $0.04510/€0.04100 per minute.

Feature overview

Post-call transcription can help with improved record keeping, improved customer service, increased productivity, and better data analysis. Vonage Video API servers generate post-call transcriptions using artificial intelligence and other state-of-the-art technology.

You enable transcriptions when you start an archive using the REST API.

After the archive recording completes, the transcription will be available as a JSON file.

Enabling transcription when starting an archive

When you use the Vonage Video REST API start an archive, set the hasAudio and hasTranscription properties to true in the JSON properties you sent to the start archive REST method:

application_id="12345abc" json_web_token="jwt_string" # replace with a JSON web token data='{ "sessionId": "1_MX40NzY0MDA1MX5-fn4", "hasAudio": true, "hasVideo": true, "hasTranscription": true, "name": "archive_test", "outputMode": "individual" }' curl \ -i \ -H "Content-Type:application/json" \ -X POST \ -H "X-OPENTOK-AUTH:$json_web_token" \ -d "$data" \ https://video.api.vonage.com/v2/project/$application_id/archive

Set outputMode (in the POST data) to "individual". Transcriptions are available for individual stream archives only.

Set the value for application_id to your Application ID. Set the value for json_web_token to a JSON web token (see the REST API Authentication documentation).

For other archive options, see the documentation for the start archive REST method.

The response for a call to the start archive REST method will include hasTranscription and transcription properties in addition to the other documented properties of the response:

See Getting transcription status for information on dynamically getting the transcription details.

In an automatically archived session, the transcription won't be started automatically. You should start a second archive, using the multiArchiveTag option, for the transcription (see Simultaneous archives).

Support for transcriptions is currently available with the Vonage Video REST API. It is not supported in the Vonage Video server SDKs.

Getting transcription status

The response for the REST methods for listing archives and retrieving archive information will include hasTranscription and transcription properties:

The hasTranscription property is a Boolean, indicating whether transcription is enabled for the archive.

The transcription property an object with the following properties:

  • status (String) — The status of the transcription, which can be set to one of these:

    • "requested" — The hasTranscription property was set to true during the start archive call, but transcription has not started.
    • "failed" — The transcription failed. Check the reason property for more information.
    • "started" — The transcription is in progress.
    • "available" — The transcription is available for download from Vonage. Check the url property.
    • "uploaded" — The transcription is available for download from the S3 bucket or Azure container you specified in your Video API account. Look for a transcription.zip in the archive ID folder in your archive storage target. See Archive storage.
  • url (String) — The URL URL for downloading the transcription, if the status is set to "available".

  • reason (String) — The reason for transcription failure, if the status is set to "failed".

You can also set an archive status callback for your Video API account. See Archive status changes. The callback data will also include hasTranscription and transcription properties.

Transcription format

The transcription is provided as a compressed ZIP file. The uncompressed file is a text file with JSON data.

The transcription includes individual segments of text. Each segment corresponds to an individual audio channel (from one of the audio streams in the session).

The JSON has the following top-level properties:

  • job_id — A unique ID for the transcription.

  • timestamp — An ISO 8601 date string for when the transcription file was created.

  • number_of_channels — The number of individual audio channels in the archive included in the transcription.

  • reliability — An object with one property: score. The score is a number indicating the estimated overall reliability of the transcription (from 0 to 1.0).

  • confidence — A object with two properties: overall and channels. The overall property is the estimated confidence of the entire transcription (from 0 to 1.0). The channels property is an array listing the estimated confidence of each channel in the transcription.

  • channels_metadata — An array of objects defining each audio channel. Each object an id property, which is the video stream ID. You can add identifying connection data when you create a client token for each user. You can use session monitoring callbacks to get the stream IDs and the connection data for each stream's connection. You can then use these to identify the stream's user in the transcription.

  • segments — An object containing individual segments in the transcript. Each segment object has the following properties:

    • text — The transcribed text of the segment.

    • formatted — The formatted text (with punctuation) of the segment.

    • confidence — A number, from 0 to 1.0, representing the estimated confidence of the segment's transcription.

    • channel — The integer identifying the audio channel for the segment.

  • raw_data — An array objects for each word in the transcription segment. Each object includes the following properties:

    • word — The word.

    • confidence — A number, from 0 to 1.0, representing the estimated confidence of the transcribed word.

    • start_ms — The offset of the start of the word from the start of the transcription, in milliseconds.

    • end_ms — The offset of the end of the word from the start of the transcription, in milliseconds.

The output of a transcription JSON file.

Limitations/Known Issues

  • Transcriptions are only available for individual stream archives, not for composed archives.

  • Transcriptions are not compatible with encrypted archives.

  • This feature currently supported with the Vonage Video REST API, not with the Vonage Video server SDKs.

  • Only English language transcriptions are supported.

  • The maximum length of a transcription is 120 minutes.

  • Post-call transcription is not fully compliant with all Regional Media Zones (see below).

Regional Media Zone Support Available during Beta Available when in GA
USA Yes Yes
EU Yes Yes
Canada No Based on requirement
Germany No Based on requirement
Australia No Based on requirement
Japan No Based on requirement
South Korea No Based on requirement
Singapore No Based on requirement

Frequently Asked Questions

  • How many streams can be analyzed from a single session?
    • Up to 50 streams with a maximum of 120 transcribed minutes.
  • Does Post-call Transcription work with both Routed and Relayed Sessions?
    • The Post-Call Transcriptions feature is intended for Routed sessions that use the Vonage Media servers.
  • If the transcription uploads to the customer-configured S3 bucket fails, does the retry or fallback mechanism work similarly to the archive upload?
    • Yes, the retry mechanism for PCT operates exactly the same as for regular archive uploads.
  • In cases where the transcription falls back and is uploaded to the Vonage cloud, will the customer need to use an HTTP GET request to obtain the download link for the transcription?
    • When the transcription status changes, the customer should receive a callback that includes the download URL. If no callback is registered, the download link can only be retrieved through an HTTP GET request.
  • Once the transcription download link is received, it allows for a direct download. Are there any plans to introduce authentication for downloading the transcription?
    • There are no plans to introduce authentication for the link. The download link has a short expiration window. If not accessed within that timeframe, a new request must be made to obtain a fresh link.
  • Even though multiple users are joined in the session, the transcription file is a single JSON file. How do we differentiate between the users?
    • Each transcription entry in the file is associated with a specific channel number, assigned to each stream. The file also includes channels_metadata, which provides stream ID information corresponding to each channel ID.