Getting started with AI transcription - Limecraft Blog : Limecraft

Compared to last year, Limecraft customers are processing 448% more material using AI transcription services. When asked for feedback, journalists and archive producers who have successfully adopted AI transcription at the core of their workspace, never turn back to manual transcription. AI transcription is faster, more accurate, it drastically improves the visibility of your content, and thus it eases collaboration within global video production teams.

In this post, we will discuss the main differences between conventional and AI transcription, explain why it is important to optimise for high-quality results, break down the main use cases, and describe in great detail how to successfully deploy AI transcription as part of your workspace.

Not familiar with Limecraft yet? Feel free to sign up for a free trial, or request a personalised demo.

Table of content

What is AI transcription?

AI transcription is a method of automatically turning spoken words into timed text using artificial intelligence that has been trained to do so. Training involves learning language models by processing millions of web pages, and learning acoustic models by listening to labelled data sets of real human speech.

Good AI transcription services include accurate speaker segmentation (sometimes referred to as “diarisation”), proper interpunction including exclamation and question marks. For professional use, it is important to make the results of automated transcription editable in an interface that highlights confidence scores, and that allows the user to modify words, speakers of timing. This is called the “Expert in the loop” workflow.

More info on editing transcripts on the knowledge base.

AI transcription and the “Expert in the loop” interface

The key differences between manual and AI Transcription

AI transcription is executed in a fraction of the time compared to manually creating a transcript. AI transcription as offered by Limecraft runs 4 times faster than real-time, whereas manual transcription typically takes 6 to 8 times the length of the clip. That boils down to at least 24 times difference in turn-around time.

The quality of automated transcription is often expressed in word error rate (WER). ASR is as smart as it has been trained. Depending on the audio quality and the ASR engine type, expect a WER of 2%-20%, compared to completely accurate by manual transcription. While a small WER might be acceptable for indexing purposes, it may be prohibitive when the results are intended for publication. Hence sometimes it is better to optimise for accuracy rather than for the lowest price per hour, and you may want to look for a complete solution that includes a proper editor for reviewing and correcting the automatically created transcript.

Limecraft offers customised ASR solutions with a WER that is significantly lower compared to the standard solutions offered by Microsoft Videoindexer, Google Speech or Amazon Recognition. Given a WER of 1%-2%, the pos-editing time of an automated transcript, is rough twice the length of the clip, which is still 3 to 4 times faster compared to conventional transcription. So while it may seem post-editing requires extra work, combining ASR and manual post-editing (”expert in the loop” workflow) still saves you 60%-80% of the time compared to conventional transcription.

But the most important characteristic of AI transcription is the fact that each word has a time-code. For you as a video professional, accurate time codes are your blood and vessels. Time-coded text is essential for efficient searching and retrieving the right content fragments, using the transcript for creating sync pulls, and allowing automatic subtitling.

💡 AI transcription is not perfect. There may be errors, and it’s important to proofread the transcript before using it. To do this, Limecraft allows you to edit each word and label speakers as appropriate.

💡 Using custom dictionaries, you can further reduce the WER by 50%

Why good audio transcripts are vital for you as a Video Producer

After a video shoot wraps up and the rushes are offloaded into storage, it’s common to then enrich and organise the media so that they are easy to find and to use in post-production (”sync pulls”).

The problem is that, unlike text, video pixels and audio waveforms are not self-descriptive. To allow searching for content, someone or something needs to index it. Indexing is the process of identifying individual shots, and describing the content of the audio and video in words whereby those descriptions are linked to the timecode. When you then search for a particular shot or fragment using words, the search engine will point you directly to the right shots.

Limecraft using a powerful video search engine that highlights the right fragments on a timeline

Properly indexed content allows the search engine to point you directly to the right shots and fragments

Arguably the most important aspect of enrichment is audio transcription, the process of turning spoken audio into timed text. Research conducted by VRT medialab, presented back in 2008 at the FIAT/IFTE conference, showed that 80% of the tags applied by archivists to index content and 80% of the words used by journalists to search in archives, refer to the audio. Surprisingly, and especially for news and documentary production, the majority of the meaning is in the audio rather than in the pixels. This is why, if you are considering automated processing of audiovisual material, you should prioritise AI transcription over other services.

Now, while audio transcription often becomes apparent in the form of closed captions or subtitles, transcripts serve a raft of other purposes and this is why AI transcription gives you a competitive edge.

The Use Cases for AI Transcription in Media

1. Improve Collaboration and allow repurposing of Content

With every spoken word time-stamped, content is indexed and easily searchable without further manual logging. By doing so, archive stock can be easily repurposed, and original footage can be easily spotted, shared and exchanged among colleagues regardless of their physical location.

2. Accelerate the Editing Process

By cutting the transcript parts per shot and exporting this as labels or markets, editors can efficiently spot the exact shot where a word or a phrase is used. Cutting an interview video or creating highlight reels becomes much more efficient.

A good transcript editor yet serves another creative purpose. Highlighting a section of the transcript in the Limecraft UI automatically creates in/out markers, and allows for time-based comments. When exporting the edit project to Avid Media Composer or Adobe Premiere Pro, time-based comments will also appear as markers, keeping a consistent view on all sorts of data all the way down to post-production.

Upon export of a collection to Adobe Premiere from Limecraft, the annotations and transcript fragments are displayed on the timeline of the editor

When exporting a collection to Adobe Premiere as an edit project, the transcript fragments are displayed on the timeline

3. Translating audio into your Language

Transcribing video content with audio in another language can be tedious. However, by using AI transcription services, you can easily and quickly translate your video content into multiple languages for international audiences. Limecraft currently supports transcription and translation in over 140 languages. (LINK TO THE COMPLETE LIST)

The audio transcription application in Limecraft allows you to translate transcripts

Limecraft allows you to translate transcripts of audio in non-domestic languages to your language or choice

4. Accessibility

You can take advantage of the transcription to automatically create subtitles or closed captions for your content. Starting from a post-edited transcript, using Natural Language Processing and taking into account your specific timing or spotting rules as well as the scene changes, it is perfectly possible to create subtitles automatically and according to broadcast standards.

Using Limecraft, subtitles are automatically spotted using one or more configurable style guides or sets of timing rules, customisable according to the specific requirements of the distribution platform. By doing so, your material is accessible not only to the deaf and the hard of hearing, but also on social or other media where the audio is disabled by default.

5. Search Engine Optimisation (SEO)

AI transcription can help you improve your media SEO by generating transcripts of your media content. This will help your videos or audio content like podcasts rank higher by search engines.

The keys to successfully integrating AI transcription as part of your workspace

Artificial Intelligence needs to be handled with great care. Especially AI transcription, when processing colloquial language, can be prone to errors. When not correctly implemented, these errors will surface during production or – worse – play out, and lead to frustration and a lack of trust. When properly implemented, and wrapped in a good user interface to help producers while post-editing and quality control, it will be a huge time saver.

Here are some tips and tricks to ensure you run for a successful implementation.

Chose your ASR engine rightfully. For professional use, optimise for accuracy and not for the cost. If cost is an issue, make sure to consider the total cost (including the post-editing time), and not just the cost of the machine transcription. A word error rate which is marginally higher, may have a prohibitive impact on the post-editing time.
AI transcription is more than audio in and text out. In a professional context, you need an “expert-in-the-loop”. So you need to look for a solution that offers a proper editor to modify words, interpunction, and speaker changes. You don’t want to do the post-editing in a text document, as you will lose the timecode.
As you will probably want to optimise for a Word Error Rate as possible, look for a solution that supports custom dictionaries. It will reduce the WER by 50% and allow real-time post-editing.
If you work on content in other languages, look for a solution that embeds machine translation. Similarly, in case you plan to use the transcripts for subtitling, look for a solution that has built-in capabilities to create broadcast-grade subtitles. Avoid copying and pasting between different applications, as it will break time codes.
Transcription always serves a purpose, which can be either indexing, pre-cutting, subtitling or any combination thereof. To integrate AI transcription as part of your workspace, look for API’s and methods that allow you to seamlessly transfer the results to Adobe, Avid and or subtitle editors like Ooona without having to copy and paste between applications.

💡Limecraft is integrated out-of-the-box with all major software vendors for video subtitle editing, allowing you with a single click to transfer transcript data along with the video.

How much does it cost?

There is no one-size-fits-all answer to this question. The cost of speech-to-text AI transcriptions will vary depending on a number of factors, including the service provider, the length and complexity of the video, the quality of the audio, and desired accuracy of the end result.

As a starting point, expect around $0.25 per minute of video for AI transcription and subtitling including the use of the editor and the creation of subtitles. If you prefer to include 3rd party AI services for transcriptions other than those provided out of the box by Limecraft, you can always connect your service of choice by using custom workflows.

Ready to give it a try? Feel free to sign up for a free trial, or to request a personalised demo by leaving your contact details.

How to get started using AI Transcription

Before you start

Make sure you have a registered account, drag and drop one or more clips in the production, and that your account is provisioned with sufficient credits. Using a free trial account, 15 minutes of transcription are granted for your convenience.

More info on the knowledge base.

From the Library view

Hoover over the thumbnail and select “Go to transcript”. Select the primary language from the drop-down and start the transcription.

More info on using transcription via the UI on the knowledge base.

Alternatively, you can transcribe content systematically as part of the ingest process

Using the API, or using ‘limecraft-tools‘ to access the API on your behalf, you can easily automate the transcription process as a step in the ingest process. Please leave a message on our contact page in case you are interested in integrating transcription services as part of your workspace.