Media Intelligence, blending AI with the right Human Input

You would think that AI would herald a boom for the media and entertainment business. Instead, entertainment professionals are facing incredible friction as they attempt to experiment with generative AI or AI in general.

💡 If you are new to using AI in media production, please refer to this excellent overview of AI Services for Video and Audio Post Production by Jonny Elwyn.

So why are the biggest brands in entertainment not aggressively embracing AI? And how can you turn that into an opportunity?

What is the problem with AI?

Most turnkey AI solutions available on the market have not been developed with professional purposes in mind and may show poor peak signal-to-noise (PSNR) values. In layman’s terms: the number of falsely recognised images and words is too high for the technology to be used in this way. Post-editing to correct the results sometimes takes more time than manually working with the transcription in the first place.

So, a naïve approach to implementing AI is to use bare-metal algorithms for image recognition or automatic speech transcription. Even the best possible Automatic Speech Recognition (ASR) engines available today will make errors, due to the simple fact that natural language is constantly evolving (when new concepts and words are born, engines need to be trained before these will be recognised), and because natural language is subject to synonymity and homonymity. And things tend to get worse and more complicated when using image recognition.

More specifically, the problem with most turnkey AI solutions today is that they deliver results that are both inaccurate and incomplete. This issue is particularly painful for journalists and documentary makers requiring high precision, because the output frequently lacks the necessary detail and accuracy. A poor overall peak signal-to-noise ratio (PSNR) means that the desired output is obscured by significant amounts of false positives, leading an overload of non-relevant search results and poor search efficiency. If any relevant results show up in the first place. As a result, journalists and edit producers (in a documentary context) have to spend hours of manual work trying to find the right relevant shots.

The root cause of this problem lies in the fact that AI systems were not specifically designed with the rigorous demands of media production in mind. Consequently, they lack the robustness and fine-tuned accuracy needed for such specialised tasks. While more and better training of AI models is certainly necessary to improve performance, this is not a complete solution. AI will remain inherently prone to errors regardless of advancements in training techniques due to the complex and unpredictable nature of real-world data. A more intelligent approach that includes highly specific training, contextual understanding, and error correction mechanisms, is essential to achieve the reliability and precision needed in high-stakes environments.

What is ‘Media Intelligence’?

When trying to solve the problems associated with AI, the best advice is ‘don’t try to boil the ocean’. The bigger the data set of potential targets (images or words), the more likely you will find false positives in the result set. It’s therefore better to try to restrict the set of potential targets (when using face recognition, animal detection by classification, language recognition, and so on). Consider the smallest possible dataset of possible targets by using any available pre-existing data from production, e.g. lists of actors or people that you may extract from the screenplay, any production briefings or call sheets.

💡 To consistently manage lists of data or taxonomies across your workspaces, you can manage thesauri on the level of the account, and use them to populate metadata fields. This ensures the highest possible accuracy, and as such improves overall search efficiency.

When it comes to image recognition, one should consider that any (!) media production process will contain lots of valuable and rich information in the form of production documents coming from planning or pre-production, logging data, or production reports. Aligning these with the images to create an accurate representation of who is in the image is cheaper and more accurate than using AI to retrofit these data.

If you implement a process where you recycle any available data from (pre-)production, parse these by your data model in the form of one or more lists of allowed values or thesauri, and use this data set as input for speech or image recognition, the recognition rate or completeness of the result will be much higher. By subsequently weeding out items with lower confidence scores, the number of false positives will be close to zero.

In a last step, we propose to reconcile the results on a single timeline and set up multi-modal indexing. Rather than just indexing a selection of words, make sure to distinguish speakers, faces, places, subject references, etc. and organise these on a single timeline lined up according to scene changes or shot cuts. The result becomes visible as a search engine to support the editorial decision-making process with unprecedented efficiency.

How does it work practically?

Every media production environment contains plenty of freely accessible information that can be used to improve the quality of an AI system’s output. Think of production documents (scripts or screenplays, call sheets, production briefings, production reports, etc.) that contain key names and phrases. In scripted entertainment, including film, TV series and continuing drama, the scripts, by definition, contain all important references. In an appropriate implementation of AI, this pre-existing information is used as input for the AI system, and it will preferentially look for said key names and phrases. Moreover, it is key to consider the user’s intentions, and to use this information to filter and reconcile the output of the AI system. As a result, the output can be tuned to be 100% complete and accurate at the same time.

This approach lets you set up an iterative method to systematically further increase the relevance and the usability of your results. Here is where the concept thesauri plays a pivotal role. By fine-tuning your naming conventions (in your capacity as archivists, edit producers, subtitlers, or similar) you are creating a mind map or an ontology which is expanded in its simplest form as a taxonomy or a thesaurus. This in turn will be used in a first stage to filter the output produced by AI, by operating the search engine (e.g. using labels to define an advanced search query), by the user interface to enable navigation of content, and/or to create grouping of content in your library.

What are the benefits?

Appropriate use of AI for Media Production (or ‘Media Intelligence’) has two key advantages.

It enables producers of original content (e.g. journalists, edit producers) to process more material faster and at the same cost, as opposed to using conventional methods where the cost inflates as soon as you try to process very large volumes of raw material;
Process automation helps broadcasters or distributors create multiple versions at marginal cost, i.e. Media Intelligence takes over the role of the edit assistant in reproducing several editorial instances of the same master.

How will this impact knowledge workers?

Rather than rendering their roles obsolete, appropriate use of AI actually gives knowledge workers a palpable competitive edge. This assumes a trustful implementation, reliable output, and overall usability of the results.

As an example, several newsdesks all over the world rely on Limecraft to automatically create subtitles for news items. These are usually short-form pieces of content, and a typical newsdesk processes up to 10,000 items per month. From a logistical point of view, it is simply not possible to rely on a subtitling company for that kind of volume. Now, as factual content is relatively easy to use as input for AI captioning with almost no post-editing required, these newsdesks are freeing up 20 to 30 full-time employees who can now create more content rather than spending their time manually creating subtitles.

Real world examples

Arrow International Media have integrated Greymeta Curio (acquired by Wasabi in January 24) for image recognition and Speechmatics in their Limecraft production workspace, which enables them to ingest and index over 2500 hours of raw material per month.
IMG, part of the Endeavor network, are using custom dictionaries to ensure correct spelling of football players when using AI subtitling.
The Associated Press automated editorial shot lists by fine-tuning image recognition and combining it with the right speech-to-text technologies.
SVT, the Swedish public service broadcaster, relies on Limecraft for automatically subtitling 7500 items of short form content per month.