The introduction of Audio Description (AD) marks a big step towards making video content more accessible. AD provides a spoken narrative of important visual elements within a video that are unavailable in the original video track. However, making accurate AD requires a lot of resources, such as special expertise, equipment, and significant time investment. Also, making AD production automatic enhances the accessibility of videos for individuals with visual impairments. Still, a big challenge in automating AD is generating sentences of the right size that fit into the different temporal gaps within actor dialogue.
Recently, Large multimodal models (LMMs) have become popular in artificial intelligence, mostly focused on integrating various data types, including text, image, audio, and video, to become more reliable and intelligent. For example, GPT-4V is an LLM model that extends large language model GPT-4 with vision potential. Moreover, a method called MM-VID pioneered the use of the GPT-4V model for AD generation with the help of a two-step method. This process includes synthesizing condensed frame captions and refining the final AD output using GPT-4. Unfortunately, these methods don’t have an explicit process for character recognition.
A team from Microsoft introduced an automated pipeline that utilizes GPT-4V(ision) to generate accurate AD for videos. This method uses a movie clip and its title information to generate AD content and utilizes the multimodal capabilities of GPT-4V by integrating visual signals from video frames with textual context to generate AD content. This method helps to adjust the size of the AD to fit the speech gap and adapt it for different kinds of videos by giving input to AD production guidelines showing how long the sentence should be in a simple, natural way.
The proposed method is tested using the MAD dataset, which includes a rich collection of over 264,000 audio descriptions from 488 movies. A simple version of the multiple-person tracker is utilized while developing this method for generating person tracklets, capturing all characters appearing in the input movie clip. The further process utilizes TransNetV2 to detect and break clips that contain multiple shots, and after generation of the tracklet, square patches are extracted around each person from the frames. Within the face patches, face detection is performed using the YOLOv7 model, facilitating crop and aligning face patches to a standard size of 112 × 112 pixels.
GPT-4V was instructed to generate all AD in word counts, such as 6, 10, and 20 words, with the performance outcomes. In the AudioVault dataset, 80% of the AD contains ten words or fewer, 99% of the AD limits up to 20 words, and the selection of 6 words matches the dataset’s average word count. The results show that the 10-word prompts show the highest ROUGE-L and CIDEr scores compared to the fixed word counts of 6, 10, and 20. The proposed method outperforms AutoAD-II, establishing a new state-of-the-art performance with CIDEr and ROUGE-L scores of 20.5 (vs. 19.5) and 13.5 (vs 13.4), respectively.
In conclusion, a team from Microsoft proposed an automated pipeline that utilizes GPT-4V(ision) to generate accurate video AD. This method outperforms various methodologies in this paper, such as AutoAD-II, with CIDEr and ROUGE-L scores of 20.5 (vs. 19.5) and 13.5 (vs. 13.4), respectively. However, the proposed method lacks a mechanism to determine suitable moments within a film to insert AD and estimate the related word count for that AD. So, in the future, there is a need to improve the generated AD quality, e.g., one can customize a lightweight language-rewriting model using available AD data to enhance the output from the LLM.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 41k+ ML SubReddit
Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.