In recent years, multimodal large language models (MLLMs) have revolutionized vision-language tasks, enhancing capabilities such as image captioning and object detection. However, when dealing with multiple text-rich images, even state-of-the-art models face significant challenges. The real-world need to understand and reason over text-rich images is crucial for applications like processing presentation slides, scanned documents, and webpage snapshots. Existing MLLMs, such as LLaVAR and mPlug-DocOwl-1.5, often fall short when handling such tasks, primarily due to two major problems: a lack of high-quality instruction-tuning datasets specifically for multi-image scenarios, and the struggle to maintain an optimal balance between image resolution and visual sequence length. Addressing these challenges is vital to advancing real-world use cases where text-rich content plays a central role.
Researchers from the University of Notre Dame, Tencent AI Seattle Lab, and the University of Illinois Urbana-Champaign (UIUC) have introduced Leopard: a multimodal large language model (MLLM) designed specifically for handling vision-language tasks involving multiple text-rich images. Leopard aims to fill the gap left by current models and focuses on enhancing performance in scenarios where understanding the relationships and logical flows across multiple images is key. By curating a dataset of about one million high-quality multimodal instruction-tuning data points tailored to text-rich, multi-image scenarios, Leopard has a unique edge. This extensive dataset covers domains like multi-page documents, tables and charts, and web snapshots, helping Leopard effectively handle complex visual relationships that span multiple images. Additionally, Leopard incorporates an adaptive high-resolution multi-image encoding module, which dynamically optimizes visual sequence length allocation based on the original aspect ratios and resolutions of the input images.
Leopard introduces several advancements that make it stand out from other MLLMs. One of its most noteworthy features is the adaptive high-resolution multi-image encoding module. This module allows Leopard to maintain high-resolution detail while managing sequence lengths efficiently, avoiding the information loss that occurs when compressing visual features too much. Instead of reducing resolution to fit model constraints, Leopard’s adaptive encoding dynamically optimizes each image’s allocation, preserving crucial details even when handling multiple images. This approach allows Leopard to process text-rich images, such as scientific reports, without losing accuracy due to poor image resolution. By employing pixel shuffling, Leopard can compress long visual feature sequences into shorter, lossless ones, significantly enhancing its ability to deal with complex visual input without compromising visual detail.
The importance of Leopard becomes even more evident when considering the practical use cases it addresses. In scenarios involving multiple text-rich images, Leopard substantially outperforms previous models like OpenFlamingo, VILA, and Idefics2, which struggled to generalize across interrelated visual-textual inputs. Benchmark evaluations demonstrated that Leopard surpassed competitors by a large margin, achieving an average improvement of over 9.61 points on key text-rich, multi-image benchmarks. For instance, in tasks like SlideVQA and Multi-page DocVQA, which require reasoning over multiple interconnected visual elements, Leopard consistently generated correct answers where other models failed. This capability has immense value in real-world applications, such as understanding multi-page documents or analyzing presentations, which are essential in business, education, and research settings.
Leopard represents a significant step forward for multimodal AI, particularly for tasks involving multiple text-rich images. By addressing the challenges of limited instruction-tuning data and balancing image resolution with sequence length, Leopard offers a robust solution that can process complex, interconnected visual information. Its superior performance across various benchmarks, combined with its innovative approach to adaptive high-resolution encoding, underscores its potential impact on numerous real-world applications. As Leopard continues to evolve, it sets a promising precedent for developing future MLLMs that can better understand, interpret, and reason across diverse multimodal inputs.
Check out the Paper and Leopard Instruct Dataset on HuggingFace. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.
[Trending] LLMWare Introduces Model Depot: An Extensive Collection of Small Language Models (SLMs) for Intel PCs