Large Language Models (LLMs) have made significant strides in recent years, prompting researchers to explore the development of Large Vision Language Models (LVLMs). These models aim to integrate visual and textual information processing capabilities. However, current open-source LVLMs face challenges in matching the versatility of proprietary models like GPT-4, Gemini Pro, and Claude 3. The primary obstacles include limited diversity in training data and difficulties in handling long-context input and output. Researchers are striving to enhance open-source LVLMs’ ability to perform a wide range of vision-language comprehension and composition tasks, bridging the gap between open-source and closed-source leading paradigms in terms of versatility and performance across various benchmarks.
Researchers have made significant efforts to tackle the challenges in developing versatile LVLMs. These approaches include text-image conversation models, high-resolution image analysis techniques, and video understanding methods. For text-image conversations, most existing LVLMs focus on single-image multi-round interactions, with some extending to multi-image inputs. High-resolution image analysis has been tackled through two main strategies: high-resolution visual encoders and image patchification. Video understanding in LVLMs has employed techniques such as sparse sampling, temporal pooling, compressed video tokens, and memory banks.
Also, researchers have explored webpage generation, moving from simple UI-to-code transformations to more complex tasks using large vision-language models trained on synthetic datasets. However, these approaches often lack diversity and real-world applicability. To align model outputs with human preferences, techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have been adapted for multimodal LVLMs, focusing on reducing hallucinations and improving response quality.
Researchers from Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong, SenseTime Group, and Tsinghua University have introduced InternLM-XComposer-2.5 (IXC-2.5), representing a significant advancement in LVLMs, offering versatility and long-context capabilities. This model excels in comprehension and composition tasks, including free-form text-image conversations, OCR, video understanding, article composition, and webpage crafting. IXC-2.5 supports a 24K interleaved image-text context window, extendable to 96K, enabling long-term human-AI interaction and content creation.
The model introduces three key comprehension upgrades: ultra-high resolution understanding, fine-grained video analysis, and multi-turn multi-image dialogue support. For composition tasks, IXC-2.5 incorporates additional LoRA parameters, enabling webpage creation and high-quality text-image article composition. The latter benefits from Chain-of-Thought and Direct Preference Optimization techniques to enhance content quality.
IXC-2.5 enhances its predecessors’ architecture with a ViT-L/14 Vision Encoder, InternLM2-7B Language Model, and Partial LoRA. It handles diverse inputs through a Unified Dynamic Image Partition strategy, processing images at 560×560 resolution with 400 tokens per sub-image. The model employs a scaled identity strategy for high-resolution images and treats videos as concatenated frames. Multi-image inputs are handled with interleaved formatting. IXC-2.5 also supports audio input/output using Whisper for transcription and MeloTTS for speech synthesis. This versatile architecture enables effective processing of various input types and complex tasks.
IXC-2.5 demonstrates exceptional performance across various benchmarks. In video understanding, it outperforms open-source models in 4 out of 5 benchmarks, matching closed-source APIs. For structural high-resolution tasks, IXC-2.5 competes with larger models, excelling in form and table understanding. It significantly improves multi-image multi-turn comprehension, outperforming previous models by 13.8% on the MMDU benchmark. In general visual QA tasks, IXC-2.5 matches or surpasses both open-source and closed-source models, notably outperforming GPT-4V and Gemini-Pro on some challenges. For screenshot-to-code translation, IXC-2.5 even surpasses GPT-4V in average performance, showcasing its versatility and effectiveness across diverse multimodal tasks.
IXC-2.5 represents a significant advancement in Large Vision-Language Models, offering long-contextual input and output capabilities. This model excels in ultra-high resolution image analysis, fine-grained video comprehension, multi-turn multi-image dialogues, webpage generation, and article composition. Despite utilizing a modest 7B Large Language Model backend, IXC-2.5 demonstrates competitive performance across various benchmarks. This achievement paves the way for future research into more contextual multi-modal environments, potentially extending to long-context video understanding and interaction history analysis. Such advancements promise to enhance AI’s capacity to assist humans in diverse real-world applications, marking a crucial step forward in multimodal AI technology.
Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter.
Join our Telegram Channel and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 46k+ ML SubReddit