Browsing: AI News
Artificial intelligence, particularly in training large multimodal models (LMMs), relies heavily on vast datasets that include sequences of images and text. These datasets enable the development…
Video large language models (LLMs) have emerged as powerful tools for processing video inputs and generating contextually relevant responses to user commands. However, these models face…
A significant challenge in the field of visual question answering (VQA) is the task of Multi-Image Visual Question Answering (MIQA). This involves generating relevant and grounded…
Effectively evaluating document instruction data for training large language models (LLMs) and multimodal large language models (MLLMs) in document visual question answering (VQA) presents a significant…
Recently, diffusion models have become powerful tools in various fields, like image and 3D object generation. Their success comes from their ability to handle denoising tasks…
Large Language Models (LLMs) and their multi-modal counterparts (MLLMs) have made significant strides in advancing artificial general intelligence (AGI) across various domains. However, these models face…
Document understanding (DU) focuses on the automatic interpretation and processing of documents, encompassing complex layout structures and multi-modal elements such as text, tables, charts, and images.…
Robustness is crucial for deploying deep learning models in real-world applications. Vision Transformers (ViTs) have shown strong robustness and state-of-the-art performance in various vision tasks since…
Whole-body pose estimation is a key component for improving the capabilities of human-centric AI systems. It is useful in human-computer interaction, virtual avatar animation, and the…
A fundamental topic in computer vision for nearly half a century, stereo matching involves calculating dense disparity maps from two corrected pictures. It plays a critical…