Earlier, with the adoption of computer vision, its studies weren’t content to only scan 2D arrays of flat “patterns.” Rather, they sought to understand images as projections of 3D scenes. Initially, researchers created several intermediate tasks to help with this pursuit. These included learning about optical properties like reflectance, three-dimensional primitives using multi-view reasoning, geometric reasoning using depth estimation, visual correspondence, recognition, keypoint grounding for affordance, and intrinsic images for forensics. Studies have led to constructing new tasks, largely articulated in natural language, in the present era of large language models (LLMs), emphasizing the vision-language relationship learned by multimodal LLMs and less on such perceptual tasks. This could be because of the intrinsic imprecision of language, which makes it difficult to utilize it to mediate many conventional computer vision tasks (e.g., pinpointing a spatial key point through language is tough).
A collaborative effort by researchers from the University of Pennsylvania, the University of Washington, the Allen Institute for AI, the University of California, and Columbia University, this study delves into crucial yet overlooked aspects of visual perception in evaluating multimodal LLMs. Despite their widespread use as evaluation metrics for seminal models like GPT-4V and Gemini-Pro, many of these standards conflate perception with linguistic understanding and reasoning. This work reveals that a ‘blind’ GPT-4 performs well on these ‘multimodal tasks’ when a task-agnostic dense caption is used in place of the picture.
The study introduces Blink, a novel benchmark for multimodal language models (LLMs) that uniquely focuses on core visual perception abilities not addressed in other evaluations. From basic pattern matching to intermediate reasoning and advanced visual understanding (like visual similarity), Blink’s fourteen classic computer vision challenges encompass a comprehensive range. The image assignments are deliberately challenging, designed to require a genuine understanding of the image’s content rather than relying on superficial labeling.
The researchers revamped every old task by making it a question-and-answer session with picture or textual answers. Blink has 3,800 questions and 7,300 photos, with each question potentially containing many images selected from various datasets. These photographs depict sights inside and outside homes, cities, and nature. Either human beings or datasets are used to generate the questions and options. A human can usually answer every question (except the IQ test) in the Blink of an eye.
On Blink, the team thoroughly assesses seventeen multimodal LLMs ranging in size from seven to thirty-four bits. Contrary to popular belief, these issues are quite easy for humans to solve (95.70% average accuracy). However, current equipment finds them incredibly challenging, with the GPT-4V model only managing an average accuracy of 51.26%. This is 44.44% poorer than humans and 13.17% better than random guessing. In addition, Blink compared multimodal LLMs to expert vision models and discovered that the latter performs substantially better. On visual correspondence estimation, for instance, the expert beats GPT-4V by 62.8%, relative depth estimation by 38.7%, and multi-view reasoning by 34.6% in terms of absolute accuracy.
The research findings challenge previous estimates of multimodal LLMs’ perceptual capacities, suggesting they may have been overstated. Moreover, these models could potentially benefit from incorporating insights from specialist models that excel in specific domains. The team envisions Blink as a valuable platform for exploring how multimodal LLMs can integrate more conventional ideas of perception with their state-of-the-art generating capabilities, paving the way for future advancements in the field.
Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 40k+ ML SubReddit
Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.