Vision-Language Models (VLMs) have come a long way recently, as demonstrated by the success of OpenAI’s GPT4-V. Recent studies have shown that these models have demonstrated remarkable performance across a variety of vision-language tasks, including captioning, object localization, multimodal world knowledge, commonsense reasoning, visual question answering (VQA), and vision-based coding.
According to earlier studies, these state-of-the-art (SOTA) VLMs perform exceptionally well on a wide range of vision-based reasoning and understanding tasks. They can effectively extract text from images, comprehend and reason with visual data, including tables and charts, and solve basic visual mathematical problems.
In recent research, a team of researchers from Apple has emphasized assessing the limitations of VLMs, especially in difficult tasks requiring advanced vision-based deduction skills. The team has used Raven’s Progressive Matrices (RPMs) to assess VLMs’ ability in complicated visual reasoning.
RPMs are well known for using only visual cues to evaluate people’s multi-hop relational and deductive reasoning skills. Using well-known techniques like in-context learning, self-consistency, and Chain-of-thoughts (CoT), the team has thoroughly evaluated a number of well-known VLMs on three different datasets: Mensa IQ exam, IntelligenceTest, and RAVEN.
The results have shown a notable discrepancy between the remarkable performance of Large Language Models (LLMs) in text-based reasoning tasks and VLMs’ competence in visual deductive reasoning. The team has shared that some techniques that work well for improving LLM performance do not transfer well to problems involving visual reasoning. A detailed study has revealed that VLMs suffer mainly because they have trouble identifying and understanding the various, possibly confusing, abstract patterns contained in RPM samples.
The team has summarized their primary contributions as follows.
- Systematic Evaluation approach: To evaluate Vision-Language Models (VLMs) on Raven’s Progressive Matrices (RPM) issues, the team has created a systematic approach. The Mensa IQ exam, IntelligenceTest, and RAVEN datasets have been used for evaluation, which provided a thorough grasp of VLM performance in image-based reasoning tasks.
- Inference-Time Techniques: To study the potential of VLMs, the team has employed common inference-time techniques found in LLMs, such as self-consistency and in-context learning. It has been found that several tactics that worked well in LLMs did not work as well in VLMs.
- Performance Analysis: A thorough analysis has been conducted of VLM performance, breaking down its abilities into three categories: perception, inference, and hypothesis testing. The research has shown that perception is the main bottleneck in the VLMs that are used today. Particular problems with perception have been identified in a case study using GPT-4V.
- Issues Found: A number of problems have been found and examined with the way that current VLMs operate, such as overconfidence, sensitivity to prompt design, and a lack of capacity to use in-context examples effectively. The influence of prompts has been evaluated on model performance through manipulation, and structured prompts have been suggested as a possible technique for enhancement.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 38k+ ML SubReddit
Want to get in front of 1.5 Million AI enthusiasts? Work with us here
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.