In the real world, information is often conveyed through a combination of text images or videos. To understand and interact with this information effectively, AI systems must be able to process both modalities. Visual language models bridge the gap between natural language understanding and computer vision, enabling more comprehensive world comprehension.
These models can generate rich and contextually relevant descriptions, stories, or explanations incorporating textual and visual elements. This is valuable for creating content for various purposes, including marketing, entertainment, and education.
The major tasks of Visual Language Models are visual question answering and image captioning. In visual question answering, the AI model is presented with an image and a text-based question about that image. The model first uses computer vision techniques to understand the contents of the image and processes the textual question using NLP. The answer should ideally reflect the image’s content and address the specific query posed in the question. Whereas image captioning involves automatically generating descriptive textual captions or sentences that explain the content of an image.
The current VLMs need to be improved in capturing the physical concepts like material type and fragility of common objects. This makes the robotic identification tasks that involve physical reasoning of the objects extremely difficult. To resolve this, Stanford, Princeton, and Google Deep Mind researchers propose PhysObjects. It is an object-centric dataset of 36.9K crowd-sourced and 417K automated physical concept annotations of common household objects. Crowd-sourced annotation collects and labels large volumes of data using a distributed group of individuals.
They have demonstrated that a fine-tuned VLM on PhysObjects can improve physical reasoning abilities significantly. Their physically-grounded VLM achieves improved prediction accuracy on the held-out dataset example. They combined this physically grounded VLM with an LLM-based robotic planner to test its advantages, where the LLM queries the VLM about the physical concepts of objects in its scene.
Researchers used the EgoObjects dataset as their image source. That was the largest object-centric dataset of real objects that was publicly released when they were constructing PhysObjects. As the dataset consists of videos of realistic household arrangements, it is relevant to the training of household robotics. On average, It includes 117,424 images, 225,466 objects, and 4,203 object instance IDs.
Their results show that the models improved in planning performance on tasks that require physical reasoning, compared to baselines that do not use physically grounded VLMs. Their future work involves expanding beyond physical reasoning, such as geometric reasoning or social reasoning. Their methodology and dataset are the first step toward using VLMs for more sophisticated reasoning in robotics.
Check out the Paper and Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.