Understanding the Theory of Mind (ToM), the ability to grasp the thoughts and intentions of others, is crucial for developing machines with human-like social intelligence. Recent advancements in machine learning, especially with large language models, show some capability in ToM understanding.
However, current ToM benchmarks primarily rely on either video or text datasets, neglecting the holistic nature of human ToM, which involves flexible reasoning based on conceptual representations from various data sources, including visual and linguistic cues. To address this limitation, researchers at MIT and Harvard introduced a Multimodal Theory of Mind Question Answering (MMToMQA) benchmark. MMToM-QA assesses machine ToM on both multimodal and different unimodal data types related to a person’s activities in a household environment.
To enhance multimodal ToM capacity, they propose a novel method called BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and employs language models for scalable Bayesian inverse planning. Through systematic comparisons involving human performance, BIP-ALM, and state-of-the-art models, including GPT-4, their experiments reveal that large language and multimodal models still lack robust ToM capacity. In contrast, BIP-ALM exhibits promising results, harnessing the strengths of both model-based mental inference and language models.
In their evaluation, they compared the performance of BIP-ALM against several state-of-the-art models designed for text or multimodal question answering, including GPT-4 and Video-LLaMA. Despite the impressive performance of existing models in other QA benchmarks, they observed substantial and systematic errors in their benchmark, leading to a failure to match human performance.
BIP-ALM fine-tunes a language model using synthetic human activity data to enhance real-world scenario inference, such as household activities in their benchmark. It then employs this language model to assess the likelihood of hypotheses regarding the person’s beliefs and goals. This strategy leverages the robustness of Bayesian inverse planning and the scalability of language models.
In contrast, BIP-ALM demonstrated superior performance, even when utilizing a relatively small language model. These results highlight the limitations of current state-of-the-art models and underscore the effectiveness of the alternative approach provided by BIP-ALM in engineering human-level ToM reasoning.
In summary, their contributions encompass the introduction of (1) the first benchmark for multimodal ToM, (2) a novel ToM reasoning method, BIP-ALM, which integrates Bayesian inverse planning and language models for robust and efficient ToM inference based on multimodal data, and (3) a systematic comparison involving various machine learning models and human ToM capabilities.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our Telegram Channel
Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.