The ability of learning to evaluate is increasingly taking on a pivotal role in the development of modern large multimodal models (LMMs). As pre-training on existing web data reaches its limits, researchers are shifting towards post-training with AI-enhanced synthetic data. This transition highlights the growing importance of learning to evaluate in modern LMMs. Reliable AI evaluation is important for human labor in complex task assessments, generating effective reward signals in reinforcement learning, and guiding inference-time search. Despite the progress in single-image, multi-image, and video scenarios, the development of open LMMs capable of evaluating the performance of other multimodal models presents a gap in the field.
Existing attempts to address the challenge of AI evaluation have primarily focused on using proprietary LMMs like GPT-4V as generalist evaluators for vision-language tasks. These models have been used in evaluation benchmarks for complex scenarios such as visual chat and detailed captioning. Moreover, open-source alternatives like Prometheus-Vision have emerged as evaluators for specific user-designed scoring criteria. In the preference learning for LMMs, techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have been applied to align models with human intentions. Recent research has expanded these concepts to the multimodal space, exploring various strategies to improve visual chat abilities and reduce hallucinations in vision-language models.
Researchers from ByteDance and the University of Maryland, College Park have proposed LLaVA-Critic, the first LMM specifically designed for evaluation tasks. This approach focuses on curating instruction-following data tailored for evaluation purposes. It addresses two primary scenarios: serving as an LMM-as-a-Judge and facilitating Preference Learning. It aims to provide reliable evaluation scores comparable to proprietary models like GPT-4V, offering a free alternative for various evaluation benchmarks in the first scenario. It presents a scalable solution for generating effective reward signals, reducing dependence on costly human feedback collection in the second scenario. The LLaVA-Critic shows a high correlation with commercial GPT models in evaluation tasks and superior performance in preference learning.
LLaVA-Critic is developed by fine-tuning a pre-trained LMM, capable of following diverse instructions. This approach ensures the model can handle a range of high-quality vision tasks. The training process involves using an evaluation prompt that combines multimodal instruction input, model response(s), and an optional reference response. LLaVA-Critic is trained to predict quantitative pointwise scores or pairwise rankings based on specified criteria and provides detailed justifications for its judgments. The model uses standard cross-entropy loss for judgments and justifications. The researchers start with the LLaVA-OneVision(OV) 7B/72B pre-trained checkpoint and fine-tune it on the LLaVA-Critic-113k dataset for one epoch.
The results demonstrate significant improvements in both pointwise scoring and pairwise ranking capabilities of LLaVA-Critic compared to baseline models. The LLaVA-Critic-72B achieves the highest average Pearson-r (0.754) and Kendall’s Tau (0.933) in pointwise scoring, outperforming the baseline LLaVA-OV-72B. In pairwise ranking, LLaVA-Critic-72B outperforms GPT-4o and GPT-4V in comparisons without tie, achieving 73.6% accuracy. LLaVA-Critic-7B outperforms most baselines compared to commercial models and other open-source LMMs in the MLLM-as-a-Judge scenario. These results highlight the effectiveness of LLaVA-Critic as an open-source alternative for multimodal model evaluation.
In conclusion, researchers have proposed LLaVA-Critic, the first LMM specifically designed for evaluation tasks. The researchers have used a high-quality, diverse instruction-following dataset to develop this model that excels in two critical areas. First, as a generalized evaluator, LLaVA-Critic shows remarkable alignment with human and GPT-4o preferences across various evaluation tasks, offering a viable open-source alternative to commercial models. Secondly, in preference learning scenarios, LLaVA-Critic functions as a reliable reward model, outperforming human feedback-based approaches in enhancing the visual chat capabilities of LMMs. This research is a key step toward building self-critiquing capabilities in open-source LMMs, enabling future advancements in scalable, superhuman AI alignment feedback.
Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit
Interested in promoting your company, product, service, or event to over 1 Million AI developers and researchers? Let’s collaborate!
Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.