The rapid development of (MLLMs) has been noteworthy, particularly those integrating language and vision modalities (LVMs). Their advancement is attributed to high accuracy, generalization capability, reasoning skills, and robust performance, and these models are experts in handling unforeseen tasks beyond their initial training scope. MLLMs are revolutionizing various fields, prompting a re-evaluation of specialized models. Their swift evolution sparks interest in employing them for computer vision tasks like object segmentation and integrating them into intricate pipelines like instruction-based image editing.
While models like ShareGPTV have their uses in tasks like data annotation, their practicality in production is limited due to their high cost. In contrast, specialized models like MiVOLO offer a cost-effective solution. This paper compares the best general-purpose MLLMs with technical models like MiVOLO to understand their capability to replace them. Results indicate significant differences in computational costs and speed for some tasks. This includes tasks such as labeling new data or filtering old datasets.
The team of Researchers from SaluteDevices has presented MiVOLOv2, a model that not only outperforms all specialized models like CNN, ResNet34, and GoogLeNet but also the first version of MiVOLO. This second version, the state-of-the-art model for gender and age determination, utilizes advanced evaluation metrics such as Mean Absolute Error (MAE) for age estimation, accuracy for gender prediction, and cumulative Score at 5 (CS@5) for age estimation. The team also conducted experiments to compare the best general-purpose MLLMs with specialized models, aiming to measure all SOTA MLLMs like LLaVA 1.5 and LLaVA-NeXT, ShareGPT4V and ChatGPT4V.
MiVOLO utilizes face and body crops for predictions, whereas other models make predictions based on prompts and images of body crops. It employs a transformer to estimate age and gender from these inputs. Additionally, we fine-tune an MLLM for gender and age estimation, contrasting it with a specialized model. Authors explore the capabilities of multimodal ChatGPT (ChatGPT4V), evaluating its proficiency in predicting facial attributes and performing face recognition tasks. With zero training, the model outperformed a specialized age-recognition model but performed less effectively in gender classification.
For MiVOLOv2, the training dataset is extended by 40% from the previous data used in MiVOLO, and it now contains more than 807,694 samples: 390,730 male and 416,964 female. Most of the images were selected where MiVOLOv1 made significant mistakes. Production pipelines and some open-source data, like LAION-5B, are primarily used to achieve this. Among the two datasets, LAGENDA is opted over IMDB. It minimizes the risk that MLLMs would provide correct answers not through age and gender estimation but because of their familiarity with famous individuals, well-known movies, etc. Despite lacking ground truths, LAGENDA offers reduced risk and accelerates MiVOLOv2 to surpass all general-purpose MLLMs in age estimation. However, LLaVA-NeXT 34B leads in this area among open-source alternatives, making fine-tuned specialized versions of LLaVA more effective.
In conclusion, this paper aimed to assess the efficacy of MiVOLO2 compared to MLLMs for age and gender estimation tasks. The second version of MiVOLO2 surpasses all general-purpose MLLMs in age estimation and succeeds in processing images of individuals. The results encouraged a comprehensive evaluation of neural networks’ potential, including LLaVA and ShareGPT.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our Telegram Channel
You may also like our FREE AI Courses….
Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.