VLMs are potent tools for grasping visual and textual data, promising advancements in tasks like image captioning and visual question answering. Limited data availability hampers their performance. Recent strides show that pre-training VLMs on larger image-text datasets improves downstream tasks. Yet, creating such datasets faces challenges: scarcity of paired data, high curation costs, low diversity, and noisy internet-sourced data.
Previous studies demonstrate the effectiveness of VLMs in tasks like image captioning, utilizing diverse architectures, and pretraining strategies. Recent advancements in high-quality image generators have sparked interest in using generative models for synthetic data generation. This trend impacts various computer vision tasks, including semantic segmentation, human motion understanding, and image classification. This study also explores integrating data-driven generative models within VLMs, emphasizing efficiency by generating image embeddings directly integrated into the model, showing superiority over existing approaches.
The researchers from Google DeepMind have proposed Synth2. This method leverages pre-trained generative text and image models to create synthetic paired data for VLMs, addressing data scarcity, cost, and noise challenges. It generates both text and images synthetically, avoiding reliance on real-world data. The approach operates at the embedding level, bypassing costly pixel-space rendering, thus enhancing efficiency without compromising performance. Pre-training the text-to-image model on the same dataset used for VLM training ensures fair evaluation and prevents unintended knowledge transfer.
Synth2 leverages pre-trained generative text and image models to create synthetic paired data for VLM training. It includes components for Caption Generation, utilizing LLMs with class-based prompting for diverse captions, and Image Generation, employing a controlled text-to-image generator trained on the same dataset as the VLM to ensure fair evaluation. The Synth2 VLM architecture integrates VQ-GAN backbones for efficient interaction with synthetically generated image embeddings, bypassing pixel-space processing and enabling seamless training. Also, a Perceiver Resampler component facilitates cross-attention between VQ tokens and language tokens in the VLM, aiding in effective multimodal representations.
In evaluating synthetic images for VLM training, Synth2 significantly improves performance over baselines, even with a smaller volume of human-annotated images. Synthetic images effectively substitute real ones, enhancing VLM capabilities. Synth2 also outperforms state-of-the-art methods like ITIT and DC, achieving competitive results with reduced data usage and computational resources. This highlights Synth2’s effectiveness and efficiency in enhancing VLM performance.
In conclusion, the researchers from Google DeepMind have proposed Synth2, which uses synthetic image-text pairs to enhance VLM training. Results show improved VLM performance compared to baselines, with enhanced data efficiency and scalability. This method offers customization for specific domains and addresses resource-intensive data acquisition challenges. The findings underscore the potential of synthetic data generation in advancing visual language understanding, suggesting avenues for further exploration.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 38k+ ML SubReddit