Any activity that requires comprehension and production in one or more modalities is considered a multimodal task; these activities can be extremely varied and lengthy. It is challenging to scale previous multimodal systems because they rely heavily on gathering a large supervised training set and developing task-specific architecture, which must be repeated for every new task. In contrast, present multimodal models have not mastered people’s ability to learn new tasks in context, meaning that they can do so with minimal demonstrations or instructions. Generative pretrained language models have recently shown impressive skills in learning from context.
New research by researchers from Beijing Academy of Artificial Intelligence, Tsinghua University, and Peking University introduces Emu2, a 37-billion-parameter model, trained and evaluated on several multimodal tasks. Their findings show that when scaled up, a multimodal generative pretrained model can learn similarly in context and generalize well to new multimodal tasks. The objective of the predict-the-next-multimodal-element (textual tokens or visual embeddings) is the only one used during Emu2’s training. This unified generative pretraining technique trains models by utilizing large-scale multimodal sequences, such as text, image-text pairs, and interleaved image-text video.
The Emu2 model is generative and multimodal; it learns in a multimodal setting to predict the next element. Visual Encoder, Multimodal Modeling, and Visual Decoder are the three main parts of Emu2’s design. To prepare for autoregressive multimodal modeling, the Visual Encoder tokenizes all input images into continuous embeddings, subsequently interleaved with text tokens. The Visual Decoder turns the regressed visual embeddings into a movie or image.
For Emu2’s pretraining, the researchers used several publicly available datasets, such as image-text pairs from LAION-2B and CapsFusion-120M, video-text pairs from WebVid-10M, interleaved image-text data from Multimodal-C4 (MMC4) interleaved videotext data from YT-Storyboard-1B, grounded image-text pairs from GRIT-20M introduced by Kosmos-2 and CapsFusion-grounded-100M curated by CapsFusion120M. Including Pile’s language-only data also helps preserve textual reasoning capacity.
Using conventional multimodal datasets and additional tasks not observed in the training set, the researchers evaluate Emu2’s ability to learn from a few instances or instructions. In particular, two cases are used to assess Emu2:
- Allowing a large number of examples to fit within the model’s context window (a “few-shot setting”)
- Training the model to obey particular commands (an “instruction tuning”). Across a variety of vision-language tasks,
Emu2 shows encouraging outcomes when trained with few shots. For example, it showcases top-notch few-shot performance on several visual question-answering datasets. An increase in the number of contextual examples leads to improved performance. For real-world tasks like recognition and counting, Emu2 excels at using multimodal reasoning. Despite its troubles at smaller scales or zero shots, Emu2 also learns to follow contextual visual cues.
Emu2 is a robust and flexible basic model for many multimodal tasks since it can handle interleaved text-image video at both the input and output sides. It follows task-specific instructions. For instance, following further tweaking using conversational data, Emu2 outperforms earlier models with more complicated designs and obtains state-of-the-art results on visual question-answering tasks. Furthermore, Emu2 can be adjusted to serve as a high-quality, controlled visual generation model. Image, location, and text can all be input as conditions, and they can produce ground-truth images according to the given parameters.
The team also highlights that, in some instances, the hallucination problem with multimodal models could lead to irrational and inaccurate predictions. Similar to other generative models, Emu2 could provide biased or destructive output because of the possibility of skewed or inappropriate training data.
Nevertheless, these results imply that generative multimodal models at scale may be a significant step toward developing generalizable, flexible multimodal systems.
Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 35k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.