Recently, there have been significant advancements in creating images from text descriptions and combining text and images to generate new ones. However, one unexplored area is image generation from generalized vision-language inputs (for example, generating an image from a scene description involving multiple objects and people). A team of researchers from Microsoft Research, New York University, and the University of Waterloo introduced KOSMOS-G, a model that leverages Multimodal LLMs to tackle this issue.
KOSMOS-G can create detailed images from complex combinations of text and multiple pictures, even when it hasn’t seen these examples. It’s the first model that can generate images in situations where various objects or things are in the pictures based on a description. KOSMOS-G can be used in place of CLIP, which opens up new possibilities for using other techniques like ControlNet and LoRA for various applications.
KOSMOS-G uses a clever approach to generate images from text and pictures. It first starts by training a multimodal LLM (which can understand both text and images together), which is then aligned with the CLIP text encoder (which is good at understanding text).
When we give KOSMOS-G a caption with text and segmented images, it’s trained to create images that match the description and follow the instructions. It does this by using a pre-trained image decoder and leveraging what it has learned from the images to generate accurate pictures in different situations.
KOSMOS-G can generate images based on instructions and input data. It has three stages of training. In the first stage, the model is pre-trained on multimodal corpora. In the second stage, an AlignerNet is trained to align the output space of KOSMOS-G to U-Net’s input space through CLIP supervision. In the third stage, KOSMOS-G is fine-tuned through a compositional generation task on curated data. During Stage 1, only the MLLM is trained. In Stage 2, AlignerNet is trained with MLLM frozen. During Stage 3, both AlignerNet and MLLM are jointly trained. The image decoder remains frozen throughout all stages.
KOSMOS-G is really good at zero-shot image generation across different settings. It can make images that make sense, look good, and be customized differently. It can do things like changing the context, adding a particular style, making modifications, and adding extra details to the images. KOSMOS-G is the first model to achieve multi-entity VL2I in a zero-shot setting.
KOSMOS-G can easily take the place of CLIP in image generation systems. This opens up exciting new possibilities for applications that were previously impossible. By building on the foundation of CLIP, KOSMOS-G is expected to advance the shift from generating images based on text to generating images based on a combination of text and visual information, creating opportunities for many innovative applications.
In summary, KOSMOS-G is a model that can create detailed images from both text and multiple pictures. It uses a unique strategy called “align before instruct” in its training. KOSMOS-G is good at making images of individual objects and is the first to do this with multiple objects. It can also replace CLIP and be used with other techniques like ControlNet and LoRA for new applications. In short, KOSMOS-G is an initial step toward making images like a language in image generation.
Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
We are also on WhatsApp. Join our AI Channel on Whatsapp..