Artificial intelligence has advanced significantly in text-to-image generation in recent years. Transforming written descriptions into visual representations has a number of applications, from creating content to helping the blind and telling stories. The researchers have been facing two significant obstacles, which are the lack of high-quality data and copyright issues related to datasets that are scraped from the internet.
In recent research, a team of researchers has proposed the idea of building an image dataset under a Creative Commons licence (CC) and using it to train open diffusion models that can outperform Stable Diffusion 2 (SD2). To do this, two major obstacles need to be overcome, which are as follows.
- Absence of Captions: Although high-resolution CC photos are open-licensed, they frequently lack the textual descriptions, i.e., the captions necessary for text-to-image generative model training. The model finds it challenging to comprehend and produce visuals based on textual input in the absence of captions.
- Scarcity of CC photos: Compared to larger, proprietary datasets like LAION, CC photos are scarcer despite being a significant resource. The question of whether there is sufficient data to train high-quality models successfully is raised by this scarcity.
The team has used a transfer learning technique and has created excellent synthetic captions using a pre-trained model and has matched them with a carefully chosen selection of CC photos. This method is simple and makes use of a model’s ability to generate text from photos or other inputs. They have accomplished this by compiling a dataset of photos and made-up captions, which can be used to train generative models that translate words into visuals.
The team has created a training recipe that is both compute- and data-efficient in order to tackle the second challenge. With less data, this aims to reach the same quality as current SD2 models. Just around 3% of the data, which is roughly 70 million examples that were first utilised to train SD2, are needed. This suggests that there are enough CC photos accessible to train high-quality models efficiently.
Several text-to-image models have been trained by the team using the data and the effective training procedure. Together, these models are called the CommonCanvas family, and they mark a major advancement in the field of generative models. They can generate visual outputs that are on par with SD2 in terms of quality.
The largest model in the CommonCanvas family, trained on a CC dataset less than 3% the size of the LAION dataset obtains performance comparable to SD2 in human evaluations. Despite the dataset size constraints and the usage of artificial captions, the method is effective in generating high-quality findings.
The team has summarized their primary contributions as follows.
- The team has used a transfer-learning method called telephoning to produce excellent captions for Creative Commons (CC) photos that had no captions at first.
- They have provided a dataset called CommonCatalog that includes about 70 million CC photos released under an open licence.
- The CommonCatalog dataset is used to train a series of Latent Diffusion Models (LDM). Combined, these models are called CommonCanvas, and they perform competitively both qualitatively and quantitatively when compared to the SD2-base baseline.
- The study applies a number of training optimisations, which causes the SD2-base model to train almost three times faster.
- To encourage cooperation and more study, the team has made the trained CommonCanvas model, CC photos, artificial captions, and the CommonCatalog dataset freely available on GitHub.
Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 32k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
We are also on Telegram and WhatsApp.
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.