With a steady training process, diffusion models have revolutionized picture production, attaining previously unheard-of levels of variety and realism. But unlike GANs and VAEs, their sampling is a laborious, iterative process that gradually reduces the noise in a Gaussian noise sample to produce a complex image by progressive denoising. This limits the amount of interaction when utilizing the generation pipeline as a creative tool, usually requiring tens to hundreds of expensive neural network evaluations. Previous techniques condense the noise→image mapping found by the initial multi-step diffusion sampling into a single-pass student network to speed up the sampling process. Fitting such a high-dimensional, intricate mapping is undoubtedly a difficult undertaking.
One area for improvement is the high expense of running the whole denoising trajectory for the student model to compute a single loss. Current techniques lessen this by gradually extending the student’s sample distance without repeating the original diffusion’s denoising cycle. Still, the original multi-step diffusion model performs better than the distilled versions. Conversely, the research team enforces that the student generations seem identical to the original diffusion model instead of requiring correspondences between noise and diffusion-generated pictures. In general, the reasoning behind their aim is similar to that of other distribution-matching generative models, such GMMN or GANs.
However, scaling up the model on the general text-to-image data has proven difficult despite their remarkable performance in producing realistic graphics. The research team avoids this problem in this work by beginning with a diffusion model that has previously been extensively trained on text-to-image data. To learn both the data distribution and the fictitious distribution that their distillation generator is producing, the research team specifically fine-tunes the pretrained diffusion model. The research team can interpret the denoised diffusion outputs as gradient directions for making a picture “more realistic” or, if the diffusion model is trained on the false images, “more fake,” as diffusion models are known to approximate the score functions on diffused distributions.
In the end, the generator’s gradient update rule is created as the difference between the two, pushing the artificial pictures toward greater realism and less fakery. Test-time optimization of 3D objects may also be achieved using pretrained diffusion model modeling of the real and fake distributions, as demonstrated by previous work using a technique called Variational Score Distillation. The research team discover that a whole generative model may be trained using a similar methodology instead. Additionally, the research team finds that in the presence of the distribution matching loss, a minor number of the multi-step diffusion sampling outcomes may be pre-computed, and implementing a simple regression loss about their one-step generation can function as an effective regularizer.
Researchers from MIT and Adobe Research provide Distribution Matching Distillation (DMD), a process that converts a diffusion model into a one-step picture generator with negligible effect on image quality. Their approach, which takes inspiration and insights from VSD, GANs, and pix2pix, demonstrates how the research team can train a one-step generative model with high fidelity by (1) using diffusion models to model real and fake distributions and (2) matching the multi-step diffusion outputs with a simple regression loss. The research team assesses models trained using their Distribution Matching Distillation technique (DMD) on a range of tasks, such as zero-shot text-to-image creation on MS COCO 512×512 and picture generation on CIFAR-10 and ImageNet 64×64. Their one-step generator performs much better than known few-step diffusion methods on all benchmarks, including Consistency Models, Progressive Distillation, and Rectified Flow.
DMD achieves FIDs of 2.62 on ImageNet, outperforming the Consistency Model by 2.4×. DMD obtains a competitive FID of 11.49 on MS-COCO 2014-30k using the same denoiser architecture as Stable Diffusion. Their quantitative and qualitative analyses demonstrate that the pictures produced by their model are of a high caliber comparable to those produced by the more expensive Stable Diffusion model. Notably, their method achieves a 100× decrease in neural network evaluations while preserving this degree of visual quality. Thanks to its efficiency, DMD can produce 512 × 512 pictures at 20 frames per second when using FP16 inference, which opens up many possibilities for interactive applications.
Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.