A deep Neural network is crucial in synthesizing photorealistic images and videos using large-scale image and video generative models. These models can be made into productive tools for humans through a critical step: adding control. This will empower generative models to follow the instructions humans provided instead of randomly generating data samples. Extensive studies have been conducted to achieve this goal. For example, in Generative Adversarial Networks (GANs), a widespread solution is to use adaptive normalization that dynamically scales and shifts the intermediate feature maps according to the input condition.
However, widely used techniques share the same underlying mechanism, i.e., adding control by feature space manipulation despite the difference in the operations. Also, the neural network weight, convolution, or linear layers remain the same for different conditions. So, two critical questions arise: (a) can image generative models be controlled by manipulating their weight? (b) Can controlled image generative models benefit from this new conditional control method? This paper aims to address both the problems in an efficient way.
Researchers from MIT, Tsinghua University, and NVIDIA introduces Condition-Aware Neural Network (CAN), a new method for adding control to image generative models. CAN successfully control the image generation process by dynamically manipulating the weight of the neural network. To achieve this, a condition-aware weight generation module is introduced that generates conditional weight for convolution/linear layers based on the input condition. There are two critical insights for CAN: choosing a subset of modules to be condition-aware is beneficial for both efficiency and performance. Secondly, directly generating the conditional weight is much more effective.
CAN is evaluated on two representative diffusion transformer models, DiT and UViT. It achieves significant performance boosts for all these diffusion transformer models while incurring negligible computational cost increases. CAN resolve various issues:
- This new mechanism controls image-generative models and demonstrates the effectiveness of weight manipulation for conditional control.
- CAN is a new conditional control method that can be used in practice with the help of design insights. It outperforms prior conditional control methods by a significant margin.
- CAN benefit the deployment of image generative models and achieves a better FID on ImageNet 512×512 by using 52× fewer MACs than DiT-XL/2 per sampling step.
Instead of directly generating the conditional weight, Adaptive Kernel Selection (AKS) is another possible approach that maintains a set of base convolution kernels and dynamically generates scaling parameters to combine these base kernels. The parameter of AKS has a smaller overhead than that of CAN; however, it cannot match CAN’s performance. This tells that dynamic parameterization is not the only key to better performance. Moreover, CAN is tested on class conditional image generation on ImageNet and text-to-image generation on COCO, resulting in significant improvements for diffusion transformer models.
In conclusion, CAN is a new conditional control method for adding control to image generative models. For CAN’s effectiveness, the experiment is carried out on class-conditional generation using ImageNet and text-to-image generation using COCO, delivering consistent and significant improvements over prior conditional control methods. Apart from this, a new family of diffusion transformer models was built by marrying CAN and EfficientViT. Future work includes applying CAN to more challenging tasks like large-scale text-to-image generation, video generation, etc.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 39k+ ML SubReddit
Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.