The field of video generation has seen remarkable progress with the advent of diffusion transformer (DiT) models, which have demonstrated superior quality compared to traditional convolutional neural network approaches. However, this improved quality comes at a significant cost in terms of computational resources and inference time, limiting the practical applications of these models. In response to this challenge, researchers have developed a novel method called Pyramid Attention Broadcast (PAB) to achieve real-time, high-quality video generation without compromising output quality.
Current acceleration methods for diffusion models often focus on reducing sampling steps or optimizing network architectures. These approaches, however, frequently require additional training or compromise output quality. Some recent techniques have revisited the concept of caching to speed up diffusion models. Still, these methods are primarily designed for image generation or convolutional architectures, making them less suitable for video DiTs. The unique challenges posed by video generation, including the need for temporal coherence and the interaction of multiple attention mechanisms, necessitate a new approach.
PAB addresses these challenges by targeting redundancy in attention computations during diffusion. The method is based on a key observation: attention differences between adjacent diffusion steps exhibit a U-shaped pattern, with significant stability in the middle 70% of steps. This indicates considerable redundancy in attention computations, which PAB exploits to improve efficiency.
The Pyramid Attention Broadcast method identifies the stable middle segment of the diffusion process where attention outputs show minimal differences between steps. It then broadcasts attention outputs from certain steps to subsequent steps within this stable segment, eliminating the need for redundant computations. PAB applies varied broadcast ranges for different types of attention based on their stability and differences. Spatial attention, which varies the most due to high-frequency visual elements, receives the smallest broadcast range. Temporal attention, showing mid-frequency variations related to movements, gets a medium range. Cross-attention, being the most stable as it links text with video content, is given the largest broadcast range. Additionally, the researchers introduce a broadcast sequence parallel technique for more efficient distributed inference. This approach significantly decreases generation time and has lower communication costs compared to existing parallelization methods. By leveraging the unique characteristics of PAB, broadcast sequence parallelism enables more efficient, scalable distributed inference for real-time video generation.
PAB demonstrates superior results across three state-of-the-art DiT-based video generation models: Open-Sora, Open-Sora-Plan, and Latte. The method achieves real-time generation for videos up to 720p resolution, with speedups of up to 10.5x compared to baseline methods. Importantly, PAB maintains output quality while significantly reducing computational costs. The researchers’ experiments show that PAB consistently delivers excellent and stable speedup across these popular open-source video DiTs. The Pyramid Attention Broadcast method achieves remarkable speedups without sacrificing output quality by identifying and exploiting redundancies in the attention mechanism. The method’s ability to reach real-time generation speeds of up to 20.6 FPS for high-resolution videos opens up new possibilities for practical applications of AI video generation. What sets PAB apart is its training-free nature, making it immediately applicable to existing models without the need for resource-intensive fine-tuning.
The development of PAB addresses a critical bottleneck in DiT-based video generation, potentially accelerating the adoption of these models in real-world scenarios where speed is crucial. As the demand for high-quality, AI-generated video content continues to grow across industries, techniques like PAB will play a vital role in making these technologies more accessible and practical for everyday use. The researchers anticipate that their simple yet effective method will serve as a robust baseline and facilitate future research and application for video generation, paving the way for more efficient and versatile AI-driven video creation tools.
Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 50k+ ML SubReddit
Find Upcoming AI Webinars here
Shreya Maji is a consulting intern at MarktechPost. She is pursued her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. An AI enthusiast, she enjoys staying updated on the latest advancements. Shreya is particularly interested in the real-life applications of cutting-edge technology, especially in the field of data science.