A team of researchers from Hong Kong introduced two open-source diffusion models for high-quality video generation. The text-to-video (T2V) model generates cinematic-quality videos from text input, surpassing other open-source T2V models in performance. On the other hand, the image-to-video (I2V) model converts a reference image into a video while preserving content, structure, and style. These models are expected to advance video generation technology in academia and industry, providing valuable resources for researchers and engineers.
Diffusion models (DMs) have excelled in content generation, including text-to-image and video generation. Video Diffusion Models (VDMs) like Make-A-Video, Imagen Video, and others have extended the Stable Diffusion (SD) framework to ensure temporal consistency in open-source T2V models. However, these models have resolution, quality, and composition limitations. These models outperform existing open-source T2V models, advancing technology in the community.
Generative models, particularly diffusion models, have advanced image and video generation. While open-source text-to-image (T2I) models exist, T2V models are limited. T2V includes temporal attention layers and joint training for consistency, while I2V preserves image content and structure. By sharing these models, the researchers aim to bolster the open-source community and push video generation technology forward.
The study presents two diffusion models: T2V and I2V. T2V employs a 3D U-Net architecture with spatial-temporal blocks, convolutional layers, spatial and temporal transformers, and dual cross-attention layers to align text and image embeddings. I2V transforms images into video clips, preserving content, structure, and style. Both models use a learnable projection network for training. Evaluation involves metrics for video quality and alignment between text and video.
The proposed T2V and I2V models excel in video quality and text-video alignment, surpassing other open-source models. T2V employs a denoising 3D U-Net architecture, delivering high visual fidelity in generated videos. I2V effectively transforms images into video clips, preserving content, structure, and style. Comparative analysis against models like Gen-2, Pika Labs, and ModelScope highlights their superior performance in visual quality, text-video alignment, temporal consistency, and motion quality.
In conclusion, the recent introduction of T2V and I2V models for generating videos has shown great potential in advancing technological advancements in the community. While these models have demonstrated superior performance in terms of video quality and text-video alignment, there is still a need for future improvements in areas such as duration, resolution, and motion quality of generated videos. However, with the development of these open-source models, researchers believe that further improvements in this field will be possible.
In the future, one could consider adding frames and creating a frame-interpolation model to extend the model duration beyond 2 seconds. To improve the resolution, collaborating with ScaleCrafter or using spatial upscaling could be explored. It may be advisable to work with higher-quality data to enhance the motion and visual quality. Including image prompts and researching image conditional branches could also be potential areas to explore for creating dynamic content with improved visual fidelity using the diffusion model.
Check out the Paper, Github, and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 32k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
We are also on Telegram and WhatsApp.
Hello, My name is Adnan Hassan. I am a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a dual degree at the Indian Institute of Technology, Kharagpur. I am passionate about technology and want to create new products that make a difference.