Advancements in generative models for text-to-image (T2I) have been dramatic. Recently, text-to-video (T2V) systems have made significant strides, enabling the automatic generation of videos based on textual prompt descriptions. One primary challenge in video synthesis is the extensive memory and training data required. Methods based on the pre-trained Stable Diffusion (SD) model have been proposed to address the efficiency issues in Text-to-Video (T2V) synthesis.
These approaches address the problem from several perspectives, including finetuning and zero-shot learning. However, text prompts must provide better control over the spatial layout and trajectories of objects in the generated video. Existing work has approached this problem by giving low-level control signals, e.g., using Canny edge maps or tracked skeletons to guide the objects in the video using ControlNet Zhang and Agrawala. These methods achieve good controllability but require considerable effort to produce the control signal.
Capturing the desired motion of an animal or an expensive object would be quite difficult while sketching the desired movement on a frame-by-frame basis would be tedious. To address the needs of casual users, researchers at NVIDIA research introduce a high-level interface for controlling object trajectories in synthesized videos. Users need to provide bounding boxes (bboxes) specifying the desired position of an object at several points in the video, together with the text prompt(s) describing the object at the corresponding times.
Their strategy involves editing spatial and temporal attention maps for a specific object during the initial denoising diffusion steps to concentrate activation at the desired object location. Their inference-time editing approach achieves this without disrupting the learned text-image association in the pre-trained model and requires minimal code modifications.
Their approach enables users to position the subject by keyframing its bounding box. The bbox size can be similarly controlled, thereby producing perspective effects. Finally, users can also keyframe the text prompt to influence the subject’s behavior in the synthesized video.
By animating bounding boxes and prompts through keyframes, users can modify the trajectory and basic behavior of the subject over time. This facilitates seamless integration of the resulting subject(s) into a specified environment, providing an accessible video storytelling tool for casual users.
Their approach demands no model finetuning, training, or online optimization, ensuring computational efficiency and an excellent user experience. Lastly, their method produces natural outcomes, automatically incorporating desirable effects like perspective, accurate object motion, and interactions between objects and their environment.
However, their method inherits common failure cases from the underlying diffusion model, including challenges with deformed objects and difficulties generating multiple objects with accurate attributes like color.
Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 35k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.