Text-to-video diffusion models have made significant advancements in recent times. Just by providing textual descriptions, users can now create either realistic or imaginative videos. These foundation models have also been tuned to generate images to match certain appearances, styles, and subjects. However, the area of customizing motion in text-to-video generation still needs to be explored. Users may want to create videos with specific motions, such as a car moving forward and then turning left. It, therefore, becomes important to adapt the diffusion models to create more specific content to cater to the users’ preferences.
The authors of this paper have proposed MotionDirector, which helps foundation models achieve motion customization while maintaining appearance diversity at the same time. The technique uses a dual-path architecture to train the models to learn the appearance and motions in the given single or multiple reference videos separately, which makes it easy to generalize the customized motion to other settings.
The dual architecture comprises both a spatial and a temporal pathway. The spatial path has a foundational model with trainable spatial LoRAs (low-rank adaptions) integrated into its transformer layers for each video. These spatial LoRAs are trained using a randomly selected single frame in each training step to capture the visual attributes of the input videos. On the contrary, the temporal pathway duplicates the foundational model, sharing the spatial LoRAs with the spatial path to adapt to the appearance of the given input video. Moreover, the temporal transformers in this pathway are enhanced with temporal LoRAs, which are trained using multiple frames from the input videos to grasp the inherent motion patterns.
Just by deploying the trained temporal LoRAs, the foundation model can synthesize videos of the learned motions with diverse appearances. The dual architecture allows the models to learn the appearance and motion of objects in videos separately. This decoupling enables MotionDirector to isolate the appearance and motion of videos and then combine them from various source videos.
The researchers compared the performance of MotionDirector on a couple of benchmarks, having more than 80 different motions and 600 text prompts. On the UCF Sports Action benchmark (with 95 videos and 72 text prompts), MotionDirector was preferred by human raters around 75% of the time for better motion fidelity. The method also outperformed the 25% preferences of base models. On the second benchmark, i.e., the LOVEU-TGVE-2023 benchmark (with 76 videos and 532 text prompts), MotionDirector performed better than other controllable generation and tuning-based methods. The results demonstrate that numerous base models can be customized using MotionDirector to produce videos characterized by diversity and the desired motion concepts.
MotionDirector is a promising new method for adapting text-to-video diffusion models to generate videos with specific motions. It excels in learning and adapting specific motions of subjects and cameras, and it can be used to generate videos with a wide range of visual styles.
One area where MotionDirector can be improved is learning the motion of multiple subjects in the reference videos. However, even with this limitation, MotionDirector has the potential to enhance flexibility in video generation, allowing users to craft videos tailored to their preferences and requirements.
Check out the Paper, Project, and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
We are also on WhatsApp. Join our AI Channel on Whatsapp..