Three-dimensional (3D) tracking from monocular RGB videos is a cutting-edge field in computer vision and artificial intelligence. It focuses on estimating the three-dimensional positions and motions of objects or scenes using only a single, two-dimensional video feed.
Existing methods for 3D tracking from monocular RGB videos primarily focus on articulated and rigid objects, such as two hands or humans interacting with rigid environments. The challenge of modeling dense, non-rigid object deformations, such as hand-face interaction, has largely been overlooked. However, these deformations can significantly enhance the realism of applications like AR/VR, 3D virtual avatar communication, and character animations. The limited attention to this issue is attributed to the inherent complexity of the monocular view setup and associated difficulties, such as acquiring appropriate training and evaluation datasets and determining reasonable non-uniform stiffness for deformable objects.
Therefore, this article introduces a novel method that tackles the aforementioned fundamental challenges. It enables the tracking of human hands interacting with human faces in 3D from single monocular RGB videos. The method models hands as articulated objects that induce non-rigid facial deformations during active interactions. An overview of this technique is reported in the figure below.
This approach relies on a newly created dataset capturing hand-face motion and interaction, including realistic face deformations. In making this dataset, the authors employ position-based dynamics to process the raw 3D shapes and develop a technique for estimating the non-uniform stiffness of head tissues. These steps result in credible annotations of surface deformations, hand-face contact regions, and head-hand positions.
At the heart of their neural approach is a variational auto-encoder that provides the depth information for the hand-face interaction. Additionally, modules are employed to guide the 3D tracking process by estimating contacts and deformations. The final 3D reconstructions of hands and faces produced by this method are both realistic and more plausible when compared to several baseline methods applicable in this context, as supported by quantitative and qualitative evaluations.
Reconstructing both hands and the face simultaneously, considering the surface deformations resulting from their interactions, poses a notably challenging task. This becomes especially crucial for enhancing realism in reconstructions since such interactions are frequently observed in everyday life and significantly influence the impressions others form of an individual. Consequently, reconstructing hand-face interactions is vital in applications like avatar communication, virtual/augmented reality, and character animation, where lifelike facial movements are essential for creating immersive experiences. It also has implications for applications such as sign language transcription and driver drowsiness monitoring.
Despite various studies focusing on the reconstruction of face and hand motions, capturing the interactions between them, along with the corresponding deformations, from a monocular RGB video has remained largely unexplored, as noted by Tretschk et al. in 2023. On the other hand, attempting to use existing template-based methods for hand and face reconstruction often leads to artifacts such as collisions and the omission of interactions and deformations. This is primarily due to the inherent depth ambiguity of monocular setups and the absence of deformation modeling in the reconstruction process.
Several significant challenges are associated with this problem. One challenge (I) is the absence of a markerless RGB capture dataset for face and hand interactions with non-rigid deformations, which is essential for training models and evaluating methods. Creating such a dataset is highly challenging due to frequent occlusions caused by hand and head movements, particularly in regions where non-rigid deformation occurs. Another challenge (II) arises from the inherent depth ambiguity of single-view RGB setups, making it difficult to obtain accurate localization information and resulting in errors like collisions or a lack of contact between the hand and head during interactions.
To address these challenges, the authors introduce “Decaf” (short for deformation capture of faces interacting with hands), a monocular RGB method designed to capture face and hand interactions along with facial deformations. Specifically, they propose a solution that combines a multiview capture setup with a position-based dynamics simulator to reconstruct the interacting surface geometry, even in the presence of occlusions. To incorporate the deformable object simulator, they determine the stiffness values of a head mesh using a method called “skull-skin distance” (SSD), which assigns non-uniform stiffness to the mesh. This approach significantly enhances the qualitative plausibility of the reconstructed geometry compared to using uniform stiffness values.
Using their newly created dataset, the researchers train neural networks to extract 3D surface deformations, contact regions on the head and hand surfaces, and an interaction depth prior from single-view RGB images. In the final optimization stage, this information from various sources is utilized to obtain realistic 3D hand and face interactions with non-rigid surface deformations, resolving the depth ambiguity inherent in the single-view setup. The results illustrated below demonstrate much more plausible hand-face interactions compared to existing approaches.
This was the summary of Decaf, a novel AI framework designed to capture face and hand interactions along with facial deformations. If you are interested and want to learn more about it, please feel free to refer to the links cited below.
Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
We are also on WhatsApp. Join our AI Channel on Whatsapp..
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.