Large-scale annotated datasets have served as a highway for creating precise models in various computer vision tasks. They want to offer such a highway in this study to accomplish fine-grained long-range tracking. Fine-grained long-range tracking aims to follow the matching world surface point for as long as feasible, given any pixel location in any frame of a movie. There are several generations of datasets aimed at fine-grained short-range tracking (e.g., optical flow) and regularly updated datasets aimed at various types of coarse-grained long-range tracking (e.g., single-object tracking, multi-object tracking, video object segmentation). However, there are only so many works at the interface between these two types of monitoring.
Researchers have already tested fine-grained trackers on real-world movies with sparse human-provided annotations (BADJA and TAPVid) and trained them on unrealistic synthetic data (FlyingThings++ and Kubric-MOVi-E), which consists of random objects moving in unexpected directions on random backdrops. While it’s intriguing that these models can generalize to actual videos, using such basic training prevents the development of long-range temporal context and scene-level semantic awareness. They contend that long-range point tracking shouldn’t be considered an extension of optical flow, where naturalism may be abandoned without suffering negative consequences.
While the video’s pixels may move somewhat randomly, their path reflects several modellable elements, such as camera shaking, object-level movements and deformations, and multi-object connections, including social and physical interactions. Progress depends on people realizing the issue’s magnitude, both in terms of their data and methodology. Researchers from Stanford University suggest PointOdyssey, a large synthetic dataset for long-term fine-grained tracking training and assessment. The intricacy, diversity, and realism of real-world video are all represented in their collection, with pixel-perfect annotation only being attainable through simulation.
They use motions, scene layouts, and camera trajectories that are mined from real-world videos and motion captures (as opposed to being random or hand-designed), distinguishing their work from prior synthetic datasets. They also use domain randomization on various scene attributes, such as environment maps, lighting, human and animal bodies, camera trajectories, and materials. They can also give more photo realism than was previously achievable because of advancements in the accessibility of high-quality content and rendering technologies. The motion profiles in their data are derived from sizable human and animal motion capture datasets. They employ these captures to generate realistic long-range trajectories for humanoids and other animals in outdoor situations.
In outdoor situations, they pair these actors with 3D objects dispersed randomly on the ground plane. These things respond to the actors following physics, such as being kicked away when the feet come into contact with them. Then, they employ motion captures of inside settings to create realistic indoor scenarios and manually recreate the capture environments in their simulator. This enables us to recreate the precise motions and interactions while maintaining the scene-aware character of the original data. To provide complex multi-view data of the situations, they import camera trajectories derived from real footage and connect extra cameras to the synthetic beings’ heads. In contrast to Kubric and FlyingThings’ largely random motion patterns, they take a capture-driven approach.
Their data will stimulate the development of tracking techniques that move beyond the conventional reliance solely on bottom-up cues like feature-matching and utilize scene-level cues to offer strong priors on track. A vast collection of simulated assets, including 42 humanoid forms with artist-created textures, 7 animals, 1K+ object/background textures, 1K+ objects, 20 original 3D sceneries, and 50 environment maps, gives their data its aesthetic diversity. To create a variety of dark and bright sceneries, they randomize the scene’s lighting. Additionally, they add dynamic fog and smoke effects to their sceneries, adding a type of partial occlusion that FlyingThings and Kubric completely lack. One of the new problems that PointOdyssey opens is how to employ long-range temporal context.
For instance, the state-of-the-art tracking algorithm Persistent Independent Particles (PIPs) has an 8-frame temporal window. They suggest a few changes to PIPs as a first step towards using arbitrarily lengthy temporal context, including considerably expanding its 8-frame temporal scope and adding a template-update mechanism. According to experimental findings, their solution outperforms all others regarding tracking accuracy, both on the PointOdyssey test set and on real-world benchmarks. In conclusion, PointOdyssey, a sizable synthetic dataset for long-term point tracking that tries to reflect the difficulties—and opportunities—of real-world fine-grained monitoring, is the major contribution of this study.
Check out the Paper, Project, and Dataset. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.