Capturing and encoding information about a visual scene, typically in the context of computer vision, artificial intelligence, or graphics, is called Scene representation. It involves creating a structured or abstract representation of the elements and attributes present in a scene, including objects, their positions, sizes, colors, and relationships. Robots must build these representations online from onboard sensors as they navigate an environment.
The representations must be scalable and efficient to maintain the scene’s volume and the duration of the robot’s operation. The open library shouldn’t be limited to predefined data in the training session but should be capable of handling new objects and concepts during inference. It demands flexibility to enable planning over a range of tasks, like collecting dense geometric information and abstract semantic information for task planning.
To include the above requirements, the researchers at the University of Toronto, MIT, and the University of Montreal propose ConceptGraphs, a 3D scene representation method for robot perception and planning. The traditional process of obtaining 3D scene representations using foundation models requires an internet scale of training data, and 3D datasets still need to be of comparable size.
They are based on assigning every point on a redundant semantic feature vector, which consumes more than necessary memory, limiting scalability to large scenes. These representations are dense and cannot be dynamically updated on the map, so they are not easy to decompose. The method developed by the team can efficiently describe the scenes with graph structures with node representations. It can be built on real-time systems that can build up hierarchical 3D scene representations.
ConceptGraphs is an object-centric mapping system that integrates geometric data from 3D mapping systems and semantic data from 2D foundation models. Therefore, this attempt to ground the 2D representations produced by image and language foundation models to the 3D world shows impressive results on open-vocabulary tasks, including language-guided object grounding, 3D reasoning, and navigation.
ConceptGraphs can construct open-vocabulary 3D scene graphs efficiently and structured semantic abstractions for perception and planning. The team also implemented ConceptGraphs on real-world wheeled and legged robotic platforms and demonstrated that those robots can perform task planning for abstract language queries with ease.
Provided RGB-D frames, the team runs a class-agnostic segmentation model to obtain candidate objects. It associates them across multiple views using geometric and semantic similarity measures and instantiates nodes in a 3D scene graph. They then use an LVLM to caption each node and an LLM to infer relationships between adjoining nodes and building edges in the scene graph.
Researchers say that future work will involve integrating temporal dynamics into the model and assessing its performance in less structured and more challenging environments. Finally, their model addresses key limitations in the existing landscape of dense and implicit representations.
Check out the Paper, GitHub, and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.