Transformer-based Models in Segmentation tasks have initiated a new transformation in the Computer Vision realm. Meta’s Segment Anything Model has proven to be a benchmark due to its robust and exquisite performance. SAM has proven highly successful as supervised segmentation continues to gain popularity in fields such as medicine, defense, and industry. However, it still needs manual labeling, making training a tough cookie. Human annotation is cumbersome, unreliable for sensitive applications, not to mention expensive and time-consuming. Annotations also establish a tradeoff between accuracy and scalability, setting a limit to exploit the architecture’s potential. SA-1B, despite its enormous size, contains just 11 million images with biased labels. These issues call for a label-less approach that offers performance parallel to SAM but at a much lower cost.
Active strides have been made in this direction with self-supervised transformers and prompt-enabled zero-shot segmentation. TokenCut and LOST were the preliminary efforts followed by CutLER. CutLER generated quality pseudo masks for multiple instances and further learning on these masks. VideoCutLER extended this functionality to videos.
Researchers at UC Berkeley developed Unsupervised SAM (UnSAM), a novel unsupervised method to address the above challenge. Unsupervised SAM uses a divide-and-conquer strategy to identify hierarchical structures in visual scenes and create segmentation masks with varying levels of granularity based on the hierarchical structures. UnSAM’s Top-Down and Bottom-Up clustering strategies generate masks capturing the most subtle of details, ensuring parallel performance with SA-1B, its human counterpart. By providing ground labels, these masks enable complete and interactive segmentation that captures minutiae better than SAM.
It would be noteworthy to discuss CutLER before diving into the nitty gritty of UnSAM. CutLER introduces a cut-and-learn pipeline with a normalized cut-based method, MaskCut, which generates high-quality masks with the help of a patch-wise cosine similarity matrix obtained from unsupervised ViT. MasKCut iteratively operates, albeit masking out patches from previously segmented instances. This methodology sets the foundation for UnSAM’s divide strategy. It leverages CutLER’s Normalized Cuts (NCuts)-based method to obtain semantic and instance-level masks from unlabeled raw images. A threshold further filters the generated masks to avoid noise. The Conquer Strategy captures the subtleties, which iteratively merges the generated coarse-grained masks into simpler parts. Non-Maximum Suppression eliminates post-merging redundancy. The bottom-up clustering strategy of UnSAM differentiates it from the earlier tasks and makes it “conquer ” other works while capturing the finest details.
UnSAM outperformed SAM by over 6.7% in AR and 3.9% in AP on the SA-1B dataset when trained with 1% of the dataset. Its performance is comparable when SAM is trained on the complete 11 M image dataset, differing by a mere 1 % considering the dataset size of 0.4 M images. On average, UnSAM surpassed the previous SOTA by 11.0% in AR. When evaluated on PartImageNet and PACO, UnSAM exceeds the SOTA by 16.6% and 12.6 %, respectively. Furthermore, UnSAM+, which combines the accuracies of SA-1B(1% dataset ) with the intricacies of unsupervised masks, outperforms even SAM by 1.3 %. Even with a backbone three times smaller.
In conclusion, UnSAM proves that high-quality results can be obtained without using humongous datasets created by intensive human endeavors. Small, Lightweight architectures could be used alongside UnSAM-generated masks to advance sensitive fields like medicine and science. UnSAM may not be the big bang in segmentation, but it seems to show the cosmic realm of segmentation light, ushering in a new research era in unsupervised vision learning.
Check out the Paper, Project, and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 50k+ ML SubReddit
Want to get in front of 1 Million+ AI Readers? Work with us here
Adeeba Alam Ansari is currently pursuing her Dual Degree at the Indian Institute of Technology (IIT) Kharagpur, earning a B.Tech in Industrial Engineering and an M.Tech in Financial Engineering. With a keen interest in machine learning and artificial intelligence, she is an avid reader and an inquisitive individual. Adeeba firmly believes in the power of technology to empower society and promote welfare through innovative solutions driven by empathy and a deep understanding of real-world challenges.