
Computer vision relies heavily on segmentation. Segmentation is the process of determining which pixels in an image represent a particular object for a variety of uses, from scientific image analysis to artistic photography. However, building an accurate segmentation model for a given task typically requires AI training infrastructure and the assistance of technical experts with access to large volumes of carefully annotated in-domain data.
Recent Meta AI research presents a project called “Segment Anything”. This is an effort to “democratize segmentation” by providing new tasks, datasets, and models for image segmentation. The company’s Segment Anything Model (SAM) and Segment Anything 1 Billion Masks dataset (SA-1B) are the largest segmentation datasets to date.
Previously, there were two main categories of strategies for dealing with segmentation problems. The first interactive segmentation could segment any object, but required a human operator to repeatedly adjust the mask. However, automatic segmentation was able to segment predefined object categories. Still, training a segmentation model required computational resources, technical expertise, and a large number of manually annotated objects. Neither method provided a fool-proof automated segmentation tool.
SAM covers both of these broader categories of methods. This is an integrated model that makes it easy to perform interactive and automated segmentation tasks. A flexible prompting interface allows the model to be used for a variety of segmentation tasks by simply designing the appropriate prompts. Additionally, SAM has been trained on a diverse, high-quality dataset of over 1 billion masks, allowing it to generalize to new types of objects and images. In general, this generalization feature removes the need for practitioners to collect segmentation data and fine-tune the model for their use case.
These capabilities allow SAM to transfer to different domains and perform different tasks. Some of the features of SAM are:
- SAM facilitates object segmentation with a single mouse click or interactive selection of points to include and exclude. Bounding boxes can also be used as model prompts.
- In a practical segmentation problem, SAM’s ability to generate valid masks that conflict in the face of object ambiguity is an important feature.
- SAM can instantly detect and mask any object in an image.
- After precomputing image embeddings, SAM generates segmentation masks on the fly for any prompt, enabling real-time interaction with the model.
The team needed a large and diverse data set to train the model. Information was collected using SAM. In particular, SAM was used by annotators to perform interactive image annotation, and the resulting data were subsequently used to refine and improve SAM. I ran this loop several times to refine my model and data.
New segmentation masks can be collected at lightning speed using SAM. The tools the team uses make interactive mask annotation fast and easy, taking just 14 seconds. This model is 6.5 times faster than COCO’s fully manual polygon-based mask annotation, and compared to previous large-scale segmentation data collection efforts, it is the largest previous data annotation effort utilizing the model. is also twice as fast.
The billion mask dataset presented could not have been constructed solely from interactively annotated masks. As a result, researchers developed a data engine to use when collecting data on his SA-1B. In this data “engine” he has three “cogs”. The model’s first mode of operation is to assist human annotators. The next gear combines fully automated annotation with human assistance to expand the range of masks collected. Finally, fully automated mask creation supports dataset scaling capabilities.
The final dataset contains over 11 million images with licensing, privacy protection, and 1.1 billion segmentation masks. A human evaluation study confirmed that the SA-1B masks were of high quality, diversity, and comparable quality to masks from previous, much smaller, manually annotated datasets. . SA-1B has his 400x mask of the existing segmentation dataset.
Researchers trained the SAM to provide accurate segmentation masks in response to a variety of inputs, including foreground/background points, rough boxes or masks, and free-form text. They observed that the pre-training task and interactive data collection imposed certain constraints on model design. For the annotator to effectively utilize SAM during annotation, the model needs to run in real-time on her web browser’s CPU.
A lightweight encoder can turn a prompt into an embedding vector on the fly, while an image encoder creates a one-time embedding of the image. A lightweight decoder is then used to combine the data from these two sources to create a segmentation mask prediction. Once the image embeddings are computed, SAM can respond to any query in his web browser using the segment within 50 milliseconds of her.
SAM may facilitate future applications in a variety of areas where arbitrary objects within arbitrary images need to be located and segmented. For example, understanding the visual and textual content of web pages is just one example of how SAM can be integrated into large-scale AI systems for general multimodal understanding of the world.
check out Papers, demos, blogs and github. All credit for this research goes to the researchers of this project.Also, don’t forget to participate Our 18k+ ML SubReddit, cacophony channeland email newsletterWe share the latest AI research news, cool AI projects, and more.
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data her science enthusiast and has a keen interest in the scope of artificial intelligence applications in various fields. Her passion lies in exploring new advancements in technology and its practical applications.