FastSam for Image Segmentation Tasks – A brief explanation

segmentation It is a popular task in computer vision, and is intended to divide the input image into multiple regions, where each region represents a separate object.

Some classic approaches from the past include taking model backbones (such as U-Net) and fine-tuned them with special datasets. While fine tuning works well, the advent of GPT-2 and GPT-3 has prompted the machine learning community to gradually shift its focus to developing zero-shot learning solutions.

Zero-shot learning refers to the ability of a model to perform a task even when explicitly receiving an example of training.

The Zero Shot concept plays a key role by allowing you to skip the fine-tuning phase.

In the context of computer vision, Meta released the widely known generic “Segment Anything Model” (SAM) in 2023. This allowed me to perform segmentation tasks in a zero-shot way with decent quality.

The segmentation task aims to split an image into multiple parts, each representing a single object.

The large results of SAM were impressive, but a few months later, the Image and Video Analysis (Casia IVA) group from the Chinese Academy of Sciences released the FastSam model. As the adjective “fast” suggests, FastSam addresses SAM speed limits by accelerating the inference process up to 50 times, while maintaining high segmentation quality.

In this article, we will explore FastSam architecture, possible inference options, and what makes it “fast” compared to standard SAM models. Additionally, it helps you to look up examples of code and solidify your understanding.

As a prerequisite, we highly recommend familiarizing yourself with the basics of computer vision, Yolo models, and understanding the goals of segmentation tasks.

Architecture

The FastSam inference process takes place in two steps.

All-instance segmentation. The goal is to create a segmentation mask for all objects in the image.
Select the prompt guide. After obtaining all possible masks, the prompt guide selection returns the image area corresponding to the input prompt.

FastSam inference takes place in two steps. Once the segmentation masks are obtained, use a rapid guided selection to filter and merge them into the final mask.

Let's start with all instance segmentation.

All Instance Segmentation

Before examining the architecture visually, refer to the original paper.

“The FastSam architecture is based on Yolov8-SEG. This is an object detector with an instance segmentation branch that utilizes the YOLACT method” –Fast segments and any paper

This definition may seem complicated to those who are not familiar with Yolov8-seg and Yolac. In any case, to make the meaning behind these two models more clear, it provides a simple intuition about what they are and how they are used.

Yolac (see coefficients only)

Yolac is a real-time instance segmentation convolution model inspired by the Yolo model and focuses on fast detection, achieving performance comparable to the Mask R-CNN model.

Yolac consists of two main modules (branches).

Prototype branch. Yolac creates a set of segmentation masks called prototypes.
Predictive branch. Yolac performs object detection by predicting bounding boxes and estimates mask coefficients. This shows how to linearly combine prototypes into the model to create the final mask for each object.

Yolac Architecture: Yellow blocks indicate trainable parameters, gray blocks indicate untrainable parameters. Source: Yolac, real-time instance segmentation. The number of mask signal types in the photo is k = 4. Applied by the author.

To extract the initial features from the image, YOLACT uses ResNet, followed by pyramid networks (FPNs) to obtain multiscale features. Each P level (displayed in images) processes features of different sizes using convolution (for example, P3 contains the smallest features, while P7 captures high-level image features). This approach helps explain objects at different scales.

yolov8-seg

Yolov8-Seg is a Yolac-based model and incorporates the same principles regarding prototypes. It also has two heads.

Detection head. Used to predict bounding boxes and classes.
Segmentation Head. It is used to generate masks and combine them.

The key difference is that Yolov8-Seg uses the Yolobackbone architecture instead of the ResNet backbone and FPN used in Yolac. This makes Yolov8-seg lighter and faster during inference.

Both Yolac and Yolov8-Seg use the default number of prototype k = 32, which is a tunable hyperparameter. In most scenarios, this provides a good trade-off between speed and segmentation performance.

For both models, vectors of size k = 32 are predicted for all detected objects, representing the weights of the mask prototype. These weights are used to linearly combine prototypes to generate the final mask of the object.

FastSam Architecture

FastSam's architecture is based on Yolov8-Seg, but it also incorporates a similar FPN to Yolac. Includes both detection and segmentation heads k = 32 prototype. However, because FastSam performs segmentation of all possible objects in the image, its workflow is different from that of Yolov8-Seg and Yolac.

First, FastSam produces and performs segmentation k = 32 Image mask.
These masks are then combined to generate the final segmentation mask.
During post-processing, FastSam extracts the regions, calculates bounding boxes, and performs instance segmentation of each object.

FastSam Architecture: Yellow blocks indicate trainable parameters, gray blocks indicate untrainable parameters. Source: Any Fast segment. Images adapted by the author.

Notes

The paper does not mention any details about post-processing, but it can be observed that the official FastSam GitHub repository uses this method. cv2.findcontours() Starting with the OpenCV in the prediction stage.

# The use of cv2.findContours() method the during prediction stage.
# Source: FastSAM repository (FastSAM / fastsam / prompt.py)  

def _get_bbox_from_mask(self, mask):
      mask = mask.astype(np.uint8)
      contours, hierarchy = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
      x1, y1, w, h = cv2.boundingRect(contours[0])
      x2, y2 = x1 + w, y1 + h
      if len(contours) > 1:
          for b in contours:
              x_t, y_t, w_t, h_t = cv2.boundingRect(b)
              # Merge multiple bounding boxes into one.
              x1 = min(x1, x_t)
              y1 = min(y1, y_t)
              x2 = max(x2, x_t + w_t)
              y2 = max(y2, y_t + h_t)
          h = y2 - y1
          w = x2 - x1
      return [x1, y1, x2, y2]

In reality, there are several ways to extract the instance mask from the final segmentation mask. Some examples include contour detection (used in FastSam) and connected component analysis (cv2.connectedComponents()).

training

FastSam researchers used the same SA-1B dataset as the SAM developers, but trained the CNN detector on only 2% of the data. Nevertheless, the CNN detector delivers performance comparable to the original SAM, but segmentation requires significantly less resources. As a result, FastSam reasoning is up to 50 times faster!

For reference, the SA-1B consists of 11 million diverse images and 1.1 billion high quality segmentation masks.

Why is FastSam faster than SAM? Sam uses the Vision Transformer (VIT) architecture, known for its large computational requirements. In contrast, FastSam uses CNN to perform segmentation, which is much lighter.

Quick guided choice

“Segment every task” It involves creating a segmentation mask for a specific prompt. This can be expressed in a variety of forms.

Different types of prompts handled by FastSam. Source: Any Fast segment. Images adapted by the author.

Point Prompt

After obtaining multiple prototypes of the image, you can use a point prompt to indicate that the object of interest is (or is not) located in a particular area of the image. As a result, the specified points affect the coefficients of the prototype mask.

Like SAM, FastSam allows you to select multiple points and specify whether they belong to the foreground or background. If multiple masks have foreground points corresponding to the object appear, you can use the background points to exclude irrelevant masks.

However, if some masks still meet the point prompt after filtering, mask mergers are applied to get the final mask of the object.

Additionally, the author applies morphological operators to smooth out the shape of the final mask and remove small artifacts and noise.

Box prompt

At the box prompt, use the bounding box specified at the prompt to select the mask with the highest intersection on the Union (IOU).

Text prompt

Similarly, for text prompts, the best mask is chosen for the text description. To achieve this, a clip model is used.

The text prompt and the embedding of the prototype mask with k = 32 is calculated.
The similarity between the text embedding and prototype is then calculated. The prototype with the highest similarity is post-processed and returned.

For text prompts, use the clip model to calculate text embedding for the prompt and image embedding for the mask prototype. The similarity between text embedding and image embedding is calculated and the prototype corresponding to the embedding of the image with the highest similarity is selected.

In general, most segmentation models typically apply prompts at the prompt level.

FastSam Repository

Below is a link to the official fastsam repository. This includes the clear readme.md file and documentation.

If you plan to use a Raspberry Pi and run FastSam models on it, make sure to check out the GitHub repository: Hailo-Application-Code-Examples. It contains all the code and scripts needed to launch FastSam on an Edge device.

In this article, we saw FastSam, an improved version of SAM. By combining the best practices of the Yolac and Yolov8-SEG model, FastSam significantly improves prediction speeds while maintaining high segmentation quality, accelerating inference dozens of times compared to the original SAM.

The ability to use prompts in FastSam provides a flexible way to get the segmentation mask for the object of interest. Additionally, decups of rapid guided selections from all-instance segmentation have been shown to reduce complexity.

Below are some examples of using FastSam with various prompts, visually showing that it retains high segmentation quality for SAM.