In the rapidly evolving landscape of artificial intelligence, the quest for models that are both powerful and efficient is paramount. As AI continues to permeate various sectors, from healthcare to autonomous vehicles, the need for models that can deliver high performance without excessive resource consumption becomes increasingly important. In this blog post, we will introduce two innovative approaches that address these challenges: EfficientSAM and SqueezeSAM. These advancements aim to optimize the Segment-Anything Model (SAM), making it more accessible and practical for a wide range of applications.
The challenge of balancing performance and efficiency
AI models have grown exponentially in complexity and capability, often requiring substantial computational resources. This can limit their deployment, particularly in environments where resources are constrained, such as mobile devices or edge-computing platforms. The Segment-Anything Model (SAM) is a powerful tool for image segmentation, but like many advanced models, it can be resource-intensive. Our two different approaches with EfficientSAM and SqueezeSAM seek to overcome these limitations by enhancing the model’s efficiency and reducing its size, all while maintaining its robust capabilities with some new features.
Efficient SAM: Leveraged masked-image pre-training for efficient segment anything
Imagine being able to process images in real time without sacrificing accuracy or performance. This is now possible with EfficientSAM, a groundbreaking innovation that optimizes the Segment-Anything Model (SAM) architecture to reduce computational overhead and improve inference times.
The original SAM model, powered by a massive transformer trained on the high-quality SA-1B dataset, has shown impressive results in zero-shot transfer and versatility. However, its hefty computation cost has limited its applications in the real world. That’s where EfficientSAM comes in: It’s a lightweight version of the model that achieves remarkable performance while significantly reducing complexity.
By leveraging masked-image pretraining through SAMI (SAM-leveraged masked-image retraining), EfficientSAM learns to reconstruct features from the SAM image encoder, resulting in effective visual-representation learning. When fine-tuned on the SA-1B dataset, EfficientSAM models demonstrate exceptional performance in various vision tasks, including image classification, object detection, instance segmentation, and semantic-object detection.
But what really sets EfficientSAM apart is its ability to outperform other fast SAM models in zero-shot instance-segmentation tasks, achieving a significant gain of up to 4 AP on COCO/LVIS datasets. With EfficientSAM, developers can now unlock real-time image-processing capabilities, opening up new possibilities for applications that require minimal resource usage and rapid processing times.
Key features of EfficientSAM
- Optimized architecture: Redesigned model structure eliminates unnecessary computations, enhancing efficiency without compromising accuracy.
- Faster inference: Reduced latency enables real-time applications in industries such as autonomous vehicles and robotics.
How it works
We developed a new approach to image segmentation called SAM-Leveraged Masked-Image Pretraining (SAMI). This method adapts the masked autoencoder (MAE) framework to create efficient image encoders for segment-anything models.
Our model consists of an encoder and a decoder, both built on transformer layers. The input image is divided into non-overlapping patches, which are then grouped into unmasked and masked tokens. The encoder uses the unmasked tokens to extract features, while the decoder reconstructs the masked tokens during self-supervised learning. The SAMI approach modifies the MAE framework by using latent features from a pre-trained image encoder, such as SAM, as the reconstruction target. This allows the model to transfer knowledge embedded in SAM to the new encoder.
After pretraining, the SAMI-pretrained lightweight encoder can be used as the image encoder for EfficientSAM models. These models can be fine-tuned on datasets such as SA-1B for the segment-anything task. The resulting models are efficient and effective for image-segmentation tasks. Overall, the SAMI approach offers a promising solution for efficient image segmentation, leveraging the power of masked autoencoders and pre-trained image encoders.
Key components
- Cross-attention decoder: This component reconstructs the representation of masked tokens using the output feature embedding from the encoder.
- Linear projection head: This component aligns the features from the SAM image encoder with the output of the MAE model.
- Reconstruction loss: This loss function is minimized to optimize the encoder, decoder, and linear projection head.
Results
In Table 1 below, we present our quantitative and qualitative results of EfficientSAM on zero-shot image segmentation. Table 1 shows that our EfficientSAM significantly reduces the gap between SAM, 44.4 AP versus 46.5 AP. Note that our model also reduces the inference time of SAM by approximately 20x and the parameter size by roughly 20x with only a small performance drop.
Figure 2 below illustrates that when compared to other efficient methods or SAM, EfficientSAM has competing qualitative output.
We proposed a masked-image pre-training approach, SAMI, to explore the potential of ViTs under the guidance of the SAM foundation model, and demonstrated that SAMI helps build efficient SAMs with pretrained light-weight encoders. Our delivered EfficientSAM enables promptable image segmentation for a wide range of real-world applications. We are also adapting Efficient SAM for video segmentation and enabling on-device video segmentation.
SqueezeSAM: User-friendly mobile-interactive segmentation
The SAM model has been a driving force behind advancements in interactive segmentation, with applications in generative AI, computational photography, and medical imaging. However, its massive 600-million-parameter architecture has made it incompatible with mobile hardware, limiting its potential for widespread adoption.
We set out to change this by developing SqueezeSAM, a fully convolutional model architecture that distills the power of SAM into a package that’s 62.5 times faster and 31.6 times smaller. This breakthrough enables automated segmentation on mobile devices, opening up new possibilities for photographers and creators.
But what does this mean in practice? With SqueezeSAM, users can enjoy seamless object detection and segmentation, even on lower-end mobile hardware. Our model achieves an accuracy within one percent of the original SAM, ensuring that results are both fast and reliable.
To further enhance the user experience, we’ve developed a novel data-augmentation scheme that addresses a common limitation of traditional segmentation models. When a user clicks on a specific part of an object, our model can now intelligently segment the entire object, rather than just the clicked area. This means that clicking on a person’s T-shirt will segment the entire person, while clicking on a person holding a basketball will segment both the person and the ball.
Key features of SqueezeSAM
- Smaller model/faster inference: Compressed architecture reduces parameters and operations, making it efficient for devices with limited resources.
- Maintained accuracy: High accuracy is retained despite reduced size.
- Whole-object segmentation: Accurately segments entire objects, even in complex scenes or with partial occlusion.
How it works
Zero-shot instance segmentation is a challenging task that involves identifying and segmenting objects in an image without prior training on those specific objects. SqueezeSAM is a novel approach to zero-shot instance segmentation, promptable with user clicks.
SqueezeSAM consists of an encoder-decoder architecture, where the encoder embeds the input image into a rich, intermediate representation, and the decoder consumes this representation along with user input (clicks) to produce a set of segmentation masks. The proposed idea of early fusion combines the input to the encoder with user clicks, allowing the model to focus more attention on the image regions that the user wants to segment.
Key design choices to achieve a good speed/accuracy tradeoff
- Architecture: Transformers at the bottom scale of the UNet.
- Low channel count: Low channel count for small model size and low latency.
- Normalization: Employing BatchNorm instead of LayerNorm for computational efficiency.
- Skip Connections: Skip connections between encoder and decoder layers.
- Early Fusion: Early fusion adds user input points to the RGB image to focus attention on desired regions. In addition to this early fusion, points are also encoded using the decoder transformer (as is done with SAM). The mix of early and late fusion helps the net further to attend to the sought objects.
The importance of automatic segmentation in photography applications
Leading players in photography applications like Apple and CapCut utilize automatic segmentation in their solutions. With a goal to train a model for autocreation, we explored classic (non-interactive) instance segmentation and salient-object detection (SOD). Our qualitative observations indicated that SOD helped us create superior input, leading to enhanced interactive segmentation experience.
Sampling points from the saliency heatmap
Salient-object detection (SOD) identifies the most notable regions in an image, producing a heat map with higher values indicating more salient areas. We use SOD to sample points from this heat map for our segmentation model, allowing us to detect objects beyond the training set. To do this, we threshold the map, detect blobs, and select the most salient one. We then divide it into four sections and use the center of mass from each section, plus the whole blob, as our selected points, resulting in five clicks.
Augment SqueezeSAM for whole-object segmentation
Segmenting people, pets, and objects is crucial for our purposes. Traditional SAM models often fail to capture full objects. To address this, we analyzed the SA1B dataset, identifying limitations such as poor representation of salient-object masks and incomplete masks. We applied the following data-augmentation techniques to overcome these challenges.
- Mask merging: Merging small masks into larger ones for better segmentation.
- Outlier injection: Introducing background points during training to sharpen object detection.
- Center cropping around random objects: We randomly sample an object in the image and crop part of the image around that chosen object.
By integrating these techniques, we’ve significantly enhanced our model’s ability to capture whole objects with precision, paving the way for more accurate and reliable performance.
Results
Figure 3 below shows the qualitative comparison of SqueezeSAM with other fast segmentation models.
We evaluated our models on an internal dataset of images, depicted above, containing person masks. We found that the salient SAM variants perform significantly better than the original SAM. We also evaluated our models on the COCO dataset, using the 5k validation partition. We used the masks present in the COCO instance segmentation and augmented the LVIS masks to replace the corresponding COCO mask if the new LVIS mask strongly overlaps with the COCO mask. We found that SqueezeSAM dramatically outperforms the prior literature, achieving higher mIOU numbers with one click.
Table 2 below presents a comparative analysis of various SAM candidates, starting with the original SAM model and including our proposed SqueezeSAM architecture in both floating-point (fp32) and quantized variants. Notably, our quantized model demonstrates comparable performance to its fp32 counterpart, suggesting that our model architecture is more amenable to quantization with minimal degradation in quality.
Furthermore, as shown below in Figure 4, we have developed a pipeline that effectively isolates prominent objects from input images. This process involves generating a saliency map, followed by strategic sampling of points from the map. Our model then utilizes these sampled points to accurately identify the salient object. Below is an illustration of our pipeline.
Our experimental pipeline reveals that SqueezeSAM surpasses the performance of original SAM on a custom dataset tailored for person segmentation. (See Table 3 below.) Notably, SqueezeSAM achieves a four-percent improvement in segmentation accuracy over original SAM on this dataset. Furthermore, fine-tuning SqueezeSAM on the LVIS dataset yields an additional four-percent enhancement in performance, underscoring the efficacy of our approach.
We proposed a new AI-powered model, SqueezeSAM, for segmenting images on mobile devices. Our SqueezeSAM is able to capture correlated objects together, providing a more comprehensive understanding of the image context, which has the potential to improve the mobile photo-editing experience, enabling users to achieve their desired results more efficiently and effectively.
Join us on this journey
We invite you to explore the full details of our research in the EfficientSAM/SqueezeSAM papers. Your feedback and insights are crucial as we continue to refine and expand the capabilities of AI.
Join us in our mission to shape the future of artificial intelligence by making it more efficient, accessible, and impactful for everyone.