Open-Vocabulary · Camouflaged Instance Segmentation · Text-to-Image Diffusion
1HKUST 2CFAR & IHPC, A*STAR 3Deakin University 4Nankai University 5NUS 6Trinity College Dublin
Abstract
Text-to-image diffusion techniques have shown exceptional capabilities in producing high-quality, dense visual predictions from open-vocabulary text. However, these advantages do not hold for camouflaged individuals because of the significant blending between their visual boundaries and their surroundings.
We propose a method built upon state-of-the-art diffusion empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representation learning. Such cross-domain representations are desirable in segmenting camouflaged objects where visual cues subtly distinguish objects from the background, and in segmenting novel classes unseen during training.
We devise complementary modules to effectively fuse cross-domain features and engage relevant features towards respective foreground objects. Extensive experiments confirm the advances of our method over existing baselines on camouflaged and generic open-vocabulary instance segmentation benchmarks.
KEYWORDS
Contributions
Method
Our pipeline takes an image and a text prompt about target objects, producing instance masks with open-vocabulary category labels. The SD and CLIP models are frozen; only the specialised modules are trained.
Multi-scale Features Fusion
Fuses multi-scale features from the SD encoder with the final decoder layer via 1×1 convolution, element-wise multiplication, and residual addition. Captures both fine-grained and broad contextual information critical for detecting objects at varying scales.
Textual-Visual Aggregation
Computes instance-aware cross-modal interactions between mask embeddings and CLIP text embeddings. Uses softmax-weighted dot product with mean-normalisation to remove background noise and focus features on text-specified foreground objects.
Camouflaged Instance Normalisation
Inspired by adaptive instance normalisation, CIN applies learnable affine transformations to textual-visual features conditioned on instance masks, producing final refined masks. Uses a confidence score rather than a class score for category-agnostic instance existence.
Experiments
Pre-trained on MS-COCO (80 categories), fine-tuned on COD10K-v3. Only 6 categories are shared between training and test datasets, validating open-vocabulary generalisation.
| Method | AP | AP50 | AP75 | Trainable Params (M) |
|---|---|---|---|---|
| Closed-set Supervised Learning | ||||
| Mask R-CNN | 25.0 | 55.5 | 20.4 | 43.9 |
| SOLOv2 | 32.5 | 63.2 | 29.9 | 46.2 |
| Mask2Former | 39.4 | 67.7 | 38.5 | 43.9 |
| OSFormer | 41.0 | 71.1 | 40.8 | 46.6 |
| UQFormer | 45.2 | 71.6 | 46.6 | 37.5 |
| DCNet | 45.3 | 70.7 | 47.5 | 53.4 |
| Open-Vocab T2I (w/o fine-tuning) | ||||
| ODISE | 21.1 | 37.8 | 20.5 | 28.1 |
| Ours | 23.9 | 44.3 | 23.1 | 28.7 |
| Ours (task-specific) | 45.1 | 71.1 | 47.4 | 28.7 |
| Method | AP | AP50 | AP75 | Trainable Params (M) |
|---|---|---|---|---|
| Closed-set Supervised Learning | ||||
| Mask2Former | 45.8 | 73.6 | 47.5 | 43.9 |
| OSFormer | 42.5 | 72.5 | 42.3 | 46.6 |
| DCNet | 52.8 | 77.1 | 56.5 | 53.4 |
| Open-Vocab T2I (w/o fine-tuning) | ||||
| ODISE | 22.9 | 37.2 | 21.4 | 28.1 |
| Ours | 24.8 | 44.2 | 23.9 | 28.7 |
| Ours (task-specific) | 52.9 | 76.8 | 55.9 | 28.7 |
| Method | ADE20K AP | Cityscapes AP | Trainable Params (M) |
|---|---|---|---|
| MaskCLIP | 6.1 | — | 354.1 |
| ODISE | 13.9 | — | 28.1 |
| X-Decoder | 13.1 | 24.9 | 38.3 |
| OpenSeeD | 15.0 | 33.2 | 116.2 |
| Ours | 14.1 | 25.6 | 28.7 |
↗ 2nd on both benchmarks, with ~4× fewer trainable parameters than OpenSeeD at similar performance.
Qualitative Analysis
Qualitative comparison on COD10K-v3 & NC4K (Input / GT / OSFormer / DCNet / ODISE / Ours)
Failure cases on COD10K-v3
LIMITATIONS & FAILURE CASES
Our method may struggle to (1) separate touching/overlapping instances of highly similar appearance, (2) handle severely occluded objects that are fragmented into non-semantic parts, and (3) segment objects with extreme spatial discontinuity (e.g., an animal's body occluded by a tree). These are inherent challenges in camouflage understanding that represent avenues for future work.
Ablation
| Variant | AP | ΔAP |
|---|---|---|
| No text (embed = 0) | 12.2 | −7.1 |
| Skip MSFF (last layer only) | 18.4 | −0.9 |
| Skip MSFF (concat all) | 18.1 | −1.2 |
| Skip CIN module | 17.6 | −1.7 |
| Skip TVA module | 18.8 | −0.5 |
| Full setting | 19.3 | — |
Text embeddings are the most critical component — removing them causes a 37% AP drop.
| Prompt Strategy | AP |
|---|---|
| "A photo of <class>." | 22.8 |
| + Synonym ensembling | 23.4 |
| + Camouflage-specific templates | 23.9 |
TEMPLATE EXAMPLES
CLIP's reasonable zero-shot accuracy on camouflaged datasets confirms the utility of textual supervision even in challenging visual scenarios.
Citation