International Journal of Computer Vision  ·  2026

Catch Me If You Can Describe Me:
Open-Vocabulary Camouflaged Instance Segmentation
with Diffusion

Open-Vocabulary  ·  Camouflaged Instance Segmentation  ·  Text-to-Image Diffusion

Tuan-Anh Vu1,2 Duc Thanh Nguyen3 Qing Guo4 Nhat Chung2,5 Binh-Son Hua6 Ivor W. Tsang2 Sai-Kit Yeung1

1HKUST    2CFAR & IHPC, A*STAR    3Deakin University    4Nankai University    5NUS    6Trinity College Dublin

Paper (IJCV) Code (GitHub) - TBD Results

Cross-domain representations
for hidden targets

Text-to-image diffusion techniques have shown exceptional capabilities in producing high-quality, dense visual predictions from open-vocabulary text. However, these advantages do not hold for camouflaged individuals because of the significant blending between their visual boundaries and their surroundings.

We propose a method built upon state-of-the-art diffusion empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representation learning. Such cross-domain representations are desirable in segmenting camouflaged objects where visual cues subtly distinguish objects from the background, and in segmenting novel classes unseen during training.

We devise complementary modules to effectively fuse cross-domain features and engage relevant features towards respective foreground objects. Extensive experiments confirm the advances of our method over existing baselines on camouflaged and generic open-vocabulary instance segmentation benchmarks.

KEYWORDS

Camouflaged Instance Segmentation Open Vocabulary Text-to-Image Diffusion CLIP Stable Diffusion Mask2Former

TASK COMPARISON

Task Instance Sep. Vocabulary Supervision
COD N/A Binary mask
CIS Closed Instance masks
OVCIS (Ours) Open-vocab Masks + categories

CORE INSIGHT

When visual features alone fail to distinguish camouflaged objects from cluttered backgrounds, textual representations learnt from large-scale language-image data provide rich, complementary cues that dramatically improve discriminability.

VISUAL-ONLY AP
12.2
TEXTUAL-VISUAL AP
23.9

What we propose

01
🎯
New Task: OVCIS
We define and address Open-Vocabulary Camouflaged Instance Segmentation — the first framework to unify camouflage understanding, instance separation, and open-vocabulary recognition.
02
🔀
Diffusion-based Pipeline
A method built on Stable Diffusion and CLIP that fuses multi-scale visual and textual features to learn powerful cross-domain representations for camouflaged targets.
03
🧩
Specialised Modules
Three novel components — MSFF, TVA, and CIN — tailored specifically to enhance camouflaged object representations through multi-scale fusion and textual guidance.
04
📊
Comprehensive Evaluation
Extensive experiments on COD10K-v3, NC4K, ADE20K, and Cityscapes benchmarks with ablation studies validating each design choice.

Pipeline Overview

Our pipeline takes an image and a text prompt about target objects, producing instance masks with open-vocabulary category labels. The SD and CLIP models are frozen; only the specialised modules are trained.

Pipeline
MSFF

Multi-scale Features Fusion

Fuses multi-scale features from the SD encoder with the final decoder layer via 1×1 convolution, element-wise multiplication, and residual addition. Captures both fine-grained and broad contextual information critical for detecting objects at varying scales.

TVA

Textual-Visual Aggregation

Computes instance-aware cross-modal interactions between mask embeddings and CLIP text embeddings. Uses softmax-weighted dot product with mean-normalisation to remove background noise and focus features on text-specified foreground objects.

CIN

Camouflaged Instance Normalisation

Inspired by adaptive instance normalisation, CIN applies learnable affine transformations to textual-visual features conditioned on instance masks, producing final refined masks. Uses a confidence score rather than a class score for category-agnostic instance existence.

Quantitative Results

Pre-trained on MS-COCO (80 categories), fine-tuned on COD10K-v3. Only 6 categories are shared between training and test datasets, validating open-vocabulary generalisation.

Method AP AP50 AP75 Trainable Params (M)
Closed-set Supervised Learning
Mask R-CNN25.055.520.443.9
SOLOv232.563.229.946.2
Mask2Former39.467.738.543.9
OSFormer41.071.140.846.6
UQFormer45.271.646.637.5
DCNet45.370.747.553.4
Open-Vocab T2I (w/o fine-tuning)
ODISE21.137.820.528.1
Ours23.944.323.128.7
Ours (task-specific)45.171.147.428.7
Method AP AP50 AP75 Trainable Params (M)
Closed-set Supervised Learning
Mask2Former45.873.647.543.9
OSFormer42.572.542.346.6
DCNet52.877.156.553.4
Open-Vocab T2I (w/o fine-tuning)
ODISE22.937.221.428.1
Ours24.844.223.928.7
Ours (task-specific)52.976.855.928.7
Method ADE20K AP Cityscapes AP Trainable Params (M)
MaskCLIP6.1354.1
ODISE13.928.1
X-Decoder13.124.938.3
OpenSeeD15.033.2116.2
Ours14.125.628.7

↗ 2nd on both benchmarks, with ~4× fewer trainable parameters than OpenSeeD at similar performance.

45.1
AP · COD10K-v3
52.9
AP · NC4K
28.7M
Trainable Params
1.88%
% of Total Params

Visual Comparisons

Our method excels at pixel-level instance segmentation, accurately delineating camouflaged objects along their blurry boundaries in cluttered backgrounds. Compared to OSFormer, DCNet, and ODISE baselines, our approach produces crisper mask boundaries and better handles multiple overlapping instances in complex underwater and terrestrial scenes. The textual-visual features learned by our pipeline visibly cluster around foreground objects even when their boundaries are nearly imperceptible to visual-only models.

Qualitative comparison on COD10K-v3 & NC4K (Input / GT / OSFormer / DCNet / ODISE / Ours)

Qualitative comparison

Failure cases on COD10K-v3

Failure cases

LIMITATIONS & FAILURE CASES

Our method may struggle to (1) separate touching/overlapping instances of highly similar appearance, (2) handle severely occluded objects that are fragmented into non-semantic parts, and (3) segment objects with extreme spatial discontinuity (e.g., an animal's body occluded by a tree). These are inherent challenges in camouflage understanding that represent avenues for future work.

Module & Design Validation

MODULE ABLATION · COD10K-v3

VariantAPΔAP
No text (embed = 0)12.2−7.1
Skip MSFF (last layer only)18.4−0.9
Skip MSFF (concat all)18.1−1.2
Skip CIN module17.6−1.7
Skip TVA module18.8−0.5
Full setting19.3

Text embeddings are the most critical component — removing them causes a 37% AP drop.

PROMPT ENGINEERING · COD10K-v3

Prompt StrategyAP
"A photo of <class>."22.8
+ Synonym ensembling23.4
+ Camouflage-specific templates23.9

TEMPLATE EXAMPLES

☞ "A photo of the camouflaged <class>."
☞ "A photo of the <class> concealed in the background."
☞ "A photo of the <class> camouflaged to blend in with its surroundings."

CLIP ZERO-SHOT CLASSIFICATION ON CAMOUFLAGED DATASETS

48.1%
COD10K-v3
45.7%
NC4K
46.5%
CAMO

CLIP's reasonable zero-shot accuracy on camouflaged datasets confirms the utility of textual supervision even in challenging visual scenarios.

How to Cite

BIBTEX
@article{vu2026ovcis, title = {Catch Me If You Can Describe Me: Open-Vocabulary Camouflaged Instance Segmentation with Diffusion}, author = {Vu, Tuan-Anh and Nguyen, Duc Thanh and Guo, Qing and Chung, Nhat and Hua, Binh-Son and Tsang, Ivor W. and Yeung, Sai-Kit}, journal = {International Journal of Computer Vision}, year = {2026}, publisher = {Springer} }

DATASETS

MS-COCO · COD10K-v3 · NC4K · CAMO · ADE20K · Cityscapes

BACKBONE

Stable Diffusion v1.3 (LAION-5B) + CLIP ViT (400M pairs)

TRAINING

4× NVIDIA A40 · 90k iters · Batch 64 · ~4.3 days