International Journal of Computer Vision · 2026

Catch Me If You Can Describe Me:
Open-Vocabulary Camouflaged Instance Segmentation
with Diffusion

Open-Vocabulary · Camouflaged Instance Segmentation · Text-to-Image Diffusion

Tuan-Anh Vu^1,2 Duc Thanh Nguyen³ Qing Guo⁴ Nhat Chung^2,5 Binh-Son Hua⁶ Ivor W. Tsang² Sai-Kit Yeung¹

¹HKUST ²CFAR & IHPC, A*STAR ³Deakin University ⁴Nankai University ⁵NUS ⁶Trinity College Dublin

Paper (IJCV) Code (GitHub) - TBD Results

Abstract

Cross-domain representations
for hidden targets

Text-to-image diffusion techniques have shown exceptional capabilities in producing high-quality, dense visual predictions from open-vocabulary text. However, these advantages do not hold for camouflaged individuals because of the significant blending between their visual boundaries and their surroundings.

We propose a method built upon state-of-the-art diffusion empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representation learning. Such cross-domain representations are desirable in segmenting camouflaged objects where visual cues subtly distinguish objects from the background, and in segmenting novel classes unseen during training.

We devise complementary modules to effectively fuse cross-domain features and engage relevant features towards respective foreground objects. Extensive experiments confirm the advances of our method over existing baselines on camouflaged and generic open-vocabulary instance segmentation benchmarks.

KEYWORDS

Camouflaged Instance Segmentation Open Vocabulary Text-to-Image Diffusion CLIP Stable Diffusion Mask2Former

TASK COMPARISON

Task	Instance Sep.	Vocabulary	Supervision
COD	✗	N/A	Binary mask
CIS	✓	Closed	Instance masks
OVCIS (Ours)	✓	Open-vocab	Masks + categories

CORE INSIGHT

When visual features alone fail to distinguish camouflaged objects from cluttered backgrounds, textual representations learnt from large-scale language-image data provide rich, complementary cues that dramatically improve discriminability.

VISUAL-ONLY AP

12.2

TEXTUAL-VISUAL AP

23.9

Contributions

What we propose

🎯

New Task: OVCIS

We define and address Open-Vocabulary Camouflaged Instance Segmentation — the first framework to unify camouflage understanding, instance separation, and open-vocabulary recognition.

🔀

Diffusion-based Pipeline

A method built on Stable Diffusion and CLIP that fuses multi-scale visual and textual features to learn powerful cross-domain representations for camouflaged targets.

🧩

Specialised Modules

Three novel components — MSFF, TVA, and CIN — tailored specifically to enhance camouflaged object representations through multi-scale fusion and textual guidance.

📊

Comprehensive Evaluation

Extensive experiments on COD10K-v3, NC4K, ADE20K, and Cityscapes benchmarks with ablation studies validating each design choice.

Method

Pipeline Overview

Our pipeline takes an image and a text prompt about target objects, producing instance masks with open-vocabulary category labels. The SD and CLIP models are frozen; only the specialised modules are trained.

MSFF

Multi-scale Features Fusion

Fuses multi-scale features from the SD encoder with the final decoder layer via 1×1 convolution, element-wise multiplication, and residual addition. Captures both fine-grained and broad contextual information critical for detecting objects at varying scales.

TVA

Textual-Visual Aggregation

Computes instance-aware cross-modal interactions between mask embeddings and CLIP text embeddings. Uses softmax-weighted dot product with mean-normalisation to remove background noise and focus features on text-specified foreground objects.

CIN

Camouflaged Instance Normalisation

Inspired by adaptive instance normalisation, CIN applies learnable affine transformations to textual-visual features conditioned on instance masks, producing final refined masks. Uses a confidence score rather than a class score for category-agnostic instance existence.

Experiments

Quantitative Results

Pre-trained on MS-COCO (80 categories), fine-tuned on COD10K-v3. Only 6 categories are shared between training and test datasets, validating open-vocabulary generalisation.

Method	AP	AP₅₀	AP₇₅	Trainable Params (M)
Closed-set Supervised Learning
Mask R-CNN	25.0	55.5	20.4	43.9
SOLOv2	32.5	63.2	29.9	46.2
Mask2Former	39.4	67.7	38.5	43.9
OSFormer	41.0	71.1	40.8	46.6
UQFormer	45.2	71.6	46.6	37.5
DCNet	45.3	70.7	47.5	53.4
Open-Vocab T2I (w/o fine-tuning)
ODISE	21.1	37.8	20.5	28.1
Ours	23.9	44.3	23.1	28.7
Ours (task-specific)	45.1	71.1	47.4	28.7

Method	AP	AP₅₀	AP₇₅	Trainable Params (M)
Closed-set Supervised Learning
Mask2Former	45.8	73.6	47.5	43.9
OSFormer	42.5	72.5	42.3	46.6
DCNet	52.8	77.1	56.5	53.4
Open-Vocab T2I (w/o fine-tuning)
ODISE	22.9	37.2	21.4	28.1
Ours	24.8	44.2	23.9	28.7
Ours (task-specific)	52.9	76.8	55.9	28.7

Method	ADE20K AP	Cityscapes AP	Trainable Params (M)
MaskCLIP	6.1	—	354.1
ODISE	13.9	—	28.1
X-Decoder	13.1	24.9	38.3
OpenSeeD	15.0	33.2	116.2
Ours	14.1	25.6	28.7

↗ 2nd on both benchmarks, with ~4× fewer trainable parameters than OpenSeeD at similar performance.

45.1

AP · COD10K-v3

52.9

AP · NC4K

28.7M

Trainable Params

1.88%

% of Total Params

Qualitative Analysis

Visual Comparisons

Our method excels at pixel-level instance segmentation, accurately delineating camouflaged objects along their blurry boundaries in cluttered backgrounds. Compared to OSFormer, DCNet, and ODISE baselines, our approach produces crisper mask boundaries and better handles multiple overlapping instances in complex underwater and terrestrial scenes. The textual-visual features learned by our pipeline visibly cluster around foreground objects even when their boundaries are nearly imperceptible to visual-only models.

Qualitative comparison on COD10K-v3 & NC4K (Input / GT / OSFormer / DCNet / ODISE / Ours)

Failure cases on COD10K-v3

LIMITATIONS & FAILURE CASES

Our method may struggle to (1) separate touching/overlapping instances of highly similar appearance, (2) handle severely occluded objects that are fragmented into non-semantic parts, and (3) segment objects with extreme spatial discontinuity (e.g., an animal's body occluded by a tree). These are inherent challenges in camouflage understanding that represent avenues for future work.

Ablation

Module & Design Validation

MODULE ABLATION · COD10K-v3

Variant	AP	ΔAP
No text (embed = 0)	12.2	−7.1
Skip MSFF (last layer only)	18.4	−0.9
Skip MSFF (concat all)	18.1	−1.2
Skip CIN module	17.6	−1.7
Skip TVA module	18.8	−0.5
Full setting	19.3	—

Text embeddings are the most critical component — removing them causes a 37% AP drop.

PROMPT ENGINEERING · COD10K-v3

Prompt Strategy	AP
"A photo of <class>."	22.8
+ Synonym ensembling	23.4
+ Camouflage-specific templates	23.9

TEMPLATE EXAMPLES

☞ "A photo of the camouflaged <class>."
☞ "A photo of the <class> concealed in the background."
☞ "A photo of the <class> camouflaged to blend in with its surroundings."

CLIP ZERO-SHOT CLASSIFICATION ON CAMOUFLAGED DATASETS

48.1%

COD10K-v3

45.7%

NC4K

46.5%

CAMO

CLIP's reasonable zero-shot accuracy on camouflaged datasets confirms the utility of textual supervision even in challenging visual scenarios.

Citation

How to Cite

BIBTEX

@article{vu2026ovcis, title = {Catch Me If You Can Describe Me: Open-Vocabulary Camouflaged Instance Segmentation with Diffusion}, author = {Vu, Tuan-Anh and Nguyen, Duc Thanh and Guo, Qing and Chung, Nhat and Hua, Binh-Son and Tsang, Ivor W. and Yeung, Sai-Kit}, journal = {International Journal of Computer Vision}, year = {2026}, publisher = {Springer} }

DATASETS

MS-COCO · COD10K-v3 · NC4K · CAMO · ADE20K · Cityscapes

BACKBONE

Stable Diffusion v1.3 (LAION-5B) + CLIP ViT (400M pairs)

TRAINING

4× NVIDIA A40 · 90k iters · Batch 64 · ~4.3 days

Catch Me If You Can Describe Me: Open-Vocabulary Camouflaged Instance Segmentationwith Diffusion

Cross-domain representationsfor hidden targets