‘Zero-Shot Referring Expression Comprehension via Structural Similarity Between Images and Captions’

“Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object).”

Find the paper and full list of authors at ArXiv.

View on Site: ‘Zero-Shot Referring Expression Comprehension via Structural Similarity Between Images and Captions’