'Zero-Shot Referring Expression Comprehension via Structural Similarity Between Images and Captions'

‘Zero-Shot Referring Expression Comprehension via Structural Similarity Between Images and Captions’

March 31, 2024

“Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object).”

Find the paper and full list of authors at ArXiv.

View on Site

Huaizu Jiang

Computer Science

‘Zero-Shot Referring Expression Comprehension via Structural Similarity Between Images and Captions’

Related

NSF grant awarded for adaptive clothing

Patent for ‘lightweight pose estimation network’ goes to Fu

DARPA grant to enhance mixed reality security

Patents for experimental virtual reality methods

Patent for efficient computation

‘Human Mobility Is Well Described by Closed-Form Gravity-Like Models Learned Automatically from Data’

‘Foundations of Scalable Systems’

‘Network Coding for Engineers’

‘Practical Business Analytics Using R and Python’