Task-Aware In-Context Retrieval for Visual Question Answering

Authors

  • Yifan Wang Author

DOI:

https://doi.org/10.61173/cdvv8617

Keywords:

Visual Question Answering (VQA), In-Context Learning, Task-Aware Retrieval, Unsupervised Clustering, Large Vision-Language Models

Abstract

Multimodal In-Context Learning (ICL) has demonstrated remarkable potential in enabling Large Vision-Language Models to adapt to new tasks without parameter updates. However, existing training-free methods primarily rely on visual or semantic similarity for demonstration retrieval, often overlooking the latent task intent of the query. This limitation leads to “task-mismatch” problem and the examples presented by this retrieval method reveal visual similarities but different logical reasoning pattern, consequently misleading the model. This paper presents a novel Task-Aware Retrieval (TAR) framework aimed at enhancing Visual Question Answering (VQA). An unsupervised semantic clustering mechanism utilizing Sentence-BERT is proposed to categorize questions into distinct task clusters through a data-driven approach. A hybrid retrieval strategy is utilized during inference to select demonstrations that correspond with the latent task intent and visual context of the test instance. Experimental results on the OK-VQA dataset indicate that the proposed method attains an accuracy of 41.44%, surpassing robust training-free baselines. Qualitative analysis confirms that TAR effectively addresses reasoning errors resulting from task misalignment, consequently validating the significance of intent consistency in multimodal ICL.

Downloads

Published

2026-02-28

Issue

Section

Articles