AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Visual Question Answering (VQA) is a fundamental multimodal task that requires models to jointly understand visual and textual information. Early VQA systems relied heavily on language biases, motivating subsequent work to emphasize visual grounding and balanced datasets. With the success of large-scale pre-trained transformers for both text and vision domains – such as PhoBERT for Vietnamese language understanding and Vision Transformers (ViT) for image representation learning – multimodal fusion has achieved remarkable progress. For Vietnamese VQA, several datasets have been introduced to promote research in low-resource multimodal learning, including ViVQA, OpenViVQA, and the recently proposed ViTextVQA. These resources enable benchmarking of models that integrate linguistic and visual features in the Vietnamese context. Evaluation of VQA systems often employs automatic metrics originally designed for image captioning or machine translation, such as BLEU, METEOR, CIDEr, Recall, Precision, and F1-score. However, recent research suggests that large language models can further improve the alignment between automatic evaluation and human judgment in VQA tasks. In this work, we explore Vietnamese Visual Question Answering using transformer-based architectures, leveraging both textual and visual pre-training while systematically comparing automatic evaluation metrics under multilingual settings.

💡 Research Summary

The paper introduces AutoViVQA, a large‑scale Vietnamese Visual Question Answering (VQA) dataset created entirely through an automated pipeline that leverages large language models (LLMs) and multimodal resources. Recognizing the scarcity of high‑quality VQA benchmarks for low‑resource languages, the authors combine real‑world images from the MS‑COCO dataset (19,411 images) with Vietnamese captions and conversational descriptions from the VIST‑A corpus. This multimodal foundation ensures strong visual‑text alignment and linguistic naturalness.

The core of the pipeline is a constraint‑guided generation process. For each image, all associated captions are merged into a unified textual context. An LLM (Gemini‑2.5 Flash) is prompted with explicit instructions that specify a target reasoning level drawn from a five‑level schema: (1) recognition, (2) spatial/relational, (3) compositional, (4) commonsense/causal, and (5) text‑in‑image reasoning. The model generates one question and five short answers (1‑10 tokens each) per question, emulating multiple annotators and enabling consensus‑based validation. Prompt constraints enforce natural Vietnamese syntax, strict grounding in the provided captions, and logical consistency between question and answers.

To guarantee quality without human annotation, the authors employ an ensemble‑based validation protocol. Multiple LLMs and vision‑language models evaluate each generated QA pair against criteria such as visual grounding, linguistic fluency, and adherence to the prescribed reasoning level. Criterion‑wise thresholds filter out low‑scoring samples, and a majority‑voting scheme determines the final inclusion. This multi‑model filtering dramatically reduces hallucinations, weak grounding, and cultural bias.

AutoViVQA contains 37,077 questions and 185,385 answers (five per question), covering nine distinct question categories (spatial, relational, causal, counting, etc.) and the full five‑level reasoning hierarchy. Compared with existing Vietnamese datasets (ViVQA, OpenViVQA, ViTextVQA) and English benchmarks (VQA‑v2, CLEVR), AutoViVQA offers greater scale, richer reasoning diversity, and more balanced answer distributions (short, open‑ended responses rather than single‑token classifications).

Experimental evaluation uses a PhoBERT‑ViT multimodal transformer. The authors assess standard automatic metrics (BLEU, METEOR, CIDEr, Recall, Precision, F1) and analyze their correlation with human judgments, confirming recent findings that large language models can improve alignment between automatic scores and human perception. Performance breakdown across reasoning levels reveals that current models still struggle with higher‑order tasks (causal inference, text‑in‑image), highlighting the dataset’s utility for probing advanced multimodal reasoning.

In summary, the contributions are threefold: (1) the release of AutoViVQA, a publicly available, large‑scale Vietnamese VQA benchmark; (2) a reproducible, LLM‑driven generation framework that explicitly controls reasoning complexity; and (3) an ensemble validation pipeline that ensures high visual grounding and linguistic quality without costly human labeling. The methodology is adaptable to other low‑resource languages and domain‑specific VQA tasks, offering a roadmap for scalable multimodal dataset creation. Future work will explore extending reasoning types, domain adaptation, and integration with state‑of‑the‑art large multimodal models.

AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering

💡 Research Summary

Comments & Academic Discussion

Leave a Comment