Reducing Text Bias in Synthetically Generated MCQAs for VLMs in Autonomous Driving
Multiple Choice Question Answering (MCQA) benchmarks are an established standard for measuring Vision Language Model (VLM) performance in driving tasks. However, we observe the known phenomenon that synthetically generated MCQAs are highly susceptible to hidden textual cues that allow models to exploit linguistic patterns rather than visual context. Our results show that a VLM fine-tuned on such data can achieve accuracy comparable to human-validated benchmarks even without visual input. Our proposed method reduces blind accuracy from +66.9% above random to +2.9%, eliminating the vast majority of exploitable textual shortcuts. By decoupling the correct answer from linguistic artifacts and employing a curriculum learning strategy, we force the model to rely on visual grounding, ensuring that performance accurately reflects perceptual understanding.
💡 Research Summary
**
The paper investigates a critical flaw in the way multiple‑choice question answering (MCQA) benchmarks are generated for Vision‑Language Models (VLMs) used in autonomous driving. Current pipelines first create a natural‑language question‑answer pair from a maneuver label using an expert language model (e.g., Gemini 2.5). In a second stage, the same model generates distractor options conditioned on the correct answer. This conditioning introduces subtle linguistic regularities that link the correct answer to its distractors. As a result, even small VLMs can achieve high performance by ignoring visual input and exploiting these hidden textual cues.
The authors demonstrate the problem with two diagnostic experiments. In a zero‑shot test, pretrained VLMs are fed only the question and answer options, with the video completely masked. Several models (Gemini 2.5 Pro, Gemini 2.5 Flash, Qwen2‑VL‑7B) achieve accuracies well above random guessing (+13 % to +14 %). In a supervised fine‑tuning (SFT) test, a Qwen2‑VL‑2B model is fine‑tuned on the synthetic MCQA dataset (denoted D_llm) and then evaluated with the video zeroed out. The model retains a “blind” accuracy of +66.9 % above random, confirming that the benchmark can be solved without any visual grounding.
To eliminate this bias, the paper proposes two complementary strategies:
-
Distractor Re‑sampling – After generating the correct answer, distractors are no longer produced by the same LLM. Instead, the authors sample other maneuver labels from the dataset and reuse the ground‑truth answers of those samples as distractors. Agent identifiers are rewritten to match the target agent in the question, ensuring semantic plausibility while breaking the answer‑conditioned generation loop. This creates a new dataset D_new where distractors are independent of the correct answer’s linguistic pattern.
-
Curriculum‑Based Option Dropping – During SFT, a fraction of MCQA instances have their answer options removed, turning them into open‑ended questions (question + video → answer). The drop ratio x(t) follows a quadratic schedule that starts high (≈80 % of samples have options dropped) and gradually decreases to a minimum value as training proceeds. Early aggressive dropping forces the model to rely on visual cues; later re‑introduction of options allows the model to learn to select the correct choice when options are present.
Experimental results show that the proposed changes dramatically reduce textual shortcuts. In zero‑shot evaluation on D_new, all tested models perform at chance level (−5 % to +1 %), whereas the same models achieve +13 % on D_llm. In SFT, a Qwen2‑VL‑2B trained on D_new reaches overall accuracies of 75.7 %–77.3 % (compared to 93.8 % on D_llm) while its blind accuracy drops to +2.9 %–+5.4 %, indicating genuine reliance on visual input. Sub‑sets designed to test “agent not visible” scenarios (D_N and D_V) further confirm that the model can no longer infer the answer from textual patterns alone.
Curriculum learning yields additional gains: models trained with option dropping achieve slightly higher overall accuracy (≈2–3 % improvement) while maintaining low blind accuracy, suggesting stronger visual‑textual grounding. Full fine‑tuning of the vision encoder and projector further boosts performance, highlighting the importance of adapting visual representations for bird‑eye‑view (BEV) video, which differs from the data used in most pre‑training regimes.
The authors acknowledge limitations: experiments are confined to a single VLM (Qwen2‑VL‑2B) and a BEV‑based driving dataset with a limited set of maneuver labels. The re‑sampling approach may struggle when the label space is small, and broader validation on other modalities (front‑camera video, richer scene descriptions) is needed. Future work could explore hybrid distractor generation that mixes label‑based and language‑model‑based options, and develop automatic bias detection metrics (e.g., BLEU differences, embedding distance) to flag problematic MCQAs before training.
In summary, the paper provides a clear diagnosis of hidden textual bias in synthetically generated MCQA benchmarks for autonomous‑driving VLMs and offers a simple yet effective remedy: decoupling distractor generation from the answer‑producing LLM and applying a curriculum that temporarily removes answer choices during training. These techniques suppress the model’s ability to “cheat” using language alone, forcing it to ground its reasoning in visual evidence. The work therefore contributes valuable guidance for constructing robust, vision‑centric evaluation suites that more faithfully measure a VLM’s true multimodal understanding in safety‑critical driving scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment