Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context code question answering using controlled ablations that test sensitivity to answer format, distractors, and context scale. Extending LongCodeBench Python dataset with new COBOL and Java question-answer sets, we evaluate state-of-the-art models under three settings: (i) shuffled multiple-choice options, (ii) open-ended questions and (iii) needle-in-a-haystack contexts containing relevant and adversarially irrelevant information. Results show substantial performance drops in both shuffled multiple-choice options and open-ended questions, and brittle behavior in the presence of irrelevant cues. Our findings highlight limitations of current long-context evaluations and provide a broader benchmark for assessing code reasoning in both legacy and modern systems.

💡 Research Summary

This paper conducts a comprehensive robustness analysis of state‑of‑the‑art large language models (LLMs) on long‑context code question answering (QA). Building on the existing LongCodeBench benchmark, which focuses on Python code and a four‑option multiple‑choice format, the authors extend the dataset with two new multilingual tracks: (1) COBOL, represented by the public OPPSCAL dataset and an internal IBM enterprise codebase, and (2) Java, drawn from four high‑profile open‑source projects (Elasticsearch, Cassandra, Dubbo, Kafka). Each track contains questions that require multi‑file reasoning, variable‑flow tracing, API behavior understanding, and exception handling across contexts ranging from 32 k to 512 k tokens (and up to 1 M tokens for Java).

To probe model behavior beyond raw accuracy, the study introduces three controlled perturbations: (i) shuffling the order of multiple‑choice options, (ii) removing the options entirely to force open‑ended answer generation, and (iii) “needle‑in‑a‑haystack” distractor injection, where relevant code snippets are mixed with adversarially irrelevant fragments placed at varying depths within the context. The open‑ended setting is evaluated with an LLM‑as‑a‑Judge framework to assess semantic correctness.

A diverse suite of models is evaluated, including GPT‑4o, Gemini‑2.5‑Flash and Pro, Claude‑4.5‑Sonnet, Llama‑3.1‑405B, Mistral‑Small‑24B and Large‑675B, Qwen‑2.5‑72B, and Granite‑4.0‑8B. Results are reported for each context length and each perturbation. The key findings are:

Recognition‑generation gap – Across all languages, accuracy drops 15–35 percentage points when answer options are removed. Models that excel in the multiple‑choice setting (e.g., Claude‑Sonnet at 80 % with options) fall to below 45 % without options, indicating heavy reliance on pattern matching rather than genuine code reasoning.
Sensitivity to option ordering – Shuffling options leads to 10–15 % performance degradation for many models, revealing positional bias. Even large models such as Llama‑3.1‑405B and Claude‑Sonnet exhibit this weakness.
Distractor vulnerability and positional bias – In the needle‑in‑a‑haystack experiments, models preferentially attend to recent tokens; when the correct evidence is placed early in the context, accuracy declines sharply. This effect is especially pronounced for COBOL, where legacy syntax (e.g., REDEFINES, PERFORM loops) appears to confuse the attention mechanisms of modern LLMs.
Scaling inconsistencies – Increasing context windows does not uniformly improve performance. Some models degrade beyond 128 k tokens, while others (e.g., Gemini‑2.5‑Flash) maintain relatively stable scores, suggesting that sheer context length is insufficient without effective retrieval or memory strategies.

Model‑specific observations: Gemini‑2.5‑Flash and Pro show the smallest generation gap (≈10 % points), hinting at stronger generative capabilities. GPT‑4o and Claude‑Sonnet achieve the highest multiple‑choice scores but suffer the steepest drops in the open‑ended setting. Mistral‑Small‑24B, despite its modest size, attains >70 % with options but falls below 45 % without them. Qwen‑2.5‑72B displays comparable performance across both settings, suggesting a more balanced architecture.

The authors conclude that current LLMs are not yet robust enough for real‑world software engineering tasks that involve long, noisy codebases and ambiguous question formats. They propose future directions: (a) integrating retrieval‑augmented pipelines to isolate relevant code fragments before reasoning, (b) language‑specific pre‑training or fine‑tuning to capture legacy language idioms, and (c) designing attention mechanisms or training objectives that mitigate distractor sensitivity and positional bias. Overall, the paper provides a valuable multilingual benchmark and a systematic methodology for evaluating and improving long‑context code reasoning in LLMs.

Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

💡 Research Summary

Comments & Academic Discussion

Leave a Comment