RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid?

RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid?
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reliable financial reasoning requires knowing not only how to answer, but also when an answer cannot be justified. In real financial practice, problems often rely on implicit assumptions that are taken for granted rather than stated explicitly, causing problems to appear solvable while lacking enough information for a definite answer. We introduce REALFIN, a bilingual benchmark that evaluates financial reasoning by systematically removing essential premises from exam-style questions while keeping them linguistically plausible. Based on this, we evaluate models under three formulations that test answering, recognizing missing information, and rejecting unjustified options, and find consistent performance drops when key conditions are absent. General-purpose models tend to over-commit and guess, while most finance-specialized models fail to clearly identify missing premises. These results highlight a critical gap in current evaluations and show that reliable financial models must know when a question should not be answered.


💡 Research Summary

The paper introduces RealFin, a bilingual benchmark designed to test large language models’ (LLMs) ability to recognize when a financial question lacks sufficient information for a definitive answer. Existing financial QA benchmarks assume that every question is fully specified and admits a unique correct answer, which masks a critical capability: the model’s meta‑reasoning about the completeness of the premises. RealFin addresses this gap by starting from professional exam‑style questions (CFA‑style in English, CPA‑style in Chinese) and manually removing one or more essential assumptions—such as macro‑economic context, valuation models, contractual constraints, or accounting standards—while keeping the text fluent and plausible. The resulting “condition‑missing” questions are under‑determined; any concrete answer would require clarification of the omitted premises.

The dataset contains 2,020 questions (1,062 English, 959 Chinese) split into Original (full‑condition) and Revised (condition‑missing) versions, and further augmented with a “None‑of‑the‑Above” (NOTA) option that forces a model to explicitly state that none of the provided choices can be justified. Three task formulations are evaluated: (i) Original – answer the fully specified question, (ii) Revised – identify that information is missing, and (iii) NOTA – select the “no correct answer” option when appropriate.

Ten LLMs are evaluated in a zero‑shot setting: five general‑purpose commercial models (GPT‑5.1‑mini, Gemini‑2.5‑Flash, Claude‑Sonnet‑3.5, DeepSeek‑V3, Qwen3‑Max) and five finance‑specialized open‑source models (XuanYuan‑3‑70B, Fin‑R1‑7B, CFGPT‑2‑7B, DISC‑FinLLM‑13B, FinGPT‑7B). All models receive the same prompt asking them to return a strict JSON with “reason”, “answer”, and “confidence”, and decoding is performed with temperature 0 to eliminate randomness.

Results show a consistent performance drop when essential premises are removed, but the nature of the drop differs across model families. General‑purpose models maintain relatively high accuracy (≈85‑90%) on Revised questions, indicating a tendency to over‑commit and guess rather than admit uncertainty. Finance‑specialized models, while sometimes competitive on Original questions, struggle to recognize missing information and rarely select the NOTA option, leading to sharp declines in accuracy. Language‑specific trends also emerge: Claude‑Sonnet‑3.5 leads on English CFA Revised questions (89.62%), while Qwen3‑Max dominates Chinese CPA Revised questions (92%). Across question types, more complex categories such as Complex Calculation and Statistical/Econometric methods exhibit lower scores, highlighting the difficulty of multi‑step numerical reasoning under uncertainty.

The authors argue that reliable financial AI must not only produce correct answers when the problem is well‑posed but also know when to withhold an answer. RealFin thus fills a critical evaluation gap by measuring this “knowledge of ignorance”. The paper calls for future model development to incorporate uncertainty‑aware training, calibrated confidence estimation, and explicit handling of under‑specified inputs, especially in high‑stakes domains like finance where regulatory compliance and risk management demand honest acknowledgment of informational gaps.


Comments & Academic Discussion

Loading comments...

Leave a Comment