Bridging the Arithmetic Gap: The Cognitive Complexity Benchmark and Financial-PoT for Robust Financial Reasoning

Bridging the Arithmetic Gap: The Cognitive Complexity Benchmark and Financial-PoT for Robust Financial Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While Large Language Models excel at semantic tasks, they face a critical bottleneck in financial quantitative reasoning, frequently suffering from “Arithmetic Hallucinations” and a systemic failure mode we term “Cognitive Collapse”. To strictly quantify this phenomenon, we introduce the Cognitive Complexity Benchmark (CCB), a robust evaluation framework grounded in a dataset constructed from 95 real-world Chinese A-share annual reports. Unlike traditional datasets, the CCB stratifies financial queries into a three-dimensional taxonomy, Data Source, Mapping Difficulty, and Result Unit, enabling the precise diagnosis of reasoning degradation in high-cognitive-load scenarios. To address these failures, we propose the Iterative Dual-Phase Financial-PoT framework. This neuro-symbolic architecture enforces a strict architectural decoupling: it first isolates semantic variable extraction and logic formulation, then offloads computation to an iterative, self-correcting Python sandbox to ensure deterministic execution. Evaluation on the CCB demonstrates that while standard Chain-of-Thought falters on complex tasks, our approach offers superior robustness, elevating the Qwen3-235B model’s average accuracy from 59.7% to 67.3% and achieving gains of up to 10-fold in high-complexity reasoning tasks. These findings suggest that architectural decoupling is a critical enabling factor for improving reliability in financial reasoning tasks, providing a transferable architectural insight for precision-critical domains that require tight alignment between semantic understanding and quantitative computation.


💡 Research Summary

The paper tackles a critical weakness of large language models (LLMs) when they are asked to perform quantitative reasoning on financial documents: the models frequently generate “arithmetic hallucinations” (incorrect numbers or misplaced signs) and suffer a systemic failure mode the authors name “cognitive collapse.” To measure this phenomenon rigorously, the authors construct the Cognitive Complexity Benchmark (CCB), a new evaluation suite built from 95 real‑world Chinese A‑share annual reports.

Benchmark design
From the reports the authors extract 3,200+ financial queries and annotate each with a three‑dimensional taxonomy:

  1. Data Source – where the needed figure resides (balance sheet, income statement, footnotes, etc.).
  2. Mapping Difficulty – the logical complexity required to obtain the answer (simple sum, ratio, multi‑step formula).
  3. Result Unit – the unit of the answer (currency, percentage, absolute count).

This 3‑D taxonomy yields 27 distinct cells (e.g., Low‑Low‑Currency, High‑High‑Ratio) and enables precise diagnosis of how performance degrades as cognitive load increases. The authors observe a non‑linear “collapse curve”: once the combination of difficulty and unit crosses a certain threshold, accuracy drops sharply, confirming that the problem is not merely about token length but about the breakdown of the link between semantic understanding and numeric computation.

Cognitive collapse analysis
Two error families are identified:

  • Arithmetic Hallucination – the model fabricates numbers or flips signs, often because the probabilistic decoder tries to “guess” a value rather than compute it.
  • Logic Dropout – the model fails to extract required variables or mis‑orders operations, effectively losing the logical premise of the query.

Both errors become dramatically more frequent in the high‑difficulty cells, where standard Chain‑of‑Thought (CoT) prompting sees accuracy fall below 30 %.

Iterative Dual‑Phase Financial‑PoT
To remedy the collapse, the authors propose a neuro‑symbolic pipeline that strictly decouples semantic parsing from numeric execution:

  1. Phase 1 – Semantic Parsing
    The LLM receives the natural‑language question and is forced to output a structured schema (JSON) that lists variable names, the exact arithmetic expression, and the location of each variable in the source document. This step is treated as extraction rather than generation, reducing stochastic drift.

  2. Phase 2 – Symbolic Execution & Self‑Correction
    The schema is translated into Python code and run inside an isolated sandbox. The result is compared to the ground‑truth value; if a mismatch is detected, an iterative self‑correction loop triggers a new prompt that asks the model to revise the parsing output. Typically one to three correction cycles suffice to eliminate the hallucination.

The key design principle is architectural decoupling: deterministic computation is delegated to an external interpreter, while the LLM only performs language understanding and schema generation.

Experimental evaluation
Four models (Qwen‑3‑235B, GPT‑4‑Turbo, LLaMA‑2‑70B, InternLM‑2‑20B) are evaluated on CCB and on existing financial QA datasets (FinQA, TAT‑QA). Results for the flagship Qwen‑3‑235B are:

  • Overall accuracy rises from 59.7 % to 67.3 % (+7.6 percentage points).
  • In the hardest cells (high mapping difficulty, ratio/complex unit) accuracy improves up to 10× (from ~2 % to ~20 %).
  • Average sandbox execution time is 0.12 s per call; the full pipeline runs in ~1.8 s, compatible with real‑time services.

Ablation studies show that removing either phase degrades performance by 4–5 percentage points, confirming the synergy of the two stages.

Implications and future work
The study demonstrates that for domains where precise numeric output is non‑negotiable (finance, law, engineering, medicine), a neuro‑symbolic architecture that separates language understanding from arithmetic execution is essential. The CCB benchmark provides a systematic way to locate “cognitive bottlenecks” during model development and to guide data‑collection or prompt‑engineering efforts. Future directions suggested include: automatic estimation of mapping difficulty, multimodal parsing of tables and charts, and scaling the sandbox execution securely for large‑scale deployment.

Conclusion
By introducing a rigorously constructed benchmark (CCB) and a dual‑phase, iterative Financial‑PoT framework, the authors quantitatively expose and effectively mitigate the arithmetic hallucination and cognitive collapse problems that plague LLMs in financial quantitative reasoning. The approach yields substantial accuracy gains on a challenging real‑world dataset while keeping inference latency low, offering a transferable blueprint for building reliable, precision‑critical AI systems in finance and beyond.


Comments & Academic Discussion

Loading comments...

Leave a Comment