FinTradeBench: A Financial Reasoning Benchmark for LLMs

Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with the adva…

Authors: Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan

FinTradeBench: A Financial Reasoning Benchmark for LLMs
FinT radeBench: A Financial Reasoning Benchmark f or LLMs Y ogesh Agrawal Aniruddha Dutta Md Mahadi Hasan Santu Karmaker Aritra Dutta Uni versity of Central Florida Abstract Real-world financial decision-making is a chal- lenging problem that requires reasoning ov er heterogeneous signals, including company fun- damentals deri ved from re gulatory filings and trading signals computed from price dynam- ics. Recently , with the advancement of Large Language Models (LLMs), financial analysts hav e begun to use them for financial decision- making tasks. Ho wev er , e xisting financial ques- tion answering benchmarks for testing these models primarily focus on compan y balance sheet data and rarely e v aluate reasoning ov er how company stocks trade in the market or their interactions with fundamentals. T o take advantage of the strengths of both approaches, we introduce FinT radeBench , a benchmark for ev aluating financial reasoning that inte- grates company fundamentals and trading sig- nals. FinTradeBench contains 1,400 questions grounded in N ASD A Q-100 companies over a ten-year historical window . The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal rea- soning. T o ensure reliability at scale, we adopt a calibration-then-scaling framew ork that com- bines expert seed questions, multi-model re- sponse generation, intra-model self-filtering, numerical auditing, and human–LLM judge alignment. W e ev aluate 14 LLMs under zero- shot prompting and retrieval-augmented set- tings and witness a clear performance gap. Re- triev al substantially improves reasoning ov er textual fundamentals, b ut provides limited ben- efit for trading-signal reasoning. These find- ings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motiv ate future research in financial intelligence. 1 Introduction Real-world financial analysis requires reasoning on two complementary information sources: company Figure 1: Performance comparison of pr oprietary LLMs on a trading signal-f ocused question. There w as no pullback in Nvidia’ s stock in July 2025, and it was not a lucrative b uy- ing opportunity ; only Claude correctly identified the pullback component. On the buying component, all LLMs failed. fundamentals and market dynamics. Company fun- damentals are accounting-based metrics deriv ed from company balance sheets or Securities and Exchange Commission (SEC) filings, such as prof- itability , le verage, and v aluation ratios, that capture a compan y’ s underlying financial health ( F ama and French , 1992 ; Harve y et al. , 2016 ). In contrast, trading signals, computed from historical price and volume data, capture mark et dynamics and in- vestor sentiment, including momentum, v olatility , and trend re versals ( Brock et al. , 1992 ; Jegadeesh and T itman , 1993 ; Lo et al. , 2000 ; Andersen et al. , 2003 ; Park and Irwin , 2007 ; Choi , 2021 ). Effecti ve financial analysis and v aluation, therefore, require integrating these two perspecti ves rather than re- lying on either source in isolation. Additionally , synthesizing heterogeneous information sources, reasoning ov er numerical indicators, and interpret- ing market beha vior under uncertainty make finan- cial analysis an inherently challenging task, ev en for expert human analysts. In the advent of artificial intelligence (AI), LLMs are increasingly used to assist analysts with finan- cial analysis tasks such as summarizing earnings calls, interpreting regulatory filings, and answer- ing questions about company performance and risk ( Lee et al. , 2025 ; Y ang et al. , 2025 , 2023 ; Djagba and Saley , 2025 ; W u et al. , 2023 ; Shah et al. , 2022 ; Araci , 2019 ). Existing benchmarking datasets such as FinQA ( Chen et al. , 2022a ), Con vFinQA ( Chen et al. , 2022b ), and T A T -QA ( Zhu et al. , 2021 ) focus primarily on numerical reasoning over financial reports and tables. Recent benchmarks ev aluate long-context reasoning and retriev al-based ques- tion answering o ver financial te xt ( Li et al. , 2024 ; Reddy et al. , 2025 ; Islam et al. , 2023 ; Choi et al. , 2025 ). For a comprehensi ve o vervie w of FinLLMs and their benchmarking, see Lee et al. ( 2025 ). Ne vertheless, these datasets rarely assess rea- soning ov er trading signals deriv ed from historical price dynamics and typically do not require models to integrate both sources of information. Conse- quently , it remains unclear whether current LLMs can answer financial questions that require joint reasoning of company fundamentals and market behavior . E.g., answering the trading question Is NVIDIA ’ s pullback in J uly 2025 a buy- ing opportunity? as in Figure 1 , requires reasoning over both com- pany fundamentals and mark et dynamics. An ana- lyst must consider metrics such as return on assets and cash-flow strength while interpreting trading signals reflected in price momentum and trading volume to determine if the company’ s stock is un- derv alued or trading at a premium. As Figure 1 sho ws, most LLMs fail to reason ov er the relev ant trading signals and gi ve incorrect answers. A related class of ambiguities arises when funda- mentals and market behavior conflict. E.g., in April 2025, despite weak first-quarter earnings (Earn- ings per shar e (EPS) $0.27 vs. $0.42 expected and r evenue $19.34B vs. $21B expected), and a gen- erally cautious analyst consensus, T esla’s stock rallied by nearly 20% within days, driven by a forwar d-looking market narr ative rather than con- temporaneous fundamentals ( In vesting.com , 2025 ; McDade , 2025 ). This is a copybook case when in vestor sentiment and mark et narrativ es can driv e stock prices independently of current fundamentals ( De Bondt and Thaler , 1985 ; Baker and W urgler , 2006 ; Shiller , 2017 ; Bybee et al. , 2023 ). If one needs to know whether to buy or sell T esla stock in April 2025, they may not find a reliable answer using financial statements alone 1 . Ev aluating such 1 This is not an isolated ev ent, see § A for analogous cases. reasoning is challenging, since high-quality anno- tations require domain expertise, and LLMs often fail to capture numerical fidelity or alignment with expert judgment. In this paper , we address these challenges by making the follo wing contributions: ( i ) FinT radeBench: A Financial Reasoning Benchmark (§ 3 ). W e introduce FinT radeBench, a benchmark for ev aluating financial reasoning over company fundamentals (from SEC filings) and trad- ing signals (from historical price data). W e curate a compact set of signals commonly used in finan- cial analysis, including valuation ratios, lev erage metrics, momentum indicators, and volatility mea- sures, and integrate them to support structured rea- soning across heterogeneous data sources; see § B T able 6 for the full set of signals. Questions are or- ganized into fundamentals-focused , trading-signal- focused , and hybrid r easoning cate gories for gran- ular model e valuation. Using a calibration-then- scaling pipeline , we combine 150 expert-authored seed questions (50 per category), each with golden ke y indicators, and scale them across firms and time periods to yield 1,400 total benchmark questions. ( ii ) Benchmarking & Evaluation (§ 4 & § 5 ). W e benchmark 14 LLMs in zero-shot prompt- ing and retriev al-augmented settings and witness a clear performance gap in financial reasoning. Retrie val substantially improv es performance on fundamentals-focused questions ( ↑ +37% higher ac- curacy), and hybrid reasoning questions ( ↑ +55% higher accuracy), but offers limited or negati ve gains for trading-signal questions derived from time-series data; see T able 2 . This suggests that while current LLMs can effecti vely lev erage tex- tual financial information, they struggle to interpret quantitati ve market dynamics. 2 Background and Related W ork Financial Question-Answering (QA) Bench- marks. The last decade witnessed a surge in question-answering and numerical reasoning datasets in finance. E.g., FinQA ( Chen et al. , 2022a ) and T A T -QA ( Zhu et al. , 2021 ) numerical rea- soning datasets based on financial reports, tables, and textual disclosures. While Con vFinQA ( Chen et al. , 2022b ) extended these tasks to conv ersa- tional settings, FinanceBench ( Islam et al. , 2023 ), FinDER ( Choi et al. , 2025 ), and DocFinQA ( Reddy et al. , 2025 ) expanded ev aluation to long-context financial reasoning and retrie val tasks over fi- nancial documents. These benchmarks signifi- cantly advanced financial QA, with a primary fo- cus on reasoning ov er textual financial disclosures and accounting-deri ved indicators. Ho we ver , they rarely ev aluate reasoning over trading signals de- ri ved from historical price dynamics or require models to integrate both sources of financial in- formation; see ( Lee et al. , 2025 ). T rading Signals and Quantitative Finance. T rad- ing signals deriv ed from price and volume data play a key role in understanding market behav- ior and risk dynamics ( Lo et al. , 2000 ). Indicators such as momentum, volatility , moving averages, and drawdo wns ha ve been widely studied in asset pricing and quantitativ e trading ( Fama and French , 1992 ; Jegadeesh and T itman , 1993 ; Lo et al. , 2000 ). V olatility measures are also used to cap- ture percei ved mark et risk and re gime changes ( En- gle , 2004 ; Ang and T immermann , 2012 ; Boller- sle v et al. , 2015 , 2018 ). While machine learning approaches hav e recently been applied to forecast- ing volatility and detecting market regimes ( Han et al. , 2025 ; Mishra et al. , 2024 ; Moreno-Pino and Zohren , 2024 ; Li , 2024 ), the y typically frame the problem as prediction rather than question answer- ing. As a result, existing NLP benchmarks rarely e valuate whether LLMs can reason about these sig- nals for financial analysis. LLM Evaluation and Benchmark Design. Care- fully designed benchmarks for ev aluating the rea- soning capabilities of the LLMs are important. Prior studies emphasize controlled e valuation set- tings, high-quality reference answers, and scal- able annotation strate gies when constructing bench- marks for comple x reasoning tasks ( Chen et al. , 2024a ; Y e et al. , 2025 ; Hossain et al. , 2025 ). Fi- nancial QA presents additional challenges due to numerical fidelity , domain e xpertise, and the inte- gration of heterogeneous data sources ( Y ang et al. , 2023 ; Ran et al. , 2019 ; Zhang et al. , 2024 ). Comparison with Existing Benchmarks. T able 1 summarizes key dif ferences between existing fi- nancial benchmarks and ours. Prior financial QA benchmarks, ( Islam et al. , 2023 ; Chen et al. , 2022b ) emphasize reasoning over textual financial docu- ments, and quantitativ e finance research ( Oberlech- ner , 2001 ; Jegadeesh and T itman , 1993 ), focuses on predicti ve modeling of trading signals. None ex- plicitly e valuates reasoning o ver trading signals or the joint interaction between fundamentals and mar- ket dynamics. FinT radeBench bridges these two areas by introducing a benchmark that ev aluates T able 1: Comparison of financial QA benchmarks. The columns indicate different features such as retriev al support (RA G), time-series trading signals (TS), multi-hop reasoning (MH), cross-modal joint reasoning over fundamentals and trading signals (F+T), and LLM-oriented design. For datasets that support these features, we use ✓ , × for not supported, and ◦ for partially supported. Dataset RA G TS MH F+T LLM FinQA ( Chen et al. , 2022a ) × × × × × DocFinQA ( Reddy et al. , 2025 ) ◦ × × × × Con vFinQA ( Chen et al. , 2022b ) × × × × × FinDER ( Choi et al. , 2025 ) ✓ × ◦ × ◦ FinT extQA ( Chen et al. , 2024b ) ✓ × ◦ × ✓ AlphaFin ( Li et al. , 2024 ) ✓ × × × ✓ FinanceBench ( Islam et al. , 2023 ) ✓ × ◦ × ✓ FinT radeBench (This paper) ✓ ✓ ✓ ✓ ✓ financial reasoning across both company fundamen- tals and trading signals within a unified e valuation frame work. By explicitly modeling these tw o sig- nal types, we perform financial reasoning tasks that closely reflect real-world financial analysis. 3 FinT radeBench: Benchmark Design In this section, we construct F inT radeBenc h using the calibration-then-scaling paradigm that grounds expert financial intuition in automated lar ge-scale e valuation ( Sriv astav a et al. , 2023 ; Liang et al. , 2022 ; Thrush et al. , 2022 ; Cobbe et al. , 2021 ). This benchmark curation has three primary compo- nents, each with multiple sub-components; see the pipeline in Figure 2 . ( 1 ) Scope and Data Sour ces. FinT radeBench cov ers N ASD A Q-100 companies over a ten-year windo w (2015–2025), ensuring reporting consis- tency and av ailability of both regulatory filings and trading data. For each company-quarter pair , we aggregate tw o primary sources: ( i ) Re gulatory F il- ings (10-K/10-Q): SEC filings from which we ex- tract company fundamentals such as profitability , le verage, v aluation, and efficienc y ratios. ( ii ) Daily T rading Data: OHLCV (Open, High, Lo w , Close, and V olume) data used to compute trading signals such as momentum, v olatility , drawdo wn, and mov- ing averages. All signals are aligned by ticker & financial quarter to ensure benchmark questions correspond to verifiable historical data; see T able 6 for the full signal list. Signal selection follows three principles, Inter - pr etability , Empirical Relevance, and Liquidity , which are consistent with established asset pricing/ trading literature ( Fama and French , 1992 ; Zheng et al. , 2023 ; Je gadeesh and T itman , 1993 ; Harve y et al. , 2016 ). Signals are organized into: ( i ) Com- Figure 2: FinT radeBench design pipeline. W e sketch the 3 primary components and their sub-components on the top block of the pipeline: 1. Data & Design (top left), 2. Question T axonomy (top middle), and 3. Calibration (top right). The lower block emphasizes the scaling phase , which is a sub- pipeline on its own. pany Fundamentals , which are accounting-based indicators including return on assets (R O A), return on equity (R OE), earnings-to-price, book-to-price, debt-to-equity , and sales-to-assets. ( ii ) T rading Sig- nals which are price time-series-deriv ed indicators including moving averages, momentum, realized volatility , drawdo wns, and volume measures. ( 2 ) Question T axonomy . W e di vide the questions into three reasoning categories: ( i ) Fundamentals-F ocused (F-type): reasoning ov er accounting-based indicators like R OA, R OE; ( ii ) T rading-F ocused (T -type): reasoning over market trading signals like price momentum, volatility , and market dynamics; and ( iii ) Hybrid (FT -type): joint reasoning across both signals. This taxonomy enables di verse performance analysis and tests whether models can integrate heterogeneous financial signals. See sample questions with gold responses in T able 9 . ( 3 ) Calibration-Then-Scaling Framework. T o make our benchmark scalable, unbiased, and repro- ducible, we use a calibration-then-scaling frame- work; see Figure 2 . The framework proceeds in three phases: first, expert-guided seed construction, follo wed by ev aluation and calibration, and finally automated scaling using a calibrated LLM judge ( Zheng et al. , 2023 ; Gu et al. , 2024 ). ( 3a ) Phase 1: Multi-Model Candidate Genera- tion and Self-Selection. In this phase, we generate responses via LLMs as potential candidates for the gold answer in our benchmark by follo wing three prompting techniques: ( i ) Multi-model, multi- prompt sampling . For each question Q , we gen- erate N = 6 candidate responses per model using distinct prompt templates deriv ed from the TELeR taxonomy ( Santu and Feng , 2023 ), which defines a structured hierarchy for reasoning-level prompts; see T able 7 . This promotes intra-model response di versity while maintaining cross-model compara- bility , paralleling best-of- N sampling ( Chow et al. , 2025 ) and self-reflective refinement ( Shinn et al. , 2023 ). ( ii ) Intra-model self-filtering. Each model independently selects its best response a ⋆ by com- paring its N candidates on factual accuracy , reason- ing completeness, and rele v ance. This symmetric, bias-neutral design av oids cross-model preference leakage ( Li et al. , 2025 ), as each model e v aluates only its own outputs. Because all candidates within a model share identical priors, the selection pro- cess remains symmetric and acts as a bias-neutral quality filter . This is consistent with prior self- e valuation work ( Lee et al. , 2024 ; Y uan et al. , 2024 ; W u et al. , 2024 ). ( iii ) A utomated numerical au- dit. Each self-selected response is audited against a structured financial kno wledge base by an inde- pendent LLM auditor . Numerical claims are classi- fied as SUPPORTED , CONTRADICTED , or NOT FOUND , yielding a binary is accurate indicator . T o quan- tify filtering effecti veness, we compute mean nu- merical accuracy before and after self-filtering, as well as precision, recall, and F1 over the overlap be- tween referenced financial metrics in the response generated ( M gen ) and ground-truth reference met- rics ( M ref ); see § E.1 . ( 3b ) Phase 2: Evaluation and Calibration. Fol- lo wing Liang et al. ( 2022 ), we ask a financial ex- pert and an independent LLM to e valuate the self- filtered response and then align their ev aluation as follows: ( i ) Human Evaluation. Domain ex- perts e valuate self-filtered responses { a ⋆ m } across all models on a 5-point Lik ert scale for four criteria: factual and numerical accurac y , completeness, rele- v ance, and clarity . Evaluation is performed double- blind to pre vent rater bias ( Zheng et al. , 2023 ). These human judgments serve as the gold standard against which the automated judge is calibrated. ( ii ) LLM-as-a-Judge Evaluation. W e ev aluate each question Q , self-selected response, a ⋆ m , and numerical audit summary by giving a structured Figure 3: Overview of the RA G architecture. The pipeline features a dual-track retriev al engine that processes unstructured text and structured time-series separately . The generation phase utilizes TELeR taxonomy to produce multiple candidate responses across the model zoo, which are filtered via self-selection before final ev aluation. rubric mirroring human criteria to Claude Sonnet 4.5 ( Anthropic , 2025 ) which acts as an independent LLM judge that is distinct from all generator mod- els. This separation of generator and ev aluator mit- igates kno wn self-preference biases in LLM-judge systems ( Chen et al. , 2024a ; Y e et al. , 2025 , 2024 ; Zheng et al. , 2023 ). W e measure Human-LLM- judge-alignment via mean absolute error (MAE) per model and aggregate across models to assess inter-rater consistency ( Hossain et al. , 2025 ); see § E.1 . W e achiev e human-LLM alignment through prompt engineering (see § F ) on our seed set of 150 questions. Each of them were annotated with golden indicators ; see sample question in T able 9 . ( 3c ) Phase 3: Scaling. In this phase, we gen- erate scaled versions of seed questions using com- panies in N ASD A Q 100 spanning 2015-2025 data. As in Phase 1 of § 3 , we use the same pipeline to generate and filter responses. Finally , using the human aligned LLM judge (MAE < 10% ,see § G ), we ev aluate these responses, scaling the benchmark to 1,400 historically grounded financial reasoning questions ( Kiela et al. , 2021 ). 4 Experimental Setup W e ev aluate LLMs on FinT radeBench under zero- shot (No-RAG) and realistic-RAG conditions to iso- late the contribution of heterogeneous information sources: textual/tabular SEC filings vs. numerical time-series market data. This design follo ws recent recommendations for high-stakes RA G ev aluation ( Friel et al. , 2025 ; Niu et al. , 2024 ). Models Evaluated. W e ev aluate 14 LLMs across FinT radeBench. W e cate gorize the LLMs into large (proprietary and open-source LLMs with better rea- soning and parameters ⪆ 100 B), mid, and small (open-source and distilled or instruction-tuned, pa- rameters ranging from 1 − 70 B) categories by pa- rameter scale and reasoning capability; see T able 2 . Ev aluating this diverse set under dif ferent signal combinations (Fundamentals (F), T rading (T), and Hybrid (FT)) allo ws us to isolate how model size and architecture af fect performance across hetero- geneous financial signals. Domain-A ware Hybrid Retrie val via RA G. W e design a RAG-based architecture for the dual na- ture of financial analysis by integrating text and tabular data with numerical time-series data. Our four-part RA G implements a Dual-Track Retrie val Engine followed by TELeR-guided generation and self-filtering before e valuation; see Figure 3 . ( i ) Data Ingestion and Indexing. Financial doc- uments often contain high tok en density and com- plex tables that standard chunking corrupts. W e address this via two strategies: ( a ) Hierar chical Indexing: A parent-child strategy segments docu- ments by logical boundaries (e.g., SEC “Item 7”) ( Shaukat et al. , 2026 ; Zhou et al. , 2026 ; Le wis et al. , 2021 , 2020 ). Retrie val matches on smaller child chunks ( L = 300 tokens) trigger the loading of the full parent context, preserving narrati ve coherence. ( b ) Metadata Injection: Structured metadata (ticker , fiscal year) is prepended to ev ery chunk embedding to mitigate temporal hallucination. ( ii ) Dual-T rack Retrieval Engine. W e use a dual-track retrie val architecture to handle the asymmetric structure of financial evidence; see Figure 3 . T rack A indexes SEC filings using parent–child chunking: smaller child chunks are embedded for retriev al, while larger parent sec- tions are returned to preserve document-lev el context. Retriev al ov er this track combines dense embeddings ( BAAI/bge-large-en-v1.5 ), BM25 lexical matching, and cross-encoder re- ranking ( ms-marco-MiniLM-L-6-v2 ). T rac k B in- dex es market data as time period-aligned price chunks retrie ved via an auxiliary temporal query mechanism; these chunks bypass cross-encoder re-ranking, as semantic rele vance models tend to underweight structured time-series evidence. At query time, we dynamically mer ge the outputs from both tracks: after ticker detection, the sys- Fundamental (F) T rading (T) Hybrid (FT) Overall Category Model No-RA G RA G ∆ No-RA G RA G ∆ No-RA G RA G ∆ No-RA G RAG ∆ Large LLMs DeepSeek-R1 34 42 +23.6% 24.8 24.8 0% 33.2 46.4 +39.8% 30.7 37.7 +23% ∗∗ Gemini 2.5 Flash 31 38.4 +23.8% 24.1 21.6 -10.3% 30.8 40.4 +31.2% 28.6 33.4 +16.7% ∗∗ Gemini 2.5 Flash-Lite 34.4 34.4 0.0% 26.4 21.2 -19.7% 33.2 37.6 +13.1% 31.3 31 -1% GPT -5-mini 37.1 42.0 +13.1% 27.8 23.2 -16.4% 34.8 44.1 +26.7% 33.2 36.4 +9.4% ∗ Mid LLMs R1-Distill-LLaMA (70B) 34.2 32.3 -5.5% 24.6 20.1 -18.4% 27.1 27.2 +0.3% 28.5 26.3 -7.7% R1-Distill-Qwen (32B) 31.7 43.5 +37% 21.6 22 +2.2% 24.1 37.4 +55.1% 25.7 33.9 +32% ∗∗ LLaMA 3.3 70B 29.4 34.8 +18.4% 21.2 21.6 +1.8% 26.8 30.2 +12.7% 25.8 28.9 +11.8% ∗∗ LLaMA 3.3 Instruct (70B) 40.1 36.9 -7.9% 25 20.5 -18.0% 29.6 28.5 -3.7% 31.4 28.4 -9.5% ∗∗ Qwen 2.5 Instruct (32B) 42.3 47 +11.3% 24.8 22.7 -8.5% 33.4 40.2 +20.3% 33.2 36.2 +8.9% ∗∗ Small LLMs LLaMA 3.1 Instruct (8B) 35.7 35 -2% 23.2 20.9 -9.8% 28.2 30 +6.3% 28.9 28.4 -1.7% Phi-4 (14B) 36.8 38.6 +4.9% 23.6 23.2 -1.6% 29.6 31.3 +5.8% 29.8 30.8 +3.3% ∗ Mistral v0.2 (7B) 33.7 34.2 +1.3% 24.3 29.9 +23.2% 27.4 30 +9.7% 28.3 31.3 +10.6% ∗∗ R1-Distill-Qwen (14B) 30.8 41.3 +33.9% 21 21.8 +3.7% 23.6 35.3 +49.5% 25 32.5 +29.6% ∗∗ LFM 2.5 (1.2B) 24.8 23.5 -5.2% 20.1 20 -0.4% 21.3 21.5 +0.8% 22 21.6 -1.8% T able 2: Overall and category-specific accuracy (%) across LLMs. Here ∆ denotes the relativ e improv ement of RAG o ver No-RA G ( ∆ = ( RA G − No-RA G ) / No-RA G × 100% ). Significance assessed via paired t -test ( ∗ p < 0 . 05 , ∗∗ p < 0 . 01 ). tem retriev es e vidence independently per track, ap- plies source-specific quotas, filters by temporal rel- e vance, and remov es duplicate parent contexts. W e assemble the final prompt under a global token bud- get, balancing long-form financial texts containing company fundamentals, with short-horizon market e vidence needed to calculate trading signals, while preserving signal-specific retrie val strengths. ( iii ) TELeR-guided r esponse generation and self-filtration. In the generation phase, we use the TELeR taxonomy ( Santu and Feng , 2023 ) to gener - ate six distinct prompts per question, ranging from simple directiv es (L1) to complex RA G-aware rea- soning tasks (L6); see § C T able 8 . This reduces reasoning errors associated with any single prompt structure. A self-selection module ev aluates these candidates against the retriev ed context to iden- tify the best RA G and best No-RA G response per model, consistent with the self-filtering approach used during benchmark construction; see § 3 . ( iv ) Evaluation. An LLM judge ev aluates the selected best RA G and best No-RA G responses against ground-truth answers and a set of golden indicators , i.e., the k ey financial metrics required for a correct answer , ensuring ev aluation captures both reasoning quality and factual precision. Evaluation Metrics & Statistical T esting. W e e valuate model performance along four dimensions: ( i ) Absolute Accuracy normalizes the judge’ s 1-5 correctness score to a percentage. ( ii ) Relative Re- trieval Delta ( ∆ ) measures the relati ve accuracy shift of RA G architectures compared to No-RA G, with statistical significance assessed via paired- samples t -test over question-lev el scores. ( iii ) Golden Indicator F1 measures precision and re- call ov er expert-defined financial metrics in model responses. ( iv ) Inte gration Scor e assesses ho w well models synthesize te xtual and tabular signals; see full metric definitions in T able 6 . 5 Results and Discussion T able 2 reports the performance comparison of RA G-based and No-RA G architectures of 14 ev al- uated LLMs on FinT radeBench. Paired t -tests on question-le vel correctness scores assess the statis- tical reliability of RA G-induced changes. T able 3 (and Figure 5 in § H.3 ) complement this with global generati ve quality metrics, re vealing ho w RA G re- shapes model reasoning beha vior beyond raw accu- racy . Our analysis surfaces the follo wing findings: ( 1 ) RA G strongly benefits fundamental rea- soning (F) and degrades trading signal (T) r ea- soning. Fundamental (F) questions require extract- ing accounting metrics such as debt-to-equity ra- tios from SEC 10-K/10-Q filings. On them, RA G produced large, statistically significant gains for reasoning-capable LLMs. E.g., R1-Distill-Qwen (32B) improv ed by 37% relati ve to its No-RA G baseline on F-type questions. Among proprietary models, Gemini 2.5 Flash gained 23 . 8% and DeepSeek-R1 gained 23 . 6% on fundamentals. This confirms that h ybrid retrie v al ov er te xt-heavy finan- cial disclosures ef fectiv ely anchors generation and mitigates hallucination when pre-training represen- tations of fundamental concepts are strong. T rading (T) questions require computing tech- nical indicators (e.g., momentum, RSI) from raw OHLCV price series. LLMs with RA G perform systematically worse on them. E.g., the perfor- mance drops of Gemini 2.5 Flash-Lite , GPT -5-mini , and LLaMA 3.3 Instruct (70B) are in the range of 16 . 4 − 19 . 7% relati ve to their No-RA G base- lines. Even when the auxiliary temporal retriev al track correctly surfaced the rele vant price chunks, models could not reliably parse unrolled numeri- cal tables to deri ve trend indicators. This suggests that quantitati ve market data demands intermediate computational steps, such as code execution, rather than retrie v al alone. Our study indicates that LLMs struggle to compute metrics on the fly , based only on retrie val. This observation aligns with Cobbe et al. ( 2021 ), where LLMs fail to perform robustly on multi-step mathematical reasoning and opens a broader research direction beyond finance that can systematically explore those f ailure cases. T akeaway Message: The pre-training data cor- pora of LLMs play a significant role in modu- lating their performance . SEC filings are le gally mandated public records, index ed by EDGAR and widely represented in financial datasets ( Islam et al. , 2023 ; Y ang et al. , 2025 ; Choi et al. , 2025 ; Chen et al. , 2022a ). As a result, LLMs enter e valuation with strong latent representations of fundamental financial concepts, and RAG acts as a grounding anchor that acti vates this prior knowledge. In con- trast, tick-level trading data and proprietary tech- nical trading signals are commonly behind a pay- wall (e.g., Bloomber g, Refiniti v). Their scarcity in pre-training mixtures leav es models without the representational frame work needed to reason o ver retrie ved numerical price tables; injecting this un- familiar structure into the context windo w causes distraction rather than augmentation. ( 2 ) Reasoning LLMs Dominate Hybrid Ques- tions. Hybrid (FT) questions impose the highest cogniti ve load in our benchmark, requiring a model to retriev e company fundamentals, trading signal context, and reason across both. Models equipped with latent chain-of-thought reasoning capabilities outperformed standard instruction-tuned models in this category . DeepSeek-R1 achiev ed the high- est Hybrid RA G accuracy at 46 . 4% , which is a ↑ 39 . 8% relati ve gain ov er its No-RA G baseline. This capability transferred to distilled open-weight models: R1-Distill-Qwen (32B) gained ↑ 55 . 1% and R1-Distill-Qwen (14B) gained ↑ 49 . 5% on FT ques- tions. By contrast, instruction-tuned models with- out e xplicit reasoning steps showed modest or ev en negati ve gains in this category . W e attribute this to the fact that latent reasoning models allocate ad- ditional inference-time computation to reconcile conflicting signals types, precisely the capability that hybrid financial questions demand. Metric No-RA G RA G ∆ (%) Golden Indicator Precision (%) 0.44 0.2 -55.6% Golden Indicator Recall (%) 0.22 0.1 -55.8% Golden Indicator F1 (%) 0.27 0.12 -56.5% Fundamental Integration (1–5) 1.60 1.81 +13.4% T rading signal Integration (1–5) 1.54 1.47 -4.6% Reasoning Depth (1–5) 2.74 2.44 -10.8% T able 3: Global Quality Metrics average across all LLMs. ( 3 ) RA G actively harms certain model fam- ilies reg ardless of their parameter count. While Qwen models and their DeepSeek distillations sho wed significant improvements ( p < 0 . 01 ), the LLaMA models exhibited systematic degradation. LLaMA 3.3 Instruct (70B) declined by ↓ 9 . 5% ov er- all ( p < 0 . 01 ), with drops across all three ques- tion categories: ↓ 7 . 9% on F , ↓ 18 . 0% on T , and ↓ 3 . 7% on FT . R1-Distill-LLaMA (70B) declined by ↓ 7 . 7% ov erall, and LLaMA 3.1 Instruct (8B) dropped ↓ 1 . 7% . Notably , the 14B R1-Distill-Qwen model outperformed all three LLaMA models un- der RAG despite ha ving fewer parameters. This pattern suggests that architecture and pre-training data mixture matter more than scale; LLaMA base weights appear highly susceptible to distraction when presented with dense, jargon-hea vy SEC text and unstructured numerical tables, causing the model to abandon its internal reasoning in fa vor of surface-le vel conte xt summarization. ( 4 ) LLMs are Distracted by RA G-Based Extra Inf ormation Sources. T able 3 sho ws that, despite the LLMs’ grounding their answers on the funda- mental texts when aided with RAG architecture, the reasoning across the golden indicator decreases. This sho ws LLMs are prone to distraction when extra information is provided. Only Fundamen- tal Inte gration scores improved by ↑ 13 . 4% with RA G. But the precise e xtraction of expert-defined Golden Indicators collapsed; Golden Indicator F1 dropped by ↓ 56 . 5% , and ov erall Reasoning Depth dropped by ↓ 10 . 8% . This shows that dense finan- cial context impro ves surface-le v el factual ground- ing b ut activ ely suppresses the abstract analytical reasoning that expert ev aluation requires. Models absorb the retrie ved documents and produce flu- ent, citation-heavy summaries, yet fail to isolate the specific financial signals an expert would prior - itize. T o resolve this, injecting rich evidence while preserving an LLM’ s analytical depth remains the central challenge for financial RA G architectures. 5.1 A Case Study on The Impact of Context Quality on Reasoning T o qualitatively illustrate the distraction ef fect (the final finding), we compare Gemini 2.5 Flash-Lite outputs on a Hybrid (FT) query about Apple Inc. (AAPL) under three retriev al conditions: No-RA G, with RA G, and finally , a new class of model, which we call ideal RAG . The ideal RA G model is re- trie ved from a precomputed conte xt. T able 4 shows that the No-RA G model produces generic textbook definitions with no actual data. The standard RA G model grounds its answer in real figures but is ov erwhelmed by raw re venue totals and daily price points, failing to surface the required Golden In- dicators. Only the ideal RA G condition, where precomputed context is provided, activ ates pre- cise reasoning; the model correctly identifies the Book/Price ratio (0 . 02) , the RSI ( 60 . 39 ), and syn- thesizes a concrete inv estment conclusion. This confirms that the bottleneck is not the model capa- bility but the context structure. When numerical financial signals are pre-computed rather than raw , a mid-tier model can reason o ver them ef fecti vely . W e provide unimodal ablations (either with precom- puted company fundamentals or trading signals) in § H.1 , § H.2 e xtends this analysis across market sectors and query types to confirm the pattern is systemic rather than instance-specific. 6 Conclusion and Future Dir ection W e introduce FinTradeBench to e v aluate financial reasoning across company fundamentals, trading signals, and hybrid queries. FinTradeBench is based on a decade of NASD A Q-100 data using a calibration-then-scaling frame work that combines expert-authored seeds with scalable automated ev al- uation. W e benchmark 14 LLMs under No-RA G and RA G settings and observe se veral ke y patterns. Quantitati vely , we witness the highest performance gains across all models in hybrid reasoning-based questions (up to ↑ 39.8% for large, ↑ 55.1% for mid, and ↑ 49.5% for small LLMs). RA G yields sta- tistically significant impro vements for fundamen- tal reasoning, where retrieved SEC filings acti- v ate relev ant prior knowledge. But RA G degrades performance on trading-signal questions, where raw numerical time-series data confuses models rather than assists them. It also introduces infor- mation ov erload, improving surface-le vel ground- ing while reducing the precise indicator extraction and reasoning depth required for financial anal- Case Study: AAPL V aluation and Momentum Question: As of August 2025, is Apple a good buy given its valuation and price tr end? Golden Indicators: Earnings/Price, Book/Price, RSI, One Day Rev ersal. 1. Zero-Shot (No-RA G) Output: “T o assess Apple’ s valuation, we would examine sev- eral k ey metrics... W e w ould need to kno w Ap- ple’ s projected EPS... A P/E significantly above its historical average might suggest the stock is expen- siv e. [Model pr oceeds to list textbook definitions of P/E, P/S, P/B, and RSI without pro viding a single ac- tual data point or conclusion for August 2025] ... ” 2. Standard RA G Output: “...T otal net sales show a consistent upward trend from 2023 to 2025. Services segment has shown strong gro wth (14% in 2025)... August 2025 prices sho w a strong up- ward trend, starting around $202.38 and reaching $232.56. [Model gets distracted by r aw text/prices, completely fail- ing to calculate or cite the Golden Indicators.] Without forward-looking earnings estimates, a definitiv e valuation assessment is not possible... ” 3. Ideal RA G (Precomputed Context): “...Apple’ s fundamental data presents a mixed picture. A low Book/Price ratio (0.02) suggests the market v alues Apple significantly higher than its book value, indicating a premium v aluation. Howe ver , it demonstrates an excellent Return on Equity (35.36%). The market data for August 2025 shows a strong positive price trend... The RSI is at 60.39, suggesting the stock is leaning to wards b ullish sentiment without being ov erbought. Conclusion: Given the strong positi ve price trend and excellent profitability , Apple presents a compelling case, though in vestors must be comfortable with its premium valuation. ” T able 4: Comparison of generative reasoning paths. Only the Ideal RA G successfully isolates the Golden Indicators to form a concrete, data-backed conclusion. ysis. Additionally , we witness that model archi- tecture plays a significant role. Latent reasoning models (e.g., DeepSeek-R1 and its distillations) substantially outperform instruction-tuned models on hybrid questions, suggesting that inference-time chain-of-thought computation is crucial for reason- ing over heterogeneous financial signals. Model families also exhibit dif ferent sensiti vities to con- text; Qwen models generally benefit from RA G, while LLaMA models degrade o verall, e ven at 70B parameters. W e expect FinT radeBench to support consistent benchmarking for financial reasoning and provide a scalable, industry-standard dataset for researchers and practitioners. W e will release FinT radeBench upon acceptance; presently , we provide a representative subset and e valuation scripts for re view . Greater data augmen- tation and ev aluation div ersity are essential, and in the future, we plan to augment FinT radeBench, di- versify the e valuation pipeline, and explore agentic RA G. Acknowledgments Aritra Dutta is partially supported by the Florida Department of Health Grant, A WD00007072, and the National Science Foundation Grant, 2321986. References T orben G Andersen, T im Bollerslev , Francis X Diebold, and Paul Labys. 2003. Modeling and forecasting realized volatility . Econometrica , 71(2):579–625. Andrew Ang and Allan Timmermann. 2012. Regime changes and financial markets. Annu. Rev . F inanc. Econ. , 4(1):313–337. Anthropic. 2025. Claude Sonnet 4.5. https://www. anthropic.com/claude . Accessed: 2025-12-21. Dogu Araci. 2019. FinBER T : Financial sentiment analy- sis with pre-trained language models. arXiv pr eprint arXiv:1908.10063 . Malcolm Baker and Jeffre y W urgler . 2006. Inv estor sentiment and the cross-section of stock returns. The journal of F inance , 61(4):1645–1680. T im Bollerslev , Benjamin Hood, John Huss, and Lasse Heje Pedersen. 2018. Risk everywhere: Model- ing and managing v olatility . The Review of F inancial Studies , 31(7):2729–2773. T im Bollerslev , Lai Xu, and Hao Zhou. 2015. Stock re- turn and cash flow predictability: The role of volatil- ity risk. Journal of Econometrics , 187(2):458–471. W illiam Brock, Josef Lakonishok, and Blak e LeBaron. 1992. Simple technical trading rules and the stochas- tic properties of stock returns. The J ournal of finance , 47(5):1731–1764. Leland Bybee, Bryan Kelly , and Y inan Su. 2023. Narra- tiv e asset pricing: Interpretable systematic risk fac- tors from ne ws text. The Review of F inancial Studies , 36(12):4759–4787. Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou W ang. 2024a. Humans or llms as the judge? a study on judgement bias. In Pr oceedings of the 2024 Confer ence on Empirical Methods in Natural Languag e Pr ocessing , pages 8301–8327. Jian Chen, Peilin Zhou, Y ining Hua, Loh Xin, Kehui Chen, Ziyuan Li, Bing Zhu, and Junwei Liang. 2024b. Fintextqa: A dataset for long-form financial question answering . In Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (V olume 1: Long P apers) , page 6025–6047. Associa- tion for Computational Linguistics. Zhiyu Chen, W enhu Chen, Charese Smiley , Sameena Shah, Iana Borov a, Dylan Langdon, Reema Moussa, Matt Beane, T ing-Hao Huang, Bryan Routledge, and W illiam Y ang W ang. 2022a. Finqa: A dataset of numerical reasoning over financial data . Preprint , Zhiyu Chen, Shiyang Li, Charese Smiley , Zhiqiang Ma, Sameena Shah, and W illiam Y ang W ang. 2022b. Con vfinqa: Exploring the chain of numerical rea- soning in con versational finance question answering . Pr eprint , arXi v:2210.03849. Chanyeol Choi, Jihoon Kwon, Jaeseon Ha, Hojun Choi, Chae woon Kim, Y ongjae Lee, Jy yong Sohn, and Ale- jandro Lopez-Lira. 2025. Finder: Financial dataset for question answering and evaluating retriev al- augmented generation . Preprint , arXi v:2504.15800. Jaehyung Choi. 2021. Maximum drawdo wn, recov ery , and momentum. Journal of Risk and F inancial Man- agement , 14(11):542. Y inlam Chow , Y u Li, Xuchao Han, Prateek Jain, Ruiqi Sun, Huazhe Xu, Jiawei Gao, Dong Zhou, and Bin Y u. 2025. Inference-aw are fine-tuning for best-of-n sampling in lar ge language models . In Pr oceedings of the 13th International Conference on Learning Repr esentations (ICLR) . Karl Cobbe, V ineet Kosaraju, Mohammad Bav arian, Mark Chen, Hee woo Jun, Lukasz Kaiser , Matthias Plappert, Jerry T worek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. T raining verifiers to solv e math word problems. arXiv pr eprint arXiv:2110.14168 . W erner FM De Bondt and Richard Thaler . 1985. Does the stock market o verreact? The Journal of finance , 40(3):793–805. Prudence Djagba and Abdelkader Y . Saley . 2025. Ex- ploring large language models for financial applica- tions: T echniques, performance, and challenges with finma . Preprint , arXi v:2510.05151. Robert Engle. 2004. Risk and volatility: Econometric models and financial practice. American economic r evie w , 94(3):405–420. Eugene F . Fama and Kenneth R. French. 1992. The cross-section of expected stock returns. The Journal of F inance , 47(2):427–465. Robert Friel, Masha Belyi, and Atindriyo Sanyal. 2025. Ragbench: Explainable benchmark for retriev al-augmented generation systems . Pr eprint , Robin Greenwood, Samuel G Hanson, and Gordon Y Liao. 2018. Asset price dynamics in partially seg- mented markets. The Re view of Financial Studies , 31(9):3307–3343. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang T an, Xuehao Zhai, Chengjin Xu, W ei Li, Y inghan Shen, Shengjie Ma, Honghao Liu, and 1 others. 2024. A surve y on llm-as-a-judge. The Innovation . Beining Han, Anqi Liu, Jing Chen, and W illiam Knot- tenbelt. 2025. Can machine learning models better volatility forecasting? a combined method. The Eu- r opean J ournal of F inance , pages 1–22. Campbell R. Harvey , Y an Liu, and Heqing Zhu. 2016. ...and the cross-section of expected returns. The Re- view of F inancial Studies , 29(1):5–68. MD Kamrul Hossain, Runpeng Zhang, Haoran Hu, and Kelvin Lo. 2025. Llms as meta-revie wers’ assistants: Benchmarking reliability , calibration, and bias in au- tomated paper e valuation . In Pr oceedings of the 48th International ACM SIGIR Conference on Resear ch and Development in Information Retrieval . T o ap- pear . In vesting.com. 2025. Earnings call transcript: Tesla’ s Q1 2025 results fall short, stock rises post-call. in- vesting.com . Accessed: 2026-03-15. Pranab Islam, Anand Kannappan, Douwe Kiela, Re- becca Qian, Nino Scherrer , and Bertie V idgen. 2023. Financebench: A new benchmark for financial ques- tion answering . Preprint , arXi v:2311.11944. Narasimhan Jegadeesh and Sheridan T itman. 1993. Re- turns to b uying winners and selling losers: Implica- tions for stock market ef ficiency . Journal of F inance , 48:65–91. Douwe Kiela, Max Bartolo, Y ixin Nie, Divyansh Kaushik, Atticus Geiger , Zhengxuan W u, Bertie V id- gen, Grusha Prasad, Amanpreet Singh, Pratik Ring- shia, and 1 others. 2021. Dynabench: Rethinking benchmarking in nlp. In Pr oceedings of the 2021 confer ence of the North American chapter of the As- sociation for Computational Linguistics: human lan- guage tec hnologies , pages 4110–4124. Jean Lee, Nicholas Stev ens, and Soyeon Caren Han. 2025. Large language models in finance (finllms) . Neural Computing and Applications , 37(30):24853–24867. Soomin Lee, Hyounghun Kim, Sungdong Kim, Joonho Kim, Soojin Park, and Juneyoung Park. 2024. Aligning large language models by on-policy self- judgment . In Pr oceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (A CL) . SELF-JUDGE: self-ev aluation and on-policy alignment for LLMs. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Hein- rich K ¨ uttler , Mike Le wis, W en-tau Y ih, Tim Rockt ¨ aschel, and 1 others. 2020. Retriev al- augmented generation for knowledge-intensi ve nlp tasks. Advances in neural information pr ocessing systems , 33:9459–9474. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨ uttler , Mike Lewis, W en tau Y ih, T im Rockt ¨ aschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrie val-augmented generation for knowledge-intensi ve nlp tasks . Pr eprint , Alex Li. 2024. V olatility forecasting in global financial markets using timemixer . Pr eprint , Xiang Li, Zhenyu Li, Chen Shi, Y ong Xu, Qing Du, Mingkui T an, Jun Huang, and W ei Lin. 2024. Alphafin: Benchmarking financial analy- sis with retriev al-augmented stock-chain framework . Pr eprint , arXi v:2403.12582. Zeming Li, Zhuo wan Jiang, Bo wen Cheng, T ianle Zhao, T ianyi Zhang, and Y izhe W ang. 2025. Preference leakage: A contamination problem in llm-as-a-judge . In Pr oceedings of the 13th International Confer ence on Learning Repr esentations (ICLR) . Percy Liang, Rishi Bommasani, T ony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Y asunaga, Y ian Zhang, Deepak Narayanan, Y uhuai W u, Ananya Ku- mar , and 1 others. 2022. Holistic ev aluation of lan- guage models. arXiv preprint . Andre w W Lo, Harry Mamaysky , and Jiang W ang. 2000. Foundations of technical analysis: Computational al- gorithms, statistical inference, and empirical imple- mentation. The journal of finance , 55(4):1705–1765. Aaron McDade. 2025. T esla’ s Q1 earnings miss esti- mates. in vestopedia.com . Accessed: 2026-01-15. Aswini Kumar Mishra, Jayashree Renganathan, and Aaryaman Gupta. 2024. V olatility forecasting and assessing risk of financial markets using multi- transformer neural network based architecture. En- gineering Applications of Artificial Intelligence , 133:108223. Fernando Moreno-Pino and Stefan Zohren. 2024. Deep- vol: V olatility forecasting from high-frequency data with dilated causal con volutions . Preprint , Cheng Niu, Y uanhao W u, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and T ong Zhang. 2024. Ragtruth: A hallucination corpus for dev elop- ing trustworthy retrie val-augmented language models . Pr eprint , arXi v:2401.00396. Thomas Oberlechner . 2001. Importance of technical and fundamental analysis in the european foreign exchange mark et. International Journal of F inance & Economics , 6(1):81–93. Cheol-Ho Park and Scott H Irwin. 2007. What do we know about the profitability of technical analysis? Journal of Economic surve ys , 21(4):786–826. Qiu Ran, Y ankai Lin, Peng Li, Jie Zhou, and Zhiyuan Liu. 2019. Numnet: Machine reading comprehen- sion with numerical reasoning. In Proceedings of the 2019 conference on empirical methods in natural language pr ocessing and the 9th international joint confer ence on natur al languag e pr ocessing (EMNLP- IJCNLP) , pages 2474–2484. V arshini Reddy , Rik K oncel-Kedziorski, V iet Dac Lai, Michael Krumdick, Charles Lo vering, and Chris T an- ner . 2025. Docfinqa: A long-context financial rea- soning dataset . Preprint , arXi v:2401.06915. Shubhra Kanti Karmaker Santu and Dongji Feng. 2023. T eler: A general taxonomy of llm prompts for bench- marking complex tasks . Pr eprint , Raj Sanjay Shah, Kunal Chawla, Dheeraj Eidnani, Agam Shah, W endi Du, Sudheer Chava, Natraj Ra- man, Charese Smiley , Jiaao Chen, and Diyi Y ang. 2022. When flue meets flang: Benchmarks and large pre-trained language model for financial do- main . Preprint , arXi v:2211.00083. Muhammad Arslan Shaukat, Muntasir Adnan, and Car- los C. N. Kuhn. 2026. A systematic inv estigation of document chunking strategies and embedding sensi- tivity . Preprint , arXi v:2603.06976. Robert C Shiller . 2000. Irrational exuberance. Philoso- phy & Public P olicy Quarterly , 20(1):18–23. Robert J. Shiller . 2017. Narrati ve economics. American Economic Revie w . Noah Shinn, G. Labash, D. Gopinath, T . Khot, and A. Sabharwal. 2023. Reflexion: An autonomous agent with dynamic memory and self-refinement . arXiv pr eprint arXiv:2303.11366 . Aarohi Sriv astav a, Abhinav Rastogi, Abhishek Rao, Abu A wal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Bro wn, Adam Santoro, Aditya Gupta, Adri ` a Garriga-Alonso, and 1 others. 2023. Be yond the imitation game: Quantifying and extrapolating the capabilities of language models. T ransactions on machine learning r esearc h . T ristan Thrush, Kushal Tirumala, Anmol Gupta, Max Bartolo, Pedro Rodriguez, T ariq Kane, W illiam Gaviria Rojas, Peter Mattson, Adina W illiams, and Douwe Kiela. 2022. Dynatask: A framew ork for creating dynamic ai benchmark tasks. In Pr oceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages 174–181. Shijie W u, Ozan Irsoy , Stev en Lu, V adim Dabrav olski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kam- badur , David Rosenberg, and Gideon Mann. 2023. Bloomberggpt: A large language model for finance. arXiv pr eprint arXiv:2303.17564 . Y ikai W u, Xiao Liu, Y uchen Xu, Y ichong Zhou, Zihao Liu, Jiahai Zhou, Geng Cui, and Maosong Sun. 2024. Meta-rew arding language models: Self-improving alignment with llm-as-a-meta-judge . arXiv preprint arXiv:2405.15000 . Hongyang Y ang, Xiao-Y ang Liu, and Christina Dan W ang. 2025. Fingpt: Open-source financial large language models . Preprint , arXi v:2306.06031. Y i Y ang, Y ixuan T ang, and Kar Y an T am. 2023. In- vestlm: A large language model for inv estment using financial domain instruction tuning. arXiv preprint arXiv:2309.13064 . Jiayi Y e, Y anbo W ang, Y ue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, W erner Geyer , Chao Huang, Pin-Y u Chen, and 1 others. 2024. Jus- tice or prejudice? quantifying biases in llm-as-a- judge. arXiv preprint . Xiang Y e, Bowen Zhang, Y ujie Zhu, Jiayi Hu, Xinyi W ang, Pengfei Liu, and Y ue Zhang. 2025. Justice or prejudice? quantifying biases in llm-as-a-judge . In Pr oceedings of the 13th International Conference on Learning Repr esentations (ICLR) . Hongyi Y uan, Runxin Sun, Jie Zhang, Zekun Xu, Geng Cui, Zeyu Zhang, Y ang Gao, Zhiyuan Liu, and Maosong Sun. 2024. Self-rew arding language mod- els . arXiv preprint . Hugh Zhang, Jeff Da, Dean Lee, V aughn Robinson, Catherine W u, Will Song, T if fany Zhao, Prana v Raja, Charlotte Zhuang, Dylan Slack, and 1 others. 2024. A careful examination of large language model per- formance on grade school arithmetic. Advances in Neural Information Pr ocessing Systems , 37:46819– 46836. Lianmin Zheng, W ei-Lin Chiang, Y ing Sheng, Siyuan Zhuang, Zhanghao W u, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, and 1 others. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information pr o- cessing systems , 36:46595–46623. Y ongjie Zhou, Shuai W ang, Be van Koopman, and Guido Zuccon. 2026. Beyond chunk-then-embed: A comprehensi ve taxonomy and ev aluation of doc- ument chunking strategies for information retrie val . Pr eprint , arXi v:2602.16974. Fengbin Zhu, W enqiang Lei, Y oucheng Huang, Chao W ang, Shuo Zhang, Jiancheng Lv , Fuli Feng, and T at-Seng Chua. 2021. T at-qa: A question answering benchmark on a hybrid of tabular and textual content in finance . Preprint , arXi v:2105.07624. FinT radeBench: A Financial Reasoning Benchmark f or LLMs Supplementary Material Organization. W e organize the Supplementary Material as follows. § A lists additional historical examples that motiv ate the need for joint reasoning ov er company fundamentals and trading signals. § B lists all financial signals used in the benchmark construction. § C presents the TELeR prompt taxon- omy used for generation. § D provides sample seed questions alongside their expert-authored golden answers. § E details the mathematical formulae and definitions for all ev aluation metrics. § F outlines the automated LLM-as-a-Judge prompt and the cor- responding human annotation rubric. § G presents the alignment statistics between human experts and the automated judge. § H provides qualitati ve case studies illustrating failures based on signal types and the RA G distraction ef fect. § I discusses the limitaions of the paper . Finally , we discuss the ethical considerations in § J . A Extended Motivating Examples: Fundamentals–Market Div ergence The T esla April 2025 case discussed in § 1 is il- lustrati ve of a broader and well-documented phe- nomenon in financial markets: sustained diver - gences between company fundamentals and market price dynamics driv en by in vestor sentiment, narra- ti ve momentum, and forward-looking expectations. W e present three additional historical examples that further moti vate the need for benchmarks capable of reasoning ov er both types of signals jointly . ( i ) Amazon (1999–2001). During the dot-com era, Amazon sustained extremely high mark et v al- uations despite persistent operating losses and neg- ati ve earnings. Rather than anchoring on con- temporaneous accounting metrics, in vestors priced in long-run platform dominance and e-commerce adoption narrati ves. This example illustrates how gro wth narrativ es can sustain valuations that are entirely disconnected from current fundamental signals ( Shiller , 2000 ). A system reasoning solely from financial statements would ha ve consistently flagged Amazon as financially distressed during this period, while the market priced in the opposite trajectory . ( ii ) Nvidia (2016–2017). NVIDIA ’ s stock ap- preciated substantially in 2016-2017, well before its earnings reports fully reflected the rev enue impact of GPU adoption in deep learning work- loads. The rally was dri ven primarily by forward- looking narrati ves around artificial intelligence and autonomous vehicles, with price mo vements lead- ing rather than follo wing the fundamental confirma- tion ( Greenwood et al. , 2018 ; Bybee et al. , 2023 ). This case illustrates the temporal asymmetry that moti vates hybrid reasoning: trading signals cap- tured the market’ s forward expectation long before quarterly filings reflected it, meaning neither signal alone would ha ve been suf ficient to correctly make the in vestment decision. ( iii ) T esla (2020): Sentiment-Driven Rally Un- der Modest Profitability . T esla’ s stock rose dra- matically in 2020 despite modest profitability at the time, as inv estors priced in narrativ es around electric vehicle adoption and autonomous driving at scale ( Shiller , 2017 ; Bak er and W urgler , 2006 ). This example predates the main example we gi ve in § 1 (the April 2025 earnings miss) and hence estab- lishes that the fundamentals–market div ergence is a recurring feature of T esla’ s price history rather than an isolated anomaly . It also sho ws how in vestor sentiment ( Baker and W urgler , 2006 ) and narrativ e economics ( Shiller , 2017 ) can sustain multi-year di ver gences, not merely short-term reactions. T akeaway . From these e xamples, we can see a consistent pattern emer ging: price dynamics reflect in v estor expectations and narrati ve momentum that often precede or contradict the signal a vailable in the company’ s financial statements. T able 5 sum- marises the ke y characteristics of each case. These cases collectiv ely motiv ate Fin- T radeBench’ s dual-signal design. A benchmark that e valuates reasoning o ver fundamentals alone would reward models that correctly identify Amazon, Nvidia, and T esla as weak or fairly v alued, missing the market trading signal entirely . A benchmark that e valuates trading signals alone would capture momentum but pro vide no mechanism for assessing whether that momentum is anchored in improving fundamentals or purely sentiment-dri ven. Only by ev aluating both types of signals jointly , and by including h ybrid questions that require reconciling conflicting signals, can T able 5: Summary of fundamentals–market di vergence episodes. “Dominant Signal” refers to which signal better characterised the market outcome at the time. Company Period Fundamental Signal Dominant Signal Amazon 1999–2001 Persistent losses, negati ve EPS Market (narrati ve) Nvidia 2016–2017 Moderate earnings, GPU rev enue lag Market (forward e xpectation) T esla 2020 Modest profitability , high P/E Market (sentiment) T esla Apr . 2025 EPS miss ($0.27 vs. $0.42), rev enue miss Market (narrati ve rally) we create a benchmark that reflects the reasoning demands of real-world financial analysis. B Financial Signal Reference T able 6 summarizes all company fundamentals and trading signals used in question design, including their formulae and economic interpretation. T rad- ing signals are derived from historical OHLCV data; fundamental signals originate from standard- ised SEC 10-K and 10-Q filings. These variables are among the most widely studied features in the financial domain ( Fama and French , 1992 ; Harve y et al. , 2016 ). C TELeR Prompt T axonomy T able 7 describes the six TELeR-inspired prompt v ariants used for multi-prompt candidate genera- tion during benchmark construction § 3 . Prompts v ary systematically in instruction richness and rea- soning explicitness, from Lev el 0 (minimal context) to Le vel 6 (maximal justification) T able 8 describes the TELeR taxonomy used during RA G ev aluation § 4 . Prompts vary system- atically in instruction richness and reasoning ex- plicitness, from Lev el 1 (single sentence high-level directi ve) to Le vel 6 (maximalist, e vidence-citing, self-justified). Lev els 0–4 are used without RA G; Le vels 5–6 ha ve RA G context. D Sample Questions and Golden Answers T able 9 presents representativ e seed questions from each of the three FinT radeBench categories along- side their expert-authored golden answers. Each answer cites the specific golden indicators required for a complete response; these indicators serv e as the reference set for Golden Indicator F1 scoring. E Evaluation Metrics and Statistical T esting This section provides complete definitions and for- mulae for metrics used across the benchmark con- struction pipeline § 3 and the RA G e v aluation § 4 . E.1 Benchmark Construction Metrics Belo w , we describe the metrics used during the benchmark construction phase § 3 . ( i ) Numerical Accuracy ( Acc num ( M m ) ) mea- sures the fraction of numerically accurate state- ments produced by model M m across its N can- didate responses, computed before and after self- filtering to quantify the filtering benefit Acc num ( M m ) = 1 N N X i =1 I [ is accurate ( a m i ) = 1] , (1) where a m i is the i -th candidate response from model M m , and is accurate ( · ) ∈ { 0 , 1 } is the binary indicator returned by the automated numerical au- ditor ( SUPPORTED = 1 , otherwise 0 ). ( ii ) Metric Extraction Pr ecision ( P ( a ⋆ m ) ), Re- call ( R ( a ⋆ m ) ), and F1 ( F1 ( a ⋆ m ) ) quantifies the over - lap between the set of financial metrics cited in a generated response and the expert-defined ref- erence set. For a self-selected response a ⋆ m , let M gen ( a ⋆ m ) denote the set of financial metrics men- tioned in the response and M ref the ground-truth reference set. Precision, recall, and F1 are then gi ven by: P ( a ⋆ m ) = | M gen ∩ M ref | | M gen | , (2) R ( a ⋆ m ) = | M gen ∩ M ref | | M ref | , (3) and F1 ( a ⋆ m ) = 2 · P ( a ⋆ m ) · R ( a ⋆ m ) P ( a ⋆ m ) + R ( a ⋆ m ) , (4) respecti vely . The macro-av eraged F1 across all K models is: F1 = 1 K K X m =1 F1 ( a ⋆ m ) , (5) where K is the total number of e valuated mod- els and a ⋆ m is the self-selected best response from model M m . T able 6: Summary of trading signals and company fundamentals used in the question design. Signal Formula Definition / Description T rading Signals Notation: P t = Price at time t , V t = V olume, R t = Return, N , k = Lookback periods, α = Smoothing factor MA (Moving A verage) 1 N P N i =1 P t − i A verage price of a stock o ver a fix ed lookback win- dow . Smooths out short-term fluctuations to re veal longer-term trends. EMA (Exp. Moving A verage) αP t + (1 − α ) EMA t − 1 W eighted moving av erage emphasizing recent prices; reacts faster to new information ( α is the smoothing factor). MA CD EMA short − EMA long Momentum indicator based on the difference be- tween short- and long-term EMAs; captures trend strength and rev ersal signals. RSI (Relative Str ength Index) 100 − 100 1+ RS (RS = A vg Gain A vg Loss ) Scaled measure of recent gains vs. losses; high RSI suggests ov erbought conditions, low RSI indicates ov ersold conditions. OBV (On-Balance V olume) OBV t − 1 + V t · sgn ( P t − P t − 1 ) Cumulativ e volume measure linking price mo vement and trading volume; rising OBV indicates accumula- tion, falling OBV suggests distrib ution. One-Day Reversal P t − P t − 1 P t − 1 Daily return from pre vious close to current close; captures immediate short-term rev ersals or shocks. Max Return (20-day) max 1 ≤ i ≤ 20  P t − i − P t − i − 1 P t − i − 1  Maximum single-day return observed o ver the past 20 trading days; indicates short-term volatility ex- tremes. Medium-T erm Momentum Q k i =1 (1 + R t − i ) − 1 Price persistence over se veral weeks or months; pos- itiv e values indicate sustained trends. Long-T erm Mean Reversal − ( P t − ¯ P long ) T endency of price to rev ert toward historical aver - age, representing equilibrium-seeking behavior in markets. Company Fundamentals Cash Flow / Assets Operating Cash Flow T otal Assets Operating cash flow di vided by total assets; mea- sures how ef ficiently assets generate cash. Book / Price (Quarterly) Book V alue of Equity Market Capitalization Ratio of book value to mark et price; high v alues may indicate undervaluation. Earnings / Price (Quarterly) Earnings Per Share (EPS) Price Per Share In verse of price-to-earnings ratio; higher v alues im- ply cheaper valuations relati ve to earnings. For ecast Earnings / Price Expected Future EPS Price Per Share Forward-looking E/P ratio using analyst forecasts; reflects expected profitability . Sales / Assets (Quarterly) T otal Sales T otal Assets Asset turnov er ratio; measures how effecti vely a company uses assets to generate re venue. Debt / Assets (Quarterly) T otal Debt T otal Assets Lev erage ratio showing the proportion of assets fi- nanced by debt. Debt / Equity (Quarterly) T otal Debt Shareholders’ Equity Ratio of total debt to shareholders’ equity; higher values indicate greater financial le verage. Dividend Y ield (Quarterly) Dividends Per Share Price Per Share Dividends per share divided by stock price; repre- sents cash return to shareholders. Return on Assets (Quarterly) Net Income T otal Assets Net income di vided by total assets; g auges profitabil- ity relativ e to firm size. Return on Equity (Quarterly) Net Income Shareholders’ Equity Net income divided by shareholders’ equity; mea- sures profitability relativ e to owners’ capital. T able 7: TELeR-inspired prompt variants used for multi-prompt generation during benchmark creation § 3 . ID Attributes Description (Example Behaviour) L0 Data-only , no task framing Provides only T rading Signals Context and Fundamental Data Context with no explicit question or role; baseline for spontaneous reasoning. L1 Single-turn, lo w detail, instruction-style, role-specified Instructs the model in simple one-sentence instructions focusing on the high-lev el goal. L2 Single-turn, moderate detail, instruction-style, role-specified Paragraph-style instructions e xpressing the high-le vel goal and sub-tasks that need to be performed to achiev e the goal L3 Step-by-step reasoning, moder- ate detail, decomposed adds a structured reasoning template (bulleted-list-style): clarify goal, decompose into sub-questions, answer sub-parts using both contexts, and synthesize a final answer . L4 Step-by-step reasoning, moder- ate detail, decomposed ”Lev el 3 Prompt” + ”It provides a guideline on how LLMs will be ev aluated. ” L5 Step-by-step reasoning, high de- tail, decomposed ”Lev el 4 Prompt” + ”Provide additional rele vant g athered via RA G” L6 Step-by-step reasoning, Maxi- malist, high detail, evidence- citing, justified ”Lev el 5 Prompt” + ”Provide e xplicit statement asking LLM to explain its own output. ” Lvl T ype Strategy RA G L1 Baseline Single-sentence high-lev el direc- tiv e. No L2 Focus Role specification + Paragraph style breakdown. No L3 CoT Bulleted Step-by-Step Chain-of- Thought. No L4 Auditor Adds explicit evaluation criteria (e.g., coherence). No L5 Context Injects retrie ved e vidence and ci- tations. Y es L6 Explain Adds self-justification for data us- age. Y es T able 8: The TELeR Prompt T axonomy used for Bench- marking and ev aluation. The framework systematically increases prompt complexity , with Lev els 5 and 6 specif- ically designed to lev erage retrie ved context. ( iii ) LLM-Judge Agr eement ( MAE ) measures the calibration between human expert scores and automated LLM-judge scores. The primary align- ment metric is mean absolute error (MAE ( M m ) ), which is in v ariant to score variance and directly measures practical disagreement magnitude ( Hos- sain et al. , 2025 ): MAE ( M m ) = 1 n m n m X i =1   S m h ,i − S m J,i   , (6) and its macro av erage across all models: MAE = 1 K K X m =1 MAE ( M m ) , (7) where n m is the number of annotated responses for model M m , S m h ,i ∈ [1 , 5] is the human Likert score Sample Golden Answers by Category 1. Company Fundamentals Q: Is Nvidia’ s pr ofitability sustainable with its valuation in September 2025? A: Nvidia’s profitability is highly sustainable and sup- ported by strong operational efficienc y . Its exceptional financial health is characterized by a R OE of 33.38% and a R O A of 21.31%, achie ved with lo w le verage (Debt/Equity of 0.102). Howe ver , the v aluation is extremely premium, evidenced by a Book/Price ratio of 0.023. 2. T rading Signals Q: Based on tr ading in H1 2025, which stoc ks show the str ongest EMA support? A: INTC, MU, and LRCX show the strongest support, trading approx. 18.5%, 16.8%, and 14.4% above their 20-day EMAs. Howe ver , LRCX provides the most sus- tainable profile, as its technical trend is backed by strong fundamentals (18.7% R OE), whereas INTC’ s support co- incides with negati ve profitability metrics. 3. Hybrid Reasoning Q: Is Micr osoft overvalued based on Q3 2025 prices de- spite str ong performance? A: Microsoft is trading at a premium valuation, but it is not necessarily ov ervalued. While the Book/Price ra- tio of 0.0929 indicates the stock is expensi ve relativ e to its book v alue, this premium is supported by high opera- tional efficienc y (Cash Flow/Assets of 0.0689) and strong profitability (R OE of 0.0890). T echnical indicators show strong momentum (RSI 65.42, MA CD > Signal) without yet reaching extreme ”ov erbought” conditions ( > 70 RSI). T able 9: Sample seed questions & corresponding golden answers; see details about key financial terms in § E T able 6 . Metric Definition / Description Benchmark Construction Numerical Accuracy Fraction of numerically accurate statements per model before and after self-filtering; measures factual grounding (Eq. 1 ). Metric Extraction F1 Precision, recall, and F1 over overlap between generated and reference financial metrics; quantifies topical completeness (Eq. 4 ). LLM–Judge Agreement (MAE) Mean Absolute Error (MAE) between human and LLM-judge scores per model (primary metric); Spearman ρ reported where score variance permits (Eq. 6 ). Self-Critique Effecti veness Fraction of models where self-selected response matches the numerically best candi- date; ev aluates internal self-consistency (Eq. 8 ). Prompt Sensitivity Intra-model F1 variance across TELeR prompt lev els; lower v ariance implies robust- ness to prompt formulation (Eq. 9 ). Overall Composite Score W eighted aggregate of F1, numerical accurac y , inv erse MAE, and prompt v ariance for cross-model ranking (Eq. 10 ). RA G Evaluation Absolute Accuracy (%) Judge’ s 1–5 correctness score normalised to a percentage; we report it ov erall and by question type (Eq. 11 ). Retriev al Delta ( ∆ ) Relativ e accuracy shift of RA G ov er No-RA G; positi ve v alues indicate grounding benefit, negati ve v alues indicate distraction (Eq. 12 ). Statistical Significance Paired t -test on question-le vel score dif ferences; ∗ p < 0 . 05 , ∗∗ p < 0 . 01 (Eq. 13 ). Golden Indicator F1 Precision and recall of expert-defined financial signals in model responses; drop under RA G signals distraction effect (Eq. 14 ). Context Inte gration (1–5) Separate judge scores for fundamental (10-K/10-Q) and trading signal integration; isolates signal-specific failures (Eq. 15 ). Reasoning Depth (1–5) Judge score for logical chain quality independent of factual correctness; decline under RA G signals information ov erload (Eq. 16 ). T able 10: Summary of ev aluation metrics used across the benchmark construction and RA G e valuation pipelines. for response i , and S m J,i ∈ [1 , 5] is the correspond- ing LLM-judge score. All scores are on the raw 1–5 Likert scale; an MAE of 0.27 corresponds to a 5.4% relati ve de viation. ( iv ) Self-Critique Effectiveness ( SCR ) mea- sures ho w often each model’ s self-selection identi- fies its most numerically correct response, assess- ing internal self-consistenc y ( Lee et al. , 2024 ; Y uan et al. , 2024 ; W u et al. , 2024 ): SCR =    M m : a ⋆ m = arg max a m i Acc num ( a m i )    K , (8) where a ⋆ m is the model’ s self-selected best response, Acc num ( a m i ) is the numerical accurac y of candidate i from model M m , and K is the total number of models. A higher SCR indicates that self-filtering reliably surfaces the most factually accurate re- sponse. ( v ) Prompt Sensitivity and Rob ustness ( V ar prompt ( M m ) ) measures the variance of intra- model F1 scores across TELeR prompt lev els, cap- turing ho w sensitiv e a model’ s output quality is to prompt structure ( Cho w et al. , 2025 ): V ar prompt ( M m ) = 1 N N X i =1  F1 ( a m i ) − F1 m  2 , (9) where F1 ( a m i ) is the indicator F1 score of the i -th candidate response from model M m , and F1 m = 1 N P N i =1 F1 ( a m i ) is the mean F1 for that model across all N prompt variants. A lower variance indicates greater robustness to prompt formulation. ( vi ) Overall Composite Score ( S ov ( M m ) ) ag- gregates the four primary signals into a single cross- model ranking metric that balances factuality , align- ment reliability , and prompt robustness: S ov ( M m ) = w 1 F1 m + w 2 A num ( M m ) + w 3 MAE − 1 m + w 4  1 − V p ( M m )  , (10) where F1 m is model M m ’ s mean indicator F1 score, A num ( M m ) is its numerical accuracy , MAE − 1 m is the in verse MAE (so that lower judge disagreement yields a higher composite score), V p ( M m ) is its prompt sensitivity , and weights { w 1 , w 2 , w 3 , w 4 } balance the four components. All terms are nor- malised to [0 , 1] before aggre gation. E.2 RA G Evaluation Metrics ( i ) Absolute Accuracy ( A ( M m ) measures the factual correctness and overall alignment of the model’ s response with the expert-provided gold an- swer . The LLM judge assigns a correctness score on a 1–5 Likert scale, which is normalised to a percentage for comparability across models and question categories: A ( M m ) = S m J 5 × 100% , (11) where S m J ∈ { 1 , 2 , 3 , 4 , 5 } is the LLM-judge cor- rectness score for model M m . W e report accuracy globally (Overall) and stratified by question type (F , T , FT); see § 2 . ( ii ) Retriev al Delta ( ∆( M m ) ) measures the rel- ati ve performance shift induced by retrie val aug- mentation compared to a zero-shot baseline. A positi ve ∆ indicates successful grounding; a neg a- ti ve ∆ indicates context distraction or information ov erload: ∆( M m ) = A RA G ( M m ) − A No-RA G ( M m ) A No-RA G ( M m ) × 100% , (12) where A RA G ( M m ) and A No-RA G ( M m ) are the nor - malised accuracy scores of model M m under RA G and No-RA G conditions respecti vely . ( iii ) Statistical Significance (P aired t -test ( t )) assesses whether RA G-induced accuracy changes are statistically reliable. W e apply a paired samples t -test on question-le vel correctness scores: t = d s d / √ N , d = 1 N N X i =1 d i , d i = x i, RA G − x i, No-RA G , (13) where d i is the per -question score dif ference, d is the mean difference, s d is the standard deviation of differences, and N is the number of questions. W e report p < 0 . 05 (denoted ∗ ) and p < 0 . 01 (denoted ∗∗ ); see § 2 . A significant result at the ∗∗ le vel indicates, with 99% confidence, that the RA G systematically alters model reasoning rather than producing localised fluctuations on a fe w queries. ( iv ) Golden Indicator F1 ( F1 GI ( a m ) ) measures the precision and recall of expert-defined financial signals in model-generated responses. Using the same formulation as Eq. 4 , let M gen ( a m ) denote the financial metrics cited in a model response un- der e valuation and M ref the expert-defined golden indicator set: P GI ( a m ) = | M gen ∩ M ref | | M gen | , R GI ( a m ) = | M gen ∩ M ref | | M ref | , F1 GI ( a m ) = 2 · P GI · R GI P GI + R GI , (14) where a m is the model response under e valuation (best RA G or best No-RA G). A drop in F1 GI under RA G signals the distraction effect: the model is absorbing retriev ed content without isolating the specific indicators an expert w ould prioritize. ( v ) Context Integration Scores ( FI ( M m ) & TI ( M m ) ) are the scores gi ven by the LLM judge, separately scoring two signal-specific integration dimensions, each on a 1-5 scale: FI ( M m ) ∈ [1 , 5] , TI ( M m ) ∈ [1 , 5] , (15) where FI represents the use of 10-K/10-Q funda- mental content, and TI represents processing of numerical time-series data. These scores isolate signal-specific failures: a model may score highly on FI while failing on TI , directly evidencing the gap between the signals. W e report macro-a verages across all models and separately by question type. ( vi ) Reasoning Depth ( RD ( M m ) ) ev aluates the quality of the model’ s logical reasoning chain in- dependently of factual correctness, specifically its ability to chain intermediate analytical steps (e.g., observing momentum → checking lev erage → con- cluding the stock price rally is over -le veraged). This is scored by the LLM judge on a 1-5 scale: RD ( M m ) ∈ [1 , 5] . (16) A decline in ReasonDepth under RA G alongside stable or rising Accuracy indicates the model is summarising retriev ed text rather than reasoning analytically over it, which is a clear indication of the information ov erload effect reported in § 5 . A complete summary of all metrics is provided in T able 10 . F Evaluation Pr ompts and Human Annotation Rubric This section presents the complete LLM-as-a- Judge prompt and the human annotation rubric used during the calibration phase (§ 3 ). Both instruments share the same four scoring dimensions to enable Group Bias MAE Overall & Dimensions Overall composite − 0.021 0.404 Accuracy − 0.059 0.163 Completeness + 0.401 0.545 Relev ance + 0.314 0.399 Clarity + 0.317 0.698 By Generator Model Gemini 3 Pro + 0.159 0.277 Grok-4.1 + 0.345 0.495 Qwen3-235B + 0.222 0.437 By Question T ype Fundamental (F) − 0.083 0.417 Hybrid (FT) − 0.060 0.348 T rading (T) + 0.087 0.452 Bias = LLM score − Human score. T able 11: Human–LLM judge alignment results. Scores are on the raw 1–5 Likert scale. direct human–LLM alignment measurement. The LLM judge additionally performs metric extraction (computing M gen , precision, recall, and F1 against M ref ), which is not required of human annotators. Figure 4 presents both instruments side by side. G Human–LLM Judge Alignment T o validate the LLM-as-a-Judge pipeline described in § 3 , we measure alignment between automated judge scores and human e xpert annotations on set of annotated responses to our seed questions span- ning all three question types (F , T , FT) and three generator models (Gemini 3 Pro, Grok-4.1, Qwen3- 235B). F or each response, both a human e xpert and the LLM judge independently assigned scores on the same four 1–5 Likert dimensions: factual and numerical accuracy , completeness, relev ance, and clarity . T able 11 reports the resulting alignment statistics. ( i ) Ov erall Alignment. W e find an overall MAE of 0.40 across all 404 responses, corresponding to an 8.0% relativ e de viation on the 1–5 scale. At the dimension level, relev ance achieves the tight- est alignment ( MAE = 0 . 40 , 8.0%), followed by completeness ( MAE = 0 . 54 , 10.9%) and clarity ( MAE = 0 . 70 , 14.0%). The dimension-level ac- curacy MAE is notably lo w ( MAE = 0 . 16 , 3.3%), which we attribute to both raters awarding near- perfect accuracy scores for filtered responses gener - ated by strong proprietary models. All MAE values fall below 1.0 on a 5-point scale, and the overall bias of − 0 . 02 is close to zero, indicating the LLM Ablation: AAPL V aluation and Momentum Question: As of August 2025, is Apple a good buy given its valuation and price tr end? Ablation 1: Precomputed Fundamental Context ONL Y “Based on the pro vided fundamental data for AAPL, I can- not determine if it is a good b uy as of August 2025. The fundamental data provided does not include inf ormation about its price trend or any forw ard-looking projections that would be necessary to assess its v aluation in August 2025. ” Ablation 2: Precomputed T rading signal Context ONL Y “Based on the provided volatility and price data for Apple (AAPL) in August 2025, the stock experienced a sig- nificant positi ve price change of 14.84%... The RSI of 60.3940 indicates bullish momentum. However , I cannot assess Apple’ s valuation . V aluation requires fundamental data such as earnings, book value, or cash flow , which are not included in the provided market conte xt. ” T able 12: Signal based ablation study for § 5.1 . Extended Case 1: Hybrid Reasoning (TSLA) Question: As of mid-2025, is T esla showing signs of fundamental deterioration beneath its technical str ength? Golden Indicators: Return on Equity , Cash Flow/Assets, EMA, One Day Rev ersal 1. Zero-Shot (No-RA G) Output “ As a financial analyst, I will assess T esla’ s financial health as of mid-2025 by examining key fundamental indica- tors and comparing them against its observed technical strength... [Model hallucinates a generic methodology without answering the pr ompt] ” 2. Standard RA G Output: T extual Distraction “Based on the provided market data and risk factor dis- closures, it’ s challenging to definiti vely assess fundamen- tal deterioration beneath its technical strength as of mid- 2025... [Model gets distracted by SEC risk factor boiler - plate text and fails to extr act any quantitative metrics to pr ove or dispr ove the pr emise] ” 3. Ideal RA G (Precomputed Context) “...T esla is not sho wing clear signs of fundamental deterio- ration; in fact, some metrics suggest a stable position. Fundamental Analysis: For ev ery dollar of assets, T esla generates about 2 cents in Cash Flo w (0.0198). T esla has a very lo w lev el of Debt/Assets (0.0562), a strong indicator of financial health. Return on Equity is 1.64%, which is modest but expected for a high-inv estment growth com- pany . T echnical Analysis: The stock experienced a price in- crease of 18.45%. The current price is abov e the 20-day EMA (415.60), suggesting an uptrend. RSI (56.48) sug- gests balanced market sentiment. ” T able 13: Extended Case study for a hybrid focused question. judge neither systematically inflates nor deflates scores relati ve to the human expert. ( ii ) V ariation across models. Breaking down by generator model, all three models fall within a narrow MAE band ( 0 . 28 – 0 . 49 ). This indicates that judge calibration is broadly consistent across generators, with the LLM judge reliably match- ing human ev aluation regardless of which model produced the response. V ariation across question types. Across question types, FT (Hybrid) questions yield the lo west MAE ( 0 . 35 , 7.0%), while T (T rading) questions yield the highest ( 0 . 45 , 9.0%). The higher MAE for Trading questions is consistent with their numerical com- plexity: both human and automated raters show greater disagreement when ev aluating responses that require precise computation over time-series data. Nonetheless, all three question types remain within an acceptable alignment range, and no sys- tematic bias tow ard an y single category is observed (bias v alues: − 0 . 060 for FT , − 0 . 083 for F , +0 . 087 for V ). H Qualitative Case Studies This section expands on the qualitati ve analysis in § 5.1 with two additional components: unimodal ablations confirming that hybrid reasoning requires both data tracks, and e xtended cross-sector exam- ples demonstrating that the distraction effect is sys- temic rather than instance-specific. H.1 Unimodal Ablations: What Happens When One Signal Is Withheld? W e test Gemini 2.5 Flash-Lite on the AAPL Hybrid (FT) query from § 5.1 by providing each signal’ s precomputed context in isolation. As shown belo w , the model exhibits strong prompt adherence: when starved of one signal type, it e xplicitly declines to answer the missing component rather than halluci- nating, confirming that hybrid financial reasoning strictly requires a multimodal retrie val pipeline. H.2 Extended Case Studies: Systemic Distraction Across Sectors The two examples in T able 13 and 14 sample from dif ferent market sectors and question types to con- firm that the RA G distraction effect observed for AAPL is not an isolated instance. In each case we compare Zero-Shot, Standard RA G, and Ideal RA G outputs from Gemini 2.5 Flash-Lite. Extended Case 2: T ab ular T rading Signal F ailure Question: Based on trading in the fir st half of 2025, which stocks ar e showing the str ongest exponential moving aver - age support? Golden Indicators: EMA (Exponential Moving A verage) 1. Zero-Shot (No-RA G) Output “ As a financial analyst, I can pro vide a detailed response to your question about identifying stocks showing strong Exponential Moving A verage (EMA) support... [Model pr ovides a te xtbook definition of EMA without naming a single stock] ” 2. Standard RA G Output: T abular Blindness “I cannot answer your question. The provided context does not contain information about exponential moving av erages or any technical analysis indicators. The data provided includes market data (close prices), b ut... [Stan- dar d semantic sear ch completely failed to map the tabular numerical array to the tec hnical concept of EMA.] ” 3. Ideal RA G (Precomputed Context) “T o determine the strongest EMA support, we look for stocks where the current price is significantly abov e their 20-day EMA. Based on the analysis, the stocks showing the strongest support are: 1. INTC: Last Close = 36.83, EMA 20 = 31.06. Differ - ence = 18.56% 2. MU: Last Close = 187.83, EMA 20 = 160.82. Dif fer- ence = 16.80% 3. LRCX: Last Close = 145.81, EMA 20 = 127.48. Dif- ference = 14.37% These stocks hav e their current prices trading at a signifi- cant premium to their 20-day EMA, suggesting it is acting as a strong support lev el. ” T able 14: Extended Case study for trading focused ques- tion. H.3 Global Quality Metric Figures Figure 5 presents the grouped bar chart (No-RA G vs. RA G scores per metric) and the relativ e ∆ hori- zontal chart discussed in T able 3 . I Limitations While FinT radeBench pro vides a new benchmark for e valuating financial reasoning across compan y fundamentals and trading signals, se veral limita- tions should be noted. ( i ) Market Co verage. The benchmark focuses on companies in the NASD A Q-100 index over a ten-year period (2015–2025). These firms repre- sent only the technology sector; the benchmark may not fully capture financial reasoning chal- lenges present in all other sectors, emerging mar- kets, or other asset classes such as commodities or fixed-income instruments. ( ii ) Signal Cov erage. FinT radeBench includes a curated set of widely used financial indicators deri ved from SEC filings and historical price data. Ho wev er , financial analysis in practice may in volv e additional signals such as macroeconomic vari- ables, analyst forecasts, alternative data sources, or high-frequency market features. Future bench- marks could extend the signal set to incorporate these additional sources of information. ( iii ) T emporal Generalization. The benchmark is based on historical price data from 2015 to 2025. Questions are designed to be answerable from pub- licly av ailable filings and price data, but the bench- mark does not co ver forw ard-looking predictions, real-time market e vents, or macroeconomic shocks that post-date the ev aluation windo w . Models ev al- uated on future data releases may exhibit dif ferent performance gaps as pre-training corpora e volv e. ( iv ) Evaluation with LLM Judges. Our ev al- uation pipeline relies on an LLM-as-a-Judge cali- brated against expert annotations on a seed set of 150 questions. Despite strong measured human– LLM alignment, the judge may not fully repli- cate the nuanced judgments of professional finan- cial analysts, particularly for subjectiv e or context- dependent reasoning steps ( Zheng et al. , 2023 ; Y e et al. , 2024 ). All judge scores should therefore be interpreted as approximations of expert assessment rather than ground truth. ( v ) Ideal RA G replicability . The ideal RA G architecture in § 5.1 represents a manually curated upper bound rather than a realistic RAG system. The observed performance ceiling under an ideal context should not be interpreted as achie v able by current automated pipelines without further engi- neering. ( vi ) Benchmark Scope. FinT radeBench focuses on question answering tasks that require reasoning ov er structured financial indicators and historical market data. The benchmark does not ev aluate other important financial tasks such as portfolio optimization, trading strategy generation, or risk management decisions. Therefore, performance on FinT radeBench should be interpreted as measuring financial reasoning capabilities rather than o verall financial decision-making ability . J Ethical Considerations ( i ) Financial decision-making risk. Fin- T radeBench ev aluates language model reasoning ov er real financial data for named public companies. Scores on this benchmark should not be interpreted as endorsements of an y model for li ve trading, in- vestment advisory , or automated financial decision- making. Even the highest-performing models in our ev aluation exhibit substantial error rates, and financial decisions based on LLM outputs carry real economic risk to end users. ( ii ) Benchmark misuse. While FinT radeBench is designed for re- search e v aluation, we ackno wledge that fine-tuning models specifically to maximise FinTradeBench scores without genuine improvement in financial reasoning could inflate reported performance. W e encourage the community to treat benchmark re- sults as one signal among man y and to complement automated e v aluation with human expert re view before drawing strong conclusions about financial reasoning capability . ( iii ) Annotator and expert in volv ement. Human financial experts in volv ed in seed question authoring were part of the research team. Evaluation was conducted double-blind to minimise rater bias. ( iv ) Societal impact. Improv e- ments in LLM financial reasoning could benefit retail inv estors and analysts by democratising ac- cess to structured financial analysis. Howe ver , the same capabilities could be exploited to automate misleading financial narrati ves or market manipu- lation at scale. W e call for responsible disclosure norms and human-in-the-loop ov ersight in any de- ployment of LLM-based financial analysis tools. (A) LLM-as-a-Judge Pr ompt System: Y ou are an e xpert financial analyst and a meticulous fact-chec ker . Input Fields: [Question] , [Reference Metrics] ( M ref ), [Automated Audit Report] , [Generated Answer] Evaluation Rubric (1–5 scale): ( 1 ) Factual & Numerical Accuracy –Relies hea vily on the Numerical Audit Report. • 5: All numerical claims are audit-supported. • 3: Minor errors that do not change the overall thesis. • 1: Sev ere hallucinations or math errors that in validate the conclusion. ( 2 ) Completeness & Context [Critical Human Alignment Rule] –Does not penalise omission of reference metrics if the response fully answers the question with a highly relev ant subset. • 5: Fully addresses the prompt with strong explanatory po wer . • 3: Addresses main points but lea ves minor sub-questions unanswered. • 1: Fails to address the core question or omits critical context. ( 3 ) Relevance & Utility –Usefulness to an in vestor or financial decision-maker . • 5: Highly actionable, directly answers the prompt without digressing. • 3: Generally relev ant but includes some tangential information. • 1: Misses the point or provides information of no practical v alue. ( 4 ) Clarity & Rationale [Critical Human Alignment Rule] –Rew ards structured, step-by-step breakdowns; penalises verbosity and repetiti ve formatting. • 5: Crisp, highly readable, actionable, gets straight to the point. • 3: Understandable but ov erly wordy or clunky in formatting. • 1: Confusing, disjointed, or buried in jar gon. Few-Shot Anchor Examples: • Completeness Anchor: If a response perfectly answers the question using 2 metrics with strong reasoning, do not dock Completeness for omitting a 3rd or 4th reference metric - score it a 5. • Clarity Anchor: If a response is accurate but opens with a long definition of basic concepts, or uses highly repetiti ve step headers that waste space, cap Clarity at 3. Output: JSON object containing qualitative scores (four scored dimensions with justifications) and metric analysis ( M gen , M ref ∩ M gen , precision, recall, F1). (B) Human Annotation Rubric T ask: Score AI-generated answers as a financial expert. For each response, provide scores on the fi ve criteria belo w , and optionally supply a golden answer or comments. Annotation Criteria: ( 1 ) A udit V alidation Agreement (0/1) – Does your independent review agree with the automated numerical audit’ s is numerically accurate flag? • 1: Agree –the audit conclusion is correct. • 0: Disagree –the audit missed an error , or incorrectly flagged a correct claim. ( 2 ) Factual & Numerical Accuracy (1–5) –Based on your o wn re view (and the audit), what is the final accurac y score? • 5: 100% correct. • 1: Contains significant, misleading numerical errors. ( 3 ) Completeness & Context (1–5) –Does the answer fully address the question and correctly use and contextualise the golden indicators? • 5: Excellent. Uses required metrics in a deep, integrated analysis. • 1: Superficial. Misses most required metrics or necessary context. ( 4 ) Relevance & Utility (1–5) –Is every piece of information relev ant? Does the response avoid fluff or potentially misleading tangents? • 5: High precision; no fluff, no harmful information. • 1: Cluttered with irrelev ant or misleading content. ( 5 ) Clarity & Rationale (1–5) –Is the answer clear , well-structured, and does it explain its reasoning? • 5: Exceptionally clear and well-reasoned. • 1: Confusing, poorly written, or reasoning is opaque. Output columns to complete: H Audit Agreement (0/1), H Accuracy (1–5), H Completeness (1–5), H Relevance (1–5), H Clarity (1–5), H Golden Answer (optional), H Notes (optional). Figure 4: Evaluation instruments used for human–LLM calibration. (A) LLM-as-a-Judge prompt, which additionally extracts M gen and computes Golden Indicator F1 against M ref . (B) Human annotation rubric administered as a CSV task. Both instruments share the same four scored dimensions (( 1 )–( 4 ) in A, ( 2 )–( 5 ) in B), enabling direct Spearman ρ and MAE alignment measurement. GI P r ecision GI R ecall GI F1 F undamental Integration (1 5) V olatility Integration (1 5) R easoning Depth (1 5) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Scor e 0.44 0.22 0.27 1.60 1.54 2.74 0.20 0.10 0.12 1.81 1.47 2.44 Global Quality Metrics: No-RAG vs. RAG No -R A G R A G 60 40 20 0 20 40 60 R e l a t i v e ( % ) R easoning Depth (1 5) V olatility Integration (1 5) F undamental Integration (1 5) GI F1 GI R ecall GI P r ecision -10.8% -4.6% +13.4% -56.5% -55.8% -55.6% Relative RAG Improvement per Metric P o s i t i v e N e g a t i v e Figure 5: Global Quality metrics: RA G vs No-RA G and improv ement

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment