재무 사실 검증을 위한 합성 데이터 혁신

February 23, 2026

Reading time: 5 minute

...

📝 Abstract

Financial applications of large language models (LLMs) require factual reliability and computational efficiency, yet current systems often hallucinate details and depend on prohibitively large models. We propose FISCAL (Financial Synthetic Claim-Document Augmented Learning), a modular framework for generating synthetic data tailored to financial fact-checking. Using FISCAL, we generate a dataset called FISCAL-data and use it to train MiniCheck-FISCAL, a lightweight verifier for numerical financial claims. MiniCheck-FISCAL outperforms its baseline, surpasses GPT-3.5 Turbo and other open-source peers of similar size, and approaches the accuracy of much larger systems (20x), such as Mixtral-8x22B and Command R+. On external datasets FinDVer and Fin-Fact, it rivals GPT-4o and Claude-3.5 while outperforming Gemini-1.5 Flash. These results show that domain-specific synthetic data, combined with efficient fine-tuning, enables compact models to achieve state-of-the-art accuracy, robustness, and scalability for practical financial AI. The dataset and scripts are available in the project repository (link provided in the paper).

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Large Language Models (LLMs) have transformed the finance domain by enabling applications such as report generation [11,6], question answering [11,4], and investment analysis [13]. However, their adoption in production systems remains limited by two critical barriers. First, financial tasks demand factual reliability: even small hallucinations in numbers, entities, or dates can lead to costly errors in high-stakes decision-making [14,2,15,7]. Second, state-of-the-art models are computationally inefficient: proprietary systems such as GPT-4 or Gemini are accurate but expensive or slow [1,8]. For financial institutions, these constraints make large-scale deployment impractical.

We address these challenges by showing that compact, open models can be both reliable and efficient when equipped with the right training data. Specifically, we introduce FISCAL, a modular data generator that produces diverse and challenging financial claim-document-label triplets, and MINICHECK-FISCAL, a 7B parameter fact-checking model fine-tuned on this data. Unlike prior work that relies on increasingly large LLMs or complex multi-step reasoning pipelines, our approach trains a lightweight verifier that makes fast, single-token predictions with interpretable confidence scores.

Across three benchmarks, our FISCAL dataset, FinDVer, and Fin-Fact-MiniCheck-FISCAL consistently outperforms its baseline (MiniCheck-7B) and approaches the accuracy of models more than twenty times its size, including, Mixtral-8x22B, Qwen2-72B, Gemini-1.5-flash, and GPT4o. These results demonstrate that domain-specific synthetic data and a lightweight parameter scale model can yield compact fact-checkers that rival much larger systems, reducing cost during deployment. By reducing inference cost while maintaining high factual accuracy, MiniCheck-FISCAL offers a practical path toward deployable financial AI that is trustworthy, cost-effective, and scalable.

Our approach consists of two main components: (1) a Modular Claim-Document Generator (FISCAL), which synthesizes claim-document-label triplets (FISCAL-Data) from financial texts, and (2) MiniCheck-FISCAL, a fact-checking model fine-tuned on this synthetic data (FISCAL-Data). Together, these components enable scalable training of domain-adapted fact-checkers while preserving efficiency at inference time.

The generator takes as input a collection of financial documents D = {d 1 , . . . , d M } and produces a set of labeled triplets

, where c i denotes a claim, d i the document, and y i ∈ {1, 0} indicates whether the document supports the claim. Constructing T involves three stages.

Dataset selection. We begin with FinanceBench [5], which provides QA pairs with ground-truth evidence. Evidence contexts are extracted to form our base set of documents, ensuring realism and alignment with factual financial data. Claim extraction. Because numerical reasoning is a primary source of hallucination in financial LLMs, we focus on the extraction of numerical atomic claims. We employ Qwen3 32B [12] in a prompt-based pipeline to identify quantitative statements. Each extracted claim is verified for atomicity using the LLM-as-judge as explained in section 2.2, ensuring that it corresponds to a single, checkable fact. Data synthesis. To balance the dataset, we generate both positive and negative examples for each claim-document pair. Rather than relying on a single transformation, we employ a modular synthesizer comprising multiple strategies.

• Claim Paraphraser Module. Generates professionally styled paraphrases of financial claims, preserving all factual details while varying syntax and wording to increase linguistic diversity. • Conflict Insertion Module. Introduces a subtle, contextually consistent contradiction to a financial document by inserting a realistic detail that conflicts with the claim, while leaving all original content otherwise unchanged. • Fact Exclusion Module. Removes all evidence of the claim from a financial document, deleting or rewriting only claim-related content while preserving coherence and leaving unrelated text intact. • Fact Value Distortion Module. Subtly alters claim-related details in a financial document (e.g., values, dates, entities) with small, plausible distortions, while keeping all unrelated content intact and professionally coherent. • Mis-attribution Module. Edits supporting evidence in a financial document by subtly reassigning attribution (e.g., year, entity, or category), leaving the claim unsupported while preserving coherence and all unrelated content. • Summarization Module. Produces a concise summary of a financial document that retains only claim-relevant details, ensuring the claim remains explicitly verifiable while omitting unrelated content.

This modular design ensures coverage of both easy and challenging cases, exposing the model to a broad spectrum of factual errors. In total, we construct 14,304 training triplets, 1,792 evaluation and 1,784 test benchmark fr

View Original ArXiv

This content is AI-processed based on ArXiv data.

재무 사실 검증을 위한 합성 데이터 혁신

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found