LLM 기반 거시금융 스트레스 테스트 파이프라인: 투명성·검증 가능성·위험 평가

Reading time: 5 minute
...

📝 Abstract

We develop a transparent and fully auditable LLM-based pipeline for macro-financial stress testing, combining structured prompting with optional retrieval of country fundamentals and news. The system generates machine-readable macroeconomic scenarios for the G7, which cover GDP growth, inflation, and policy rates, which are translated into portfolio losses through a factor-based mapping that enables Value-at-Risk and Expected Shortfall assessment relative to classical econometric baselines. Across models, countries, and retrieval settings, the LLMs produce coherent and country-specific stress narratives, yielding stable tail-risk amplification with limited sensitivity to retrieval choices. Comprehensive plausibility checks, scenario diagnostics, and ANOVA-based variance decomposition show that risk variation is driven primarily by portfolio composition and prompt design rather than by the retrieval mechanism. The pipeline incorporates snapshotting, deterministic modes, and hash-verified artifacts to ensure reproducibility and auditability. Overall, the results demonstrate that LLM-generated macro scenarios, when paired with transparent structure and rigorous validation, can provide a scalable and interpretable complement to traditional stress-testing frameworks. CCS Concepts • Computing methodologies → Natural language generation; • Information systems → Retrieval models and ranking; • Applied computing → Economics; • General and reference → Evaluation.

💡 Analysis

We develop a transparent and fully auditable LLM-based pipeline for macro-financial stress testing, combining structured prompting with optional retrieval of country fundamentals and news. The system generates machine-readable macroeconomic scenarios for the G7, which cover GDP growth, inflation, and policy rates, which are translated into portfolio losses through a factor-based mapping that enables Value-at-Risk and Expected Shortfall assessment relative to classical econometric baselines. Across models, countries, and retrieval settings, the LLMs produce coherent and country-specific stress narratives, yielding stable tail-risk amplification with limited sensitivity to retrieval choices. Comprehensive plausibility checks, scenario diagnostics, and ANOVA-based variance decomposition show that risk variation is driven primarily by portfolio composition and prompt design rather than by the retrieval mechanism. The pipeline incorporates snapshotting, deterministic modes, and hash-verified artifacts to ensure reproducibility and auditability. Overall, the results demonstrate that LLM-generated macro scenarios, when paired with transparent structure and rigorous validation, can provide a scalable and interpretable complement to traditional stress-testing frameworks. CCS Concepts • Computing methodologies → Natural language generation; • Information systems → Retrieval models and ranking; • Applied computing → Economics; • General and reference → Evaluation.

📄 Content

Macroeconomic stress testing is central to financial stability analysis and bank supervision [66]. Stress tests articulate adverse but plausible macroeconomic conditions to assess vulnerabilities, determine capital adequacy, and inform policy design. Regulatory authorities such as the Federal Reserve and the ECB provide top-down narratives, while financial institutions implement internal frameworks based on historical replays, econometric models, or Monte Carlo simulations [48,68]. Yet, despite their institutional importance, traditional stress-testing pipelines face persistent challenges. First, they struggle to represent low-frequency disruptions such as pandemics, supply-chain failures, energy shocks, or geopolitical fragmentation [9,11] that fall outside econometric training windows. Second, scenario design remains manually intensive [1,6] and difficult to scale across jurisdictions or portfolios. Third, econometric systems are often slow to adapt to real-time information [56], limiting responsiveness in fast-moving macro-financial environments.

Large Language Models (LLMs) offer a promising complement. Their ability to synthesize structured macro narratives from heterogeneous information sources has been demonstrated across domains including software engineering [34,41,53,70], education [14,72], and policy analysis [13,25,61]. For stress testing, LLMs can rapidly generate country-specific macroeconomic scenarios while remaining interpretable to human analysts. However, unconstrained generation poses well-known risks: hallucination, numerical drift, internal inconsistency, and limited reproducibility [35,37,62,65,69]. These issues motivate hybrid architectures with explicit grounding, structure, and diagnostics.

In this paper we develop and evaluate a transparent, retrieval-optional pipeline for macro-financial scenario generation. Our system couples structured country profiles with optional news retrieval, prompting GPT-5-mini and Llama-3.1-8B-Instruct to emit machine-readable macro shocks (GDP, inflation, interest rates). These shocks are then translated into portfolio losses through a linear factor channel, enabling direct computation of scenario-induced VaR and CVaR multiples relative to historical and econometric baselines. The design preserves narrative flexibility while enforcing numerically stable, auditable outputs.

Motivation. Traditional econometric stress tests struggle to scale or update quickly, while fully generative LLM approaches lack governance guarantees. Our aim is to bridge these approaches: retaining the interpretability and auditability of structured stress-testing frameworks while exploiting the adaptability and expressiveness of modern LLMs.

(1) A fully auditable Prompt-RAG pipeline for macro-financial scenario generation, with structured JSON outputs and optional grounding via country profiles and news retrieval. (2) A comprehensive G7 experiment comprising 840 intended scenarios per model (7 countries × 30 prompt variants × 4 retrieval configurations), of which 627/617/307 survive plausibility filtering for deterministic GPT-5-mini, non-deterministic GPT-5-mini, and Llama-3.1-8B-Instruct, respectively. (3) A consistent macro→portfolio mapping using a linear factor channel, enabling computation of VaR/CVaR multiples relative to historical bootstrap, EWMA, and GARCH(1,1)-t baselines. (4) Extensive diagnostics including scenario plausibility checks, dispersion analysis, cross-run stability, fairness metrics, and ANOVA variance decomposition, showing that portfolio composition and prompt design dominate risk variance, while RAG/news have only marginal effects. (5) A complete reproducibility and governance layer: deterministic run modes, hash-verified artifact manifests, and explicit snapshotting to ensure replayability across models and retrieval configurations.

From a practitioner perspective, the pipeline is best viewed as a “scenario generator” for risk committees: given a fixed regulatory or internal baseline, the LLM produces a menu of country-specific, narrative-rich shocks that can be screened, edited, and selectively added to an institution’s scenario library, rather than replacing existing frameworks.

Large language models in finance. Early NLP applications used domain lexicons and linear models, but struggled with contextual nuance. [46,49,67]. Transformer architectures and later, large language models (LLMs), closed that gap [4,77]. The survey of Xing et al. documented over 40% accuracy gains relative to bag-of-words baselines on stock-return prediction tasks [75]. Domain-specific pre-training further boosts performance: BloombergGPT (50B parameters) outperformed general models by up to 15 pp on 14 financial NLP benchmarks [74], while the open-source FinGPT project emphasises continual web-scale fine-tuning for reproducibility [44]. LLMs also show strong zero-shot capabilities: ChatGPT improves short-horizon equity-return forecasts from headlines [45], and GPT-4 c

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut