CN-Buzz2Portfolio: A Chinese-Market Dataset and Benchmark for LLM-Based Macro and Sector Asset Allocation from Daily Trending Financial News
Large Language Models (LLMs) are rapidly transitioning from static Natural Language Processing (NLP) tasks including sentiment analysis and event extraction to acting as dynamic decision-making agents in complex financial environments. However, the evolution of LLMs into autonomous financial agents faces a significant dilemma in evaluation paradigms. Direct live trading is irreproducible and prone to outcome bias by confounding luck with skill, whereas existing static benchmarks are often confined to entity-level stock picking and ignore broader market attention. To facilitate the rigorous analysis of these challenges, we introduce CN-Buzz2Portfolio, a reproducible benchmark grounded in the Chinese market that maps daily trending news to macro and sector asset allocation. Spanning a rolling horizon from 2024 to mid-2025, our dataset simulates a realistic public attention stream, requiring agents to distill investment logic from high-exposure narratives instead of pre-filtered entity news. We propose a Tri-Stage CPA Agent Workflow involving Compression, Perception, and Allocation to evaluate LLMs on broad asset classes such as Exchange Traded Funds (ETFs) rather than individual stocks, thereby reducing idiosyncratic volatility. Extensive experiments on nine LLMs reveal significant disparities in how models translate macro-level narratives into portfolio weights. This work provides new insights into the alignment between general reasoning and financial decision-making, and all data, codes, and experiments are released to promote sustainable financial agent research.
💡 Research Summary
The paper introduces CN‑Buzz2Portfolio, a reproducible benchmark designed to evaluate large language models (LLMs) as autonomous financial agents in the Chinese market. Unlike existing static benchmarks that focus on entity‑level stock picking, this dataset captures a daily stream of the top‑20 trending topics from four major Chinese financial platforms, representing the “public attention” that real investors encounter. The benchmark spans a rolling horizon from 2024 through mid‑2025 and requires agents to translate these open‑world news narratives into macro‑level and sector‑level portfolio allocations using diversified exchange‑traded funds (ETFs) rather than individual equities, thereby reducing idiosyncratic volatility.
To standardize evaluation, the authors propose a three‑stage “Compression‑Perception‑Allocation” (CPA) workflow. In the Compression stage, a summarizer (ASum) filters noisy click‑bait and non‑financial items, producing a concise list of financially relevant events. The Perception stage (Ana) performs qualitative analysis, mapping each event to potential impacts on predefined macro and sector ETFs without relying on price data. Finally, the Allocation stage (Trade) integrates Ana’s insights with historical price and trading records, issuing rebalancing commands in a deterministic execution engine that handles all arithmetic, using budget‑based buy orders and ratio‑based sell orders to avoid numerical errors.
The experimental setup simulates a retail investor with an initial capital of 100,000 RMB, daily rebalancing at market close, and a realistic transaction cost of 0.01 %. Two distinct market periods are tested: (1) the full year of 2024, characterized by a “bear‑to‑bull” transition and intense policy shifts, and (2) the first half of 2025, marked by high volatility but sideways price movement. Nine state‑of‑the‑art LLMs are evaluated, divided into “reasoning‑oriented” models (DeepSeek‑R1, Qwen‑3‑Max‑Think, Qwen‑3‑32B‑Think) that incorporate chain‑of‑thought mechanisms, and “general instruction” models (GPT‑5, Gemini‑2.5‑Pro, DeepSeek‑V3, GLM‑4.6, Qwen‑3‑Max, Qwen‑3‑32B).
Performance is measured across four metrics: cumulative return, Sharpe ratio, maximum drawdown, and volatility. Results show that reasoning‑oriented models consistently outperform general models. Qwen‑3‑Max‑Think achieves the highest cumulative return (12.3 % in the macro‑thematic task) and a Sharpe ratio above 1.2, while maintaining a maximum drawdown under 10 %. In contrast, GPT‑5 records a modest 4.1 % return, a Sharpe ratio near 0.6, and a maximum drawdown of 18 %, indicating poorer risk management and less effective translation of policy narratives into sector bets. All models struggle with delayed policy effects and simultaneous multi‑sector impacts, highlighting a current limitation of LLMs in capturing temporal causal relationships.
The authors openly release the full dataset, code, and experimental results, ensuring reproducibility and facilitating future research. They acknowledge limitations: the ETF universe is confined to Chinese A‑share markets, and the simulation does not fully capture market frictions such as slippage or liquidity shocks. Future work is suggested in three directions: expanding the asset universe to include global bonds and alternative investments, integrating temporal causal reasoning (e.g., graph‑based or meta‑learning approaches) to better model policy lag effects, and developing multi‑agent collaborative frameworks that combine qualitative narrative analysis with quantitative optimization.
In summary, CN‑Buzz2Portfolio provides the first comprehensive benchmark that bridges the gap between LLM semantic understanding and actionable macro‑sector investment decisions in an emerging market context. By focusing on public‑attention‑driven asset allocation and offering a clear, reproducible evaluation pipeline, the work sets a solid foundation for advancing LLM‑based autonomous financial agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment