Alpha Discovery via Grammar-Guided Learning and Search
Automatically discovering formulaic alpha factors is a central problem in quantitative finance. Existing methods often ignore syntactic and semantic constraints, relying on exhaustive search over unstructured and unbounded spaces. We present AlphaCFG, a grammar-based framework for defining and discovering alpha factors that are syntactically valid, financially interpretable, and computationally efficient. AlphaCFG uses an alpha-oriented context-free grammar to define a tree-structured, size-controlled search space, and formulates alpha discovery as a tree-structured linguistic Markov decision process, which is then solved using a grammar-aware Monte Carlo Tree Search guided by syntax-sensitive value and policy networks. Experiments on Chinese and U.S. stock market datasets show that AlphaCFG outperforms state-of-the-art baselines in both search efficiency and trading profitability. Beyond trading strategies, AlphaCFG serves as a general framework for symbolic factor discovery and refinement across quantitative finance, including asset pricing and portfolio construction.
💡 Research Summary
This paper addresses the fundamental challenge in quantitative finance: the automated discovery of formulaic “alpha factors”—explicit, interpretable mathematical functions that predict future stock returns. Existing methods, including heuristic, data-driven ML, and genetic programming, often suffer from inefficiency due to unstructured search over vast combinatorial spaces and a lack of interpretability.
The authors introduce AlphaCFG, a novel grammar-guided framework that reformulates alpha discovery as a structured language generation and learning problem. The core innovation lies in defining the search space using formal grammars. First, a syntax-enforcing grammar (α-Syn) generates only well-formed expressions using prefix notation and arity-checked operators. This is then refined into a semantics-enforcing grammar (α-Sem) that incorporates domain-specific financial constraints (e.g., rolling window sizes must be integer constants, operands must be temporally consistent). Finally, a length-bounded version (α-Sem-K) renders the space finite and tractable.
On this grammar-defined space, the paper formulates alpha discovery as a Tree-Structured Linguistic Markov Decision Process (TSL-MDP). Each state is a partial expression tree, actions are grammar production rules, terminal states are complete alpha factors, and the reward is the Information Coefficient (IC)—the correlation between the factor’s scores and subsequent returns.
To solve this TSL-MDP, the authors propose a grammar-aware Monte Carlo Tree Search (MCTS) algorithm enhanced with neural guidance. Partial expression trees are encoded using a Tree-LSTM, which preserves syntactic structure. This encoding feeds into two networks: a value network that estimates the expected IC of a completed factor from the partial tree, and a policy network that predicts promising grammar rules for expansion. During MCTS, node selection is guided by a syntax-sensitive Upper Confidence Bound (UCB) rule that integrates the neural network priors (policy and value) with simulation-derived statistics. This synergy allows the search to be both informed by learned generalizations and driven by actual performance feedback.
Extensive experiments on Chinese (CSI 300) and U.S. (S&P 500) equity market data demonstrate AlphaCFG’s superiority. It significantly outperforms state-of-the-art baselines like AlphaGen and Genetic Programming in key metrics: information coefficient (predictive power), Sharpe ratio, cumulative returns, and maximum drawdown. Ablation studies confirm the critical roles of the grammar constraints and the tree-structured neural encoding. Furthermore, AlphaCFG shows practical utility by successfully refining and improving existing known alpha factors.
In conclusion, AlphaCFG provides a principled and effective framework for automated alpha discovery by integrating formal language theory with reinforcement learning and search. Its grammar-based approach ensures syntactic validity, financial interpretability, and computational efficiency. The framework is generalizable, offering a blueprint for symbolic factor discovery and refinement across various quantitative finance tasks beyond trading, such as asset pricing and portfolio construction.
Comments & Academic Discussion
Loading comments...
Leave a Comment