FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment

F actorEngine: A Program-lev el Kno wledge-Infused F actor Mining F ramew ork for Quan titativ e In v estmen t Qinhong Lin 1 , Ruitao F eng 2 , Yinglun F eng 1 , Zhenxin Huang 3 , Y ukun Chen 1 , Zhongliang Y ang 1 , Linna Zhou 1 , and Binjie F ei 2 Jiaqi Liu 2 Y u Li 2 (  ) 1 Beijing Univ ersity of P osts and T elecommunications g reenred99@bupt.edu.cn 2 Beijing V alue Simplex T echnology Co. Ltd. 3 Y angtze Delta Research Institute, Univ ersity of Electronic Science and T ec hnology of China Abstract. W e study alpha factor mining—the automated discov ery of predictiv e signals from noisy , non-stationary market data—under a prac- tical requiremen t that mined factors b e directly executable and auditable, and that the disco very process remain computationally tractable at scale. Existing symbolic approaches are limited b y b ounded expressiveness, while neural forecasters often trade interpretabilit y for p erformance and remain vulnerable to regime shifts and o verﬁtting. W e in tro duce F ac- torEngine (FE), a program-lev el factor discov ery framework that casts factors as T uring-complete code and improv es both eﬀectiveness and eﬃ- ciency via three separations: (i) logic revision vs. parameter optimization, (ii) LLM-guided directional search vs. Bay esian hyperparameter search, and (iii) LLM usage vs. local computation. FE further incorp orates a kno wledge-infused bo otstrapping mo dule that transforms unstructured ﬁnancial rep orts into executable factor programs through a closed-lo op m ulti-agent extraction–veriﬁcation–code-generation pip eline, and an ex- p erience knowledge base that supp orts tra jectory-a ware reﬁnement (in- cluding learning from failures). Across extensiv e backte sts on real-world OHLCV data, FE produces factors with substantially stronger predictive stabilit y and portfolio impact-for example, higher IC/ICIR (and Rank IC/ICIR) and improv ed AR/Sharp e, than baseline metho ds, achieving state-of-the-art predictiv e and p ortfolio p erformance. Keyw ords: alpha factor mining · programmatic factors · program syn- thesis · large language mo dels · Ba yesian optimization. 1 In tro duction Alpha mining is a cen tral ob jective in quan titativ e inv estmen t, aiming to dis- co ver predictiv e factors that extract actionable signals from noisy , non-stationary mark et data. Despite decades of research [15, 6, 23, 17], eﬀective and eﬃcient fac- tor disco very remains challenging due to market complexity , regime shifts, and sev ere noise. Existing factor mining approaches can generally b e divided into 2 Qinhong Lin et al. H um an Kno wl edge Res ea rc h Repor t Ma rke t In s ight Analysis Sele ct e d Elite Node Quali ty Assessm ent Phase1 Reflectio n & V erificati on Se lecti on C h ain of experien c e Idea Gene ra ti on Decision Tree Implementat ion O p ti m i z er Ver i fi er Cor r ec tor Macro Muta tio n Micro Muta tio n Cod e pool integ rati on Macro - Micro Co - Evoluti on Promisin g cod e P i ck N ew P i ck C um ul ativ e Re turn Ba ye s ian s ea rc h Mom ent um Bu llis h acc el era ti on Be aris h acc el era ti on Be aris h d ecele ratio n Bu llis h d ecele ratio n Mom ent um Neutral . .. . .. df _p l = df _p l .w i th _c o lu m n s( pl .c o l( ' da il y _j u mp _t u rn o v er _p r ox y _r aw ' ) . r an k( m et h od = ' d en s e ' ) . o ve r( ' da t et im e ') . a li as ( 'd a il y_ j um p _ tu rn o ve r _p ro x y' ) ) Phase 2 Pool of Ev olved Programs Ba cktes t Pe rforman ce In it ial Programs Fig. 1: Ov erview of F actorEngine (FE). L eft: Bootstr apping extracts factor ideas and conv erts pseudo code in to executable Python to seed a knowledge-infused po ol. Center: Evolution p erforms macro–micro co-ev olution: LLM agen ts prop ose macro m utations guided by c hains of exp erience, and Ba yesian searc h conducts micro-level parameter tuning with fast v alidation and feedbac k updates. Right: Inte gr ation selects elite factors to train mo dels for bac ktesting, pro ducing p ortfolio-level feedbac k. t wo categories: symbolic expression-based metho ds and neural net w ork-based metho ds. Symbolic factors [4, 8] are built on explicit mathematical expressions, pro viding strong interpretabilit y and clear ﬁnancial in tuition. These factors often rely on handcrafted rules and domain exp ertise, leading to heavy manual eﬀort and limited scalability . Additionally , symbolic factors tend to b e fragile in the face of rapidly changing market conditions and less adaptive to real-world com- plexities. In recent y ears, genetic programming (GP) and reinforcemen t learning (RL) [13, 21, 22] hav e b een used for sym b olic factor disco very , enabling auto- mated search within predeﬁned op erator spaces and accelerating factor evolution while preserving a degree of in terpretability . How ever, their strong dep endence on man ually designed op erator sets constrains expressiv eness, leading to lim- ited p erformance and eﬃciency in practice. Neural netw ork–based [3, 20, 19] approac hes can capture implicit patterns and nonlinear relationships in market data. While achieving strong predictive p erformance, these metho ds typically suﬀer from p o or interpretabilit y and are prone to o v erﬁtting, esp ecially under unconstrained mo del arc hitectures and limited ﬁnancial inductiv e bias. More recen tly , large language mo dels (LLMs) hav e demonstrated remark able capabil- ities across a wide range of domains [1, 18], sparking gro wing in terest in their application to alpha mining. AlphaAgent [16] integrates LLM reasoning with ﬁnancial rep ort kno wledge and regularized exploration to mitigate alpha decay , while RD-Agen t-Quant (RD-A GENT) [9] prop oses an agen t-based, data-centric framew ork for factor and mo del joint optimization. F actorEngine: A Program-level Knowledge-Infused F actor Mining F ramework 3 Despite recent adv ancemen ts, existing alpha mining metho ds remain chal- lenged by eﬀectiv e domain-knowledge in tegration and eﬃcient factor disco v- ery . Sp eciﬁcally , we identify three critical c hallenges in curren t approaches: (1) Bounded expressiv eness due to symbolic factor reliance : Symbolic fac- tors are constrained by limited op erator spaces, resulting in restricted search ca- pacit y , fragile evolv ed factors, and a heavy dependence on sp eciﬁc training p eri- o ds. (2) Limited factor div ersity and stabilit y : Current methods lac k mech- anisms to eﬀectively integrate ﬁnancial theory and transform complex, high-level features from ﬁnancial reports into e xecutable factors. (3) Ineﬃcien t evolu- tion pip elines : A signiﬁcan t speed mismatc h exists b etw een LLM generation (e.g., prop osals) and ev aluation signal pro duction (e.g., backtesting), leading to high computational costs and low ov erall eﬃciency . These challenges highlight the need for more robust and scalable factor ev olution frameworks. T o address these challenges, w e introduce F actorEngine (FE), a program-level factor ev olution framew ork in tegrating logic evolution with Bay esian h yperpa- rameter optimization. FE treats parameter optimization as a computationally in tensive pro cess distinct from semantic reasoning and realizes three k ey separa- tions for eﬃcien t macro-micro co-evolution: (1) logic separation b et ween program logic/idea evolution and parameter optimization, (2) search strategy separation b et ween LLM-driven directional search and automated Bay esian search, (3) re- source separation b etw een LLM utilization and lo cal computation resources. The LLM agents fo cus on logic discov ery , while local computation with Bay esian searc h automates parameter optimization. Unlik e prior works, FE con tinuously ev olves factors with T uring-complete programs, allo wing complex control ﬂows, conditional logic, and iterative computation. This enables more ﬂexible mo del- ing of market dynamics and higher-order feature in teractions, making the factors more adaptable to rapidly changing mark et conditions. Initialized with domain- kno wledge-infused factors derived from ﬁnancial rep orts and expert-designed factors, FE enhances b oth eﬃciency and performance. A cross extensiv e back- testing on real-world market data, FE consistently outp erforms existing metho ds on predictiv e and p ortfolio metrics. Our contributions are as follows: – Program-Lev el Hyp er-Heuristic F ramework: W e prop ose F actorEngine (FE), a system that transforms factor mining into a T uring-complete program evo- lution problem. F actorEngine lev erages en vironmen tal feedbac k and chains of experience to guide LLMs in heuristic searches within high-dimensional co de spaces, realizing high p erformance and interpretable factors – Macro-Micro Co-ev olution: F actorEngine decouples the ev olution into macro- lev el heuristic logic ev olution and micro-lev el hyperparameter optimization via Ba yesian searc h, eﬀectively addressing the lo cal optim um issue of param- eters, o vercoming eﬃciency b ottlenecks and reducing evolution costs. – Kno wledge-Infused F actor Diversit y: W e propose a closedlo op m ulti-agent mo dule that precisely transforms features from unstructured ﬁnancial re- p orts into programmatic factors, enabling the system to exploit prior kno wl- edge from div erse research grounded in transparent economic rationales. 4 Qinhong Lin et al. – Sup erior P erformance & Diversit y: Extensive exp eriments demonstrate that FE surpasses state-of-the-art baseline metho ds in predictiv e and p ortfolio metrics. Notably , FE demonstrates a 58% impro vemen t in Information Co- eﬃcien t (IC) and a 126% increase in excess annual return compared to Al- pha158 with factors initially derived from ﬁnancial rep orts. A dditionally , FE enhances the div ersity of the factor p o ol compared to state-of-art metho ds. 2 Related W ork 2.1 T raditional Alpha Mining T raditional alpha mining inv olv es handcrafted factors derived from ﬁnancial do- main kno wledge, suc h as Alpha158 4 and Alpha360 4 from Qlib, which are known for their stabilit y and p ow erful p erformance. Ho w ever, manual factor design is lab or-in tensive and diﬃcult to scale, whic h motiv ates automated symbolic ap- proac hes. T o this end, Genetic programming (GP) methods automatically dis- co ver factors with predeﬁned operators. AlphaEvolv e [12] further enhances GP with optimization ov er parameters and matrix-based operations, incorp orating AutoML techniques. In parallel, reinforcement learning (RL)–based methods for- m ulate factor mining as a sequential decision-making problem, us ing ﬁnancial signals suc h as Sharpe or Calmar ratios as rewards. AlphaF orge [13] adopts a t wo-stage RL frame w ork to discov er factor combinations and adaptively adjust w eights. Neural factor mo dels hav e also b een widely explored in alpha mining. Classical mac hine learning mo dels [2], deep learning mo dels [7, 17], as well as time-series mo dels [5] hav e been proposed to extract implicit representations, reducing reliance on explicit symbolic expressions. Nevertheless, these metho ds still exhibit limited stability and robustness, and are prone to alpha deca y under rapidly c hanging market conditions. 2.2 LLM in Finance Recen tly , large language mo dels ha v e emerged as a promising direction for alpha mining. F AMA [10] in tro duces dynamic factor com bination and cross-sample se- lection to adapt across mark et regimes. Shi et al. [14] lev erages LLM-pow ered Mon te Carlo T ree Search to impro ve exploration eﬃciency . AlphaAgen t [16] lev erages agent-based frameworks to extract inspiration for factor ev olution from ﬁnancial materials, incorporating diversit y-aw are constrain ts to mitigate factor deca y . RD-Agent [9] prop oses a data-centric framework that jointly evolv es fac- tors and multi-factor mo dels, achieving an end-to-end automation pip eline that translates mo del knowledge to symbolic expressions and executable co de. Although these metho ds signiﬁcan tly impro v e p erformance, they still rely on sym b olic representations, restricting expressiv e pow er and search space. F uther- more, LLMs in these w orks are required to both handle logic ev olution and parameter optimization limitting scalabilit y and evolution eﬃciency . 4 h ttps://github.com/microsoft/qlib F actorEngine: A Program-level Knowledge-Infused F actor Mining F ramework 5 3 Problem F ormulation Consider an N-sto ck universe S = { s 1 , s 2 , ..., s N } observed ov er T trading da ys T = { t 1 , t 2 , ..., t T } . F or each sto ck s i on eac h da y t ∈ T , we observe an M - dimensional feature v ector. Let X t − L +1: t ∈ R N × L × M denote the ra w mark et features ov er a lo okback window of length L ending at day t . The ob jectiv e of factor mining is to learn an alpha factor f that maps historical features to a l -step-ahead predictive signal, r t + l ∈ R N , where eac h elemen t corresp onds to the predicted signal for one sto ck. F ormally , a factor is deﬁned as f ( X t − L +1: t ) → r t + l In practice, we often construct a set of K factors { f k } K k =1 . The outputs of these factors are aggregated by a function g (e.g., linear regression or a neural netw ork) in to a comp osite predictiv e s ignal: z t = g ( f 1 ( X t − L +1: t ) , ..., f k ( X t − L +1: t )) . Let Y = { y t, 1 , y t, 2 , ..., y t,n } ∈ R N denote the ground-truth future returns at time t , where y t,i represen ts the realized return of sto c k s i o ver a predeﬁned horizon (e.g., next-da y returns or next-10-day returns). Collecting predictions o ver time yields: Z = { z t } T − 1 t = L , Y = { Y t } T − 1 t = L . The ob jectiv e of alpha mining is then to construct a new set of K factors that maximizes a predeﬁned performance metric R ( Z , Y ) suc h as the Information Co eﬃcien t (IC), ev aluated o ver the entire time horizon. 4 Metho dology In this w ork, we fo cus on programmatic (co de-based) factors as the fundamental represen tation for alpha mining. In practice, F actorEngine (FE) enforces explicit in terface constraints within eac h factor program, including predeﬁned input data t yp es, output formats, p ermissible Python libraries, and task-sp eciﬁc execution seman tics. This design ensures that all evolv ed factors are executable, compara- ble, and compatible with downstream ev aluation and mo deling comp onents. As illustrated in Fig. 1, the FE system consists of three functionally decoupled yet collab orativ ely in teracting mo dules, forming a closed-lo op pip eline: (1) a Bo ot- strapping Mo dule, whic h constructs a knowledge-infused initial factor po ol;(2) an Ev olution Mo dule, whic h p erforms co de-level factor ev olution guided by c hains of exp erience and empirical feedbac k signals; and(3) an Integration Mo dule, which supp orts multi-factor mo deling and market data backtesting. 4.1 Bootstrapping Module The Bo otstrapping Mo dule enables the systematic extraction, reﬁnement and transformation of expert knowledge inside ﬁnancial rep orts and exp ert-designed factors, in to programmatic factors. In con trast to traditional sym b olic approaches that use reports only as conceptual cues for hypothesis generation, w e prop ose a closed-lo op multi-agen t system that transforms rep ort-derived kno wledge in to executable factors, thereby ov ercoming the limitations of predeﬁned expression spaces. The Bootstrapping Mo dule consists of three in terconnected submo d- ules: (1) PDF Pro cessing: Performs LLM-based compliance screening to retain v alid reports, and consolidates the core kno wledge with the model’s domain 6 Qinhong Lin et al. Pol ar C oor d i nat e – B a se d Pr i c e – Vol u m e F usi on F in - Ana ly ze r Repo rt Le t θ b e t he a ng le o f a s t o c k . T he a d just e d f unc t io n is d e f ine a s : 𝑓 𝜃 = e x p − 𝜃 … Co re I dea P s eu - Co de G ener a t o r L o g ic Ref iner ...... wh e n $ \ t h e t a \ a p p ro x \ pi / 3 $ . ...... \ pi / 4 C o de Sy n th esi z er Execu to r Att r i b ut e E r r or : m o d ule 'p o la r s ’ ha s no a t t r ib ut e 'c o s '. Co rr ec t o r E x e c u ta bl e Pyth o n : def factor( i n pu t: pri c i n gd at a) - > o u tp u t: df[ " date time " ," i n s tr u m e n t" ," val u e " ] \ S t a t e \ C o m m e nt{ S t e p 4 : Po la r a ng le p r e f e r e nc e } \ S t a t e De f ine $ f ( \ t he t a ) $ t o a t t a in it s m a x im um wh e n $ \ t he t a \ a p p r o x \ p i / 3 $. \ S t a t e \ C o m m e nt{ Pr e f e r e nc e f o r p r ic e - - v o lum e c o - m o v e m e nt} \ S t a t e $ \ t e x t { a ng le \ _sc o r e } \ le f t a r r o w f ( \ t he t a ) $ . . . . . . Pol ar C oor d i nat e – B ase d Pri c e – Vo l um e F usi o n Le t θ b e t h e a n g le of a s t oc k , 𝜽 𝟎 = 𝝅 𝟒 as the r e fe r e nc e ang l e . Th e a d jus t e d f unc t io n is d e f ine a s : 𝑓 𝜃 = e x p − 𝜃 − 𝜋 4 … Self - Ref inem ent ...... t he t a = p l. a r c t a n( y _v, y _p ) a ng le _sc o r e = ( a t he t a - np . p i /4) \ . c os . c lip ( lo we r _b o un d = 0 ) ...... ...... t he t a = p l. a r c t a n( y _v , y _p ) a ng le _sc o r e = p l .c os ( a t he t a - np . p i /4) \ . c lip ( lo we r _b o un d = 0 ) ...... E x t r a c t Fig. 2: Overview of the Bootstrapping mo dule. kno wledge, yielding reliable inputs for do wnstream factor extraction. (2) F ac- tor Extraction: Implements a tw o-step understanding-to-generation w orkﬂow with iterativ e reﬂection and veriﬁcation to distill core ﬁnancial ideas from re- searc h reports in to the structured JSON represen tations accompanied by LaT eX- formatted pseudoco de (3) Co de Generation: T ransforms v eriﬁed pseudo co de and core idea summaries in to executable Python co de through iterativ e reﬁnement that v alidates structural compliance. All successfully extracted factors, along with their core idea summaries and economic rationales, are stored as a knowledge-infused initial factor p o ol that serv es as the seed p opulation for the Ev olution Mo dule. This module not only ensures quality b y automatically identifying and repairing logical gaps in re- p orts, but also enables high-ﬁdelity , scalable extraction, rapidly constructing an extensiv e initial p o ol that substantially broadens the scop e of factor mining. 4.2 Ev olution Mo dule The Evolution Module, the heart of FE system, is designed to improv e factor p er- formance through a macro–micro co-evolution mec hanism. It com bines empirical v alidation and analysis, and h uman-like self-reﬁnemen t. The co-evolution mec h- anism separates logic ev olution and parameter optimization. A t the macro lev el, agen ts e xplore and reﬁne factor logic, while at the micro lev el, Bay esian optimiza- tion ﬁne-tunes parameters. Our evolution framew ork is inspired by Op enEvolv e 5 . In particular, we reuse its agent orc hestration mechanisms with our re-designed ev olution logic, added c hain-of-exp erience, and new program-lev el mutation strate- gies to suit factor mining. Eac h evolution iteration follows a four-stage pipeline: Program Selection, Idea Generation, Implemen tation, and Analysis. It can b e interpreted through the paradigm of reinforcement learning: program selection and idea generation join tly deﬁne the action, automated instantiation and ev aluation corresp ond to en vironment transition, and m utation analysis pro duces a reward signal that 5 h ttps: //github.com/algorithmicsuperintelligence/openevolv e F actorEngine: A Program-level Knowledge-Infused F actor Mining F ramework 7 guides future exploration. It can b e form ulated as F = ( P , E , ϕ ) where P is tree- structure program searc h space , E is the execution and v eriﬁcation environmen t, ϕ is agen t’s parameterized priors. Program Selection. The ev olution factor p o ol is organized as a tree structure, where eac h no de corresp onds to an e xecutable program evolv ed from its parent and can b e directly ev aluated to obtain an immediate rew ard. A t eac h iteration, the ev olution process selects the most promising no de from the curren t tree as the con text for subsequent reﬁnement. W e deﬁne the node v alue Q ( v ) as the empirical mean rew ard ov er all ev aluations within the subtree ro oted at v including itself, reﬂecting both its p erformance and its p oten tial. After each ev aluation, Q ( v ) and the visit coun t N ( v ) are up dated via bac kpropagation along the ancestor path. T o balance exploration and exploitation, we adopt the Upp er Conﬁdence Bound for T rees (UCT) criterion to score candidate nodes. Sp eciﬁcally , the v alue of a no de is: U C T ( v ) = Q ( v ) + c s ln N parent ( v ) N v , (1) where N parent ( v ) denotes the visit count of v ’s parent, and c is a constant that con trols the exploration–exploitation trade-oﬀ. W e set c = √ 2 as it is commonly used. Unlike standard MCTS settings where no des represent partial states or in termediate decisions, eac h no de i n our tree is a fully sp eciﬁed program and th us indep endently executable and ev aluable, enabling ﬂexible evolution. Idea Generation. W e prompt the agent to synthesize environmen tal feedback and its parametric knowledge to generate high-level inspirations and structural mo diﬁcations of programs. This pro cess emphasizes semantic reasoning, pattern abstraction, and conceptual exploration, where the LLM excels, realizing macro m utations as shown in Fig. 1. Based on the selected program no de, we construct C , an ev olution chain of exp erience (CoE), from the evolution p o ol, representing the historical tra jec- tory leading to the current no de. F rom the global tree, we further select n = 3 candidate paths { p i } n i =1 that jointly balance high empirical p erformance and lo w ov erlap with the current one. Each path serves as a compact representation of prior evolution exp erience. F ormally , given C , p i = { p i, 1 , p i, 2 , ..., p i,j } , and function C v g ( · ) , we deﬁne the cov erage score as: S cov ( p i ) = C vg ( C, p i ) = α | Φ | | p i | + β | Φ | | C | (2) where Φ = p i ∩ C and C v g ( · ) measures the degree of o verlap b etw een the can- didate path and the curren t c hain , and | p i | denotes the coun t of nodes in the path. The eﬀectiv eness score of a path is deﬁned as S ef f ( p i ) = 1 | p i | | p i | X m =1 S core ( p i,m ) (3) 8 Qinhong Lin et al. where S cor e ( · ) denotes the empirical ev aluation metric of a node (e.g., IC or bac ktesting p erformance). The ﬁnal path score of p i is then computed as: S total ( p i ) = S ef f ( p i ) − γ S C v g ( p i ) (4) These selected paths are then assembled in to a structured context, whic h ex- p oses explicit exp erience knowledge to the LLM and co op erates with its intrin- sic parametric knowledge to generate a mutation idea. In practice, evolutionary optimization is inherently non-monotonic. Consequently , FE explicitly captures the full dynamic pro cess, including transient ﬂuctuations and local setbac ks, rather than just the ﬁnal success. Unlike prior works which predominantly fo cus on static high-p erforming no des, exposing the agent to these winding historical tra jectories stim ulates h uman-lik e reasoning. This enables LLMs to internalize feedbac k from failures, learn to recov er from p erformance dips, and steer explo- ration to ward more robust and promising directions. Implemen tation. In this stage, w e explored micro mutations , i.e., optimiza- tion of parameter-related comp onents, such as windo w sizes and deca y factors through an automated search algorithm, and v alidated via high-throughput ex- ecution. P arameter optimization employs Ba yesian search, with our implemen- tation supporting m ultiple Ba yesian searc h methods including T ree-structured P arzen Estimator (TPE), Gaussian Process-based metho ds, and other proba- bilistic optimization algorithms. F or a given ev olved program P with parameter v ector θ ∈ Θ , the optimization problem is θ ∗ = arg max θ ∈ Θ f ( P , θ ) , where f ( P , θ ) is the ev aluation function returning the com bined_score metric. These Ba yesian metho ds model the ob jectiv e function probabilistically and suggest parameters that maximize Exp ected Impro vemen t: E I ( θ ) = R ∞ −∞ max( y ∗ − y , 0) · p ( y | θ ) dy , where y ∗ is a p erformance threshold (typically the top 25% of observed scores), balancing exploration of uncertain regions and exploitation of promising regions to eﬃciently con verge to optimal parameter combinations. During the Idea Gen- eration phase, the LLM agent speciﬁes parameter searc h ranges (e.g., windo w sizes, decay factors) based on domain kno wledge and previous results, but the actual parameter exploration is delegated to this automated Ba yesian searc h pro cess. The v alidation pro cess employs a t wo-phase strategy to op erationalize the resource separation: - Phase 1 : Sequential co de v alidation with LLM-based automatic correction, ensuring ev olved programs executes correctly before ex- p ensiv e parallel optimization. - Phase 2 : Parallel Ba yesian optimization en- tirely on local computational resources without LLM, using v alidated code. This phase employs m ulti-pro cess parallel execution, distributing trials across mul- tiple work ers that execute concurrently . Each work er runs multiple trials, with the Bay esian searc h algorithm coordinating across w orkers to a v oid redundan t exploration. The ev aluation function f ( P , θ ) computes empirical performance metrics through high-throughput execution on historical mark et data. The ev al- uation en vironment supp orts this through data caching (reducing data loading time from 30s to <1s), parallel execution across m ultiple work ers, and eﬃcien t metrics calculation that aggregates IC (Information Co eﬃcient) and ICIR (In- F actorEngine: A Program-level Knowledge-Infused F actor Mining F ramework 9 formation Ratio) across multiple lag p erio ds (1, 3, 5, 10 days) in to a single com- bined_score ob jectiv e. This design ensures LLM resources are used eﬃciently (co de correction happens once in Phase 1) while local computation resources are maximized for high-throughput parameter ev aluation in Phase 2. F eedbac k propagation. After a new program is instantiated and ev aluated, the framew ork p erforms feedbac k propagation to transform empirical outcomes into actionable guidance for subsequen t evolution. Sp eciﬁcally , the LLM is prompted to summarize the core implementation logic of the new program, identify its relativ e changes with respect to b oth the ro ot and the paren t program, as well as assess its performance improv ement or degradation. These elements are jointly organized into a structured format and stored within the node for experience reuse in later iterations. Concurren tly , the quantitativ e feedbac k is propagated bac k along the ev olution path, up dating the Q and N v alues of all tra versed no des up to the ro ot. Through the integration of these four stages, FE forms an eﬃcient evolu- tion framework supporting exp erience-guided reasoning, div erse idea generation, macro–micro co ordinated evolution and online v alidation. Despite forming a complete ev olution lo op, w e observe that LLM-driven inspi- ration is inherently stochastic. Consequen tly , ev en under identical conﬁgurations, individual evolution tra jectories ma y diverge substantially in b oth structure and p erformance, potentially leading to ineﬃcient exploration. T o address this issue, w e introduce a multi-island evolution conﬁguration up on the FE framework to impro ve b oth eﬃciency and sampling diversit y . Sp eciﬁcally , w e initialize N inde- p enden t evolution pro cesses by duplicating the initial program no de and evolv e them concurrently , each of which acts as an isolated island. Every M evolution rounds, eac h island selects its top-3 programs based on ev aluation metrics and migrates the copies to other islands as child nodes of the target island’s ro ot no de, allo wing adv an tageous mutations discov ered in one tra jectory to propa- gate to others. Migrated programs are incorp orated in to the context construction for subsequen t idea generation, increasing the lik eliho o d that successful struc- tural insigh ts from diﬀerent tra jectories are reused and recom bined. This design preserv es diversit y while signiﬁcantly impro ving ev olution eﬃciency , facilitating more eﬀectiv e exploration under sto chastic LLM-driven inspiration. 4.3 In tegration Mo dule Building up on the bo otstrapping and ev olution modules, the in tegration module is designed to construct high-performing multi-factor signals from the ev olving factor po ol via elite node selection and m ulti-factor mo deling, targeting robust predictiv e performance in real-w orld ﬁnancial mark ets. While explicitly mo deling factor correlations with the existing factor p ool is an eﬀective strategy , the com- putational cost of IC-based dependency analysis gro ws rapidly as the factor p o ol expands. T o address this issue, we adopt a ligh tw eight threshold-based ﬁltering mec hanism to control ev aluation complexity . Speciﬁcally , for each initial node with candidate evolv ed factor no des, w e apply a hard p erformance threshold to 10 Qinhong Lin et al. retain only high-qualit y candidates. Each factor is ev aluated using a weigh ted ﬁtness score (FS) deﬁned as: F S = 1 4 ( I C ∗ 10 + I C I R + RI C ∗ 10 + RI C I R ) (5) whic h jointly captures b oth the magnitude and stability of linear and rank-based predictiv e signals. In practice, under a rolling ev aluation windo w of length L =2, w e retain at most the top 5 factor no des whose ﬁtness scores exceed 0.4. F or each retained no de, we further select the top 10 asso ciated parameter conﬁgurations according to the same ﬁtness criterion. 5 Exp erimen t Setup Baselines. W e comp ared F actorEngine (FE) with sev eral represen tativ e base- line methods: (1) GPlearn [15], which p erforms sym b olic factor discov ery via genetic programming; (2) T raditional time-series forecasting neural mo dels, such as Ligh tGBM (LGBM), LSTM and T ransformer, which capture temp oral depen- dencies in the market data; (3) Sp ecialized ﬁnancial model, TRA [11], which fo cuses on in tegrating multiple trading strategies and mo deling non-i.i.d. mar- k et patterns; (4) Agent-based alpha factor mining metho ds, including AlphaA- gen t [16] and RD-Agen t-Quan t [9] (RD-Agent); (5) hand-crafted factor baseline: Alpha-158. F or FE, we ran t w o experiments: one starting with manual factors (FE-alpha) and another with ﬁnancial reports (FE-rep ort). W e ac knowledge that mo dern LLMs are trained on data extending beyond our test perio d. How ev er, this limitation is shared b y all agent-based metho ds, and we ensured fair compar- ison b y using Gemini-2.5-Pro 6 as the bac kb one mo del across all agen t baselines. Hyp erparameters. W e conducted tw o exp eriments with diﬀeren t budgets, in whic h each framew ork ev olved 200 and 400 iterations respectively , with one factor generated p er iteration. All generated factors w ere ﬁltered according to eac h framework’s criteria. F or bac ktesting, we merged generated factors with the Alpha-158 s et to train a LGBM mo del and implemented a strategy that selects the top-50 assets and retains them for 5 da ys. F or FE, the n umber of islands was set to 2, with migration p erformed every 7 iterations. Under the tw o budgets, the framew ork was initialized with 5 and 10 alpha factors (or rep orts), resp ectively . α, β , γ in Eq. 2, 4 were set to 1, 1, 1. Datasets. F or all methods, we performed on the full-market dataset and ev alu- ated in the CSI300 and CSI500 mark ets. The dataset, collected from Qlib, was divided in to training (2008-01-01 – 2014-12-31), v alidation (2015-01-01 – 2016- 12-31), and testing (2017-01-01 – 2024-12-31) p erio ds. The raw data used to calculate alpha factors consist solely of OHLCV features. T o preven t p otential leak age in the kno wledge-infused b o otstrapping mo dule, we only used ﬁnancial researc h rep orts published b efore 2017 for factor extraction and code b o otstrap- ping, ensuring that no rep ort conten t ov erlaps with the test p erio d. 6 h ttps://ai.go ogle. dev/gemini-api/do cs/mo dels F actorEngine: A Program-level Knowledge-Infused F actor Mining F ramework 11 T able 1: Predictive and portfolio performance of FE and baseline metho ds in the CSI300 and CSI500 markets (b oth 200- and 400-iteration settings). Bold denotes the b est result within eac h blo ck, and underlining indicates the second-best ; "-1","-2" denote 200, 400 iterations, resp ectiv ely . Methods CSI300 CSI500 IC ICIR RIC RICIR AR |MDD| IR SR IC ICIR RIC RICIR AR |MDD| IR SR LGBM 0.0040 0.0326 0.0078 0.0587 0.0129 39.18% 0.1706 0.1006 0.0057 0.0531 0.0108 0.0913 -0.0514 48.22% -0.3317 -0.3051 LSTM 0.0053 0.0313 0.0129 0.0704 0.0486 34.09% 0.4810 0.3241 0.0054 0.0306 0.0138 0.0726 -0.0104 36.05% -0.0080 -0.1156 T ransformer -0.0012 -0.0066 -0.0078 -0.0388 0.0342 29.48% 0.3027 0.1490 0.0012 0.0076 -0.0024 -0.0130 -0.0221 53.04% -0.0542 -0.1598 TRA 0.0256 0.1559 0.0302 0.1964 0.0674 16.02% 0.6881 0.3747 0.0341 0.3184 0.0295 0.2717 0.0320 31.62% 0.2877 0.0072 Alpha158 0.0299 0.2008 0.0331 0.2164 0.0840 17.49% 0.7440 0.4196 0.0403 0.3100 0.0416 0.3172 0.0197 25.17% 0.2152 0.0089 GPLearn 0.0292 0.1971 0.0321 0.2120 0.0814 15.99% 0.7337 0.4152 0.0409 0.3190 0.0427 0.3113 0.0272 22.79% 0.2751 0.0451 RD-Agent-1 0.0255 0.1667 0.0294 0.1881 0.0507 23.71% 0.4770 0.2627 0.0385 0.2887 0.0398 0.2963 -0.0033 30.55% 0.0404 -0.0915 AlphaAgent-1 0.0282 0.1978 0.0313 0.2142 0.0673 17.00% 0.6346 0.3499 0.0400 0.3076 0.0407 0.3135 0.0102 24.40% 0.1431 -0.0319 FE-alpha-1 0.0319 0.2178 0.0346 0.2308 0.0888 16.91% 0.7886 0.4526 0.0413 0.3112 0.0430 0.3241 0.0346 21.44% 0.3315 0.0793 FE-report-1 0.0333 0.2325 0.0360 0.2459 0.1017 15.89% 0.8959 0.5147 0.0420 0.3244 0.0429 0.3289 0.0458 25.44% 0.4030 0.1222 RD-Agent-2 0.0269 0.1833 0.0300 0.1978 0.0917 15.23% 0.8113 0.4647 0.0402 0.3070 0.0411 0.3130 0.0189 24.15% 0.2092 0.0052 AlphaAgent-2 0.0314 0.2089 0.0346 0.2252 0.0755 16.23% 0.6779 0.3868 0.0385 0.2870 0.0396 0.2906 0.0235 25.36% 0.2437 0.0266 FE-alpha-2 0.0315 0.2211 0.0344 0.2360 0.0943 15.07% 0.8241 0.4762 0.0417 0.3183 0.0434 0.3293 0.0399 23.84% 0.3770 0.1064 FE-report-2 0.0474 0.3185 0.0475 0.3146 0.1899 12.61% 1.6001 1.0093 0.0536 0.4140 0.0487 0.3744 0.0836 21.51% 0.6719 0.2945 2017 2018 2019 2020 2021 2022 2023 2024 2025 Y ear 0.0 0.5 1.0 1.5 2.0 2.5 Cumulative R etur n Benchmark Alpha158 AlphaAgent-2 RD- Agent-2 FE-factor -1 FE-factor -2 FE-r eport-1 FE-r eport-2 2017 2018 2019 2020 2021 2022 2023 2024 2025 Y ear 0.0 0.2 0.4 0.6 0.8 Cumulative R etur n Benchmark Alpha158 AlphaAgent-2 RD- Agent-2 FE-factor -1 FE-factor -2 FE-r eport-1 FE-r eport-2 RD- Agent, R oG=0.482, K eep_R atio=3.9% AlphaAgent, R oG=0.454, K eep_R atio=21.9% FE-alpha, R oG=0.532, K eep_R atio=57.1% R efer ence (r=0.5) Fig. 3: (Left) Cum ulative excess return comparison in the CSI300 market. (Middle) Cum ulative excess return comparison in the CSI500 market. (Right) Visualization of factor correlation structure of three agent-based metho ds based on MDS. Metrics. W e ev aluated metho ds using a comprehensiv e set of metrics. F or pre- dictiv e p erformance, w e rep orted the Information Co eﬃcien t (IC), Information Co eﬃcien t Information Ratio (ICIR), Rank IC, and Rank ICIR. F or p ortfolio p erformance, w e ev aluated Annualized Return (AR), Information Ratio (IR), maxim um drawdo wn (MDD) and Sharp e Ratio (SR). W e calculated these met- rics based on the excess return series, which was calculated as the diﬀerence b et ween the p ortfolio return and the b enchmark return. 6 Results Analysis 6.1 Main Result T ab. 1 presents the experimental results of F actorEngine (FE) and baseline metho ds in the CSI300 and CSI500 markets. T raditional metho ds lik e LGBM, LSTM, and T ransformer lac k eﬀectiv e feature mo deling, leading to p o or per- formance in b oth predictive accuracy and p ortfolio p erformance. Sp eciﬁcally , in the CSI500 market, they yielded negative excess returns. The TRA method used 12 Qinhong Lin et al. artiﬁcial factors for deep learning mo deling, in tegrating their feature represen ta- tions, which signiﬁcan tly enhanced the predictive capabilit y of the RNN netw ork. The 50 factors generated by GPlearn show that, in the CSI300 mark et, mixing these 50 factors with Alpha158 do es not further impro ve the correlation b etw een the features and future returns. How ev er, in the CSI500 market, GPlearn’s fac- tors pro vide signiﬁcant returns. Comparing diﬀerent agent-based frameworks, b oth FE-alpha and FE-rep ort clearly outperformed other baseline methods in b oth exp erimen tal setups. In the CSI300 mark et, FE-rep ort ac hiev ed the highest IC of 0.0474 and excess ann ual return of 18.99%, while in the CSI500 mark et, they reac hed 0.0536 and 8.36%, respectively . In our exp erimen ts, although three agen t-based metho ds show ed improv emen ts as the num b er of iterations increases, AlphaAgen t and RD-Agent p erformed noticeably worse than the Alpha158 factor in the CSI300 mark et with fewer iterations. Ho w ever, as the n umber of itera- tions increased, the performance gap narrow ed, and they ev entually surpassed Alpha158. In contrast, b oth FE setups consisten tly show ed better predictive p er- formance and p ortfolio p erformance than other metho ds at diﬀerent iteration rounds. This indicates that the program-lev el factor evolution framework (FE) is eﬀective at disco vering factors that b etter capture mark et characteristics. F ur- thermore, when the amoun t of ﬁnancial rep ort data increased, the p erformance of the factors evolv ed from ﬁnancial rep orts signiﬁcan tly impro ved. IC increased from 0.0333 to 0.0474, AR impro ved from 0.1017 to 0.1899, and MDD decreased from 15.57% to 12.61%. This demonstrates that FE can eﬀectively extract ﬁ- nancial knowledge from ﬁnancial rep orts for modeling and con tinuously evolv e factors during iterations. Through the t wo FE exp erimental setups, we found that the factors evolv ed from reports consisten tly outperformed those evolv ed from Alpha, regardless of the iteration rounds. W e b elieve that the ﬁnancial kno wledge embedded in ﬁnancial rep orts help ed the agent better analyze market c haracteristics, leading to the generation of more stable and pow erful factors. Fig. 3 sho ws the excess cum ulative return curves of three agen t-based factor mining framew orks in the CSI300 and CSI500 mark ets from 2017 to 2024. The factors evolv ed using FE outp erformed Alpha158, AlphaAgent, and RD-A GENT in b oth 200-iteration and 400-iteration exp erimen ts. 6.2 F actor Div ersit y Analysis T o ev aluate the non-redundancy of generated factors, w e utilized Multidimen- sional Scaling (MDS) to pro ject the correlation matrix of generated factors into a 2D space. Speciﬁcally , we constructed a dissimilarity matrix with entries 1 − | ρ | , suc h that larger Euclidean distances corresp ond to w eaker absolute correlations. As sho wn in the righ t subﬁgure of Fig. 3, after ﬁltering out low-qualit y factors with an IC lo wer than threshold 0.015, FE-alpha retained 36 eﬀectiv e factors, corresp onding to a 57.1% k eep ratio, signiﬁcan tly surpassing the yield of AlphaA- gen t and RD-A GENT. Bey ond mere quantit y , the spatial top ology further high- ligh ts factor div ersit y . FE-alpha exhibited a clear “circular dispersion” pattern, with most factors distributed near the periphery , suggesting stronger mutual F actorEngine: A Program-level Knowledge-Infused F actor Mining F ramework 13 2017 2018 2019 2020 2021 2022 2023 2024 Y ear 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Alpha158 AlphaAgent RD -Agent FE-factor FE-report 2017 2018 2019 2020 2021 2022 2023 2024 Y ear 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Alpha158 AlphaAgent RD -Agent FE-factor FE-report 2 5 10 15 20 25 30 T+N lag L ength 0.02 0.03 0.04 0.05 0.06 0.07 V alue RD -Agent_IC AlphaAgent_IC FE-factor_IC RD -Agent_RIC AlphaAgent_RIC FE-factor_RIC Fig. 4: Y early IC and Rank IC comparisons in the CSI300 (Left) and CSI500 markets (Middle). Mean IC and Rank IC b et ween the top 10% factors and future returns at T+N on the CSI300 market across three exp erimental settings (Righ t). indep endence and reduced redundancy . This observ ation is consistent with the Radius of Gyration (RoG) metric, where FE-alpha achiev ed the largest RoG, in- dicating the highest o verall disp ersion in the embedded space. Collectiv ely , these results suggest that FE-alpha pro duced a richer and less redundan t factor set, oﬀering more complemen tary alpha signals for downstream m ulti-factor models. 6.3 Alpha Deca y Analysis In Fig. 4, the ﬁrst tw o subplots show the yearly IC v ariation trends on the test data of the CSI300 mark et after 400 iterations of the baseline and 3 agent-based framew orks. All factors exhibited some degree of deca y ov er time, but throughout the entire p erio d, FE-rep ort consistently main tained a high IC and rank IC. Notably , it stopp ed decaying in 2021 and even show ed an improv emen t. The FE- alpha exp erimental group exp erienced a more gradual decay compared to other factors, with no signiﬁcant ﬂuctuations, resulting in s uperior cumulativ e returns and p ortfolio p erformance. Although the IC of other methods were sligh tly higher in some y ears, their signiﬁcant declines in other y ears mak e it diﬃcult to mitigate the losses caused by this decay . In the third ﬁgure of Fig. 4, w e presented changes in IC and RankIC of the top 10% of ev olv ed factors in factor p o ols after 400 rounds of experiments using diﬀerent lag length return signals as lab els on the test data for three agent-based metho ds. W e observed that the correlation of all three experimental groups increased smoothly as the window size gro ws. Moreo ver, the av erage correlation b et ween the FE-evolv ed factors and the return signals w as higher than that of the other tw o baselines. 6.4 T oken Eﬃciency and Executabilit y As sho wn in T ab. 2, we compared execution performance metrics of three agen t- based framew orks running for 200 iterations. Our FE framework has an o v er- head comparable to AlphaAgen t but uses few er resources than RD-A GENT. FE signiﬁcan tly surpasses others in operational eﬃciency , thanks to the usage of P olars framework for computation acceleration and the parallelization of factor 14 Qinhong Lin et al. T able 2: Comparison of the run time performance. "Debug" indicates the API call ratio used for co de debugging. Methods Search Space Run Space Cost($) Time(h) Executable Ratio Debug RD-Agent symbolic co de 16.91 48.0 96% 68% AlphaAgent sym b olic co de 11.61 9.7 93% 51% F actorEngine co de code 12.01 0.5 99% 32% 0 5 10 15 20 25 30 35 40 iteration 0.15 0.20 0.25 0.30 0.35 avg metric max 0.384 max 0.254 bay_avg vs no_bay_avg over iterations bay_avg no_bay_avg Rank IC IR ARR Rank ICIR MDD SR 0.747 0.423 0.026 0.080 0.176 0.182 0.142 0.036 1.014 0.124 0.242 0.605 RD-Agent-ﬂash AlphaAgent-ﬂash FE-factor-ﬂash RD-Agent-gpt4o AlphaAgent-gpt4o FE-factor-gpt4o Fig. 5: Left :Eﬀect of Bay esian micro-search. Ba yesian parameter searc h ( bay_avg ) yields higher ﬁnal performance and a faster improv ement tra jectory than that without Ba yesian tuning ( no_bay_avg ). Righ t : Comparison of three metho ds ev olved using the GPT-4o and Gemini-2.5-ﬂash-lite mo dels as bac kb one agents. ev olution and calculation. In contrast, RD-A GENT’s generation of numerous DL-based factors resulted in reduced eﬃciency . Additionally , FE evolv es pro- gressiv ely within the code space, requiring fewer API calls for debugging. In con trast, other frameworks rely heavily on API calls to con vert expression-based factors into co de. Moreo ver, FE ac hieves the low est rate of unexecutable factors. 6.5 Ablation Study Ba yesian Micro-Searc h vs. No Ba y esian Searc h. W e ablated the micro- lev el parameter optimization by comparing t wo v ariants under the same macro- lev el ev olution budget (40 iterations): w/ Bay es , applied Ba yesian searc h to tune parameters b efore ev aluation, and w/o Ba y es , used ﬁxed/default param- eters. The left of Fig. 5 sho ws that Ba yesian searc h improv es both ﬁnal qual- ity and se ar ch sp e e d . Concretely , the best evolv ed program after 40 iterations w as substantially higher with Bay esian search (ab out 0.38 vs. 0.25). Moreov er, the impro vemen t tra jectory w as consistently steep er: Bay esian tuning provides a stronger and less noisy ﬁtness signal, allo wing FE to iden tify and promote promising program logic earlier, accelerating the discov ery of high-p erforming factors rather than only reﬁning p erformance at the end. Bac kb one Ablation. W e conducted experiments using three agent-based frame- w orks on the Gemini-2.5-Flash-Lite 7 and GPT-4o 8 for backbone ablation, testing 7 h ttps://ai.go ogle. dev/gemini-api/do cs/mo dels 8 h ttps://op enai.com/index/gpt-4o-system-card/ F actorEngine: A Program-level Knowledge-Infused F actor Mining F ramework 15 T able 3: Exp erimental comparison of FE-alpha under diﬀerent prompt settings, island conﬁgurations, and n umbers of initial factors. conﬁg RIC RICIR AR IR |MDD| 6alpha,1island,CoE 0.0325 0.2165 0.0728 0.6737 0.1678 6alpha,1island,top-k 0.0319 0.2125 0.0696 0.6428 0.1626 6alpha,2island,CoE 0.0346 0.2308 0.0888 0.7886 0.1691 6alpha,2island,top-k 0.0332 0.2189 0.0775 0.7085 0.1614 10alpha,1island,CoE 0.0344 0.2358 0.0782 0.7079 0.1787 10alpha,1island,top-k 0.0341 0.2266 0.0761 0.6944 0.1673 10alpha,2island,CoE 0.0344 0.2360 0.0943 0.8241 0.1557 10alpha,2island,top-k 0.0353 0.2408 0.0839 0.7648 0.1708 them on the CSI300 market data. As shown in the righ t of Fig. 5, all frameworks ev olved for 200 iterations. Since GPT-4o generally exhibits stronger reasoning capabilities, the alpha factors generated b y GPT-4o exhibited stronger predic- tiv e and p ortfolio performance. Among the six setups, FE-factor with GPT-4o sho wed the b est results in RankIC, Rank ICIR, AR, IR, and SR. Conﬁguration Ablation. During the FE ev olution pro cess, the use of a m ulti- island setup and prompts with Chain-of-Exp erience (CoE) information feedback enhanced factor evolution performance. W e initiated the evolution with artiﬁ- cial factors and conducted ablation exp eriments to analyze the gains pro duced b y these tw o conﬁgurations in factor mining. T ab. 3 presents the results of the ablation experiments. In exp eriments with diﬀerent num b ers of initial factors, exp erimen ts starting with 2-island, with both prompts with CoE or top-K fac- tors, consistently pro duced factors with higher RankIC compared to the 1-island conﬁguration. It also eﬀectiv ely improv ed AR and IR p erformance. Similarly , fac- tors evolv ed with CoE prompts generally exhibited higher RankIC and RICIR, while also impro ving the p ortfolio p erformance of the evolv ed factors. 7 Conclusion W e presented F actorEngine , a program-level alpha factor mining framework for disco vering exe cutable and auditable factors while keeping the ov erall pip eline computationally tractable. FE departs from prior sym b olic expression searc h by represen ting factors as T uring-c omplete pr o gr ams and impro ving eﬀectiveness, eﬃciency and div ersity . FE further introduces a knowledge-infused bo otstrapping mo dule transforming ﬁnancial rep orts in to executable programs via a closed-loop m ulti-agent extraction–v eriﬁcation-generation pip eline, together with the CoE that supp orts tra jectory-aw are reﬁnement and learning from failures. More broadly , FE is a gr adient-fr e e optimization framework for discrete, structured search spaces: the key optimization signal is pro duced by the evo- lution mac hinery (exp erience-guided exploration and Ba y esian micro-search), rather than b y alpha-speciﬁc assumptions, making the approac h applicable to other black-box discrete optimization problems with exp ensiv e execution-based ev aluation. F uture work includes extending to ric her data modalities, improving 16 Qinhong Lin et al. robustness under distribution shift and transaction costs, enabling the LLM to activ ely interrogate market data, and b etter characterizing diversit y and gener- alization in exp erience-guided program evolution. References 1. Bro wn, T.B., et al.: Language mo dels are few-shot learners. NeurIPS (2020) 2. Cortes, C., V apnik, V.: Supp ort-vector netw orks. Machine Learning (1995) 3. Duan, Y., W ang, L., Zhang, Q., Li, J.: F actorv ae: A probabilistic dynamic factor mo del based on v ariational autoenco der for predicting cross-sectional sto ck returns. In: Pro ceedings of the AAAI conference on artiﬁcial intelligence. vol. 36, pp. 4468– 4476 (2022) 4. F ama, E.F., F rench, K.R.: The cross-section of exp ected sto ck returns. the Journal of Finance 47 (2), 427–465 (1992) 5. F an, X., et al.: Mo deling the momentum and mean reversion of sto ck prices via m ultiscale representation learning. KDD (2022) 6. Ho c hreiter, S., Sc hmidhuber, J.: Long short-term memory . Neural Computation 9 (8), 1735–1780 (1997). https://doi.org/ 10.1162/neco.1997.9.8.1735 7. Ho c hreiter, S., Sc hmidhuber, J.: Long short-term memory . Neural Comput. (1997) 8. Hou, K., Xue, C., Zhang, L.: Replicating anomalies. The Review of Financial Stud- ies 33 (5), 2019–2133 (2020). https://doi.org/10.1093/rfs/hh y131 9. Li, Y., Xu, Y., Xiao, Y., Xu, M., W ang, X., Liu, W., Bian, J.: R&d-agent-quan t: A m ulti-agent framew ork for data-centric factors and model joint optimization. arXiv preprin t arXiv:2505.15155 (2025) 10. Li, Z., Song, R., Sun, C., Xu, W., Y u, Z., W en, J.R.: Can large language mo d- els mine interpretable ﬁnancial factors more eﬀectiv ely? a neural-symbolic factor mining agen t mo del. In: Findings of the Association for Computational Linguistics A CL 2024. pp. 3891–3902 (2024) 11. Lin, H., Zhou, D., Liu, W., Bian, J.: Learning multiple sto ck trading patterns with temp oral routing adaptor and optimal transp ort. In: Proceedings of the 27th ACM SIGKDD conference on kno wledge discov ery & data mining. pp. 1017–1026 (2021) 12. No viko v, A., V ˜ u, N., Eisen b erger, M., et al.: Alphaev olve: A coding agent for scien tiﬁc and algorithmic discov ery . arXiv preprint arXiv:2506.13131 (2025) 13. Shi, H., Song, W., Zhang, X., Shi, J., Luo, C., Ao, X., Arian, H., Seco, L.A.: Alphaforge: A framework to mine and dynamically combine formulaic alpha fac- tors. In: Proceedings of the AAAI Conference on Artiﬁcial In telligence. vol. 39, pp. 12524–12532 (2025) 14. Shi, Y., Duan, Y., Li, J.: Na vigating the alpha jungle: An llm-p ow ered mcts frame- w ork for formulaic factor mining. arXiv preprin t arXiv:2505.11122 (2025) 15. Stephens, T.: gplearn: Genetic programming in p ython. https://github.com/trev o rstephens/gplearn (2016) 16. T ang, Z., Chen, Z., Y ang, J., et al.: Alphaagent: Llm-driven alpha mining with regularized exploration to coun teract alpha decay . In: Pro c. A CM SIGKDD Int. Conf. Kno wl. Discov. Data Min. (KDD). pp. 2813–2822 (2025) 17. V aswani, A., Shazeer, N., Parmar, N., et al.: Atten tion is all you need. In: Adv ances in Neural Information Pro cessing Systems. v ol. 30 (2017) 18. W ei, J., et al.: Emergen t abilities of large language mo dels. arXiv preprint arXiv:2206.07682 (2022) F actorEngine: A Program-level Knowledge-Infused F actor Mining F ramework 17 19. Xu, W., Liu, W., W ang, L., Xia, Y., Bian, J., Yin, J., Liu, T.Y.: Hist: A graph- based framew ork for sto ck trend forecasting via mining concept-oriented shared information. arXiv preprint arXiv:2110.13716 (2021) 20. Xu, W., Liu, W., Xu, C., Bian, J., Yin, J., Liu, T.Y.: Rest: Relational even t-driven sto c k trend forecasting. In: Proceedings of the w eb conference 2021. pp. 1–10 (2021) 21. Y u, S., Xue, H., Ao, X., et al.: Generating synergistic form ulaic alpha collections via reinforcement learning. In: Pro ceedings of the 29th A CM SIGKDD Conference on Kno wledge Discov ery and Data Mining. pp. 5476–5486 (2023) 22. Zhang, T., Li, Y., Jin, Y., Li, J.: Autoalpha: An eﬃcien t hierarchical ev olution- ary algorithm for mining alpha factors in quantitativ e inv estment. arXiv preprint arXiv:2002.08245 (2020) 23. Zhang, X., Li, P ., Zh u, J., T ang, J.: T emp oral routing adaptor for deep time series forecasting. In: Pro ceedings of the 28th ACM SIGKDD Conference on Kno wledge Disco very and Data Mining. pp. 2447–2457 (2022) A Exp erimen tal Details A.1 Implemen tation Settings Har dwar e Setup. All exp eriments were conducted on a server equipp ed with 56 CPU cores, pro viding a total of 56 parallel threads. A.2 Dataset The market data used in our exp eriments were generated using the Qlib frame- w ork 9 . All methods w ere trained and ev aluated in the same manner. T o a v oid p oten tial data leak age, w e adopt a clean data-splitting strategy for AlphaAgent and RD-Agen t-Quant during factor mining. Speciﬁcally , the training data are further split into 2008-01-01 – 2012-12-31, 2013-01-1 – 2013-12-31 and 2014-01- 01 - 2014-12-31 for training, v alidation and backtesting in mining stages. After all factors are generated, w e revert to the original train/v alidation/test split to train the m ulti-factor mo dels and backtest, ensuring a fair and leak age-free ev aluation. F or AlphaAgent and RD-AGENT-QUANT, which require backtesting during the factor mining pro cess, the training data were further split into tr ain/valida- tion/test subsets for signal bac ktesting. In contrast, our prop osed metho d relies solely on single-factor metrics, including IC, ICIR, RIC, and RICIR, and there- fore do es not require additional data splitting. T o construct a more robust factor p o ol, all generated factors from diﬀerent metho ds w ere com bined with the widely used alpha158 factor set in ﬁnancial researc h. The deﬁnitions of these 158 factors can be found in the Qlib reposi- tory 10 . In the main exp eriments, w e use t w o predeﬁned factor subsets for the 200 iterations exp erimen t and 400 iterations exp eriments: 9 h ttps://github.com/microsoft/qlib 10 h ttps://github.com/microsoft/qlib/blob/main/qlib/contrib/data/loader.py 18 Qinhong Lin et al. – 5-factor set : { corr5 , resi5 , klen , klow , vstd5 }. – 10-factor set : { corr5 , resi10 , roc60 , rsqr5 , cord5 , std5 , klen , klow , vstd5 , wvma5 }. F urther Analysis. It is worth noting that, in Fig.3 of the pap er, from 2017 to 2021, due to the o verall market c haracteristics of A-shares, cross-sectional factors struggled to generate proﬁts, and the ov erall excess cumulativ e return in the CSI300 mark et was negative. Only after 2021, when the mark et characteristics shifted, did these factors start to generate proﬁts.In the CSI500 market, m ultiple signiﬁcan t bac ktest results app eared within the bac ktest p erio d, all of whic h w ere similarly aﬀected by market sho c ks.This contrasts with the ﬁndings in the AlphaAgen t and RD-AGENT-Quan t exp erimental rep orts. A.3 Ev aluation Metrics W e adopt b oth predictive and strategy-level metrics to ev aluate p erformance. Information Co eﬃcient (IC). IC measures the cross-s ectional correlation b e- t ween predicted scores and realized returns, and is widely used in quantitativ e ﬁnance. Information Co eﬃcient Information R atio (ICIR). ICIR ev aluates the temp oral stabilit y of IC and is deﬁned as: ICIR = mean( IC ) std( IC ) . (6) R ank Information Co eﬃcient (RIC). RIC refers to the Spearman rank correla- tion b et ween predicted and realized return rankings. R ank Information Co eﬃcient Information R atio (RICIR). RICIR ev aluates the stabilit y of RIC ov er time: RICIR = mean( RIC ) std( RIC ) . (7) A nnual R eturn (AR) AR reﬂects the compound geometric gro wth rate of the p ortfolio: AR = T Y t =1 (1 + r t ) ! 252 T − 1 , (8) where r t denotes the daily return and T is the total num b er of trading days. A nnual Exc ess R eturn (AER) AER reﬂects the comp ound annual gro wth rate of the p ortfolio relative to a b enchmark: AER =  P T /P 0 B T /B 0  252 T − 1 , (9) where P t and B t denote the p ortfolio v alue and the b enchmark v alue at time t , resp ectiv ely , and T is the total num b er of trading da ys. F actorEngine: A Program-level Knowledge-Infused F actor Mining F ramework 19 Maximum Dr awdown (MDD) MDD measures the maximum loss from p eak to trough during the ev aluation p erio d: MDD = max t ∈ [1 ,T ]  max s ∈ [1 ,t ] P s − P t max s ∈ [1 ,t ] P s  , (10) where P t denotes the p ortfolio v alue at time t . R elative Maximum Dr awdown (RMDD) RMDD measures the maximum draw- do wn of the strategy relativ e to a benchmark. W e ﬁrst deﬁne the relativ e net v alue as: V rel t = P t B t , (11) and compute the maxim um drawdo wn on V rel t : RMDD = max t ∈ [1 ,T ] max s ∈ [1 ,t ] V rel s − V rel t max s ∈ [1 ,t ] V rel s ! . (12) Sharp e R atio (SR) The Sharpe Ratio ev aluates risk-adjusted returns b y normal- izing excess returns with their v olatility . It is deﬁned as: SR = E [ r t − r f ] p V ar( r t − r f ) , (13) where r t denotes the portfolio return at perio d t and r f is the risk-free rate ov er the same p erio d. In our exp eriments, following common practice in empirical bac ktesting, we set r f = 0 when computing SR on daily returns. W e rep ort the ann ualized Sharp e Ratio, computed as: SR ann = √ 252 · mean( r t − r f ) std( r t − r f ) . (14) A.4 T rading Strategy During bac ktesting, we explicitly account for the market’s daily price-limit (limit- up/limit-do wn) rules and imp ose corresponding constrain ts on trade execution. The trading strategy is deﬁned as follo ws: – A t the close of trading day t , the mo del generates a ranking score for each sto c k in the p o ol based on predicted returns. – W e adopt a rolling up date strategy with a ﬁxed 5-da y holding p eriod to balance signal freshness and turno ver costs. Sp eciﬁcally , the total capital is managed as ﬁve ov erlapping sub-p ortfolios. On each trading day t + 1 , w e liquidate only the sub-p ortfolio that has reached its 5-day maturity and rein vest the released cash into the top 50 sto cks currently ranked by the mo del. 20 Qinhong Lin et al. – The selected top 50 stocks within each newly constructed tranche are w eigh ted equally . – W e employ a realistic cost mo del aligned with the Chinese A-share mark et. This includes a bilateral commission rate of 1 . 5 × 10 − 4 (0.00015) c harged on b oth buy and sell orders, and a unilateral stamp duty of 5 × 10 − 4 (0.0005) c harged only on sell orders. – T o accoun t for execution uncertaint y and mark et impact, we incorp orate a prop ortional slippage of 8 × 10 − 4 (0.0008) on all trades. – W e adhere to strict realistic constraints: the minimum trading unit is set to 1 lot (100 shares). T o ensure liquidity , we imp ose a volume limit preven ting the strategy from exceeding 10% of any sto ck’s daily trading volume. The initial capital is set to 100 , 000 , 000 (CNY) to stabilize p ortfolio construction and minimize the impact of rounding errors on small p ositions. A.5 Baselines W e compare our metho d with the following baselines: – GPLearn : A sym b olic regression metho d based on genetic programming. – T ransformer : A multi-head self-attention model that captures long-range dep endencies in time-series data. – LSTM : A recurren t neural net work with memory cells and gating mecha- nisms for mo deling long-term dep endencies. – TRA : A T ransformer-based mo del incorp orating a dynamic temp oral rout- ing mec hanism to adaptively capture diverse market patterns. – Ligh tGBM : A gradien t b o osting decision tree (GBDT) framework that builds an ensem ble of trees in a stage-wise manner, optimized with histogram- based split ﬁnding and leaf-wise gro wth to achiev e high eﬃciency and strong p erformance on tabular features. A.6 Prompt Design Ev olution Module Ide a Gener ation. The agen t b ehavior is constrained by an explicit system prompt. W e sho w the template format b elow. System Prompt Listing 1.1: System prompt format used for idea generation in the Ev olution Mo dule. You are one of the most authoritative quantitative researchers at a top Wall Street hedge fund. I need your expertise to design and implement new factors or models to enhance investment returns. You will receive information about # Original Program and # Current Program and # Program Evolution History F actorEngine: A Program-level Knowledge-Infused F actor Mining F ramework 21 contains samples of your historical evolution tests and results. Your goal is to improve the factor or raise a new one and maximize the specified evaluation metrics while avoiding any look-ahead bias or data leakage based on your knowledge. Metrics description: ... Task description: 1.Implement optimizations: ... 2.Propose & implement alternative factors: ... 3.Compliance & rigor: ... Hard requirements: 1.... 2.... ... The system prompt is further conditioned on a chain of exp erience con taining historical ev olution tra jectories as sho wn b elow. Chain of Experience Listing 1.2: Chain-of-exp erience template provided to the agent, including programs, metrics, ev olution history , and resp onse constraints. # Original Program ‘‘‘{language} {original_program} ‘‘‘ # Original Information - Metrics: {original_metrics} - Fitness: {original_fitness_score} - Feature coordinates: {original_feature_coords} # Program Evolution History: your historical continuous evolutionary attempt paths, including historical idea, and changes in metrics compared to the initial program and every previous step. {evolution_history} # Current Program: a program evolute from a previous attempt along a evolution path ‘‘‘{language} {current_program} ‘‘‘ 22 Qinhong Lin et al. # Current Program Information - Metrics: {current_metrics} - Fitness: {current_fitness_score} - Feature coordinates: {current_feature_coords} - Focus areas again previous step: { current_improvement_areas_against_previous} - Focus areas again #Original Program: { current_improvement_areas_against_origin} {current_artifacts} # Task Suggest improvements to the program that will improve its -Metrics following ’Metrics description’ and -Fitness. The system maintains diversity across these dimensions: { feature_dimensions} Different solutions with similar fitness but different features are valuable. # Response requirement You MUST use the format shown below with the exact SEARCH/REPLACE diff of code changes: ###Analyse: Analyze the domain insights you have gained from the comparison between # Current Program Information and # Original Program Information, and the lesson learn from previous evolution attempts. ###IDEA: Your idea about how to improve the performance according to your domain insights. Learn from attempts that lead to high scores and avoid attempts that have already degraded. You should focus on both the factor function and the parameters. ###Code changes: <<<<<<< SEARCH # Original code that need to be replaced (must match exactly) ======= # New replacement code >>>>>>> REPLACE You can suggest multiple changes. Each SEARCH section must exactly match code in the ’# Current Program’. IMPORTANT: Do not rewrite the entire program - focus on targeted improvements. ###Parameters: Define the search ranges for Bayesian optimization (Optuna). For each parameter, specify the type and range. F actorEngine: A Program-level Knowledge-Infused F actor Mining F ramework 23 Format for numeric parameters: {{"param_name": {{"type": "float", "low": min_value, "high": max_value}}, "param_name2": {{"type": "int", "low": min_int, "high": max_int }}}} Example: {{"w_v": {{"type": "float", "low": 0.3, "high": 0.9}}, "N_r": {{"type": "int", "low": 5, "high": 30}}}} A.7 Illustrativ e Example: F rom Rep ort-Inspired Seed to Ev olv ed Programmatic F actor T o concretely demonstrate how F actorEngine (FE) op erationalizes pr o gr am-level evolution , w e provide an end-to-end example of a factor program. W e sho w (i) an initial executable factor generated b y the bo otstrapping module from a ﬁnancial researc h rep ort in Fig. A.7, and (ii) an evolve d factor pro duced after 40 evolu- tion iterations in Fig. A.7, under the same I/O contract. This example highlights ho w FE reﬁnes factor logic (e.g., turnov er-a ware pro xies, rank-based normaliza- tion, and temp oral smo othing) while maintaining executability and auditabilit y throughout the ev olution pro cess. Seed F actor Program (Bootstrapp ed from Research Rep ort) Listing 1.3: Rep ort-inspired initial factor (seed program). An executable programmatic factor pro duced by the b o otstrapping mo dule from a ﬁnancial researc h rep ort, serving as a seed in the initial factor p o ol. def factor(pricing_data: pl.DataFrame, parameters): w1 = parameters.get("w1", 0.25) w2 = parameters.get("w2", 0.25) w3 = parameters.get("w3", 0.50) EPSILON = parameters.get("epsilon", 1e-9) if isinstance(pricing_data, pd.DataFrame): # Handle pandas DataFrame input df_pl = pl.from_pandas(pricing_data.reset_index()).rename({ ’$close’: ’close’, ’$open’: ’open’, ’$high’: ’high’, ’$low’: ’low’, ’$volume’: ’volume’ }) else: df_pl = pricing_data.rename({ 24 Qinhong Lin et al. ’$close’: ’close’, ’$open’: ’open’, ’$high’: ’high’, ’$low’: ’low’, ’$volume’: ’volume’ }) df_pl = df_pl.select( [’instrument’, ’datetime’, ’open’, ’high’, ’low’, ’close’, ’ volume’] ).with_columns([ pl.col("datetime").cast(pl.Date), pl.col([’open’, ’high’, ’low’, ’close’, ’volume’]).cast(pl. Float64) ]) daily_range_expr = pl.col(’high’) - pl.col(’low’) sf1_expr = -pl.col(’volume’) * (pl.col(’close’) - pl.col(’low’)) / (daily_range_expr + EPSILON) sf2_expr = -pl.col(’volume’) * (pl.col(’high’) - pl.col(’open’)) / (daily_range_expr + EPSILON) sf3_expr = pl.col(’volume’) * (pl.min_horizontal(’open’, ’close’) - pl.col(’low’)) / (daily_range_expr + EPSILON) df_factor = df_pl.with_columns( z1=(sf1_expr - sf1_expr.mean().over(’datetime’)) / (sf1_expr. std(ddof=0).over(’datetime’) + EPSILON), z2=(sf2_expr - sf2_expr.mean().over(’datetime’)) / (sf2_expr. std(ddof=0).over(’datetime’) + EPSILON), z3=(sf3_expr - sf3_expr.mean().over(’datetime’)) / (sf3_expr. std(ddof=0).over(’datetime’) + EPSILON), ).with_columns( (w1 * pl.col(’z1’) + w2 * pl.col(’z2’) + w3 * pl.col(’z3’)). alias(’Factor’) ) df_tf = df_factor.select([’instrument’, ’datetime’, ’Factor’]) df_tf = df_tf.filter( pl.col(’Factor’).is_not_nan() & pl.col(’Factor’).is_finite() ) df_tf = df_tf.with_columns(pl.col("datetime").cast(pl.Date).alias ("datetime")) return df_tf Ev olved F actor Program (After 40 Ev olution Iterations) F actorEngine: A Program-level Knowledge-Infused F actor Mining F ramework 25 Listing 1.4: Evolv ed factor after 40 iterations. A represen tative factor program ev olved from the seed via FE’s macro–micro co-ev olution, incorpo- rating reﬁned signal construction (e.g., turnov er-based pro xies, rank normal- ization, and exp onential smo othing) while preserving the same executable in terface. def trend_factor(pricing_data: pl.DataFrame, parameters): w3 = parameters.get("w3", 0.50) w1 = parameters.get("w1", (1.0 - w3) / 2.0) w2 = parameters.get("w2", (1.0 - w3) / 2.0) smoothing_window = parameters.get("smoothing_window", 5) EPSILON = parameters.get("epsilon", 1e-9) if isinstance(pricing_data, pd.DataFrame): # Handle pandas DataFrame input df_pl = pl.from_pandas(pricing_data.reset_index()).rename({ ’$close’: ’close’, ’$open’: ’open’, ’$high’: ’high’, ’$low’: ’low’, ’$volume’: ’volume’}) else: df_pl = pricing_data.rename({ ’$close’: ’close’, ’$open’: ’open’, ’$high’: ’high’, ’$low’: ’low’, ’$volume’: ’volume’}) df_pl = df_pl.select( [’instrument’, ’datetime’, ’open’, ’high’, ’low’, ’close’, ’ volume’] ).with_columns([ pl.col("datetime").cast(pl.Date), pl.col([’open’, ’high’, ’low’, ’close’, ’volume’]).cast(pl. Float64) ]).sort([’instrument’, ’datetime’]) turnover_expr = pl.col(’volume’) * pl.col(’close’) # Use turnover for capital-weighted signal sf1_expr = -turnover_expr * (pl.col(’close’) - (pl.col(’high’) + pl.col(’low’)) / 2.0) / (daily_range_expr + EPSILON) sf2_expr = -turnover_expr * (pl.col(’high’) - pl.col(’open’)) / ( daily_range_expr + EPSILON) sf3_expr = turnover_expr * (pl.min_horizontal(’open’, ’close’) - pl.col(’low’)) / (daily_range_expr + EPSILON) rank_norm_expr = lambda expr: (expr.rank(method=’average’).over(’ datetime’) / (expr.count().over(’datetime’) + 1)) - 0.5 df_factor = df_pl.with_columns( # Calculate daily raw combined factor using rank-normalized components raw_combined_factor=( w1 * rank_norm_expr(sf1_expr) + w2 * rank_norm_expr(sf2_expr) + w3 * rank_norm_expr(sf3_expr)) 26 Qinhong Lin et al. ).with_columns( smoothed_factor=pl.col(’raw_combined_factor’).ewm_mean( span=smoothing_window, min_periods=max(1, smoothing_window // 2)).over(’instrument’) ).with_columns( Factor=( (pl.col(’smoothed_factor’) - pl.col(’smoothed_factor’). mean().over(’datetime’)) / (pl.col(’smoothed_factor’).std(ddof=0).over(’datetime’) + EPSILON)) ) df_tf = df_factor.select([’instrument’, ’datetime’, ’Factor’]) df_tf = df_tf.filter(pl.col(’Factor’).is_not_nan() & pl.col(’ Factor’).is_finite()) df_tf = df_tf.with_columns(pl.col("datetime").cast(pl.Date).alias ("datetime")) return df_tf

FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment