An Auditable AI Agent Loop for Empirical Economics: A Case Study in Forecast Combination

An Auditable AI Agen t Lo op for Empirical Economics: A Case Study in F orecast Com bination Minc h ul Shin F e der al R eserve Bank of Philadelphia ∗ First v ersion: Marc h 17, 2026 This v ersion: Marc h 23, 2026 Abstract AI co ding agen ts make empirical speciﬁcation search fast and c heap, but they also widen hidden researc her degrees of freedom. Building on an op en-source agen t-lo op arc hitecture, this pap er adapts that framew ork to an empirical economics workﬂo w and adds a post-search holdout ev aluation. In a forecast-combination illustration, m ultiple indep endent agen t runs outp erform standard benchmarks in the original rolling ev aluation, but not all contin ue to do so on a p ost-search holdout. Logged search and holdout ev aluation together make adaptiv e sp eciﬁcation search more transparent and help distinguish robust improv ements from sample-sp eciﬁc discov eries. Keywor ds: Agen t lo ops, Autoresearch, Sp eciﬁcation searc h, Researcher degrees of freedom, Resp onsible AI, F orecast com bination JEL Co des: C53, C52, C18 ∗ The views expressed herein are those of the author and do not necessarily reﬂect the views of the F ederal Reserv e Bank of Philadelphia or the F ederal Reserve System. Email: visiblehand@gmail.com. 1 1 In tro duction Empirical researc hers hav e alw ays explored alternativ e sp eciﬁcations. AI mak es that pro cess c heap and fast. An AI co ding agen t can rewrite an estimation script, run a ﬁxed ev aluator, observ e the score, and try again hundreds of times in one session. Once that lo op is in place, the practical b oundary b etw een co ding assistance and undisclosed sp eciﬁcation searc h b ecomes thin. F or a ﬁeld long concerned with researcher degrees of freedom (Leamer, 1983; White, 2000; Gelman and Lok en, 2013; Miguel, 2021), the central question is therefore not whether AI agents can improv e an empirical score, but whether their search can b e made transparen t and auditable. This pap er studies one resp onse. Building on Karpath y’s (2026) op en-source autoresearch , I adapt a simple agentic co ding lo op to a research proto col for empirical economics. The proto col has four ingredients: a written instruction contract, an imm utable ev aluator, a single editable script, and a complete exp eriment log. I then add an explicit p ost-search holdout ev aluation. T ogether, they do not eliminate adaptive searc h, but they mak e it insp ectable, b ounded, and easier to disclose, while also making v alidation beyond the search sample explicit. The p oint is therefore not a new agent arc hitecture, but a wa y to use an existing one in empirical w ork with clearer external v alidation. The empirical illustration uses the forecast-combination problem of Dieb old and Shin (2019). An AI co ding agent is asked to ﬁnd a lo ok-ahead-free real-time tuning rule for p eLASSO while lea ving the ev aluator ﬁxed. On the original rolling ev aluation, three indep en- den t runs improv e on the simple av erage and on the pap er’s ex p ost p eLASSO b enc hmark. A p ost-searc h holdout for 2017Q1–2025Q4 then pro vides the k ey v alidation: tw o disco v ered metho ds con tinue to outp erform standard b enchmarks, while one do es not. That split result is the main message. The exp eriment logs sho w that nominal hyper- parameter tuning readily drifts in to broader metho d discov ery . Some departures surviv e external ev aluation; others do not. The v alue of the framew ork is therefore tw ofold: the underlying lo op records how the agen t searched, and the added holdout ev aluation allo ws the 2 resulting departures to b e judged rather than hidden. F or empirical economics, that is the relev an t promise of agen tic searc h: not automation without discretion, but automation with an audit trail and explicit p ost-searc h holdout ev aluation. Recen t work on AI for science spans a wide range of autonomy . On one end, systems suc h as AlphaEvolv e (No viko v et al., 2025) and DS-ST AR (Nam et al., 2025) use large language mo dels to iteratively rewrite co de and reﬁne analyses against automated ev aluators. On the other end, systems such as The AI Scien tist (Lu et al., 2024) and AI co-scien tist (Gott weis et al., 2025) pursue broader researc h automation across idea generation, exp erimen t design, analysis, and comm unication. In economics, Korinek (2025) surv eys AI agen ts for researc h, while Dawid et al. (2025) dev elop broader agen tic w orkﬂows. Our con tribution is narrow er: w e study ev aluator-lo ck ed lo cal search within a ﬁxed empirical workﬂo w, where the researcher sets the question, data, ev aluator, editable surface, and search budget. The goal is not full automation, but an auditable proto col for standard empirical practice. The remainder of the pap er describ es the proto col (Section 2), presents the empirical illustration (Section 3), and concludes (Section 4). 2 An auditable agen t-lo op proto col Consider a researc her who starts with a baseline empirical mo del and w an ts to explore nearby alternativ es using an AI co ding agen t. 1 W e formalize the searc h as follows. Let C denote a written instruction con tract that sp eciﬁes the researc h ob jective, admissible modiﬁcations, and searc h budget. Let T ( C ) denote the class of admissible editable scripts under C . Let the a v ailable data b e partitioned as D = ( D S , D H ), where D S is the search sample and D H is a p ost-searc h holdout reserv ed for external v alidation. Let S ( τ ; D S ) denote the scalar score returned b y an immutable ev aluator when script τ ∈ T ( C ) is run on search sample D S . Both T ( C ) and S are ﬁxed ex ante b y the researc her as part of the searc h design. In practice, the 1 By AI co ding agent we mean a general-purp ose co ding assistant suc h as Claude Co de or Op enAI Co dex that plans, writes co de, and executes to ol calls within a developmen t environmen t. 3 Algorithm 1 Auditable agent-loop proto col Require: Instruction contract C , ev aluator S ( · ; D S ), search sample D S , initial script τ 0 , budget K 1: Run ev aluator on τ 0 ; record score; set τ best ← τ 0 2: for k = 1 , . . . , K do 3: Agen t proposes candidate script τ k , informed b y log L k − 1 4: Run ev aluator on τ k ; record score (or crash) 5: If score improv es, set τ best ← τ k ; otherwise rev ert to τ best 6: App end ( k , τ k , s k , outcome) to log L k 7: end for con tract ﬁxes the ev aluator, editable surface, and no-lo ok-ahead constrain t, but it need not imp ose a machine-c hec k able restriction that candidate scripts b elong to a particular mo del family; the eﬀective admissible set T ( C ) may therefore b e broader than the researc her’s seman tic in tent. A complete exp eriment log L records ev ery attempted mo diﬁcation and its outcome. These four elements—con tract, ev aluator, editable script, and log—constitute the proto col. When low er scores are preferred, the search problem is τ ⋆ ∈ arg min τ ∈T ( C ) S ( τ ; D S ) . In practice, the agent sequentially generates and ev aluates a ﬁnite sequence of candidate scripts { τ 0 , τ 1 , . . . , τ K } , each informed b y the outcomes of previous attempts, and returns ˆ τ K ∈ arg min τ ∈{ τ 0 ,τ 1 ,...,τ K } S ( τ ; D S ) . This distinction matters. τ ⋆ is the b est admissible script in the full class T ( C ). ˆ τ K is the b est script found b y a particular b ounded search. The audit log records the outcome of ev ery exp erimen t as an ordered sequence L K =  ( k , τ k , s k , d k )  K k =0 , where s k = S ( τ k ; D S ) when the run succeeds, s k = + ∞ when it crashes, and d k ∈ { ke ep , disc ar d , cr ash } records the outcome status. Algorithm 1 summarizes the proto col. 4 The proto col do es not impose a model of ho w the sequence of candidates is generated; that depends on the underlying large language mo del and the co ding environmen t. Because the agen t observes what it has already tried, the search is path dep endent and sto c hastic: indep enden t runs from the same starting p oint will generally follo w diﬀeren t paths and may reac h diﬀerent solutions. The searc h is adaptive but b ounded. The agent searc hes freely within the editable script, but the ev aluator, searc h sample, and scoring rule are ﬁxed ex ante. This separation b etw een imm utable searc h design and editable implemen tation is analogous to a pre-analysis plan, except that the audit trail is generated automatically and co v ers every sp eciﬁcation the agent considered. Implementation details, including the instruction contract, are in the Appendix. Once the searc h is complete, the researc her ma y ev aluate the selected script ˆ τ K on the holdout D H . Let S H ( τ ; D H ) denote a scoring function computed on the holdout. The realized holdout score S H ( ˆ τ K ; D H ) provides an external c heck on whether improv ements discov ered during adaptive search on D S generalize b eyond the search sample. Because D H need not ev en exist during the searc h, the holdout ev aluation is a researc her action, not part of the agen tic loop. 3 Empirical illustration: real-time tuning of p eLASSO 3.1 Setup W e illustrate the proto col using the real-time tuning problem in Diebold and Shin’s (2019) p eLASSO forecast combination. p eLASSO ﬁrst uses LASSO to select a subset of forecasters from the Surv ey of Professional F orecasters and then shrinks the surviving w eigh ts to ward equalit y using an egalitarian p enalty . In the tw o-step implementation, t wo regularization parameters must b e c hosen at each forecast origin using only information a v ailable at that time. Dieb old and Shin (2019) ev aluate the pro cedure with ex p ost optimal parameters and do not pro vide a data-based rule for real-time selection. Instead, they propose other metho ds 5 suc h as subset av eraging, inspired b y the ex post analysis of peLASSO, that p erform w ell on their ev aluation sample. The present application takes the tuning problem of p eLASSO, whic h w as left unresolv ed, as the ob ject of agentic search. The instruction con tract asks the agen t to ﬁnd a look-ahead-free lam b da selection rule for p eLASSO, editing only a single forecasting script ( train.R ). The imm utable ev aluator ( prepare.R ) implements rolling-origin RMSE ev aluation under the original Dieb old and Shin (2019) design: at eac h date t , only data through t − 1 is av ailable, and the score is the RMSE o v er the full ev aluation p eriod. W e conduct three independent runs, eac h starting from the simple av erage as the initial script and an appro ximate budget of K = 200 experiments. W e also ev aluate the disco vered metho ds on a p ost-search holdout (2017Q1–2025Q4, 36 quarters) that w as withheld from the ev aluator during the agen tic searc h. As additional b enc hmarks, w e include tw o methods from Dieb old and Shin (2019): best ≤ N max -a v erage ( N max = 6, their T able 4) and b est ( ≤ N max , ≤ W max )-a v erage (their T able 5), b oth of whic h exhaustiv ely searc h o ver forecaster subsets at eac h forecast origin. 3.2 Searc h-sample and holdout results T able 1 rep orts RMSE for the searc h sample (1999Q3–2016Q4) and the holdout (2017Q1– 2025Q4). On the searc h sample, all three agen t runs b eat b oth the simple av erage and the p eLASSO ex p ost b enchmark: Run 1 achiev es a relativ e RMSE of 0.858, Run 3 reaches 0.808, and Run 2 achiev es 0.510. The t w o b enc hmark metho ds from the original pap er also outp erform the simple av erage, with relative RMSEs of 0.954 (b est ≤ N max -a verage) and 0.916 (b est ( ≤ N max , ≤ W max )-a v erage). On the holdout, Run 2 remains the strongest metho d, with a relative RMSE of 0.811 (0.739 excluding the CO VID quarters of 2020). Run 1 also holds up, at 0.945 (0.861 excluding CO VID). The tw o b enc hmark metho ds from the original pap er generalize mo destly , eac h at 0.974. In con trast, Run 3 (1.089) p erforms worse than the simple a verage on the holdout, and p eLASSO ex p ost shows essentially no improv ement (0.995). The holdout therefore 6 T able 1: Searc h-sample and holdout ev aluation Searc h sample Holdout Holdout excl. COVID RMSE Relativ e RMSE Relativ e RMSE Relativ e Simple av erage 1 . 504 1 . 000 2 . 979 1 . 000 1 . 120 1 . 000 p eLASSO ex p ost 1 . 400 0 . 930 2.964 [.191] 0 . 995 1.075 [.136] 0 . 960 Best ≤ 6-a vg 1 . 435 0 . 954 2.901 [.240] 0 . 974 1.043 [.347] 0 . 932 Best ( ≤ 6 , ≤ 40)-avg 1 . 378 0 . 916 2.901 [.127] 0 . 974 1.142 [.602] 1 . 020 Run 1 1 . 291 0 . 858 2.816 [.177] 0 . 945 0.965 [.234] 0 . 861 Run 2 0 . 767 0 . 510 2.417 [.127] 0 . 811 0.827 [.108] 0 . 739 Run 3 1 . 216 0 . 808 3.244 [.852] 1 . 089 1.305 [.755] 1 . 165 Notes: RMSE of forecast errors for euro area real GDP gro wth (y ear on year). Searc h sample: 1999Q3–2016Q4 (65 ev aluation perio ds). Holdout: 2017Q1– 2025Q4 (36 quarters). Holdout excl. COVID: holdout dropping 2020Q1–Q4 (32 quarters). Relative RMSE is computed against the simple a verage on eac h resp ectiv e sample. Best ≤ 6-a vg and b est ( ≤ 6 , ≤ 40)-a vg are from Dieb old and Shin (2019), T ables 4 and 5. Brac keted v alues in the holdout columns are one sided p -v alues from the Dieb old–Mariano test that the metho d has sup erior predictive accuracy relative to the simple av erage, using the EWC ﬁxed- b appro ximation (Shin and Sc hor, 2026); these hav e in trinsically lo w p o wer given the small holdout sample and serial correlation in forecast errors, and should be read as descriptive. p eLASSO ex p ost re-optimizes λ on eac h ev aluation windo w separately; see App endix E for the ﬁxed- λ v arian t. F ull results including intermediate metho ds are in App endix E. sho ws that agentic search can unco ver metho ds that generalize, but it also rev eals that some disco v ered impro v ements are sample sp eciﬁc. 3.3 What the log rev eals The instruction contract asks the agen t to tune regularization parameters for p eLASSO. The exp erimen t logs 2 , ho wev er, reveal that t w o of the three runs drifted b ey ond that task: Run 1 to ward stability selection and p erformance w eighting, and Run 2 to ward a ranking-based metho d with bias correction. Run 3 sta yed closest to the original task, using adaptive LASSO with forw ard cross v alidation for lambda selection, though it added an egalitarian elastic net blend. The holdout mak es these departures ev aluable. Run 2’s bias correction, whic h adjusts for recen t forecast bias, turns out to generalize and accounts for most of its holdout gain. 2 App endix C repro duces an excerpt from the exp erimen t log of Run 3, and App endix D gives a detailed description of the metho d discov ered by each run. 7 Run 3’s egalitarian elastic net blend, which drov e m uch of its search-sample improv ement, degrades holdout performance. The proto col do es not prev en t the agent from leaving the in tended design space; it ensures that ev ery departure is recorded, so that a holdout or other external evidence can later distinguish structural insight from sample adaptation. 4 Conclusion Agen t lo ops can widen researcher degrees of freedom, but they need not do so invisibly . This pap er adapts a minimal agent lo op arc hitecture to an auditable proto col for agentic sp eciﬁcation searc h and adds a p ost-searc h holdout ev aluation. In the forecast combination illustration based on Dieb old and Shin (2019), three indep enden t runs improv e on the search sample, but a genuine holdout shows that only tw o of the three disco v ered metho ds generalize. The con tribution of the framework is not to eliminate adaptiv e searc h, but to mak e it visible, b ounded, and ev aluable. Because the proto col disciplines the search pro cess rather than the sub ject matter, it applies wherev er automated sp eciﬁcation search is used. App endix F sk etc hes ho w the same architecture maps to regression sp eciﬁcation search, structural V AR iden tiﬁcation, and other standard empirical tasks. Sev eral limitations deserv e emphasis. First, a language mo del may enco de bac kground kno wledge that o v erlaps with the holdout p erio d, so residual look-ahead concerns remain. Second, the mo del’s in ternal trade-oﬀs b etw een score impro vemen t and contract b oundaries are opaque; it is often unclear how those ob jectiv es are balanced. Third, the eﬀectiv e searc h class ma y recom bine existing ideas more readily than it generates nov el ones. F ourth, the curren t search o ver T ( C ) is greedy: prop ose one candidate, ev aluate it, then keep or discard it. More structured strategies, such as the ev olutionary p opulation-based searc h in AlphaEv olve (No viko v et al., 2025) or tree searc h in Ayg ¨ un et al. (2025), could explore the space more eﬃcien tly . Fifth, the protocol is a practical guardrail against data-adaptiv e searc h, but it is not a substitute for formal p ost-selection inference or other corrections for p-hac king. 8 References Eser Ayg ¨ un, Anastasiya Belyaev a, Gheorghe Comanici, Marc Coram, Hao Cui, Jak e Garrison, An ton Renee Johnston Kast, Cory Y. McLean, P eter Norgaard, Zahra Shamsi, Da vid Smalling, James Thompson, Subhashini V en ugopalan, Brian P . Williams, Ch ujun He, Sarah Martinson, Martyna Plomeck a, Lai W ei, Y uc hen Zhou, Qian-Ze Zh u, Matthew Abraham, Erica Brand, Anna Bulanov a, Jeﬀrey A. Cardille, Chris Co, Scott Ellsw orth, Grace Joseph, Malcolm Kane, Ry an Krueger, Johan Kartiwa, Dan Liebling, Jan-Matthis Luec kmann, P aul Raccuglia, Xuefei (Julie) W ang, Katherine Chou, James Manyik a, Y ossi Matias, John C. Platt, Lizzie Dorfman, Shibl Mourad, and Michael P . Brenner. An AI system to help scien tists write exp ert-level empirical soft ware. arXiv pr eprint arXiv:2509.06503 , 2025. Herb ert Dawid, Philipp Harting, Hankui W ang, Zhongli W ang, and Jiachen Yi. Agen tic work- ﬂo ws for economic researc h: Design and implemen tation. arXiv pr eprint arXiv:2504.09736 , 2025. doi: 10.48550/arXiv.2504.09736. F rancis X. Diebold and Minc hul Shin. Mac hine learning for regularized surv ey forecast com bination: P artially-egalitarian LASSO and its deriv ativ es. International Journal of F or e c asting , 35(4):1679–1691, 2019. doi: 10.1016/j.ijforecast.2018.09.006. Andrew Gelman and Eric Loken. The garden of forking paths: Wh y multiple comparisons can b e a problem, even when there is no “ﬁshing expedition” or “p-hac king” and the research h yp othesis w as p osited ahead of time. Unpublished man uscript, 2013. Jura j Gott weis, W ei-Hung W eng, Alexander Daryin, T ao T u, Anil P alepu, P etar Sirk o vic, Artiom My asko vsky , F elix W eissen b erger, Keran Rong, Ryutaro T anno, Khaled Saab, Dan P op ovici, Jacob Blum, F an Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin V ahdat, Pushmeet Kohli, Y ossi Matias, Andrew Carroll, Ka vita Kulk arni, Nenad T omasev, Y uan Guan, Vikram Dhillon, Eeshit Dhav al V aishnav, Byron Lee, Tiago R. D. 9 Costa, Jos ´ e R. Penad ´ es, Gary P eltz, Y unhan Xu, Annalisa P a wlosky , Alan Karthik esalingam, and Viv ek Natara jan. T o wards an AI co-scien tist. arXiv pr eprint arXiv:2502.18864 , 2025. doi: 10.48550/arXiv.2502.18864. Andrej Karpath y . autoresearc h. GitHub rep ository , 2026. https://github.com/karpathy/ autoresearch . An ton Korinek. Ai agents for economic research. NBER W orking P ap er 34202, National Bureau of Economic Researc h, 2025. Edw ard E. Leamer. Let’s take the con out of econometrics. Americ an Ec onomic R eview , 73 (1):31–43, 1983. Chris Lu, Cong Lu, Robert Tjark o Lange, Jak ob F oerster, Jeﬀ Clune, and Da vid Ha. The AI scientist: T ow ards fully automated op en-ended scien tiﬁc disco very . arXiv pr eprint arXiv:2408.06292 , 2024. doi: 10.48550/arXiv.2408.06292. Edw ard Miguel. Evidence on research transparency in economics. Journal of Ec onomic Persp e ctives , 35(3):193–214, 2021. doi: 10.1257/jep.35.3.193. Jaeh yun Nam, Jinsung Y oon, Jiefeng Chen, Ra j Sinha, Jinw o o Shin, and T omas Pﬁster. DS- ST AR: Data science agen t for solving div erse tasks across heterogeneous formats and op en- ended queries. arXiv pr eprint arXiv:2509.21825 , 2025. doi: 10.48550/arXiv.2509.21825. Alexander Novik ov, Ngˆ an V ˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt W agner, Sergey Shirob ok o v, Borisla v Kozlovskii, F rancisco J. R. Ruiz, Abbas Mehrabian, M. P aw an Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Da vies, Sebastian Now ozin, Pushmeet Kohli, and Matej Balog. AlphaEvolv e: A co ding agen t for scientiﬁc and algorithmic disco very . arXiv pr eprint arXiv:2506.13131 , 2025. doi: 10.48550/arXiv.2506.13131. 10 Minc hul Shin and Nathan Sc hor. F oreComp: An R pack age for comparing predictive accuracy using ﬁxed-smo othing asymptotics. arXiv pr epr int arXiv:2603.07458 , 2026. doi: 10.48550/arXiv.2603.07458. Halb ert White. A realit y chec k for data sno oping. Ec onometric a , 68(5):1097–1126, 2000. doi: 10.1111/1468- 0262.00152. 11 Online App endix An Auditable AI Agen t Lo op for Empirical Economics Not for public ation A Implemen tation details A.1 F our-ﬁle arc hitecture The proto col requires only four ﬁles in a shared directory: C ≡ program.md , τ ≡ train.R , S ≡ prepare.R , L K ≡ results.tsv . program.md is a plain-text instruction con tract that deﬁnes the research ob jectiv e, the admissible mo diﬁcations, the rules, and the exp eriment lo op of Algorithm 1. train.R is the editable script con taining one function the agen t rewrites. prepare.R is the imm utable ev aluator: it loads the search sample, sources train.R , and returns the scalar score; the holdout data is en tirely outside the agent’s workspace. results.tsv is a tab-separated audit log in which each ro w records a commit iden tiﬁer, the score, the outcome status, and a v erbal description of the strategy attempted. In the working directory , the git commit history preserv es the actual co de of every retained τ k . 3 The four-ﬁle arc hitecture is adapted from Karpathy’s (2026) autoresearch rep ository . 4 The contribution here is not the lo op itself but its translation in to a proto col for empirical economics: the imm utable ev aluator ﬁxes what coun ts as success, the editable script deﬁnes 3 The replication arc hive includes the TSV logs and ﬁnal co de but not the full git ob ject store. F ull reco verabilit y from commit history requires access to the original working directory . 4 See https://github.com/karpathy/autoresearch . The arc hitecture is representativ e of a broader class of agent lo ops in which a language mo del acts as the optimizer within a ﬁxed ev aluation harness. 12 the searc h surface, the instruction con tract sp eciﬁes the researc her’s p ermissions, and the audit log turns ev ery attempted speciﬁcation into a rep ortable ob ject. A.2 T riggering the agen t lo op In the simplest case, the researc her opens a general-purp ose coding assistan t and prompts it to implemen t program.md . The agent reads the instruction con tract, whic h describ es the exp eriment lo op of Algorithm 1 in plain language, and b egins the edit–ev aluate–log– retain/rev ert cycle autonomously . In our application, w e used an outer shell script ( run.sh ) to restart agen t sessions when the mo del reached its context windo w limit. Eac h restart reads results.tsv and the git history to recov er state, and the session prompt injects the remaining experiment budget: “Y ou ha v e a budget of K remaining more experiments. Stop after that.” This is ho w the budget K is comm unicated and enforced in practice. The outer script also pro vides an appro ximate stop by counting ro ws in results.tsv and refusing to launc h a new session once the coun t reac hes K ; a session already in progress may ov ersho ot. The full run.sh is av ailable in the replication archiv e. T o run the proto col, the researc her op ens a general-purp ose AI co ding assistan t, suc h as Claude Co de or Co dex, and asks it to implement program.md . Because the instruction con tract describ es the exp erimen t lo op in plain language, the agent follows the proto col autonomously . No custom softw are is needed b ey ond the four ﬁles and a standard co ding assistan t. A.3 Isolating the holdout The p ost-search holdout D H m ust remain en tirely outside the agent’s w orkspace during the search. In our application, the holdout p erio d (2017Q1–2025Q4) w as constructed after the agentic searc h w as complete, so no holdout data existed in the w orking directory while the agent w as running. More generally , researchers should ensure that holdout data is not 13 T able 2: Exp erimen t coun ts Exp erimen ts Keeps Discards Crashes Run 1 232 42 188 2 Run 2 228 61 164 3 Run 3 201 46 152 3 Notes: “Keep” means the mo diﬁcation improv ed on the current b est RMSE; “discard” means it did not; “crash” means the candidate script ex- ceeded the 30-min ute timeout or pro duced an error. The budget is enforced as a soft constrain t; realized counts v ary . accessible to the agen t during the searc h. Because curren t co ding agents can access the in ternet and execute arbitrary co de, an agent could in principle do wnload or construct holdout p erio d data on its o wn. Disabling in ternet access during the searc h, restricting the agent’s ﬁle system p ermissions, or simply withholding the holdout data from the working directory are practical safeguards. Without suc h b oundaries, the transparency of the holdout ev aluation cannot b e guaranteed. A.4 Actual run Eac h run uses Claude Co de with Opus 4.6 at default thinking eﬀort, starting from a fresh agen t with the same initial script (simple a verage) and an appro ximate budget of K = 200 exp erimen ts. The ev aluator calls the forecasting function 66 times ( t = 5 , . . . , 70) but scores RMSE ov er 65 p erio ds ( t = 6 , . . . , 70), matching the burn-in and ev aluation design in Dieb old and Shin (2019). T able 2 rep orts the exp eriment counts for the three agent runs. The budget is enforced as a soft constrain t; the realized coun ts are 232, 228, and 201 for Runs 1, 2, and 3, resp ectiv ely . If a candidate script exceeds 30 min utes of runtime, the exp erimen t is terminated and logged as a crash. 14 B Instruction con tract The following is the verbatim program.md used in the empirical application of Section 3. # autoresearch | PE Lasso lambda selection ## Goal Find the best look-ahead-free, data-based lambda selection algorithm for PE Lasso forecast combination. PE Lasso forecast combination we mean partially egalitary lasso method in the paper /original_paper/DiebodShinEgalitarianLasso.md (both one-step conceptualization and two-step implementation). For both cases, we have two regularization parameters. IMPORTANT: Focus on two-step implementation of PE Lasso. Do not try to combine with other methods. I want to focus on finding a way to choose these regularization parameters for PE Lasso so that it becomes tuning free method. We could try bunch of different CV methods. Or, other data adaptive method automatically selects these tuning parameters. We understand it is difficult problem finding two regularization parameters from two different stages. But, this is exactly why we want to run this experiments. The simple average benchmark has RMSE ~1.50. The ex-post oracle (which cheats by using future data) achieves ~1.40. Your job is to close that gap. You decide what strategies to explore. The reference implementation is in ‘original_program/R/‘ | read it for ideas. ## Setup 1. Agree on a run tag with the user: propose a tag based on today’s date (e.g. ‘mar11‘). The branch ‘autoresearch/‘ must not already exist. 2. Create the branch: ‘git checkout -b autoresearch/‘ 3. Read the in-scope files: - ‘program.md‘ | this file (your instructions) - ‘prepare.R‘ | fixed evaluation harness. Do not modify. 15 - ‘train.R‘ | the file you modify. Defines ‘select_lambda(info)‘. 4. Check session history: Read ‘results.tsv‘ and ‘git log‘ to understand what has been tried before. This is your memory across sessions. 5. Initialize results.tsv if it doesn’t exist: create it with just the header row. 6. Confirm and go. ## The rules - One file: You only edit ‘train.R‘. It must define ‘select_lambda(info)‘ returning ‘list(forecast, method)‘. - No lookahead: At time t, only data from 1:(t-1) is available. The ‘info‘ object enforces this. - No new packages: Only glmnet, forecast, and base R. - Runtime: Each full run (66 evaluation periods) should complete in under 30 minutes. - Deterministic: Use a fixed seed if randomness is needed. ## What’s in ‘info‘ info$X_train # rolling window training forecasts [max(1,t-w):(t-1), 23] info$y_train # corresponding actuals info$X_history # full history [1:(t-1), 23] info$y_history # full history actuals info$x_new # current period forecasts [23] info$lambda_grid # 200-point lambda grid, exp(15) to exp(-15) info$lambda_grid2 # same grid (for two-step methods) info$t # current time index info$w # window size (20) All functions from ‘prepare.R‘ are in scope: ‘pelasso_lasso_avg()‘, ‘pelasso_lasso_eridge()‘, ‘pelasso_lasso_elasso()‘, ‘fit_lasso()‘, ‘fit_ridge()‘, ‘fit_elasso()‘, ‘fit_eridge()‘, ‘cv_loo_bandwidth()‘, ‘cv_loo_bandwidth_2step()‘, etc. ## Benchmarks | Method | RMSE | Notes | |--------------------------------------|-------|----------------------------| | Simple average | ~1.50 | Baseline (no lambda) | 16 | Ex-post optimal peLASSO(LASSO, Avg) | ~1.40 | Oracle | uses future data | | Good data-based method | 1.42-1.48 | Realistic target range | ## Output format The script prints: --- method: rmse: benchmark_rmse: relative_rmse: Extract the key metric: ‘grep "^rmse:" run.log‘ ## Logging results Log every experiment to ‘results.tsv‘ (tab-separated). Do NOT commit this file. commit rmse status description a1b2c3d 1.5012 keep baseline simple average b2c3d4e 1.4723 keep LOO CV for pelasso_lasso_avg c3d4e5f 1.4891 discard BIC-based lambda selection d4e5f6g 0.0000 crash 2D CV OOM on full grid ## Session recovery You may be starting a fresh session after a previous one was interrupted (context limit, crash, etc.). On every session start: 1. Read ‘results.tsv‘ and ‘git log --oneline‘ to see what has been tried. 2. Reconcile state: Compare the last ‘results.tsv‘ entry to the latest git commit. - If the latest commit is NOT in ‘results.tsv‘, it was never evaluated. Run it now and log the result before moving on. - If ‘train.R‘ has uncommitted changes, they are leftover from an interrupted edit. Review and either commit or discard. 3. Identify the current best RMSE from ‘results.tsv‘ (the lowest ‘rmse‘ with status ‘keep‘). 17 4. Continue experimenting from there. ## The experiment loop LOOP (until experiment budget is exhausted or interrupted): 1. Read the current state: ‘train.R‘, ‘results.tsv‘, git log 2. Edit ‘train.R‘ with a new lambda selection strategy 3. ‘git commit -m "descriptive message"‘ 4. Run: ‘Rscript prepare.R > run.log 2>&1‘ 5. Extract: ‘grep "^rmse:" run.log‘ 6. If empty (crash): ‘tail -n 50 run.log‘ to diagnose 7. Log to ‘results.tsv‘ 8. If RMSE improved -> keep the commit 9. If RMSE worse -> ‘git reset --hard HEAD~1‘ 10. Repeat Logging is step 7, BEFORE the keep/revert decision in steps 8-9. This ensures every experiment is recorded even if the session is interrupted right after. Timeout: If a run exceeds 30 minutes, kill it and treat as crash. Crashes: Fix trivial bugs (typos, off-by-one). If the idea is fundamentally broken, log crash and move on. NEVER STOP: Do not pause to ask if you should continue. Run autonomously until your experiment budget runs out or you are manually interrupted. If you run out of ideas, re-read ‘original_program/R/‘ for new angles, try combining previous near-misses, or try more radical approaches. C Exp erimen t log excerpt T able 3 presen ts a selected excerpt from the 201-experiment log of Run 3. The tra jectory illustrates a systematic exploration pattern. The agen t b egins with standard cross-v alidation for peLASSO (exp eriments 1–4), quic kly mo v es to adaptiv e LASSO with ridge-based p enalt y w eigh ts and forw ard cross-v alidation (experiments 25–45), disco vers that multi-horizon forw ard 18 T able 3: Exp erimen t log excerpt: Run 3 Exp. Status RMSE Description 1 k eep 1 . 5043 baseline simple av erage 2 k eep 1 . 4692 pelasso(lasso,avg) LOO CV B=0 3 crash — pelasso(lasso,avg) LOO CV B=1 – zero v ariance error 4 discard 1 . 9556 p elasso(lasso,eridge) 2D LOO CV coarse 30pt grid 25 k eep 1 . 4455 p e(adalasso,avg) fwdCV exp 0.95 ridge p enalt y weigh ts 45 k eep 1 . 3839 p e(adalasso,avg) rolling window ridge p enalties decay=0.6 86 k eep 1 . 3539 p e(adalasso,avg) h=2 step ahead forw ard CV LOO ridge 113 k eep 1 . 3202 p e(adalasso,avg) h=2 mo del av eraging top 2p ct lam b das 138 k eep 1 . 2533 p e(adalasso,blend) 80p ct PE + 20p ct eridge ensem ble 152 k eep 1 . 2253 p e(adalasso,blend) 70/30 PE + eEN cv.glmnet LOO 201 k eep 1 . 2159 p e(adalasso,blend) mo del avg 1p ct h1(0.05)+h2(0.95) Notes: Selected rows from the 201-exp eriment log of Run 3. Descriptions are the agen t’s verbatim one-line summaries, truncated for display . Exp erimen t 1 is the initial simple av erage baseline. RMSE for crash en tries is undeﬁned; the implemen tation records them as zero in results.tsv , but they are excluded from all comparisons (formally , s k = + ∞ ). F ull logs are in the replication archiv e. CV impro ves regularization parameter selection (experiment 86), in tro duces mo del a v eraging o v er near-optimal lambda v alues (exp eriment 113), and ﬁnally arriv es at an ensemble that blends p eLASSO with an egalitarian elastic net comp onen t (exp eriments 138–201). Each phase builds on the structural lessons of the previous one. D Disco v ered metho ds Eac h agent run produced a ﬁnal metho d and sev eral in termediate v arian ts. The holdout ev al- uation in Section E rep orts results for the ﬁnal metho d and t w o represen tative intermediates p er run, selected from distinct algorithmic phases. W e describe eac h metho d here. Except where noted, all metho ds op erate within the same rolling windo w framew ork. A t eac h forecast origin t , let X t ∈ R n t × K and y t ∈ R n t denote the training windo w of forecaster predictions and GDP realizations, where n t = min ( t − 1 , W ), W = 20, and K = 23. Let x t = ( x 1 ,t , . . . , x K,t ) ′ denote the new forecaster predictions for perio d t . Eac h metho d pro duces a combined forecast ˆ y t ; most metho ds use only ( X t , y t , x t ), with exceptions noted b elow. 19 D.1 Run 1: stabilit y selection with p erformance w eigh ting Run 1 (ﬁnal). Stabilit y selection with p erformance weigh ted aggregation. 1. Fit elastic net ( α = 0 . 65) with 5 fold CV on the training windo w. Deﬁne ˆ λ 1 . 5se as the largest λ suc h that the CV error is within 1 . 5 standard errors of the minimum. 2. Stability sele ction. Let W = { j : | log λ j − log ˆ λ 1 . 5se | < δ } with δ = 0 . 25. The selection frequency of forecaster k is π k = |W | − 1 X j ∈W 1 { ˆ β k ( λ j )  = 0 } . (1) The active set is A = { k : π k > 0 . 4 } . 3. Comp osite weights. Using the m = 2 most recent training observ ations, compute individual RMSE k for each k ∈ A and form w k = π 2 k · RMSE − p k , p = 14 . (2) The exp onen t p = 14 w as selected by the searc h from nearby alternatives (pow ers 2 through 20 w ere tried). 4. Dominanc e che ck. If w (1) > 5 · w (2) (where w ( j ) denotes the j th largest weigh t), use the single b est forecaster. 5. Otherwise, the com bined forecast is the weigh ted q th quantile ( q = 0 . 44) of { x k,t } k ∈A with normalized weigh ts w k / P k w k , computed by linear in terp olation of the weigh ted empirical CDF. 20 Run 1a (Phase 2: elastic net with median). PE selection using elastic net ( α = 0 . 65) with 5 fold CV and ˆ λ 1se . The com bined forecast is the median of the selected forecasters: ˆ y t = median  { x k,t } k ∈A ( ˆ λ 1se )  . (3) No stability selection, no p erformance weigh ting. Run 1b (Phase 1: p eLASSO with CV). Standard p eLASSO(LASSO, Avg) with ˆ λ 1se from cv.glmnet : ˆ y t = |A ( ˆ λ 1se ) | − 1 X k ∈A ( ˆ λ 1se ) x k,t . (4) This is the closest v arian t to the original Dieb old and Shin (2019) method, but with data driv en λ selection rather than ex post optimization. D.2 Run 2: ranking, adaptive windo ws, and bias correction Run 2 (ﬁnal). T emp orally w eigh ted ranking with adaptiv e windo w selection and bias correction. F or a given training windo w ( X , y ) of size n : 1. T emp or al weights. Deﬁne ω s = exp  − α ( n − s ) / ( n − 1)  / P n j =1 exp  − α ( n − j ) / ( n − 1)  with α = 6. More recen t observ ations receive higher w eight. 2. Weighte d MAE r anking. MAE k = n − 1 P n s =1 nω s | y s − x k,s | . Rank forecasters by ascending MAE. 3. LOO CV for subset size. F or eac h candidate N ∈ { 3 , . . . , min ( K, 18) } and each left out observ ation i : • Rerank by RMSE on the remaining n − 1 observ ations. • Select the top N forecasters, weigh t them b y ˜ w j ∝ MAE − 3 j (on leav e out data). 21 • Compute the w eighted av erage LOO forecast ˆ y LOO i = P j ˜ w j x j,i and the median based LOO forecast ˆ y pct i = Q 0 . 50 ( { x j,i } j ∈ top N ). Select N ∗ = arg min N p n − 1 P i ( ˆ y LOO i − y i ) 2 . After N ∗ is selected, the current p erio d forecast (step 5) uses the full windo w weigh ted MAE ranking, not the LOO reranking or the MAE − 3 w eigh ted a v erage. 4. A daptive window. Ev aluate steps 1–3 at the full rolling window and at eac h sub-window w ′ ∈ { 4 , 5 , . . . , 19 } , k eeping whichev er ( N ∗ , w ′ ) com bination yields the lo west LOO RMSE. 5. Me dian aggr e gation. The base forecast is ˜ y t = Q 0 . 50 ( { x k,t } k ∈ top N ∗ ). 6. Bias c orr e ction. The ﬁnal forecast applies a bias adjustment based on the most recent lea v e one out error: ˆ y t = ˜ y t − γ · ˆ e LOO , pct n , γ = 0 . 80 , (5) where ˆ e LOO , pct n = ˆ y pct n − y n is the p ercen tile based LOO error from the most recent training observ ation. Run 2a (no bias correction). Identical to Run 2 (steps 1–5) but omits the bias correc- tion (5): ˆ y t = ˜ y t . (6) This isolates the con tribution of the forecaster ranking and adaptiv e windo w mec hanism. Run 2b (w eighted ranking, no adaptive window). Steps 1–2 as in Run 2 with a ﬁxed windo w. Step 3 uses a p enalized LOO criterion, RMSE LOO ( N ) + 0 . 01 · log ( N + 1), to select N ∗ . The com bined forecast is a w eigh ted a v erage (not median) of the top N ∗ forecasters: ˆ y t = N ∗ X j =1 ˜ w j x σ ( j ) ,t , ˜ w j ∝ MAE − 3 σ ( j ) , (7) 22 where σ denotes the MAE ranking. No adaptive windo w, no bias correction. D.3 Run 3: adaptiv e LASSO with forw ard CV and ensem ble Run 3 (ﬁnal). Adaptiv e LASSO with m ulti horizon forward CV, mo del av eraging, and egalitarian elastic net ensem ble. 1. Pilot estimator. Fit elastic net ( α pilot = 0 . 1) with LOO CV on the rolling windo w to obtain ˆ β ridge . 2. A daptive LASSO p enalties. Deﬁne forecaster speciﬁc p enalty factors ˜ w k = ( | ˆ β ridge k | + ϵ ) − 1 ( | ˆ β ridge | + ϵ ) − 1 , ϵ = 5 × 10 − 3 , (8) where · denotes the cross sectional mean. Fit LASSO with penalty λ · ˜ w k | β k | on the full history . 3. Multi horizon forwar d CV. Lambda is selected b y w alk forw ard CV on the full history preceding the curren t origin, rows 1 , . . . , t − 1. F or fold f , ﬁt adaptiv e LASSO on the ﬁrst t f ro ws and compute forecast errors on the next tw o ro ws t f + 1 and t f + 2 (where t f + 2 ≤ t − 1 b y construction). Let e f ,s ( λ ) = ˆ y t f + s ( λ ) − y t f + s for s = 1 , 2, where ˆ y t f + s ( λ ) applies the λ dep enden t com bination rule to the forecaster predictions at ro w t f + s . Tw o candidate aggregation rules are ev aluated at eac h λ : simple av erage and blended av erage (0 . 7 × mean + 0 . 3 × ridge w eighted mean). The CV criterion is CV( λ ) = X f w f  0 . 05 · e 2 f , 1 ( λ ) + 0 . 95 · e 2 f , 2 ( λ )  , (9) with fold w eights w f ∝ γ F − f and γ = 0 . 75. 4. Mo del aver aging. Average forecasts ov er all λ within 1% of the b est CV score, w eighted b y in verse RMSE. 23 5. Egalitarian elastic net. Fit elastic net ( α = 0 . 5) via cv.glmnet with LOO CV on the rolling window, selecting ˆ λ min , on the egalitarian transformation ˜ y s = y s − K − 1 P k x k,s , obtaining ˜ β eEN , and compute ˆ y eEN t = P k ( ˜ β eEN k + K − 1 ) x k,t . 6. Final blend. ˆ y t = 0 . 70 · ˆ y PE t + 0 . 30 · ˆ y eEN t . (10) Run 3a (Phase 4: single horizon, no blend). Steps 1–2 as in Run 3. F orw ard CV ev aluating only at the next panel row ( t f + 1) with γ = 0 . 75. No mo del a v eraging. Simple a v erage of selected forecasters. No egalitarian elastic net blend: ˆ y t = |A ( ˆ λ ) | − 1 X k ∈A ( ˆ λ ) x k,t . (11) Run 3b (Phase 5: m ulti horizon, no blend). Same pilot and adaptiv e p enalt y construction as Run 3, with γ = 0 . 80 and ϵ = 10 − 3 . F orward CV ev aluates only the simple a v erage aggregation rule (no blended alternativ e). Mo del a veraging ov er λ v alues within 1% of the b est CV score is applied. No egalitarian elastic net blend: ˆ y t = ˆ y PE t . E P ost-searc h holdout ev aluation This section rep orts the full holdout ev aluation summarized in Section 3 of the main text. The holdout p erio d (2017Q1–2025Q4, 36 quarters) was withheld from the ev aluator during the agentic search. E.1 Data and pro cessing The extended dataset com bines ECB Survey of Professional F orecasters (SPF) individual micro data with Eurostat GDP realizations (table namq 10 gdp , c hain link ed v olumes, sea- sonally and calendar adjusted). The original sample (1999Q3–2016Q4, 70 observ ations) is 24 copied verbatim from the original H1 gdp.csv of Dieb old and Shin (2019) to guaran tee exact RMSE reproduction. The new sample (2017Q1–2025Q4, 36 quarters) is constructed from the SPF extraction. Missing forecaster resp onses in the new sample are imputed by linear in terp olation for in terior gaps and b y cross-sectional mean for trailing gaps and inactiv e forecasters. CO VID quarters are deﬁned as 2020Q1–2020Q4. E.2 F ull holdout results T able 4 rep orts RMSE for all methods, including intermediate v ariants that isolate individual comp onen ts. F or each run, we include t w o represen tativ e in termediate methods selected from distinct algorithmic phases. The intermediate metho ds are informativ e b ecause they isolate the contribution of individual comp onents such as stabilit y selection, bias correction, or ensemble blending. E.3 Role of individual comp onen ts The intermediate metho ds isolate what drives generalization. Stabilit y selection (Run 1 vs. 1a, 1b). The full Run 1 (holdout relative RMSE 0.945) substan tially outp erforms both alternatives: Run 1a (1.004), whic h uses elastic net selection without stabilit y , and Run 1b (1.027), which uses standard p eLASSO with CV. The stability selection mec hanism and the comp osite p erformance w eigh ting are join tly resp onsible for the holdout gain. Without them, elastic net PE selection is essen tially no b etter than the simple a v erage. Bias correction (Run 2 vs. 2a, 2b). Removing the bias correction from Run 2 degrades p erformance from 0.811 to 0.990 (Run 2a) or 0.980 (Run 2b). The bias correction contributes most of Run 2’s adv an tage. How ever, Run 2a and 2b are still competitive with the simple a v erage, indicating that the temp orally w eigh ted ranking and adaptiv e window mechanism 25 T able 4: F ull holdout ev aluation of forecast combination metho ds Searc h sample Holdout Holdout excl. COVID RMSE Relativ e RMSE Relativ e RMSE Relativ e Panel A: Benchmarks and original p ap er metho ds Simple av erage 1 . 504 1 . 000 2 . 979 1 . 000 1 . 120 1 . 000 p eLASSO ex p ost (original) 1 . 400 0 . 930 3.172 [.880] 1 . 065 1.360 [.885] 1 . 214 p eLASSO ex p ost (p er window) 1 . 400 0 . 930 2.964 [.191] 0 . 995 1.075 [.136] 0 . 960 Best ≤ 6-a vg 1 . 435 0 . 954 2.901 [.240] 0 . 974 1.043 [.347] 0 . 932 Best ( ≤ 6 , ≤ 40)-avg 1 . 378 0 . 916 2.901 [.127] 0 . 974 1.142 [.602] 1 . 020 Best individual 1 . 403 0 . 933 2.994 [.716] 1 . 005 1.164 [.714] 1 . 040 Panel B: Run 1 family Run 1 (ﬁnal) 1 . 291 0 . 858 2.816 [.177] 0 . 945 0.965 [.234] 0 . 861 Run 1a (ENet, median) 1 . 390 0 . 924 2.991 [.742] 1 . 004 1.158 [.794] 1 . 034 Run 1b (p eLASSO, CV) 1 . 451 0 . 964 3.059 [.938] 1 . 027 1.270 [.929] 1 . 134 Panel C: Run 2 family Run 2 (ﬁnal) 0 . 767 0 . 510 2.417 [.127] 0 . 811 0.827 [.108] 0 . 739 Run 2a (no bias corr.) 1 . 308 0 . 869 2.949 [.087] 0 . 990 1.093 [.290] 0 . 976 Run 2b (weigh ted avg) 1 . 351 0 . 898 2.918 [.113] 0 . 980 1.120 [.499] 1 . 000 Panel D: Run 3 family Run 3 (ﬁnal) 1 . 216 0 . 808 3.244 [.852] 1 . 089 1.305 [.755] 1 . 165 Run 3a ( h =1, no blend) 1 . 309 0 . 870 2.994 [.648] 1 . 005 1.238 [.818] 1 . 106 Run 3b ( h =2, no blend) 1 . 307 0 . 869 3.037 [.865] 1 . 020 1.276 [.855] 1 . 140 Notes: RMSE of forecast errors for euro area real GDP gro wth (year on y ear). Search sample: 1999Q3–2016Q4 (65 ev aluation p erio ds). Holdout: 2017Q1–2025Q4 (36 quarters). Holdout excl. CO VID: holdout dropping 2020Q1–Q4 (32 quarters). Relative RMSE is computed against the simple a verage on each resp ective sample. Brack eted v alues in the holdout columns are one sided p -v alues from the Dieb old–Mariano test that the method has sup erior predictiv e accuracy relative to the simple a verage, using the EWC ﬁxed- b appro ximation (Shin and Schor, 2026); these ha ve intrinsically lo w p ow er given the small holdout sample (36 quarters, 32 excluding COVID) and serial correlation in forecast errors, and should b e read as descriptive. p eLASSO ex p ost (original) and b est individual use parameters selected on the searc h sample and are ev aluated without re-optimization on the holdout. p eLASSO ex p ost (p er windo w) re-optimizes λ on eac h ev aluation window separately . 26 also add mo dest v alue. The bias correction adjusts for p ersisten t short run bias in combined forecast errors, which app ears to reﬂect a structural feature rather than o verﬁtting to the searc h sample. Egalitarian elastic net blend (Run 3 vs. 3a, 3b). Run 3’s 70/30 blend with the egalitarian elastic net (holdout relativ e RMSE 1.089) p erforms worse than the simpler PE adaptiv e LASSO without the blend (3a at 1.005, 3b at 1.020). The ensem ble comp onen t that impro v ed searc h-sample ﬁt in tro duced additional estimation noise that do es not pa y oﬀ on the holdout. E.4 Indep enden t dataset robustness T o v erify that these results are not an artifact of using the original H1 gdp.csv v erbatim for the searc h-sample p erio d, we also constructed a fully indep endent dataset where all 106 rows are built from do wnloaded SPF and Eurostat data, with no verbatim copy from the original ﬁle. The independent dataset uses the curren t (2026) GDP vin tage rather than the 2018Q1 vin tage in the original study; o v er the original sample the mean absolute revision is 0.079 pp and the maxim um is 0.20 pp. T able 5 repeats the exercise on this indep endent dataset. The key rankings are preserv ed: Run 2 ac hiev es a relativ e RMSE of 0.808 on the holdout, Run 1 achiev es 0.946, and the original paper metho ds ac hieve 0.974 and 0.968. Run 3’s relative RMSE shifts from 1.089 to 0.962 on the indep endent dataset, but its p -v alues remain very large in b oth cases, so the diﬀerence is not meaningful. The indep enden t dataset conﬁrms that the main ranking is preserv ed across vin tages: Run 2 strongest, Run 1 second, and Run 3 not distinguishable from the simple a verage. 27 T able 5: F ull holdout ev aluation: indep enden t dataset Searc h sample Holdout Holdout excl. COVID RMSE Relativ e RMSE Relative RMSE Relativ e Panel A: Benchmarks and original p ap er metho ds Simple av erage 1 . 510 1 . 000 2 . 979 1 . 000 1 . 120 1 . 000 p eLASSO ex p ost (original) 1 . 417 0 . 938 3.041 [.763] 1 . 021 1.350 [.881] 1 . 206 p eLASSO ex p ost (p er window) 1 . 417 0 . 938 2.976 [.249] 0 . 999 1.119 [.188] 0 . 999 Best ≤ 6-a vg 1 . 500 0 . 994 2.901 [.239] 0 . 974 1.043 [.346] 0 . 931 Best ( ≤ 6 , ≤ 40)-avg 1 . 298 0 . 860 2.884 [.100] 0 . 968 1.094 [.340] 0 . 977 Best individual 1 . 407 0 . 932 2.909 [.245] 0 . 977 1.109 [.463] 0 . 990 Panel B: Run 1 family Run 1 (ﬁnal) 1 . 295 0 . 858 2.817 [.177] 0 . 946 0.965 [.233] 0 . 861 Run 1a (ENet, median) 1 . 479 0 . 980 2.991 [.730] 1 . 004 1.157 [.779] 1 . 033 Run 1b (p eLASSO, CV) 1 . 455 0 . 964 3.051 [.952] 1 . 024 1.250 [.931] 1 . 116 Panel C: Run 2 family Run 2 (ﬁnal) 0 . 834 0 . 552 2.408 [.127] 0 . 808 0.826 [.107] 0 . 737 Run 2a (no bias corr.) 1 . 304 0 . 864 2.947 [.090] 0 . 989 1.096 [.298] 0 . 978 Run 2b (weigh ted avg) 1 . 343 0 . 890 2.898 [.100] 0 . 973 1.065 [.054] 0 . 950 Panel D: Run 3 family Run 3 (ﬁnal) 1 . 305 0 . 864 2.865 [.309] 0 . 962 1.349 [.728] 1 . 205 Run 3a ( h =1, no blend) 1 . 490 0 . 987 2.960 [.400] 0 . 994 1.270 [.802] 1 . 133 Run 3b ( h =2, no blend) 1 . 502 0 . 995 2.963 [.414] 0 . 995 1.276 [.812] 1 . 139 Notes: Same as T able 4 but using the indep endently constructed dataset (all rows from do wnloaded SPF and Eurostat data, no verbatim copy from the original H1 gdp.csv ). Search-sample RMSEs diﬀer from T able 4 due to GDP vintage diﬀerences and indep endent imputation. Brac keted v alues in the holdout columns are one sided p -v alues from the Dieb old–Mariano test using the EWC ﬁxed- b appro ximation (Shin and Schor, 2026); see T able 4 notes for details and cav eats. peLASSO ex p ost (original) and b est individual use parameters selected on the search sample; see T able 4 notes for details. 28 F Illustration: b ey ond forecasting com bination In the empirical application of Dieb old and Shin (2019) studied in Section 3, program.md asks the agen t to ﬁnd a lo ok-ahead-free, data-based lam b da selection rule for p eLASSO forecast combination. train.R deﬁnes a function that tak es admissible historical data and returns a forecast. prepare.R implemen ts a rolling-origin ev aluation: at each date t , it passes train.R only information av ailable through ( t − 1), produces a forecast for t , and computes RMSE o v er the ev aluation p erio d. results.tsv records one line p er candidate rule. The empirical task is narrow and insp ectable: construct a function that maps admissible historical information in to one forecast, and judge ev ery attempted rule under the same rolling-origin RMSE criterion. The same four-ﬁle architecture applies whenev er the researcher can separate an immutable ev aluation design from an editable implementation rule. The remainder of this section pro vides three stylized illustrations that map the proto col in to familiar empirical settings: regression sp eciﬁcation search, structural V ARs, and inﬂation forecasting for monetary p olicy . These illustrations are inten tionally sc hematic. In forecasting problems, the ev aluator can often b e a straightforw ard out-of-sample loss. In inference-heavy settings, b y con trast, an y scalar score should b e understo o d as a diagnostic screen within a broader researc h design, not as a complete v alidation of identiﬁcation or inference. Example 1: Sp eciﬁcation search in a regression mo del. Consider a researc her studying a descriptive wage regression using cross-sectional data. The estimating equation is log w i = β · educ i + g ( z i ) + e i , where w i is earnings, educ i is years of schooling, β is a co eﬃcient of interest, and z i is a v ector of controls including exp erience, exp erience squared, demographic indicators, o ccupation, industry , and geographic region. The researc her w an ts to compare alternativ e sp eciﬁcations 29 for the n uisance function g ( · ) while handling inference on β separately . File mapping. • program.md ( C ): “Compare candidate sp eciﬁcations of g ( z i ). Only edit train.R . The searc h sample and its internal ev aluation split are ﬁxed in prepare.R . Do not modify the sample or the outcome v ariable.” • train.R ( τ ): Deﬁnes a function that receives the training sample and returns ﬁtted v alues for the n uisance comp onent, p oten tially along with ﬁtted v alues for educ i if a partialling-out design is used. The agen t ma y explore p olynomial terms, splines, in teractions, LASSO-penalized sp eciﬁcations, or other functional forms for g . • prepare.R ( S ): Splits the search sample D S in to a training partition I train and an ev aluation partition I ev al , ﬁxed ex ante. On I train , it sources train.R to obtain the n uisance ﬁt, and if desired partials out b oth outcome and treatmen t b efore estimating ˆ β . On I ev al , it computes a holdout prediction error such as S ( τ ; D S ) = 1 |I ev al | X i ∈I ev al  log w i − ˆ β · educ i − ˆ g ( z i )  2 . • results.tsv ( L K ): Records eac h sp eciﬁcation attempted, its ev aluation MSE, and the implied ˆ β . If the scoring rule w ere the in-sample p -v alue for ˆ β , the lo op w ould amoun t to automated p - hac king. The discipline comes from ﬁxing the ev aluator and the sample split b efore the search b egins and from disclosing the full searc h path. Causal claims ab out returns to sc ho oling w ould still require separate identiﬁcation assumptions. After the searc h, the researcher can ev aluate the winning sp eciﬁcation on a p ost-search holdout D H (e.g., a later survey year or a pre-reserv ed subsample withheld from the search) to c hec k whether the nuisance ﬁt generalizes. 30 Example 2: Structural V AR iden tiﬁcation. Consider a three-v ariable V AR for macro e- conomic analysis of monetary p olicy . Let y t = (∆ GDP t , π t , i t ) ′ , where ∆ GDP t is real GDP gro wth, π t is inﬂation, and i t is the federal funds rate. The reduced-form V AR is y t = Φ 1 y t − 1 + · · · + Φ p y t − p + u t , where u t is the reduced-form inno v ation with V ar ( u t ) = Σ u . The structural represen tation is u t = A − 1 0 ε t , where ε t is the vector of structural sho c ks and A 0 enco des the contemporaneous causal structure. The goal here is more mo dest: compare candidate wa ys of constructing a monetary p olicy sho ck series and record how eac h one lines up with an external narrativ e diagnostic. File mapping. • program.md ( C ): “Compare candidate constructions of a monetary p olicy sho ck series using a ﬁxed external narrativ e diagnostic. Only edit train.R . Do not mo dify the V AR sp eciﬁcation.” • train.R ( τ ): Deﬁnes a function that receives the estimated reduced-form parameters ( ˆ Φ 1 , . . . , ˆ Φ p , ˆ Σ u ), together with an y auxiliary diagnostic series supplied by prepare.R , and returns a candidate monetary p olicy sho ck series ˆ ε mp t and the asso ciated GDP impulse resp onse at horizon h . The agent ma y explore recursive orderings, sign restrictions, or other mappings from reduced-form innov ations in to candidate sho ck measures. • prepare.R ( S ): Estimates the reduced-form V AR on a ﬁxed training sample, calls train.R to obtain the candidate sho c k, and computes a diagnostic score on D S : S ( τ ; D S ) = −   corr  ˆ ε mp t , z RR t    , where z RR t is the Romer and Romer narrative monetary p olicy sho ck measure. The 31 negativ e sign ensures that higher correlation corresponds to a low er (better) score. This score is b est read as an external diagnostic rather than a standalone pro of of iden tiﬁcation. • results.tsv ( L K ): Records the candidate construction, the diagnostic score, and the implied p eak resp onse of GDP to a monetary p olicy shock. The p oint is not that the agen t “ﬁnds the largest eﬀect” or settles iden tiﬁcation, but that any search o ver candidate sho ck constructions is visible, b ounded, and judged against a criterion ﬁxed ex ante. After the searc h, the researc her can ev aluate the chosen sho ck construction on a later sample p eriod D H (e.g., searc h on 1969–2000, holdout on 2001–2007) to chec k whether the narrativ e correlation and impulse resp onse patterns hold outside the searc h sample. Example 3: Inﬂation forecasting for monetary p olicy . Cen tral banks routinely pro duce inﬂation forecasts to guide monetary p olicy decisions. The workhorse framew ork is the Phillips curv e: π t + h = α + β · slac k t + γ · π e t + δ ′ z t + e t + h , where π t + h is h -quarter-ahead CPI inﬂation, slac k t is a measure of economic slack, π e t is an inﬂation expectations proxy , and z t is a v ector of additional conditioning v ariables (commodity prices, exc hange rates, ﬁnancial conditions). The practical diﬃcult y is that the sp eciﬁcation in volv es man y c hoices: which slack measure (unemploymen t gap, output gap, capacity utilization), whic h exp ectations pro xy (surv ey exp ectations, breakev en inﬂation, adaptive exp ectations), which additional predictors, whic h estimation metho d (OLS, ridge, random forest), and whic h estimation window. File mapping. • program.md ( C ): “Compare candidate real-time Phillips curv e speciﬁcations for h = 4 quarter-ahead CPI inﬂation forecasting. No lo ok ahead: at forecast origin t , only 32 real-time vintage data through t is av ailable. Only edit train.R . Budget: K = 200 exp erimen ts.” • train.R ( τ ): Deﬁnes a function that receives the information set F t = { y s , x s } t s =1 and returns a p oint forecast ˆ π t + h | t . The agen t may explore diﬀerent slac k measures, exp ectations pro xies, transformations, v ariable selection metho ds, p enalized estimation, or nonlinear sp eciﬁcations. • prepare.R ( S ): Implements pseudo-out-of-sample ev aluation. F or eac h forecast origin t ∈ { t 0 , t 0 + 1 , . . . , T − h } , passes the real-time information set to train.R and collects the forecast. The score is S ( τ ; D S ) = RMSE = 1 |E | X t ∈E  π t + h − ˆ π t + h | t  2 ! 1 / 2 . • results.tsv ( L K ): Records each Phillips curv e speciﬁcation, its RMSE, and a descrip- tion of the metho d (e.g., “ridge Phillips curv e, unemplo yment gap, SPF exp ectations, 40-quarter rolling windo w”). This is a natural ﬁt for the proto col b ecause the ev aluator is a gen uine pseudo-out-of- sample forecast loss. F or cen tral bank staﬀ who routinely compare Phillips curv e sp eciﬁcations, the proto col would make the searc h path more visible and easier to disclose. The researc her can further extend the ev aluation to a later p erio d D H not a v ailable during the searc h to c hec k whether the selected speciﬁcation generalizes. These three illustrations suggest ho w the four-ﬁle protocol can be adapted across settings. The arc hitecture applies whenever the researc her can separate an immutable ev aluation design from an editable implemen tation rule. F orecasting is the cleanest instance because the score is naturally out-of-sample. In more inference-hea vy settings, the proto col is b etter viewed as a w ay to make adaptiv e searc h more visible and easier to disclose, not as a substitute for the underlying econometric argumen t. 33 A p ost-searc h holdout can also help reduce concerns ab out p-hacking, but only in a limited sense. Its v alue is that it separates disco very from ev aluation: the agen t ma y search adaptiv ely o ver many sp eciﬁcations on the search sample, but the selected output is then judged on data not used during that search. This mak es sample-speciﬁc improv ements less lik ely to b e mistak en for robust ﬁndings. At the same time, a holdout is not a complete solution. If the holdout is consulted rep eatedly and the metho d is revised in resp onse, it b ecomes part of the searc h pro cess itself. And in inference-heavy settings, holdout p erformance on a diagnostic criterion do es not by itself v alidate identiﬁcation, causal interpretation, or p ost-selection inference. The holdout should therefore b e understo o d as a practical guardrail against hidden adaptiv e searc h, not as a full cure. 34

An Auditable AI Agent Loop for Empirical Economics: A Case Study in Forecast Combination

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment