Understanding by Reconstruction: Reversing the Software Development Process for LLM Pretraining

Understanding by Reconstruction: Reversing the Software Development Process for LLM Pretraining Zhiyuan Zeng 1 , 2 ∗ , Yichi Zhang 3 , ∗ , Yong Shan 1 , ∗ , Kai Hua 1 , ∗ , Siyuan Fang 3 , Qian Liu 4 , Jiaheng Liu 3 , Haozhe Wang 6 , 3 , Yining Zheng 2 , Ming Ding 1 , Ke Shen 1 , Ge Zhang 1 † , Wenhao Huang 1 , † , Xipeng Qiu 2 , 5 † 1 ByteDance Seed , 2 Fudan University , 3 M-A-P , 4 ByteDance TikTok , 5 Shanghai Innovation Institute , 6 The Hong Kong University of Science and Technology ∗ Core Con tributors , † Corresp onding authors Abstract While Large Language Models (LLMs) hav e ac hiev ed remark able success in code generation, they often struggle with the deep, long-horizon reasoning required for complex softw are engineering. W e attribute this limitation to the nature of standard pre-training data: static softw are rep osi- tories represent only the terminal state of an intricate in tellectual pro cess, abstracting aw a y the in termediate planning, debugging, and iterativ e reﬁnement. T o bridge this gap, we prop ose a nov el paradigm: understanding via reconstruction. W e hypothesize that reverse-engineering the latent agen tic tra jectories—the planning, reasoning, and debugging steps—behind static repositories pro vides a far richer sup ervision signal than ra w co de alone. T o op erationalize this, we introduce a framew ork that synthesizes these tra jectories using a multi-agen t simulation. This pro cess is grounded in the structural realities of the source rep ositories (e.g., dep endency graphs and ﬁle hierarc hies) to ensure ﬁdelity . F urthermore, to guarantee the logical rigor of the synthetic data, we emplo y a search-based optimization tec hnique that iteratively reﬁnes the Chain-of-Though t (CoT) reasoning to maximize the likelihoo d of the ground-truth co de. Empirical results demonstrate that con tinuous pre-training on these reconstructed tra jectories signiﬁcan tly enhances Llama-3-8B’s p erformance across diverse b enchmarks, including long-context understanding, co ding proﬁciency , and agentic capabilities. Date: Marc h 20, 2026 Correspondence: Zhiyuan Zeng at cengzy23@m.fudan.edu.cn , Ge Zhang at zhangge.eli@bytedance.com 1 Introduction The remark able success of Large Language Mo dels (LLMs) can b e view ed as a mo dern v alidation of Richard F eynman’s famous dictum: “ What I c annot cr e ate, I do not understand. ” The dominan t paradigm of generativ e pre-training [ 7 , 16 ] is built on this very principle—that the abilit y to generate text token-b y-tok en serves as the proxy for understanding language. By learning to predict the next token, mo dels internalize the syntax, seman tics, and world knowledge embedded within v ast corp ora. Ho wev er, this “understanding via generation” paradigm faces a fundamental limit when applied to complex, long-horizon artifacts, such as substantial softw are rep ositories. A soft ware rep ository , in its ﬁnal form, is the 1 terminal state of an in tricate intellectual pro cess. It is a highly compressed artifact where the “computational steps” of human reasoning—the requiremen t analysis, architectural planning, trial-and-error debugging, and iterativ e reﬁnement—ha ve been abstracted aw a y . When we train mo dels solely on this static co de, we are essen tially asking them to memorize the destination without showing them the map. Consequen tly , mo dels often learn to mimic the surface-level structural patterns of the result rather than mastering the generative reasoning required to derive it. This explains why mo dels that excel at generating short snipp ets often fail to grasp the deep, causal logic required to construct and maintain complex soft ware systems [ 19 ]. T o bridge this gap, we prop ose that to truly understand a rep ository , a mo del should learn to reconstruct the pro cess that created it. Our motiv ation is to reverse-engineer the latent agen tic tra jectory hidden b ehind static co de [ 30 ]. W e h yp othesize that by restoring the missing details of the generation pro cess—explicitly expanding a static rep ository into a dynamic sequence of planning, reasoning, and execution steps—we can pro vide a far richer sup ervision signal than the raw co de alone. This allows the mo del to learn not just what the co de is, but why and how it w as written, thereby aligning the training data more eﬀectively with the mo del’s next-token prediction ob jective. T o implement this data-centric philosophy , we developed a framework to synthesize these tra jectories from existing high-qualit y open-source rep ositories. W e treat the rep ository as the ground truth answ er and sim ulate the problem-solving steps required to arrive there. Sp ec iﬁcally , we employ a multi-agen t simulation, where a main agent generates high-level requirements and implementation plans, while sub-agents are delegated to handle individual ﬁles. These agents utilize a “Read” tool to gather context and a “W rite” tool to generate co de. Crucially , to preven t the simulation from drifting, we inject structural ground-truth information—such as ﬁle hierarchies and dep endency graphs extracted from the rep ository—to guide the agents, ensuring the syn thesized tra jectory faithfully reconstructs the target artifact. While this reconstruction pro vides the “missing steps”, the quality of the reasoning itself remains a v ariable. The initial CoT generated during simulation ma y b e sub optimal. T o address this, we in tro duce a search-based optimization tec hnique to reﬁne the thinking pro cess. W e p osit that a high-qualit y thought ( z ) should maximize the likelihoo d of the correct co de ( x ), formalized as maximizing log p ( x | z ) . Drawing inspiration from tree-searc h algorithms [ 21 , 30 ], we decomp ose the tra jectory in to steps and iteratively sample reﬁnemen ts. W e replace the original reasoning with reﬁned thoughts only when they low er the p erplexity of the target ground-truth co de. This pro cess p olishes the synthetic tra jectory , yielding a dataset that is not only causally complete but also logically rigorous. W e empirically v alidate our paradigm by con tin uously pre-training Llama-3-8B on our syn thesized dataset. The results demonstrate that learning from these reconstructed tra jectories leads to signiﬁcant p erformance gains across diverse b enchmarks, including long-context understanding, co ding, reasoning, and agentic capabilities. Our contributions are summarized as follows: 1. W e prop ose a nov el paradigm for scaling LLM capabilities based on the principle of understanding via reconstruction . W e argue that static rep ositories miss crucial generative details, and we introduce a metho d to reverse-engineer these laten t agen tic tra jectories to pro vide ric her sup ervision. 2. W e develop a m ulti-agent sim ulation framework that synthesizes these tra jectories by grounding the generation pro cess in the structural realities of source rep ositories, eﬀectiv ely conv erting static data in to dynamic thinking and acting,. 3. Exp erimen tal results show that Llama-3-8B, when pre-trained on our reconstructed data, ac hiev es sup erior p erformance across b enc hmarks for long-con text understanding, co ding, reasoning, and agentic tasks. 2 Related Work 2.1 Reverse Reasoning in Pretraining Data The paradigm of “reasoning recov ery” posits that logical capabilities are latent within pre-training data and can b e activ ated through explicit structural modeling. Early eﬀorts suc h as Quiet-ST aR [ 38 ] internalize 2 this pro cess at the tok en level b y training mo dels to generate implicit rationales that minimize future- tok en uncertaint y , eﬀectively em b edding an “internal monologue” within the mo del’s latent space. Shifting to ward structural optimization, BOL T [ 18 ] learns laten t reasoning for pre-training do cuments via an EM framew ork, systematically bridging the gap b etw een ra w text and logical deriv ation. F rom a data-engineering p ersp ectiv e, Thinking Augmented Pre-training (TPT) [ 32 ] prep ends syn thetic thinking tra jectories to pre- training corp ora, eﬀectively reallo cating computational budget tow ard logic-dense segments to enhance data eﬃciency . Most recen tly , REER [ 30 ] introduced a reverse-engineering approac h for op en-ended generation, utilizing p erplexity-driv en path searching to reconstruct the logical scaﬀolding b ehind high-qualit y reference answ ers. Unlik e existing approaches that recov er isolated reasoning steps, our framework reconstructs a holistic agen tic tra jectory—integrating high-level architectural planning, ﬁle-level action sequencing, and iterative to ol-use—thereb y capturing the multi-dimensional generative pro cess of entire rep ositories. 2.2 Synthetic Agent Trajectories There are tw o primary metho ds for constructing agent tra jectories in existing research. The ﬁrst metho d inv olv es generating tra jectories through agent exploration in real-world en vironments [ 6 , 17 , 27 , 29 , 36 ]. While this approac h ensures the authen ticit y of the tra jectories, it has signiﬁcan t dra wbacks, including p otentially exp ensive to ol inv o cation costs and substan tial engineering eﬀorts required for environmen t setup and maintenance. The second metho d places the agen t in an en vironmen t sim ulated by LLM [ 3 , 14 , 26 ] or prompt LLM to generate an en tire synthetic tra jectory [ 23 , 30 ]. The main adv an tage of this approach is its lo w cost. How ev er, the resulting tra jectories may suﬀer from extensive hallucinations generated b y the LLM, compromising data reliabilit y . The tra jectory synthesis metho d prop osed in this pap er draws inspiration from the second by using an LLM to generate b oth the to ol calls and their corresp onding outcomes. Although the synthesized tra jectories ma y con tain some noise, we ensure that their terminal state is a real rep ository , which serves as the ground truth. 2.3 Synthetic Data for Coding The use of synthetic data has b ecome a cornerstone in adv ancing the capabilities of LLMs for co de-related tasks. A signiﬁcan t b o dy of w ork fo cuses on generating instruction-following datasets. F or instance, Magico der [ 35 ] synthesizes user instructions for op en-source co de snipp ets to create a dataset aimed at enhancing the co ding abilities of LLMs. Similarly , Co de Alpaca [ 1 ] emplo ys the self-instruct metho dology [ 33 ] to generate a dataset of 20,000 co de instructions. T o improv e the quality of these instructions, WizardCo der [ 15 ] introduces an evolutionary pip eline that progressiv ely increases the complexity and diversit y of the initial instructions. Other research explores diﬀeren t forms of synthetic data. Case2Co de [ 24 ], for example, collects a v ast num ber of input-output test cases by executing existing programs and then generates new programs that satisfy these test cases. More recen tly , SWE-Synth [ 19 ] fo cuses on generating synthetic data for program bug ﬁxing, which has prov en eﬀective in improving LLM p erformance on b enchmarks lik e SWE-Bench [ 13 ]. The widespread adoption of such syn thetic data in both the pre-training and p ost-training phases of mo dern Co de LLMs, suc h as Qwen2 [ 12 ] and DeepSeek-Co der [ 9 ], underscores its critical imp ortance [ 12 , 31 ]. While our work also contributes to the ﬁeld of synthetic data for co de generation, it diverges from previous eﬀorts in tw o fundamen tal asp ects. First, instead of augmenting isolated co de snipp ets, we focus on augmenting en tire rep ositories. Second, rather than merely capturing the ﬁnal co de or the asso ciated chain-of-though t, w e reconstruct the entire agentic pro cess of dev eloping a rep ository . This inv olv es synthesizing a sequence of actions, to ol interactions, and evolving states, thereby pro viding a more comprehensive and realistic represen tation of the softw are developmen t lifecycle. 3 Figure 1 The pip eline of synthetic agent tra jectory curation. 3 Approach Our goal is to create a high-quality , structured dataset of agentic tra jectories from existing co de rep ositories for LLM pretraining. Our metho d consists of t wo main stages: (1) Multi-Agen t T ra jectory Curation, where w e simulate a dev elop er workﬂo w to reverse-engineer an agentic tra jectory from a complete rep ository , and (2) LongCoT Optimization, where w e reﬁne the reasoning within these tra jectories using a search-based algorithm. Figure 1 provides an o verview of our entire pip eline. The primary ob jective of this framework is not agent training, but the curation of a high-ﬁdelity pre-training corpus. By simulating the developmen t pro cess, we provide the mo del with a reasoning-dense sup ervision signal characterized by long-horizon con text dep endencies that go far b eyond surface-level co de patterns. 3.1 Multi-Agent Trajectory Curation W e design a multi-agen t workﬂo w that mirrors a h uman soft ware developmen t pro cess. T able 1 shows an illustrativ e example of such a syn thesized tra jectory . Instead of building a live agent framework, w e prompt a p ow erful LLM to simulate the en tire workﬂo w and generating the corresp onding tra jectory data. The sim ulation unfolds as follows: Main A gent: Pr oje ct Planning The pro cess begins with a Main Agent. Its resp onsibilities are high-level planning and co ordination. Given the entire co de rep ository as context, the Main Agent is prompted to: • Generate Pro ject Requiremen ts: Syn thesize a high-lev el description of the pro ject’s purpose and functionalit y , as if it were a task brief. • F ormulate an Implementation Plan: Decomp ose the pro ject into a logical sequence of ﬁle creation steps. This plan outlines which ﬁles should b e created and in what order, establishing a dep endency-a ware dev elopment path. F or each ﬁle in the implementation plan, the Main Agent then inv ok es a sp ecialized Sub-Agent to handle the implemen tation. Sub-A gent: File Implementation A Sub-Agent is resp onsible for generating the co de for a single ﬁle. This pro cess is also broken do wn in to thought and action steps: • Plan File Implementation: The Sub-Agent ﬁrst outlines a plan for the sp eciﬁc ﬁle’s structure and logic. • Information Gathering ( Read T o ol): Before writing co de, the Sub-Agent may need to understand the con text of other parts of the rep ository . It sim ulates this by calling a Read to ol to access the conten t of other, already “implemented” ﬁles. 4 Step Role Component Content 0 User Task Create a simple calculator pro ject in Python. It should include a ﬁle for arithmetic operations and a main script to run a demonstration. 1 Main Agent Think The pro ject requires a calculator. It needs a ﬁle for core logic, operations.py , and a main ﬁle to run it, main.py . The latter dep ends on the former, so I m ust create operations.py ﬁrst. The plan is: 1. Create operations.py . 2. Create main.py . 2 Main Agent Action CallSubAgent(task="Create ’operations.py’...") ... ... ... ... 5 Main Agent Observation Sub-agent ﬁnished task for operations.py . The ﬁle has b een generated. 6 Main Agent Think Now proceeding to the next step in my plan. 7 Main Agent Action CallSubAgent(task="Create ’main.py’ that uses the add function...") 8 Sub-Agent Think I need to create main.py . This ﬁle must import the add function from operations.py . T o ensure correctness, I should ﬁrst read the conten t of operations.py . 9 Sub-Agent Action ReadTool(file="operations.py") 10 Sub-Agen t Observation def add(a, b):\n return a + b 11 Sub-Agen t Think Ok a y , I have the conten t of operations.py . Now I can write main.py . I will imp ort add , call it with 2 and 3, and print the output. 12 Sub-Agen t Action WriteTool(file="main.py", content="from operations import add...") 13 Sub-Agen t Observation Successfully wrote 89 bytes to main.py . 14 Main Agent Think Sub-agent ﬁnished task for main.py . All ﬁles in the plan hav e been generated. The pro ject is complete. Table 1 An illustrative example of a synthesized agent tra jectory for creating a tin y pro ject. The process starts with a user instruction, follow ed by the interac tion b etw een the Main Agent and Sub-Agents. • Co de Generation ( Write T o ol): Finally , the Sub-Agent calls a Write to ol, providing the full co de con tent for the current ﬁle. This entire sequence of thoughts, to ol calls ( Read , Write ), and to ol resp onses constitutes a single, coherent agen tic tra jectory (see Steps 8-13 in T able 1 for a detailed instance). Gr ounding the Simulation with Extr acte d Information A purely LLM-simulated tra jectory is prone to noise and hallucinations. T o enhance the ﬁdelity and accuracy of our synthetic data, w e ground the simulation by injecting ground-truth information extracted directly from the source rep ository . This serv es tw o purp oses: guiding the LLM’s generation and replacing noisy outputs with factual data. W e extract the following ground-truth information: • File Structure T ree: A complete directory and ﬁle lay out of the rep ository . This is provided to the LLM to simulate the implementation plan of Main Agent. • In ter-File Dep endency Graph: W e analyze imp ort statements to build a graph representing how ﬁles dep end on one another. This is imp ortan t for the LLM to simulate the tool call and to ol response of Read T o ol. • In tra-File Structure: F or each ﬁle, we parse its Abstract Syntax T ree (AST) to extract key structural elemen ts like class and function deﬁnitions. This information is provided to the LLM to simulate the Sub-Agen t tra jectory . F urthermore, we use this ground-truth data to correct parts of the sim ulated tra jectory . F or example: • The resp onse to a Read to ol call is replaced with the actual conten t of the ﬁle from the rep ository . 5 • The ﬁnal output of the Write to ol call is replaced with the ground-truth co de of the ﬁle. This grounding pro cess ensures that while the reasoning is generated by the LLM, the actions and outcomes are anchored to reality . 3.2 CoT Optimization via Search The initial tra jectory curation stage leverages an LLM’s ability to simulate agen tic b eha vior. How ever, the generated CoT reasoning ( z ) may not b e optimal for generating the target co de ( x ). An ideal thought pro cess should make the subsequent co de generation step as simple as p ossible. F ormally , we aim to ﬁnd a reasoning path z ∗ that maximizes the conditional log-probabilit y of the co de: z ∗ = arg max z log p ( x | z ) (1) While this ob jective could b e optimized using RL (with log p ( x | z ) as the reward), RL training is often complex, exp ensiv e, and unstable. W e therefore opt for a simpler yet eﬀective inference-time searc h strategy . F ollo wing W ang et al. [ 30 ], we decomp ose the CoT into steps ( z 1 , . . . , z n ) and optimize each z i : 1. Sample: W e prompt an LLM to generate a set of k alternativ e “reﬁnements” for the thought step z i . 2. Ev aluate: F or each candidate z cand = ( z 1 , . . . , z ′ i , . . . , z n ) , measure the Perplexit y (PPL) of the ground- truth co de x : PPL ( x | z cand ) . 3. Up date: If the b est reﬁnement z ′∗ i results in a low er p erplexity than the original step z i , we p ermanently up date the CoT with this new, improv ed step. This iterative reﬁnement ensures the reasoning path is causally structured and directly facilitates correct co de generation. 3.3 Continue Pretraining on Synthetic Agent Trajectories W e utilize the synthesized agen t tra jectories for contin ual pre-training rather than SFT or post-training. This c hoice is motiv ated by the inherent nature of our synthetic data. The tra jectories inevitably contain noise and biases stemming from the LLM’s potential hallucinations and our agent workﬂo w. Contin uous pre-training, which typically inv olv es larger and more diverse datasets than SFT, is inherently more robust to suc h imp erfections. T r aje ctory Flattening T o prepare the data, we transform the hierarc hical m ulti-agent interaction into a single sequen tial do cument. When the Main Agent calls a Sub-Agent, we recursiv ely inject that Sub-Agen t’s en tire tra jectory (though ts, to ol calls, and observ ations) directly in to the call p oin t. This creates a monolithic, c hronological sequence that mirrors the complete dev elopment lifecycle of the rep ository , structurally similar to the example shown in T able 1 . T ar gete d L oss Masking T o ensure the mo del learns the causal link betw een reasoning and action rather than memorizing feedback, w e mask the tokens corresponding to Observations (to ol resp onses). The model is thus trained exclusively to predict Think and Action tok ens, forcing it to in ternalize the logic of the dev elopment pro cess. 4 Experiments 4.1 Experiment Setup Data Gener ation: W e curate appro ximately 300k GitHub repositories by ﬁltering too short and long rep ositories. Using Qwen3-30B-A3B-Instruct-2507 [ 28 ], we generate 4B tokens of synthetic agen t tra jectories. F or CoT optimization, we generate t wo candidates for eac h CoT step and iterate the search-and-replace pro cess for 3 rounds. 6 Benc hmark Context Prolong Ra w-Repo Rep o2Agent Repo2Agent-Searc h Ruler 16,384 83.61 86.90 87.50 87.10 32,768 81.77 83.20 84.00 84.40 65,536 57.10 61.00 58.10 61.80 Helmet 16,384 60.17 60.41 61.56 61.99 32,768 61.57 60.98 62.03 62.65 65,536 58.10 57.13 57.32 57.84 Table 2 Summary of Long-Context Understanding performance (A v erage Scores). Detailed sub-task results are pro vided in App endix A . T r aining Conﬁgur ation: W e contin ually pre-train Llama3-8B-Instruct [ 5 ] for 20B tokens with a 64k context windo w, follo wing Gao et al. [ 8 ] . T o ensure a fair comparison, all mo dels share a 70% general-domain and 30% rep ository-related data mixture. Within the 30% rep ository slot, 18% is ﬁxed (Prolong Rep os), while the remaining 12% is allo cated to our exp erimental data v arian ts. Baselines and Mo del V ariants: W e compare the oﬃcial Prolong baseline against three internal v ariants, diﬀering only in the 12% exp erimental data slot: • Raw-Repos: 12% slot ﬁlled with raw source co de from our 300k rep os. • Rep o2Agent: 12% slot ﬁlled with unoptimized synthetic tra jectories. • Rep o2Agent-Searc h: 12% slot ﬁlled with search-optimized tra jectories. Our primary comparison fo cuses on R aw-R ep os , R ep o2A gent , and R ep o2A gent-Se ar ch to isolate the impact of con verting co de into agentic tra jectories. Evaluation Benchmarks W e assess the mo dels across four key capabilities. The selection of these b enchmarks is directly motiv ated by the long-context, reasoning-intensiv e, and co de-centric nature of our reconstructed data. • Long-Con text Understanding: As our reconstruction unfolds rep ositories in to massive sequen tial traces, it introduces long-range causal dep endencies. we ev aluate this via Ruler [ 11 ] and Helmet [ 37 ] to test information retriev al across extended horizons. • Co ding: Giv en our co de-domain fo cus, we use LongCo deBench [ 22 ] and HumanEv al [ 2 ] to v erify if observing the “pro cess” of co de creation enables b etter synthesis than memorizing static ﬁles. • Reasoning: A central feature of our data is the search-optimized CoT. W e ev aluate the transferability of this structured logic to general domains using BBH [ 25 ], AGIEv al [ 39 ], GSM-8k [ 4 ], MA TH [ 10 ], and MMLU-Pro [ 34 ]. • Soft ware-engineering capability: W e use APTBenc h [ 20 ], which is sp eciﬁcally designed to assess the foundational agentic capabilities of pre-trained models (without p ost-training) on SWE-Benc h and Deep-Researc h, to measure the inherent p oten tial instilled by our tra jectories. 4.2 Main Results 4.2.1 Long-Context Understanding W e ev aluate the long-context capabilities of our mo dels using tw o comprehensiv e b enchmarks: Ruler and Helmet. Across b oth b enchmarks, our primary observ ation is that training on structured agent tra jectories (Rep o2Agen t v ariants) consisten tly yields sup erior p erformance compared to training on ﬂattened co de (Ra w-Rep os). This conﬁrms that reconstructing the process of co de generation provides a denser, more instructiv e signal for long-context modeling than static co de ﬁles alone. F urthermore, our optimized mo del, 7 Benc hmark Prolong Ra w-Rep os Rep o2Agent Repo2Agent-Searc h A GI-Ev al 36.91 35.78 36.32 36.85 BBH 66.69 66.27 66.00 67.03 GSM-8k 59.67 61.94 61.94 60.96 MA TH 1.64 2.18 3.72 3.76 Human-Ev al 16.46 34.76 36.59 37.20 LongCo deBenc h-32k 29.38 34.16 34.51 36.46 LongCo deBenc h-64k 30.52 27.37 31.05 30.26 Table 3 Results on Reasoning or Co ding Benchmarks, including AGI-Ev al, BBH, GSM-8k, MA TH, Human-Ev al and LongCo deBenc h. R ep o2A gent-Se ar ch , frequently surpasses the strong external baseline ( Pr olong ), particularly in tasks requiring complex information retriev al. W e present the av erage scores of long-context understanding in T able 2 . F or a more granular breakdo wn of p erformance across all sub-tasks in Ruler and Helmet, please refer to App endix A (T ables 5 and 6 ). Performanc e on Ruler As shown in T able 2 , the mo dels trained on our synthetic data (Rep o2Agen t & Rep o2Agen t-Search) consistently outp erform the in ternal R aw-R ep o baseline across all tested context lengths. A t shorter context lengths (16k and 32k), R ep o2A gent and R ep o2A gent-Se ar ch mainta in a steady lead o ver ra w co de pre-training. The adv an tage of agentic synthetic data b ecomes most evident at the 64k window size. While the oﬃcial Pr olong baseline and the R aw-R ep o ablation show signiﬁcant degradation, R ep o2A gent-Se ar ch ac hieves the highest robustness with an a verage score of 61.80. This suggests that learning from a structured, step-b y-step construction pro cess helps the mo del maintain information integrit y even when the context is hea vily p opulated. Performanc e on Helmet The results on the Helmet b enchmark further reinforce the superiority of the reconstruction paradigm. A t 16k and 32k context lengths, R ep o2A gent-Se ar ch achiev es p eak p erformance, reaching an av erage of 62.65 at 32k. This represents a signiﬁcant improv emen t ov er the R aw-R ep o baseline (60.98), indicating that the searc h-optimized reasoning steps pro vide a cleaner and more eﬀective sup ervision signal for long-range retriev al and reasoning than ﬂattened co de ﬁles. At the maximum length of 64k, while the Pr olong baseline remains highly comp etitive, our R ep o2A gent-Se ar ch con tinues to outp erform the primary R aw-R ep o ablation. This conﬁrms that even when considering the holistic p erformance across diverse long-con text tasks, conv erting static rep ositories into dynamic histories is a more p otent data strategy than standard co de pre-training. 4.2.2 Coding and Reasoning W e further ev aluate whether agentic pre-training b eneﬁts fundamental co ding and general reasoning (T able 3 ). Co ding Cap abilities Our reconstruction paradigm shows a clear adv an tage in code generation. On HumanEv al, R ep o2A gent-Se ar ch scores 37.20, outp erforming the R aw-R ep os baseline (34.76). This conﬁrms that learning the “pro cess” of creation—incorp orating planning and reﬁnement—is sup erior to memorizing static co de. This edge extends to long-horizon tasks; R ep o2A gent-Se ar ch leads on LongCo deBench-32k (36.46). R e asoning T r ansfer Despite the lack of math-sp eciﬁc tuning, our metho d induces p ositive transfer to general reasoning. On MA TH, although absolute scores are lo w across all mo dels—reﬂecting the inherent limitations of the Llama-3-8B in complex mathematics— R ep o2A gent-Se ar ch still yields the b est results. F urthermore, on BBH and AGI-Ev al, our models match or slightly exceed the baselines. These results demonstrate that the structured logic within agentic tra jectories provides a higher-quality sup ervision signal than raw code, enhancing sp ecialized skills without compromising general intelligence. 8 Category Sub-task Raw-Repos Repo2Agent Repo2Agent-Search DeepResearc h Op enend-Citation 11.20 10.94 11.49 Op enend-Plan 16.11 13.42 10.40 Op enend-Qualit y 21.99 24.74 26.20 Plan 47.09 49.54 47.81 Summ-Ans 43.12 45.30 44.40 A ver age 29.21 30.49 30.02 En v-Setup A ction 20.39 22.05 21.13 Error 22.45 23.13 24.49 Plan 18.99 17.85 19.22 A ver age 20.61 21.01 21.61 Issue-Fix Fix-P atch 26.72 28.02 25.43 Lo cate 24.03 23.67 24.03 Plan 37.04 40.74 38.68 T est-P atch 26.60 27.08 26.60 T o ol-Call 54.23 54.69 54.28 A ver age 33.72 34.84 33.80 Overall Average 29.02 30.10 29.65 Table 4 Results on APTBenc h (Merged En/Zh Sub-tasks) 4.2.3 Software-Engineering Capability W e use APTBench to ev aluate the foundational argentic potential instilled b y our pre-training. By decon- structing complex tra jectories into atomic skills (e.g., planning, ﬁx-patch, to ol selection, test-patc h, ...), APTBenc h measures a mo del’s inherent aptitude without the confounding eﬀects of p ost-training. R ep o2A gent excels in planning-centric categories like Issue-Fix (34.84%). This suggests that natural, unreﬁned CoT provides a more generalizable signal for holistic workﬂo ws. R ep o2A gent-Se ar ch leads in En v-Setup (21.61%), particularly in the Error diagnosis sub-task (24.49%). This indicates that search-reﬁned reasoning, b eing logically more rigorous, is more eﬀective for teaching meticulous, lo w-level implemen tation and debugging logic. In summary , pre-training on synthetic tra jectories (Rep o2Agent) signiﬁcantly fosters innate agentic capabilities compared to raw co de, with the choice of optimization (Search) oﬀering a tunable balance betw een broad planning and logical precision. 5 Analysis on Synthetic Data 5.1 Token Distribution W e analyze the structural comp osition and length of our synthetic tra jectories to ev aluate the impact of agen tic reconstruction and search-based optimization (Figure 2 ). Comp osition and R e asoning Exp ansion As shown in Figure 2a , tok ens are primarily concentrated in sub-agen t activities (T o ol-Calls and Resp onses), reﬂecting the detailed implemen tation pro cess. Crucially , our search optimization signiﬁcantly deep ens the reasoning trace: Sub-Agent-Call-Think tokens more than double from 900 in R ep o2A gent to 2,300 in R ep o2A gent-Se ar ch . This v alidates that the searc h pro cess do esn’t merely reﬁne thoughts but substantially elab orates on the logical steps required for implemen tation. Information Exp ansion Figure 2b highlights how our paradigm decompresses static code in to explicit narrativ es. T ransforming ra w co de (avg. 4,865.5 tokens) into an agen tic tra jectory ( R ep o2A gent-Se ar ch , avg. 9 (a) Comp osition of Agent T ra jectories (b) T ra jectory Length Figure 2 (a): T oken distribution on thinking, to o-call and too-resp onse of main-agent and sub-agen t. (b): A verage n umber of tok ens for eac h rep o. (a) CoT Length (b) PPL of the co des Figure 3 (a): The CoT increases with more CoT-optimization iterations. (b): PPL of the co de to b e generated decreases with more iterations. 12,083.4 tokens) signiﬁcantly increases the p er-rep ository tok en count by making latent planning and execution steps explicit. Imp ortan tly , despite the increased sample length, all mo del v arian ts are trained on a ﬁxed budget of 12% of 20B total tokens. This ensures a fair comparison: the p erformance gains are driven by the structural qualit y and informational density of the tra jectories, rather than an increase in the total volume of training data. 5.2 Impact of CoT Optimization W e ev aluate the relationship b etw een optimization iterations, CoT length, and target co de p erplexity (PPL) using 100 sample tra jectories o ver 10 iterations (Figure 3 ). As shown in Figure 3a , the av erage CoT length correlates p ositively with the num ber of iterations, conﬁrming that our search-based method actively elaborates on the initial reasoning to pro duce more explicit though t pro cesses. Crucially , this elab oration directly improv es reasoning qualit y: Figure 3b illustrates a steady decrease in co de PPL as iterations increase. This in verse relationship supp orts our h ypothesis that more detailed reasoning provides a more informativ e and predictive con text, thereby simplifying the subsequent co de generation task. 10 6 Conclusion In this work, we addressed the limitations of training on static softw are artifacts by prop osing a nov el paradigm of understanding via reconstruction . By rev erse-engineering latent agentic tra jectories through grounded m ulti-agent simulation and reﬁning the reasoning via search-based optimization, w e transformed static rep ositories into dynamic, causally rich training data. Our exp eriments with Llama-3-8B demonstrate that learning from these reconstructed pro cesses signiﬁcantly enhances co ding, reasoning, and agentic capabilities. 11 References [1] Sahil Chaudhary . Code alpaca: An instruction-following llama mo del for co de generation. https://github.com/ sahil280114/codealpaca , 2023. [2] Mark Chen. Ev aluating large language mo dels trained on co de. arXiv preprint arXiv:2107.03374, 2021. [3] Zhaorun Chen, Zhuok ai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan W u, T arun Kalluri, Sara Cao, Y uanhao Xiong, Haib o T ong, et al. Scaling agen t learning via exp erience synthesis. arXiv preprint arXiv:2511.03773, 2025. [4] Karl Cobbe, Vineet Kosara ju, Mohammad Bav arian, Mark Chen, Heewoo Jun, Luk asz Kaiser, Matthias Plappert, Jerry T worek, Jacob Hilton, Reiic hiro Nak ano, et al. T raining v eriﬁers to solv e math w ord problems. arXiv preprin t arXiv:2110.14168, 2021. [5] Abhiman yu Dub ey , Abhina v Jauhri, Abhinav P andey , Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Math ur, Alan Schelten, Amy Y ang, Angela F an, et al. The llama 3 herd of mo dels. arXiv e-prin ts , pages arXiv–2407, 2024. [6] Runnan F ang, Shihao Cai, Baixuan Li, Jialong W u, Guangyu Li, W enbiao Yin, Xin yu W ang, Xiaobin W ang, Liangcai Su, Zhen Zhang, et al. T o wards general agentic intelligence via environmen t scaling. arXiv preprint arXiv:2509.13311, 2025. [7] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scop e, limits, and consequences. Minds and machines , 30(4):681–694, 2020. [8] Tian yu Gao, Alexander W ettig, How ard Y en, and Danqi Chen. Ho w to train long-con text language mo dels (eﬀectiv ely). In Pro ceedings of the 63rd Annual Meeting of the Asso ciation for Computational Linguistics (V olume 1: Long Papers), pages 7376–7399, 2025. [9] Da ya Guo, Qihao Zhu, Dejian Y ang, Zhenda Xie, Kai Dong, W entao Zhang, Guanting Chen, Xiao Bi, Y. W u, Y.K. Li, F uli Luo, Yingfei Xiong, and W enfeng Liang. Deepseek-co der: When the large language mo del meets programming – the rise of co de intelligence, 2024. URL . [10] Dan Hendryc ks, Collin Burns, Saurav Kadav ath, Akul Arora, Steven Basart, Eric T ang, Da wn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprin t , 2021. [11] Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantan u Ac hary a, Dima Rekesh, F ei Jia, Y ang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of y our long-context language mo dels? arXiv preprin t arXiv:2404.06654, 2024. [12] Bin yuan Hui, Jian Y ang, Zeyu Cui, Jiaxi Y ang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia jun Zhang, Bow en Y u, Kai Dang, et al. Qwen2. 5-co der tec hnical rep ort. arXiv preprint arXiv:2409.12186, 2024. [13] Carlos E Jimenez, John Y ang, Alexander W ettig, Shun yu Y ao, Kexin Pei, Oﬁr Press, and Karthik Narasimhan. Sw e-b ench: Can language mo dels resolv e real-w orld github issues? arXiv preprin t arXiv:2310.06770, 2023. [14] Y uetai Li, Huseyin A Inan, Xiang Y ue, W ei-Ning Chen, Luk as W utschitz, Janardhan Kulk arni, Radha Poov endran, Rob ert Sim, and Sarav an Ra jmohan. Sim ulating en vironments with reasoning models for agent training. arXiv preprin t arXiv:2511.01824, 2025. [15] Ziy ang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiub o Geng, W enxiang Hu, Chongyang T ao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardco der: Empow ering code large language mo dels with ev ol-instruct. arXiv preprin t arXiv:2306.08568, 2023. [16] Long Ouyang, Jeﬀrey W u, Xu Jiang, Diogo Almeida, Carroll W ainwrigh t, Pamela Mishkin, Chong Zhang, S andhini Agarw al, Katarina Slama, Alex Ra y , et al. T raining language models to follow instructions with human feedbac k. A dv ances in neural information pro cessing systems, 35:27730–27744, 2022. [17] V ardaan Pah uja, Y adong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Y u Su, and Ahmed Hassan. Explorer: Scaling exploration-driven w eb tra jectory syn thesis for multimodal web agents. In Findings of the Association for Computational Linguistics: A CL 2025, pages 6300–6323, 2025. [18] Bo Pang, Hanze Dong, Jiacheng Xu, Silvio Sav arese, Yingb o Zhou, and Caiming Xiong. Bolt: Bootstrap long c hain-of-thought in language mo dels without distillation, 2025. URL . 12 [19] Minh VT Pham, Huy N Phan, Hoang N Phan, Cuong Le Chi, Tien N Nguy en, and Nghi DQ Bui. Sw e-synth: Syn thesizing veriﬁable bug-ﬁx data to enable large language mo dels in resolving real-world bugs. arXiv preprin t arXiv:2504.14757, 2025. [20] Jiarui Qin, Y unjia Xi, Junjie Huang, Ren ting Rui, Di Yin, W eiw en Liu, Y ong Y u, W einan Zhang, and Xing Sun. Aptb enc h: Benc hmarking agentic potential of base llms during pre-training. arXiv preprint , 2025. [21] Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Huazheng W ang, Kaixuan Huang, Y ue W u, and Mengdi W ang. T reeb on: Enhancing inference-time alignment with sp eculative tree-search and best-of-n sampling. arXiv preprin t arXiv:2410.16033, 2024. [22] Stefano Rando, Luca Romani, Alessio Sampieri, Luca F ranco, John Y ang, Y uta Kyuragi, F abio Galasso, and T atsunori Hashimoto. Longcodeb ench: Ev aluating co ding llms at 1m con text windo ws. arXiv preprint arXiv:2505.07897, 2025. [23] Saptarshi Sengupta, Harsh V ashistha, Kristal Curtis, Akshay Mallip eddi, Abhinav Math ur, Joseph Ross, and Liang Gou. Mag-v: A m ulti-agent framework for syn thetic data generation and veriﬁcation. arXiv preprin t arXiv:2412.04494, 2024. [24] Y unfan Shao, Liny ang Li, Yich uan Ma, Peiji Li, Demin Song, Qinyuan Cheng, Shimin Li, Xiaonan Li, Pengyu W ang, Qip eng Guo, et al. Case2code: Scalable synthetic data for code generation. In Pro ceedings of the 31st In ternational Conference on Computational Linguistics, pages 11056–11069, 2025. [25] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi T ay , Hyung W on Chung, Aak anksha Cho wdhery , Quo c V Le, Ed H Chi, Denny Zhou, , and Jason W ei. Challenging big-b ench tasks and whether c hain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022. [26] Qiao yu T ang, Ziliang Deng, Hongyu Lin, Xianp ei Han, Qiao Liang, Boxi Cao, and Le Sun. T o olalpaca: Generalized to ol learning for language mo dels with 3000 sim ulated cases. arXiv preprint arXiv:2306.05301, 2023. [27] Kimi T eam, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Y anru Chen, Y uankun Chen, Y utian Chen, et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025. [28] Qwen T eam. Qw en3 technical rep ort, 2025. URL . [29] Haozhe W ang, Long Li, Chao Qu, F engming Zhu, W eidi Xu, W ei Ch u, and F angzhen Lin. T o co de or not to code? adaptive tool integration for math language mod els via expectation-maximization. arXiv preprint arXiv:2502.00691, 2025. [30] Haozhe W ang, Haoran Que, Qixin Xu, Minghao Liu, W angch unsh u Zhou, Jiazhan F eng, W anjun Zhong, W ei Y e, T ong Y ang, W enhao Huang, et al. Rev erse-engineered reasoning for op en-ended generation. arXiv preprin t arXiv:2509.06160, 2025. [31] Haozhe W ang, Qixin Xu, Che Liu, Junhong W u, F angzhen Lin, and W enhu Chen. Emergent hierarchical reasoning in llms through reinforcement learning. arXiv preprint arXiv:2509.03646, 2025. [32] Liang W ang, Nan Y ang, Shaohan Huang, Li Dong, and F uru W ei. Thinking augmen ted pre-training, 2025. URL https://arxiv.org/abs/2509.20186 . [33] Yizhong W ang, Y eganeh K ordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Ha jishirzi. Self-instruct: Aligning language mo dels with self-generated instructions. In Pro ceedings of the 61st ann ual meeting of the asso ciation for computational linguistics (volume 1: long pap ers) , pages 13484–13508, 2023. [34] Y ub o W ang, Xueguang Ma, Ge Zhang, Y uansheng Ni, Abhranil Chandra, Shiguang Guo, W eiming Ren, Aaran Arulra j, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and c hallenging multi-task language understanding b enc hmark. A dv ances in Neural Information Processing Systems, 37:95266–95290, 2024. [35] Y uxiang W ei, Zhe W ang, Jiaw ei Liu, Yifeng Ding, and Lingming Zhang. Magico der: Empow ering code generation with oss-instruct. arXiv preprint arXiv:2312.02120, 2023. [36] Yiheng Xu, Dunjie Lu, Zhenan Shen, Junli W ang, Zekun W ang, Y uchen Mao, Caiming Xiong, and T ao Y u. Agen ttrek: Agen t tra jectory synthesis via guiding replay with web tutorials. arXiv preprint , 2024. 13 [37] Ho ward Y en, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe W asserblat, and Danqi Chen. Helmet: How to ev aluate long-context mo dels eﬀectively and thoroughly . In The Thirteen th In ternational Conference on Learning Representations, 2025. [38] Eric Zelikman, Georges Harik, Yijia Shao, V aruna Jay asiri, Nick Hab er, and Noah D. Go o dman. Quiet-star: Language mo dels can teach themselv es to think b efore sp eaking, 2024. URL . [39] W anjun Zhong, Ruixiang Cui, Yiduo Guo, Y aob o Liang, Sh uai Lu, Y anlin W ang, Amin Saied, W eizhu Chen, and Nan Duan. Agiev al: A human-cen tric b enchmark for ev aluating foundation mo dels. In Findings of the Asso ciation for Computational Linguistics: NAACL 2024, pages 2299–2314, 2024. 14 Appendix F amily Con text Length (tok ens) P erformance Prolong Ra w-Rep o Rep o2Agent Repo2Agent-Searc h NIAH-Multi 16384 99.40 99.50 99.60 99.70 32768 98.40 99.00 99.20 99.20 65536 66.20 76.30 68.70 80.40 NIAH-Single 16384 100.00 100.00 100.00 100.00 32768 100.00 99.90 99.90 99.90 65536 92.10 89.90 90.30 91.30 R ULER-CWE 16384 80.40 85.00 79.90 87.30 32768 27.60 34.60 33.20 42.30 65536 0.30 6.50 0.40 1.60 R ULER-FWE 16384 92.70 92.10 95.10 92.20 32768 87.70 88.10 90.10 86.00 65536 73.70 70.30 79.20 67.90 R ULER-QA 16384 8.90 27.90 32.90 28.20 32768 30.30 33.90 39.00 39.70 65536 20.30 25.60 21.50 20.80 R ULER-VT 16384 99.40 99.00 99.00 98.20 32768 95.20 96.00 95.30 94.10 65536 20.50 14.40 18.60 16.60 Average 16384 83.61 86.90 87.50 87.10 32768 81.77 83.20 84.00 84.40 65536 57.10 61.00 58.10 61.80 Table 5 Results on Ruler. W e av erage the results on NIAH-Multi-Key , NIAH-Multi-V alue and NIAH-Multi-Query as NIAH-Multi-Multi. The results on RULER-QA-Hotpot and R ULER-QA-Squad are a veraged as RULER-QA. A Detailed Results on Long-Context Benchmarks Ruler T able 5 presen ts the p erformance on the Ruler b enchmark. The results highlight the robustness of agen t-based training data at extreme con text lengths. • Sup eriorit y o ver Raw Co de: Our proposed metho ds consisten tly outperform the internal R aw-R ep o baseline. F or instance, at the 16k context length, R ep o2A gent achiev es 87.50 compared to 86.90 for R aw-R ep o . This gap widens i n sp eciﬁc tasks; in RULER-CWE (32k), R ep o2A gent-Se ar ch scores 42.30, signiﬁcan tly outpacing R aw-R ep o ’s 34.60. • Robustness at 64k: P erformance stability at the maxim um window size (64k) is a k ey diﬀerent iator. While the Pr olong baseline degrades to 57.10 and R aw-R ep o to 61.00, R ep o2A gent-Se ar ch maintains the highest robustness with an a verage score of 61.80 . • Complex Retriev al (NIAH): The b eneﬁts of agen tic data are most pronounced in the NIAH-Multi tasks, whic h require retrieving multiple pieces of scattered information—a pro cess analogous to an agent lo cating dep endencies across a ﬁle system. At 64k tokens, R ep o2A gent-Se ar ch ac hieves 80.40 , drastically outp erforming R aw-R ep o (76.30) and establishing a massive lead o ver the Pr olong baseline (66.20). Helmet The Helmet b enchmark results (T able 6 ) further v alidate the eﬃcacy of learning from tra jectories, particularly in In-Context Learning (ICL) and Recall tasks. 15 P erformance Category Con text Length Prolong Ra w-Rep o Rep o2Agent Repo2Agent-Searc h ICL 16384 72 . 52 68 . 08 68 . 88 73.52 32768 75 . 84 71 . 84 72 . 72 76.32 65536 80.68 75 . 92 77 . 36 78 . 72 LongQA 16384 28 . 13 33 . 44 36.59 35 . 04 32768 40 . 40 38 . 28 41.55 40 . 20 65536 46.78 45 . 03 44 . 84 45 . 48 RA G 16384 64 . 17 63 . 88 64.46 64 . 08 32768 63 . 33 63.67 63 . 13 63 . 25 65536 56 . 00 57 . 42 57.67 55 . 58 Recall 16384 99 . 94 99 . 94 99 . 69 99 . 94 32768 99 . 38 98 . 94 99 . 19 99.81 65536 95 . 75 93 . 31 91 . 94 96.00 Rerank 16384 36 . 11 36 . 71 38.17 37 . 35 32768 28 . 90 32 . 19 33 . 56 33.69 65536 11 . 31 13 . 95 14.77 13 . 42 A vg 16384 60 . 17 60 . 41 61 . 56 61.99 32768 61 . 57 60 . 98 62 . 03 62.65 65536 58.10 57 . 13 57 . 32 57 . 84 Table 6 Results on Helmet • Consisten t Gains ov er Raw-Repo: Across all con text lengths (16k, 32k, and 64k), R ep o2A gent-Se ar ch consisten tly achiev es a higher av erage score than the R aw-R ep o baseline. Notably , at 32k, our search- optimized mo del reaches 62.65 compared to 60.98 for ra w co de, demonstrating that the reasoning steps injected during training translate to b etter general understanding. • Recall and ICL Capabilities: R ep o2A gent-Se ar ch excels in tasks that mirror the “Recall-Plan-A ct” lo op of our syn thetic agents. In the R e c al l category , it outp erforms b oth R aw-R ep o and the external Pr olong baseline at 32k (99.81) and 64k (96.00). Similarly , in ICL (In-Context Learning), our metho d dominates at 16k and 32k, suggesting that observing the step-b y-step history of co de construction improv es the mo del’s ability to learn patterns from context. • Comp etitiv e Performance at Scale: While the Pr olong baseline shows strength in the L ongQA sub-task at 64k, leading to a slightly higher ov erall av erage (58.10), our metho d remains highly comp etitive (57.84) and crucially , still outp erforms the R aw-R ep o baseline (57.13). T o interpret these results correctly , it is necessary to distinguish the role of the baselines. While Pr olong serv es as an external state-of-the-art reference, R aw-R ep o is the primary controlled ablation. In summary , b oth b enchmarks conﬁrm that transforming ra w rep ositories into dynamic agent tra jectories is a more eﬀective strategy than standard pre-training on static co de, yielding mo dels with sharp er retriev al and reasoning capabilities. B Case Study: Example of synthetic pretraining data C Case Study: Evolution of CoT T o provide a concrete illustration of ho w our LongCoT optimization reﬁnes the agent’s reasoning, we present a case study tracking the evolution of a single Chain-of-Thought (CoT) step through multiple rounds of search. The task is to generate the co de for a Python script, 2_Connect_Postgres_DB.py , whic h in volv es database 16 Original CoT in a Syn thetic Agen t T ra jectory Now I understand that authentication state is managed via st.session_state by the authenticator. The ’status_2FA’ flag is likely set in another page after 2FA setup, so I will proceed to implement the access control based on session state as described. I will now generate the full code for ’2_Connect_Postgres_DB.py’, including: 1. Page title 2. Loading authentication config from ’config.yaml’ 3. Checking login and 2FA status 4. Initializing database connection with @st.cache_resource 5. Query execution with @st.cache_data 6. Displaying results in the required format 7. Error handling for database operations I will write the code directly since no other function definitions are needed. The CoT after 1st-round Searc h Optimization Now I understand that the authentication state is managed through ‘st.session_state‘ by the authenticator. The ‘status_2FA‘ flag is likely set in a different page after 2FA setup, so I need to ensure the access control checks both ‘authentication_status‘ and ‘status_2FA‘ in the session state. I will proceed to implement the access control logic as described. Next, I will generate the full code for ‘2_Connect_Postgres_DB.py‘, making sure to: 1. Set the page title. 2. Load the authentication configuration from ‘config.yaml‘. 3. Check for valid login and 2FA status in the session state. 4. Initialize the PostgreSQL connection using ‘@st.cache_resource‘ for one-time initialization. 5. Execute a query using ‘@st.cache_data‘ with a 5-minute cache TTL. 6. Display the query results in the specified format. 7. Include proper error handling for database connection and query execution. I will now write the full code, ensuring all requirements are met. in teraction within a Streamlit application. The Original CoT (Box C ) generated by the initial sim ulation is functional but generic. It outlines a correct but high-level plan, listing seven necessary steps, such as “Checking login and 2F A status” and “Initializing database connection”. The reasoning is sparse, providing little detail on the “ho w” or “why” b ehind each step. After the 1st R ound of Optimization (Box C ) , the CoT b ecomes more sp eciﬁc and technically precise. Key impro vemen ts include: • Explicit V ariable Checks: It explicitly mentions the need to chec k b oth authentication_status and status_2FA in the session state, a detail absent in the original plan. • T ec hnical Justiﬁcation: It b egins to add rationale, explaining that @st.cache_resource is for “one- time initialization” and sp ecifying a “5-minute cache TTL” for @st.cache_data . This demonstrates a deep er understanding of the to ols b eing used. The Final CoT after the 2nd R ound (Box C ) represents a signiﬁcant leap in reasoning quality , transforming a simple chec klist into a comprehensive implementation blueprint. The enhancements are substantial: 17 The CoT after 2nd-round searc h optimization Now I understand that the authentication state is managed through ‘st.session_state‘ by the authenticator. The ‘status_2FA‘ flag is likely set in a different page after 2FA setup, so I need to ensure the access control checks both ‘authentication_status‘ and ‘status_2FA‘ in the session state. I will proceed to implement the access control logic as described. Next, I will generate the full code for ‘2_Connect_Postgres_DB.py‘, making sure to: 1. Set the page title to "Connect Postgres DB" to align with the user’s requirement. 2. Load the authentication configuration from ‘config.yaml‘ as it’s used to initialize the authenticator, which is crucial for the authentication flow. 3. Check for valid login and 2FA status in the session state. If either is missing or not verified, display a warning and prevent further execution. 4. Initialize the PostgreSQL connection using ‘@st.cache_resource‘ for one-time initialization, ensuring that the connection is not re-established on every page rerun. 5. Execute a query using ‘@st.cache_data‘ with a 5-minute cache TTL, which is specified in the requirement to cache query results. 6. Display the query results in the specified format, looping through the fetched rows and formatting the output as ‘{row[0]} has a :{row[1]}:‘. 7. Include proper error handling for database connection and query execution to ensure the app handles failures gracefully. I will now write the full code, ensuring all requirements are met, including environment variable loading, caching, and user access control. I will also make sure the code is modular, readable, and follows best practices for Streamlit app development. • Detailed Rationale and User Inten t: Eac h step is now accompanied by a rich explanation that links the action to a requirement. F or example, it sp eciﬁes the exact page title to “align with the user’s requiremen t” and explains why loading config.yaml is “crucial for the authentication ﬂo w”. • Elab oration on Edge Cases and Best Practices: The plan no w includes explicit error handling logic (“displa y a warning and preven t further execution”) and implementation details (“lo oping through the fetc hed ro ws and formatting the output”). • Holistic Pro ject A w areness: The concluding thought expands b eyond the immediate ﬁle, men tioning broader concerns like “environmen t v ariable loading”, “mo dularity , readability , and follows b est practices for Streamlit app developmen t”. This indicates a shift from a narrow, ﬁle-centric view to a more holistic, pro ject-a ware mindset. This qualitative analysis empirically demonstrates that our search-based optimization do es not simply rephrase CoT s. It systematically enriches the reasoning pro cess, making it more detailed, explicit, and context-a w are. This enriched reasoning, which more closely mirrors that of an exp ert dev elop er, provides a m uch stronger learning signal for the mo del, which we believe is a key factor b ehind the p erformance improv emen ts observ ed in our exp eriments. 18 D Prompt Prompt for Generating Main-agen t T ra jectory A github repo: $repo_code The tree structure of repo: $file_tree. Given the repo code and the tree structure of the repo, I want to use it to construct multi-agent synthetic data. The main agent needs to generate the implementation plan for the repo based on the detailed requirement document of the repo provided by the user, including the tree structure of the repo and the implementation order of files. It will also call sub-agents to realize the code generation of each file. ‘‘‘ [ { "role": "system_prompt", "content": "you are a helpful assistant. ... Show the sub-agent tool usage here.", }, { "role": "user", "content": "A detailed requirement document for repo, but DO NOT mention implementation details of repo", }, { "role": "gpt", "content": "tree structure of repo, implementation order of repo, call sub-agent to generate code for the first file", "tool-call": { "function_name": "code_generator", "arguments": { "requirement_for_repo": "requirement for repo", "tree_structure": "tree structure of repo", "file_name": "first_file.py", "file_path": "first_file.py", "requirement": "requirement for first_file.py", } } }, { "role": "tool-response", "content": "return of function call", }, { "role": "gpt", "content": "call sub-agent to generate code for the second file", "tool-call": { ... } }, ... ] The Tool usage which should be put at the system prompt: Arguments of sub-agent: {requirement for repo, tree structure, file_name, file_path, requirement for file,} Return of sub-agent: {file_path has been generated successfully} 19 The memory of main agent should cover the planning of all the files in the repo, and call code-generator to generate all these files. Prompt for generating sub-agen t tra jectory I have a GitHub repo, and I want to use it to construct multi-agent synthetic data. The main agent needs to generate the implementation plan for the repo based on the detailed requirement document of the repo provided by the user, including the repo’s tree structure and the implementation order of files. It will also call sub-agents to realize the code generation of each file. The sub-agent requires information provided by the main agent, including the repo’s requirement document, the repo’s tree structure, the name and path of the code file that the sub-agent needs to generate, and the requirement description for this code file. Your task is to generate a JSON list representing the simulated sub-agent’s memory. This memory should chronicle the step-by-step thought process of creating a specific file from scratch, based on a user’s requirement. **Crucially, you are simulating the *creation* process, not explaining or refactoring existing code.** The agent you are simulating does not have access to the final source code at the beginning; it must figure out how to write it. The format of the memory is as follows: [ { "role": "system_prompt", "content": "You are ’code_generator’, an expert software engineer. \nYour goal is to implement robust, production-ready code from a given requirement.\n\nWorkflows:\n1. ANALYZE the file requirement and its place in the repo structure.\n2. IDENTIFY dependencies. If you need to use external classes/functions, use the ‘read‘ tool to check their definitions first.\n3. PLAN the implementation details (class structure, methods, logic).\n4. WRITE the code using the ‘write‘ tool.\n\nTools:\n- read(file_to_read): Returns the definition/signature of a file. Usage: When you need to understand how to invoke another module.\n- write(file_path, content): Writes the code to the file system.\n- final_answer(answer): Reports completion.", }, { "role": "user", "content": "requirement for repo, tree structure, file_name, file_path, requirement for file", }, { "role": "gpt", "content": "Here, the agent analyzes the requirement. It decides if external dependencies need to be checked based on the specific logic needed. It expresses curiosity or caution about specific interfaces it might need to interact with.", "tool-call": { "function_name": "read", "arguments": { "file_to_read": "file name", } } }, { "role": "tool-response", 20 "content": "the content of the file that was read", }, { "role": "gpt", "content": "Here, the agent synthesizes the information from the requirement and any dependencies it read. It DOES NOT just list ’Plan: 1, 2, 3.’ Instead, it narrates its engineering decisions, mentions specific variable names it *plans* to use, considers edge cases for the ‘file_name‘’s logic, and explicitly reasons about how its planned implementation will satisfy the requirements.", "tool-call": { "function_name": "write", "arguments": { "file_path": "path of file", "content": "The source code that the agent decides to write." } } } ... (rest of the JSON structure) ] To help you generate this simulated memory, you are provided with the following information. Use it as a guide to construct a realistic and accurate thought process. * **Information to construct the user prompt:** ‘$arguments_from_main_agent‘ * **The Golden Source Code for ‘$file_name‘ (The Goal):** This is the target code the simulated agent should ultimately produce. **You must not assume the agent has seen this code beforehand.** Use it as the "ground truth" to form a plausible thinking path that leads to this exact implementation. ‘$source_code‘ * **Source code of related files (Dependencies):** This is the content the agent will see when it uses the ‘read‘ tool on other files. ‘$related_source_code‘ ### CRITICAL INSTRUCTION: THOUGHT PROCESS DIVERSITY The ‘content‘ fields in the "gpt" turns must contain **highly intelligent, specific, and varied** thought processes. **STRICTLY AVOID** using the same template (e.g., "Okay, I have checked... Plan: 1. 2. 3.") for every file. **Follow these guidelines to generate the thought process:** 1. **Context-Driven Reasoning**: - If the **target** ‘$source_code‘ contains complex algorithms, the simulated thought process should focus on algorithmic efficiency and data structures. - If the **target** ‘$source_code‘ is a simple DTO or config file, the thought process should be brief and focused on correctness. - **Mention specific names**: The thought process MUST mention the actual class names, variable names, or function names found in the **target** ‘$source_code‘ and ‘$related_source_code‘ as part of its reasoning and planning. 2. **Dependency Logic**: - When simulating a ‘read‘ call: Explain *specifically* what the agent is looking for (e.g., "I need to see if the ‘User‘ class has a ‘get_id‘ method or just a public ‘id‘ field before I can implement the logic that uses it."). - After simulating a ‘read‘: The agent should react to the content found (e.g., "Ah, I see ‘User‘’s constructor requires a positional argument, not a keyword argument. I’ll make sure 21 to call it correctly in my implementation."). # Output Format Return strictly a JSON list representing the memory. Prompt for Optimizing CoT You are an expert software engineer. Your task is to simulate the human reasoning process required to solve a programming problem. **The Goal:** You need to rewrite a specific part of a reasoning chain (the "Target Block"). The goal is to make the reasoning logic more precise, detailed, and aligned with the correct solution, WITHOUT breaking the narrative flow. **Input Data:** 1. **Reference Source Code:** (The correct answer, for your understanding ONLY) {} 2. **Full Reasoning Context:** (The story so far) {} 3. **Target Block to Rewrite:** (The weak step needs replacing) {} **CRITICAL INSTRUCTIONS (Read Carefully):** 1. **The "Time Travel" Rule:** You must act as if you are solving this problem *for the first time*. You do NOT know the final code yet; you are currently deriving it. * **STRICTLY FORBIDDEN:** Do not mention "Reference Code", "Provided Solution", or "Ground Truth". * **CORRECT APPROACH:** Instead of saying "The reference code uses a HashMap...", say "I think a HashMap would be the best data structure here because..." 2. **The "Invisible Stitch" Rule:** Your output will be copy-pasted directly into the original text to replace the old block. It must fit perfectly. * **STRICTLY FORBIDDEN:** Do not verify or announce the correction. Never use phrases like "In this refinement...", "Correcting the previous step...", "Here is the better reasoning...", or "Let’s refine this". * **CORRECT APPROACH:** Just write the thought process directly. Start immediately with "I need to analyze...", "Next, I will...", etc. 3. **Tone & Style:** * Use **First-Person Singular** ("I check...", "I decide..."). * Use **Present Tense** (Reasoning happens *now*). * Be technical, precise, and deductive. **Your Workflow:** 1. **Analyze (‘‘ tags):** * Briefly analyze the Reference Code to understand the *correct* logic. * Identify why the original ‘Target Block‘ was weak or incorrect. * Plan the logic steps needed to bridge the gap. 22 2. **Generate (‘‘ tags):** * Write the purely deductive thought process. * Ensure it starts and ends in a way that connects with the surrounding text in ‘reasoning_chain‘. Now, generate the replacement block. [Your analysis of the gap between the reasoning and the code] [The seamless, first-person reasoning stream ONLY] 23

Understanding by Reconstruction: Reversing the Software Development Process for LLM Pretraining

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment