The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase

Code production is now a commodity; the bottleneck is knowing what to build and proving it works. We present the Kitchen Loop, a framework for autonomous, self-evolving software built on a unified trust model: (1) a specification surface enumerating …

Authors: Yannick Roy

The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase
The Kitc hen Loop: User -Spec-Dr iv en De v elopment f or a Self-Ev ol ving Codebase Y annick R o y 0xAgentKitchen@gmail.com March 2026 Abstract Code production is now a commodity ; the bottleneck is knowing what to build and pr oving it w orks . W e present the Kitchen Loop , 1 a framew ork for autonomous, self-ev olving softw are built on a unified tr ust model: (1) a specification sur face enumerating what the product claims to suppor t; (2) “ As a User × 1000” , where an LLM agent e x ercises that sur face as a synthetic pow er user at ∼ 1 , 000 × human cadence; (3) U nbeat able T ests , g round-truth v erification the code author cannot fak e; and (4) Drift Control , continuous quality measurement with automated pause gates. W e validate across tw o production sy stems ov er 285+ iterations, producing 1,094+ merg ed pull requests with zero regressions detected by the reg ression oracle (methodology in Section 6.1 ). W e obser v e emerg ent proper ties at scale: multi-iteration self-correction chains, autonomous infrastructure healing, and monotonically impro ving q uality g ates. The pr imitiv es are not ne w; our contribution is their composition into a production-tested sys tem with the operational discipline that makes long-running autonomous ev olution safe. Backlog Groom queue Ideation Use the product T riage Findings → tick ets Ex ecution Branch, fix, PR P olishing Re view , CI, merg e Regr ession Oracle + dr ift Figure 1: The Kitchen Loop: a six-phase autonomous improv ement cycle. 1 Open-source: https://github.com/0xagentkitchen/kitchenloop 1 1 Core Contributions This paper mak es f our claims: 1. “ As a User” v s. task -completion. Ag entic sy stems should not just close tic kets — the y should sy stematically ex ercise a product ’ s specification surface the wa y a user w ould. The Kitchen Loop is a production-tested frame work f or this, grounded in synthetic user journey s rather than isolated issues. 2. The unified trust model. Autonomous ev olution requires a unified tr ust model. Three components interloc k to make it safe: a specification surface (what the product claims to do), unbeatable tests (ground-tr uth v er ification the author can ’ t f ake), and a regression oracle with drif t control and automatic pause g ates. 3. U nbeat able tests. Cor rectness requires adv ersar ial, multi-model revie w . Implementer -wr itten tests are necessary but insufficient — in our deployments, 38 passing unit tests coe xisted with complete f eature failure. The Kitchen Loop enf orces cor rectness through adv ersarial U A T gates (sealed test cards, fresh e valuator , zero context) and mandatory cross-model re view : ev er y PR is challeng ed b y independent agents (Code x, Gemini, CodeRabbit) bef ore merg e. No output is accepted as-is from the model that wrote it. 4. Bounded production evidence. Across two production sy stems and 285+ iterations, the Kitc hen Loop produced 1,094+ merg ed pull reques ts with zero regressions detected by the regression oracle (methodology in Section 4.1), monotonically impro ving quality gates (76–91% → 100%), and a cost of ∼ $0.38 per merg ed PR. Ke y definitions used throughout: T erm Definition Specification surface The enumerable set of capabilities a product claims to suppor t — the input to cov erage-e xhaustion U nbeat able test A test that verifies outcomes agains t ground tr uth that the code author cannot f ake Regr ession oracle A repeatable, bounded test that answers “is the sys tem at least as good as bef ore?” Co v erage-exhaustion mode An operating regime where the ag ent sys tematically e x ercises e very combination in the specification sur f ace until cov erage g aps approach zero U A T gate A dversarial user -acceptance testing b y a fresh ev aluator with zero implementation conte xt Drift control Continuous measurement of quality trends with automated pause gates that halt the loop when metr ics degrade 2 2 Ex ecutiv e Summary (TL;DR) The method An LLM agent uses the product as a synthetic pow er user at ∼ 1 , 000 × human cadence agains t its specification sur f ace and bey ond, validates through unbeatable tests, and controls drift and reg ression bef ore accepting the w ork to ensure the product self-ev olv es in the r ight direction (Figure 2 ). Specification Sur face “What does the product claim to do?” 𝑁 features × 𝑀 platf or ms × 𝐾 actions = cov erage matrix As a User × 1000 (AaU1000) T1 Found. (30%) T2 Comp. (50%) T3 Front. (20%) Usage Scenario + Experience Report + Actionable Tick ets Unbeatable T ests L1 Unit → L2 SDK → L3 Integration → L4 E2E 4-Lay er: Compile → Execute → Parse → State Deltas (ground truth) Drift Control Regression Oracle → Dr ift Metr ics → Pause Gates “Is the system at least as good as last iteration?” Ne xt Iteration Figure 2: The unified tr ust model: each iteration flo ws through the full verification stack bef ore proceeding. The arc hitecture A six-phase loop (Figure 1 ) with automated dr ift control and pause g ates. The results Se ven w eeks, tw o production sys tems: 3 Kitchen Loop Results Iterations = 285+ PRs merg ed = 1,094+ Tic kets = 700+ T es ts = 13,000+ R eg ressions = 0 Cost/PR = ∼ $0.38 Quality gates L1 100%, L2 100%, L3 100% (from 76–91%) Canary escapes (Tier 1) 0 across all 163 signal iterations Monthl y cost ∼ $350 (flat-rate subscriptions, both sys tems) Iteration speed 5 min (signals) to 80–230 min (e xecution) Production incidents 0 3 The P ost-Commodity Code Thesis 3.1 Code Is No Longer the Har d P art LLM-based coding ag ents can produce functional code at a rate that renders the wr iting step non-limiting. R obbes et al. (2026) find that 15–23% of mature open-source projects adopted coding ag ents within just nine months of tool a vailability [ 4 ]. A senior engineer’ s value is shifting from wr iting code to kno wing whic h code to wr ite, why , and how to pr ov e it’ s corr ect . Y et the productivity story is more nuanced than adoption rates sugg est. Becker et al. (2025) conducted the most rigorous randomized controlled tr ial to date: e xper ienced open-source dev elopers using AI copilots w ere 19% slo wer than those w orking without AI, contradicting both dev eloper self-estimates (+20% speedup) and e xper t f orecasts (+38% speedup) [ 12 ]. He et al. (2025) pro vide causal evidence that Cursor adoption produces a large but transient v elocity boost (+281% lines added in month one, dissipating b y month three) accompanied by persistent quality deg radation: +30% static analy sis warnings, +42% cognitiv e comple xity , and +7% code duplication [ 11 ]. The accumulated comple xity feeds back into v elocity — ev er y doubling of comple xity reduces future velocity by ∼ 64.5% [ 11 ]. The Kitchen Loop can be read as a response to both findings: by shifting emphasis from code g eneration to specification and unbeatable verification, it addresses the quality degradation that erodes v elocity g ains and the lack of structural v er ification that lea ves AI-assisted de velopment slo wer than e xpected. The same shift is happening to code revie w . Automated revie w tools can identify bugs, sty le violations, and security issues at a pace no human revie w er can match. The human revie wer ’ s role shifts from “find the bug” to “judg e whether this code ser v es the product ’ s intent.” This creates a ne w bottleneck: specification and v erification . The hard problems are no w: • What should the product do? (Specification) • Does the product actuall y do it? (V erification) • Is the product g etting better or w orse ov er time? (Dr ift control) 3.2 The T uring T est for Code Alan T uring’ s insight was that y ou don ’ t need to understand ho w a machine thinks — y ou need a test rigorous enough that passing it constitutes sufficient evidence of capability . The machine can remain a black box. The test is the arbiter . W e appl y the same pr inciple to AI-g enerated code. When an LLM agent wr ites a function, a connector, or a full f eature, y ou face a choice: • Option A : R ead ev er y line, understand e very branch, v er ify the logic manually . This scales poorl y when the ag ent produces thousands of lines per da y . • Option B : Build tests so rigorous that if the code passes them, y ou can trust it w orks — without needing to audit e very line. The code becomes a black bo x. The tests are the T uring T est. 4 Empirical e vidence suppor ts both options coe xisting in practice. Huang et al. (2025) find that 69% of prof essional dev elopers carefully re view ev er y ag entic chang e and 75% read ev er y line of AI-g enerated code — but crucially , dev elopers w orking on unfamiliar tasks let agents driv e implementation while monitoring prog ram outputs rather than re viewing code line-by -line [ 2 ]. This v alidates Option B for the regime where the Kitchen Loop operates: autonomous ag ents on structured tasks with rigorous output v er ification. The urg ency f or Option B is underscored by a gro wing QA cr isis in AI-assisted dev elopment. F awzy et al. (2025) find that 36% of practitioners using AI code g eneration skip quality assurance entirely , 18% place uncritical tr ust in AI output, and 10% delegate QA back to the same AI that wrote the code [ 1 ]. The result is predictable: 68% of practitioners characterize the output as “f ast but fla wed” [ 1 ]. V oluntar y QA discipline is demons trably insufficient; str uctural enf orcement — tests that cannot be skipped — is the onl y reliable solution. Option B requires a specific kind of tes t: not unit tests written by the same ag ent that wrote the code (the ag ent could pass its own tests trivially), but end-to-end v erification ag ainst r eal-wor ld state . For a DeFi SDK, this means ex ecuting transactions on chain forks and verifying balance chang es. For a signal platf or m, this means cross-ref erencing claims ag ainst live data APIs. For a w eb application, this means bro wser automation against a real render ing engine. For an y product, it means tes ting at the boundar y where the softw are meets reality . W e call these unbeatable tests — tes ts that v er ify outcomes against g round truth that the code author cannot f ake. The test doesn ’ t care how the code ac hiev es the result — onl y that it does. It is of utmost impor tance to be aw are of the w eaknesses of LLMs when it comes to tests. There is a f allacy to belie v e that LLMs are so good at writing code, that they should be equall y good at writing tests. The y aren ’ t. They often obsess ov er the test itself f org etting that passing the tes t is not the goal in itself. That obsession to pass the tes t often leads to c heating, modifying the test or e ven wr iting side scripts nailing the test, leaving the actual codebase untested. 4 R elated W or k The Kitchen Loop e xists within a rapidly ev ol ving landscape of ag entic softw are engineer ing. W e position our contributions against f our categories of related w ork. 4.1 A utonomous Coding Agents SWE-agent (Y ang et al., 2024) and AutoCodeR o v er (Zhang et al., 2024) demonstrated that LLM ag ents can resolv e GitHub issues autonomousl y by navig ating codebases, editing files, and r unning tests. OpenHands (W ang et al., 2024) and De vin (Cognition, 2024) e xtended this to multi-s tep w orkflow s including en vironment setup, debugging, and deplo yment. RepairAg ent (Bouzenia et al., 2024) focused specificall y on automated prog ram repair with LLM-driv en fault localization. These sys tems operate in task -completion mode : giv en an issue, produce a patch. The Kitchen Loop operates in co v erage-e xhaustion mode : giv en a specification surf ace, sy stematicall y e xercise e very combination until the gap betw een spec and reality approaches zero. The unit of w ork is not “resol v e this issue ” but “attempt this user scenario end-to-end and document e v er ything that breaks.” T ask completion is a component of the loop (the Ex ecute phase), not the loop itself. 4.2 Self-Impr o ving and Self-Ev olving Agents SIC A (“ A Self-Improving Coding A gent ”, arXiv 2504.15228, 2025) introduced ag ents that edit their o wn prompts and tools based on e xecution f eedback, achieving self-improv ement without human inter v ention. AlphaEv olv e (DeepMind, 2025) demonstrated repositor y -scale code e v olution using LLMs guided b y automated ev aluators. Confucius SDK proposed a meta-ag ent build-test-impro ve loop for SDK de velopment. 5 The Kitc hen Loop shares the self-impro vement proper ty (Section 13) — it disco vers its o wn infrastructure bugs and updates its o wn skill files. The k ey difference is that self-impro vement in the Kitchen Loop is anchor ed to a specification sur f ace and reg ression oracle , not to an optimization objectiv e. The loop doesn ’t optimize f or a metr ic; it conv erges tow ard a specification. This prev ents the “goodharting” failure mode where an agent optimizes a proxy metr ic while the product degrades in unmeasured dimensions. 4.3 Long-Running and Looping Patterns Ralph Wiggum loops (Huntley , 2025; snarktank/ralph) demonstrated that autonomous agents can produce 1,000+ commits o vernight in a continuous loop. Multi-ag ent SE framew orks like MetaGPT (Hong et al., 2024), ChatDe v (Qian et al., 2024), and ALMAS assigned different ag ent roles (PM, architect, coder, tester) to simulate a software team. The Kitchen Loop draw s from this lineag e but adds three structural elements that Ralph-sty le loops lack: (1) a production-parity tick et management la y er (Linear) where human and AI tick ets are treated identically , pre v enting the loop from diver ging from human pr ior ities; (2) automated drift control with pause gates that halt the loop when quality degrades; and (3) the three-tier strategy model (Foundation/Composition/Frontier) that ensures balanced co verag e rather than random or greedy scenar io selection. 4.4 Spec-Driv en and V erification-F ocused Approaches SpecR ov er (Ruan et al., 2024) used natural-languag e specifications to guide automated debugging. AgileGen and A ugmented Agile e xplored LLM-assisted requirements engineer ing and test generation from user stories. USEagent proposed user -story-driven end-to-end testing. The Kitchen Loop’ s AaU1000 method is closest to USEagent ’ s vision but differs in scope and tr ust model. USEagent generates tests from user stories; AaU1000 generates usag e scenarios from specification surfaces and v alidates them through unbeatable t ests (4-la yer v er ification agains t ground tr uth). The distinction matters: a test g enerated from a user story prov es the story is implemented; a usage scenario v alidated through state-delta v er ification pro v es the product actually w orks in the w ay a real user w ould e xper ience it. 4.5 Vibe Coding and the QA Crisis F awzy et al. (2025) find a “speed-quality paradox” across 101 practitioner sources: 62% of vibe coders cite speed as their pr imary motivation, y et 68% characterize output as “fast but flaw ed” [ 1 ]. Huang et al. (2025) confir m that prof essional de velopers reject the vibe coding paradigm, carefully controlling ag ents through strategic plans — av eraging onl y 2.1 ex ecution steps per prompt despite plans spanning 70+ steps [ 2 ]. Their task suitability tax onomy identifies agents as unsuitable f or business logic, domain kno wledg e, comple x reasoning, and security-critical code — precisely the categor ies where the Kitchen Loop’ s spec surface and U A T gate pro vide structural guardrails. 4.6 Benchmar k Contamination, A doption, and Quality Evidence The dominant e valuation benchmark, SWE-Bench, suffers from data contamination: Liang et al. (2025) sho w a 23-percentag e-point accuracy gap betw een memor ized and no v el repositor ies [ 6 ], while Thai et al. (2025) find GPT -5 scores 65% on single-issue tasks but onl y 21% on multi-file e v olution tasks [ 7 ]. Meanwhile, adoption is accelerating — Robbes et al. (2026) find 15–23% of mature open-source projects adopted coding agents within nine months [ 4 ] — but quality lags: 90.6% of agent-authored PRs receiv e zero human revie w [ 3 ], and ∼ 40% of Copilot-generated code contains secur ity vulnerabilities [ 10 ]. The Kitchen Loop addresses both problems: it rejects benchmark -sty le ev aluation in fa v or of verification agains t liv e product state, and inter poses adv ersar ial v erification between g eneration and merg e. 6 4.7 Multi-Ag ent Design Patterns Cai et al. (2025) sy stematicall y re vie w 94 LLM-based multi-ag ent sys tems f or softw are engineer ing, identifying 16 design patterns ranked by adoption frequency [ 8 ]. The Kitchen Loop composes se ven of these patterns into a single integrated loop: R ole-Based Cooperation (six-phase loop with distinct roles per phase), Self-R eflection (Polish and Regress phases), Cross-Reflection (U A T Gate), Debate- Based Cooperation (Discussion Manager), V oting-Based Cooperation (multi-model tr ibunal), T ool- Ag ent R egistr y (skills director y), and A gent Ev aluator (regression oracle). Notably , three of these — Debate- Based Cooperation (4.3% adoption), V oting-Based Cooperation (3.2%), and A gent Ev aluator (3.2%) — are among the least adopted patterns in the literature [ 8 ], sugg esting the Kitc hen Loop operationalizes architectural ideas the field has not y et widely implemented. 4.8 Our Differentiation T able 1: Compar ison of ag entic softw are engineer ing approaches. T ask-Com pletion Ralph Loops Self-Impro ving Kitchen Loop U nit of w ork Issue → Patch Commit Optimization step Scenario → e xper ience repor t Stopping cond. Issue resolv ed Time/count Metric plateau Specification e xhausted Co verag e Reactiv e (given issues) Random/greedy Objectiv e-guided Three-tier (F/C/F) Quality gate CI passes CI + typecheck Evaluator function 4-lay er + multi-model tr ibunal Drift control None Basic (prog ress log) Implicit (metr ics) Explicit (oracle + pause gates) Human role PR revie w Manual ov ersight None Shared backlog, ticket parity Self-impro vement N one Manual (A GENTS.md) Core feature Anchored to spec + oracle From our anal ysis, three dis tinct operating regimes f or agentic softw are engineer ing emerg e: T able 2: Three operating regimes f or agentic softw are engineer ing. Regime Unit of W ork Stopping Cond. V erification T arget F ailure Mode T ask -completion Issue Issue resolv ed Patc h correctness Local fix / global mismatch Metric-optimization Objectiv e step Metric plateau Pro xy metric Goodharting Co verag e-e xhaustion User scenario Surface e xhaustion User -visible behavior Drift / incomplete sur face The Kitchen Loop operates in the co verag e-e xhaustion regime. The regime deter mines the sy stem’ s f ailure mode: task -completion r isks local fixes that break global behavior; metric-optimization r isks Goodhar ting on proxy measures; cov erag e-e xhaustion r isks dr ift or incomplete specification sur f aces, which the dr ift control mechanisms in Section 7 are designed to address. 5 The “ As a User x 1000” Method (AaU1000) 5.1 The Core Insight The most reliable signal about whether a f eature works is a real usage attempt — not a unit tes t, not a code re view , but someone e x ercising the product end-to-end. The problem: human usage attempts are slo w and e xpensive. The AaU1000 method replaces the “someone ” with an LLM agent that: 1. Selects a realistic usage scenario derived from the product’ s specification sur f ace 2. A ttempts it as a real user w ould — writing code, r unning it, observing what breaks 3. Documents failur es as actionable tick ets — not v ague obser vations but precise bug repor ts with reproduction steps, root cause hypotheses, and file-lev el pointers 7 4. Fix es those failures immediatel y — implementing the fix, wr iting tes ts, and shipping a PR 5. W atches for regr ession and drift — verifying nothing else broke and the codebase is not getting w orse And then does it ag ain. At whate ver cadence the infrastructure allow s. 5.2 Quantifying the 1000x The “1000x” is an order -of-magnitude claim, not a precise multiplier . W e ground it empir icall y: Single-thread v elocity : A senior engineer realisticall y ships ∼ 15-25 merg ed PRs per month (bug fixes, f eatures, tests). Each Kitchen Loop instance runs single-threaded per product. The strategy frame w ork produced 728+ merged PRs in ∼ 5 w eeks ( ∼ 145/w eek); the signal platf orm produced 366 in 17 da ys ( ∼ 150/w eek). Per -sys tem, this is a ∼ 24-48x single-thread throughput increase depending on domain iteration speed (the signal platf orm iterates ∼ 25x f aster than the SDK). Combined across both sys tems running concur rently , total output w as 1,094+ merg ed PRs. Scenario throughput : The DeFi strategy framew ork’ s loop completes one full usag e scenar io (ideate → implement → test → fix → regress) in 80-230 minutes. A human attempting the same scenar io takes 1-3 da ys. This is a 7-25x per-scenario speedup . Parallelization potential : The loop cur rentl y runs single-threaded per product. Running N loops on N products (as demonstrated with SDK + Edg e concur rently) scales linear ly . W ith parallel ex ecution infrastructure (dynamic por t allocation, containerized test en vironments), a single product could r un N loops e xploring different specification regions simultaneousl y . The 1000x represents the achiev able ceiling with moderate parallelization — not the current single-thread reality . 5.3 Spec-Driv en, N ot Spec-Speculativ e The method is scoped to what exists and is expected to w ork toda y — not speculativ e f eature engineer ing, benchmarking, or user research. 5.4 The “Obviousl y Missing” Signal The most important class of findings is the “ob viously missing” signal : f eatures any competent user w ould e xpect, but that don’ t w ork. These are objectivel y v er ifiable f ailures — a function retur ning AttributeError , a missing configuration entry , a g as cap bloc king all transactions on a chain. The loop disco vers verifiable f ailures, not speculative issues. 5.5 The Three- Tier Strategy Model The Kitc hen Loop f or malizes a three-tier scenario generation model that ensures balanced cov erage across the product ’ s matur ity spectrum: T1 T2 T3 30% 50% 20% F oundation — “Does the basic stuff work per f ectly?” Composition — “What breaks when we combine things?” Frontier — “What’ s missing f or the ne xt generation?” Tier 1 — F oundation (30%) e xercises a single f eature on a single platform or configuration. One integration, one action, happy path. These scenar ios should be tr ivially achiev able b y a new user in a f ew minutes. If anything goes wrong, it is a critical reg ression. Foundation iterations maintain the baseline: the easy stuff must alw ay s be bulletproof. 8 Tier 2 — Com position (50%) combines tw o or more features in creative w ay s. Multi-ser vice flo ws, indicator -driven beha vior , multi-step w orkflo ws, cross-platf orm configuration stress tests. The goal is to find bugs at the seams betw een components that pass individuall y but fail in combination. This is where the ma jor ity of nov el bug discov er y happens, because the combinator ial space of f eature pairs and tr iples is f ar larger than the space of individual f eatures. Tier 3 — Frontier (20%) deliberatel y reaches bey ond the product ’ s cur rent capabilities. The deliv erable shifts from “working scenario ” to “gap analy sis ”: what would you need to build this? What’ s missing? What w ould it unlock? The experience repor t emphasizes specific missing f eatures, ordered by implementation effor t and user value. 5.6 The Self-Expanding Property The three tiers f or m a self-reinf orcing gro wth cy cle . This is the “secret sauce ” that transf or ms a random ag ent into a strategic engineer: Iteration N: Features = {A, B} T1 (Foundation): test A, test B (2 scenarios) T2 (Composition): test A+B, test B+A (2 scenarios) T3 (Frontier): “I need C to do A+C” → gap analysis (1 gap report) — C gets built — Iteration N+k: Features = {A, B, C} T1 (Foundation): test A, test B, test C (3 scenarios) T2 (Composition): test A+B, A+C, B+C, A+B+C (4+ scenarios) T3 (Frontier): “Now I want D to do A+C+D” (1 gap report) Foundation gro ws linearl y with eac h new f eature. Composition gro ws superlinear ly — each ne w f eature can be combined with ev er y e xisting f eature and pair . Frontier pushes the boundar y outward. The loop e vol v es the product b y sy stematically testing what e xists and identifying the ne xt most v aluable capability . Design Principles P1. Ground- T ruth V erification T r ust in AI-g enerated code should be proportional to ho w ground-tr uth-v er ifiable the tes t outcomes are, not to tes t co v erag e percentage. P2. W eak est-Evaluator A user test is only trustw or th y if a minimally capable ev aluator can e x ecute it. (§4.8) P3. Spec-Anchored Impro vement Self-impro ving ag ents should optimize to ward specification satisf action, not pro xy metrics. (§2.2, §3.5) P4. Drift-Before-F ailure A utonomous loops need continuous trend monitor ing, not onl y binar y quality g ates. (§5.5) 6 U nbeat able T ests: The Multi- Tier QA Frame w ork 6.1 Methodology : What “Zer o R egressions” Means Throughout this paper , we repor t “zero regressions introduced b y loop-merg ed code.” This claim requires precise definition to be f alsifiable: 9 • Detection method : Ev er y iteration ends with an automated regression oracle run (Section 5.2). For the DeFi strategy frame work, this means e xecuting demo strategies on Anvil chain f orks and v er ifying 4-lay er s tate deltas. For the signal platform, this means running the full quality -gate pipeline (structural, f actual, temporal, cognitiv e) plus anti-signal canaries. • Measurement windo w : Continuous — e v er y iteration, not sampled. The oracle runs after ev er y merg e, not on a periodic schedule. • Scope : The claim co vers regressions det ected by the or acle in code merg ed through the loop’ s own pr ocess . It does not claim the codebase is bug-free, nor that latent regressions undetectable by the oracle do not exis t. The oracle’ s co verag e is bounded by its test suite (10,913 unit tes ts, 62 demo strategies, 77 signal verifiers). • Ex clusions : Environmental failures (API timeouts, chain fork instability) are dis tinguished from code regressions b y cor relation with recent merg es. A f ailure that reproduces on the pre-mer ge commit is en vironmental; a failure introduced by a specific PR is a reg ression. This definition makes the claim auditable: an y third party with access to the oracle suite and git history can reproduce the measurement. 6.2 Wh y T ests Ar e the Ne w Compe titive A dvantage Gao et al. (2025) find functional bugs in 78% of studies on AI-g enerated code, with ∼ 40% of Copilot code containing secur ity vulnerabilities [ 10 ]. Dominant benc hmarks (HumanEval, MBPP) rely on P ass@k metrics that o v erlook semantic cor rectness [ 10 ], and Liang et al. (2025) sho w SWE-Benc h per f or mance is inflated b y data contamination (76% accuracy on memorized paths vs. 53% on nov el repositories) [ 6 ]. U nbeatable tests sidestep both f ailure modes: the y v er ify against liv e product s tate that c hanges with e very deplo yment, making both contamination and pass-rate g aming ir relev ant. This requires tes ts at multiple tiers, each catching a different class of def ect that low er tiers miss. 6.3 The 4-Lev el T esting Pyramid Le vel 4: E2E Scenario T ests Scenario → Actions → Execution → State Le vel 3: Integration T ests Compile, ex ecute, verify agains t ground tr uth Le vel 2: API/A dapter T ests Individual method tests with real dependencies Le vel 1: U nit T ests Isolated, mock ed, f ast — pure function cor rectness F ull user journey Real execution API contracts Logic validation Le vel What It T ests Speed T rust Le vel W ritten By L1 Isolated logic, calculations Fas t (ms) Lo w — pro ves logic, not integration Ag ent or human L2 API methods, adapters Medium (s) Medium — pro ves API contracts Ag ent or human L3 Full e xecution pipeline Slow (min) High — pro v es real-w orld behavior Loop + ag ent L4 Complete user journey s Slo wes t Highest — prov es product works Loop (AaU1000) The critical insight : L1 and L2 are necessar y but not sufficient . A function can pass all its unit tests and still fail in production because the unit tests don ’t e xercise the real ex ecution environment. L3 and L4 are the unbeatable tests — the y verify agains t g round truth (real state, real APIs, real e xecution) that the code author cannot f ake. 10 6.4 The 4-La y er V erification Pattern Ev er y L3 integration test implements f our v erification la y ers , each catc hing a different class of def ect. W e illustrate with the DeFi strategy framew ork (Case Study A, Section 10 ); other domains substitute their o wn v er ification lay ers (e.g., brow ser automation f or w eb apps, API contract testing f or back ends — see Section 16 f or adaptation e xamples). La y er Name What It Catches Ho w 1 Compilation W rong params, missing config, type er rors Build/compile the action, assert success 2 Execution Runtime failures, per mission er rors, timeouts Execute against real en v , asser t success 3 Output Parsing W rong decoding, missing fields, malf ormed data Parse output, assert e xpected data e xtracted 4 State Deltas Silent failures, par tial ex ec, wrong outcomes Measure state before/after , assert e xact deltas A test that only compiles is incomplete . A test that compiles and e x ecutes but doesn ’ t chec k state deltas is dang erously incomplete — it could silently succeed while doing the wrong thing. Lay er 4 is what makes the test unbeatable: it verifies the outcome , not jus t the execution . For f ailure-mode tests: 3 lay ers are required (compilation, ex ecution, state deltas), and state conservation MUST be asser ted (state bef ore and after remain unchang ed). The sy stem must not silently lose assets when it f ails. 6.5 The Cov erage Matrix: Exhaustiv e b y Design The specification defines a co v erage matrix: ev er y combination of f eature, platf or m, and action type that the product claims to support. The test suite ’ s goal is to fill ev er y cell in this matrix. For a product with N f eatures, M platforms, and K action types, the matr ix has N x M x K cells. Each cell represents a claim: “Feature X works on Platf or m Y f or Action Z.” Each empty cell is an untested claim — a place where the product might silently f ail. The Kitchen Loop fills these cells sys tematically , prior itizing b y r isk: 1. P0 : Core features on pr imar y platf orms (table stakes) 2. P1 : Core features on secondar y platf or ms (breadth) 3. P2 : Adv anced features on pr imary platforms (depth) 4. P3 : Edge cases and aggregators (completeness) No manual QA process can fill a matrix with hundreds or thousands of cells on a w eekly cadence. The Kitchen Loop can, because it generates and r uns tests at AI speed. 6.6 Anti-Signal Canaries: T esting the T ests For sy stems where outputs are non-deter ministic (signals, recommendations, g enerated content), the Kitchen Loop uses anti-signal canaries : intentionally crafted bad inputs injected alongside real ones to v er ify the quality gate catches what it should. Four tiers of increasing deceptiv eness: A dditionally , 3 API degradation canaries tes t resilience to partial e xter nal API f ailures (e.g., a data source returning er rors or timeouts). These v er ify the quality gates degrade gracefully rather than producing f alse passes when dependencies fail. The canar y sys tem pro vides a known-bad baseline that the reg ression phase can measure agains t. Tier 1 canar y escapes are treated as a cr itical w ar ning signal f or operator revie w . The monotonic improv ement across all tiers — from par tial catch rates in earl y iterations to 100% across all 4 tiers by iteration 124 — demonstrates that the loop’ s quality infrastructure impro ves alongside the product it protects. A notable finding from the Edg e deplo yment: Tier 2 catch rates w ere stuck at 33% f or 70+ iterations. Earl y attempts to improv e them via the L4 LLM tribunal (GPT/Claude/Gemini judgment) produced 11 Tier N ame Description Expected Observ ed (iter 163) 1 Ob viously Bad Glaring structural or factual er rors any gate should catch 100% 100% (0 escapes / 163 iters) 2 Shado w F actually true but stale, low -nov elty , or below threshold 50–80% 100% (from 33% at iter 1, fixed iter 124) 3 A dversarial Real data with wrong conclusions — fools deterministic and LLM chec ks 30–60% 100% (from 67% at iter 1) 4 Mix ed T/F Blend of valid and f abr icated data in one signal — par tial-f ailure detection 20–50% 100% (added iter ∼ 100, fixed iter 124) non-deterministic results — catc hing 0-2 canar ies per iteration with no conv erg ence. The durable fix came from encoding f ailure patterns as det erministic rules (e.g., STALE_NARRATIVES , STALE_CATALYSTS ), achie ving 100% catch rate that held f or 40+ subseq uent iterations. This v alidates Design Pr inciple P1 (Ground- T r uth V erification): f or saf ety-critical gates, deterministic v er ification outper f orms probabilis tic LLM judgment. 6.7 Multi-Model Re view T ribunals For decisions that require judgment (architectural choices, ambiguous test results, code quality assessment), the Kitchen Loop uses multi-model tribunals : three independent AI revie wers e valuate the same artifact in parallel. Findings are synthesized with consensus classification: • Consensus (all three agree): treated as a confirmed finding • Ma jority (tw o ag ree): treated as a likel y finding, pr ioritized f or action • Solo (one re view er): flagged f or human judgment This reduces the f alse-positive rate of any single model and pro vides higher -confidence assessments f or critical decisions. Section 7 e xtends this concept into a full structured deliberation system — the Discussion Manag er — with multi-round debate, epistemic saf eguards against sy cophancy , and empir ical v alidation across 23 production discussions. 6.8 The Adv ersarial U A T Gate: “Ho w W ould a User T est This?” U nit tests pro ve code works; the y don’ t pro ve f eatures w ork f or users. When the same agent that implements a f eature also tests it, three failure modes emerg e: 1. Happ y-path blindness — the implementer only tests the case the y built f or 2. Context leakag e — the agent “kno ws ” the implementation and unconsciousl y compensates f or gaps 3. Cheating — AI models optimize f or green chec ks, not product tr uth. The y write side scripts, mock data, and reinterpret asser tions to f orce a pass The solution is not “ask the model to test honestl y” but to design the loop so honesty is the easiest beha vior and cheating is mechanicall y visible . The U A T Gate implements this through a three-s tep adv ersar ial protocol: 6.9 Step 1: “Ho w w ould a user test this?” After implementing a tick et and creating a PR, the implementing ag ent must wr ite a sealed test car d — a step-b y-s tep recipe that an y user could f ollow to verify the f eature w orks. The test card f or mat is strict: • Ev er y step has an exact command (no placeholders req uiring judgment) 12 • Ev er y step has an e xact expected e xit code and exact output assertions (not “should work” — specific strings) • At least one step verifies that bad input is rejected (not just happy path) • No manual code edits — if testing requires configuration, the implementer ships a fixture in the PR • No implementation details — user -visible behavior only This is the “as a user, how can I test this?” f orcing function. If the implementer cannot write a card that demonstrates the feature working end-to-end, that’ s a signal the feature isn ’ t done — e ven if all unit tests pass. 6.10 Step 2: “Clear session — r emo ve biases” A fresh ag ent is spawned with zero implementation conte xt to ex ecute the test card: • Inf ormation wall : The e v aluator receiv es only the test card. No diff, no tick et, no code context, no con versation histor y from the implementing agent. • W eak est model : The ev aluator uses the w eakes t av ailable model (e.g., Haiku) as a “dumb user ” pro xy . A strong model compensates f or bad test cards b y “figur ing out ” what the f eature does. A w eak model fails if the card is incomplete — which is e xactl y the signal y ou want. If the w eakes t model can f ollo w the card and get expected results, a real user can too. • Isolated w orktree : The ev aluator r uns in a clean git w orktree — a separate copy of the repository with pristine state. • Read-onl y mandate : The ev aluator is explicitl y f orbidden from editing any product file. Its pr ime directiv e is to disconfirm the feature, not confir m it. 6.11 Step 3: “Do it — sho w me the r esults” The ev aluator e xecutes ev er y step and produces structured evidence: ra w command output, ex act e xit codes, actual-v s-e xpected compar isons. Then the implementing agent per f or ms a mechanical integrity chec k : Three-La y er Anti-Cheating 1. Inf ormation w all. Evaluator gets ONL Y the test card. No diff, no tick et, no implementation conte xt. 2. Mechanical integrity check. After e valuation: git diff on the U A T w orktree. Any product file modification = EVAL_CHEAT_FAIL . An y untrack ed files outside evidence dir = F AIL. 3. Evidence structure. Every step: ra w command output + e xit codes. Missing steps = f ailure. “It w orked” without output = failure. The v erdict tax onomy makes the outcome actionable: V erdict Meaning A ction PASS Feature works from user perspectiv e Proceed, attach evidence to PR PRODUCT_FAIL Feature is broken Keep tick et open, tag PR uat-failed UAT_SPEC_FAIL T est card is ambiguous or un-runnable Log f or process impro vement, don ’ t block EVAL_CHEAT_FAIL Ev aluator modified product files Serious process issue, flag f or human re view 13 6.12 Real- W orld Exam ple: The Bac ktest Service Gap This e xample illustrates ex actly wh y the U A T gate exis ts. An ag ent implemented a backtest HTTP ser vice with 38 passing unit tests cov er ing HTTP routing, job lif ecycle state machines, model serialization, and capacity limits. All tests green. Lint passes. PR created. Tick et mo ved to “In Re view .” But nobody e ver actually started the service and sent a req uest to it. The real smok e test — curl -X POST http://localhost:8000/api/v1/backtest f ollo wed b y polling f or completion — would ha ve re v ealed that PnLBacktester() w as instantiated with no constructor arguments. The data pro viders, f ee models, and slippag e models were nev er wired up. The service accepted jobs and immediately failed them. 38 unit tests passed. The featur e was comple tely brok en. With the U A T gate, the implementer must write a test card whose steps include starting the service, submitting a job, polling f or completion, and v er ifying results contain data. The Haik u e valuator r uns this card blindl y . Step 3 retur ns "status": "failed" . V erdict: PR ODUCT_F AIL . The tick et sta ys open until the ag ent actually wires up the backtest pipeline. The implementer can ’ t dodge this: - Can ’ t wr ite “run the unit tes ts ” as the card — v alidation rejects it (not user testing) - Can ’ t test onl y HTTP routing — the card r ules require testing the actual user journey - Can ’ t w eaken assertions — “Expected output contains: completed” is binary - Can ’ t skip the gate — mandatory for any change to product code. In autonomous loops where 90.6% of agent-authored PRs receiv e zero human revie w [ 3 ], the U A T gate fills this g ap mechanicall y . 7 Controlling Regr ession and Drift 7.1 The Drift Problem A self-e vol ving codebase f aces a unique r isk: quality drift . Each iteration produces code that passes its o wn tests — but does the accumulation of changes make the sys tem better or worse? Without continuous measurement, the answ er is unkno wable until a user hits a reg ression in production. Shukla et al. (2025) quantify this r isk: iterativ e LLM code refinement parado xically degr ades security , with vulnerabilities r ising from 2.1 per sample in ear ly iterations to 6.2 by iterations 8–10 — a 37.6% increase in critical vulnerabilities after just fiv e iterations [ 9 ]. He et al. (2025) corroborate the dr ift risk from a different angle: Cursor adoption produces persistent quality degradation (+30% static anal y sis w ar nings, +42% cognitiv e complexity) that outlasts the transient v elocity gains [ 11 ]. The Kitchen Loop treats regression control as a first-class concer n, not an after thought. Every iteration ends with a regression phase that answers: “Is the sys tem at least as good as it w as before this iteration?” 7.2 The Regression Oracle Each product domain requires a r egression oracle — a repeatable test that answers “is the sys tem still w orking?” in bounded time. The oracle ’ s proper ties: • Deterministic : Same inputs produce same pass/f ail on same codebase • Compr ehensive : Co vers the product’ s cr itical paths • F ast enough : Must complete within the iteration budg et • Independent of the loop : The oracle tests the product, not the loop’ s o wn output Mode Duration Co verag e When to Use Full 120-150 min All scenarios, all platf or ms Scheduled weekl y 14 Mode Duration Co verag e When to Use Quick 30-40 min One scenario per platf or m Ev er y iteration 7.3 The Block ed Combos Registry A critical operational tool is the Block ed Combos registr y — a machine-readable list of f eature/platf or m combinations that are kno wn-broken and should not be re-tested b y the ideation phase. This prev ents the loop from w asting iterations on kno wn-broken paths. When a bloc king tick et is resolv ed, the combo is remo ved and becomes a vailable f or ideation ag ain. The registry grow s and shrinks as bugs are f ound and fixed, creating a living map of the product’ s actual (not assumed) capability surface. 7.4 Drift Metrics The loop trac ks se veral metr ics across iterations to detect drift: Metric Health y T rend W arning Signal T est count Gro wing or stable Declining Pass rate Stable at > 95% Declining ov er 3+ iterations Bug discov er y rate Declining (maturity) Sudden spike (regression) Oracle pass rate 100% An y f ailure correlated with recent c hanges Block ed combos Declining Gro wing without cor responding fix tic kets Canary escape rate 0% for Tier 1 An y Tier 1 escape The Regress phase v erifies iteration history completeness bef ore drift anal ysis; missing ro ws are bac kfilled automaticall y , since gaps would silently def eat sliding-windo w trend detection. A complementar y cross-PR interaction detector identifies structural regression patter ns — files added in one PR and deleted b y another , re vert commits, and high-churn files modified b y 3+ independent merg es — that the regression oracle cannot catch because they span multiple PRs. 7.5 A utomated P ause Gates Fiv e automated gates determine whether the loop should continue or pause f or human re vie w: Gate T rigger Response A uto? Regression F ailure Oracle pass rate drops belo w threshold Pause after 𝑁 consecutiv e failures (de- fault 3) Semi Canary Escape Tier 1 canar y passes quality gate W arn operator Advisory Drift Threshold Quality metr ic declines 3+ con- secutiv e iters W arn operator Advisory Backpressure Open PRs ex ceed threshold Enter drain mode (polish-only) Y es Starvation Ex ecute starved 𝑁 consecutiv e iters Monitor -only , alert human Y es Drain Mode : When open PRs e x ceed a threshold (default: 10), the loop automatically enters drain mode — skipping all phases ex cept Polish and increasing the PR processing limit. When PRs drop belo w the e xit threshold (default: 5), nor mal operation resumes. The or iginal phase configuration is sa ved and restored, so drain mode is transparent to the r unning loop. This prev ents the failure mode where the loop produces PRs f aster than it can merg e them, causing an unbounded bac klog. 15 Starvation Gate : When the Ex ecute phase produces zero output f or N consecutiv e iterations (def ault: 10), the loop transitions to monitor -only mode and aler ts the operator . This g ate was validated empirically in the Edge deplo yment (Section 11.5 ): at iterations 112-127, all remaining w ork required chang es in an e xter nal dependency (SDK/Python) unreachable from the Edge (T ypeScr ipt) codebase. The circuit breaker fired 11 times, and the loop cor rectl y recommended stopping. When the dependency blockers w ere resolv ed at iteration 128, the loop immediately resumed productiv e work (14 PRs in one batch). The starvation gate v alidates the spec-anchored design: when the specification surface is fully co v ered, the loop stops rather than in venting work. Three control counters — star v ation, drain entr ies, and no-work loops — are persisted to disk, so the loop retains its operational position across process res tar ts. Of the fiv e gates, drain mode and starvation are fully automatic. Regression failure halts only after consecutiv e iteration failures ex ceed a configurable threshold (def ault: 3). Drift and canar y escape cur rentl y warn rather than pause — making them advisor y g ates that rely on operator attention. These gates ensure the loop cannot deg rade the product f aster than it impro v es it. The loop is allo wed to r un autonomousl y because the gates pro vide a saf ety net. 8 Sy stem Arc hitecture 8.1 The Six-Phase Loop The Kitchen Loop is orches trated b y a shell framew ork that manag es state transitions, git w orktree isolation, and er ror reco very . The core logic is delegated to specialized AI skills, each encoding a repeatable, autonomous w orkflo w . Backlog Groom queue, fill pipeline Ideation Select spec, attempt usage T riage Findings to tick ets Ex ecution Branch, fix, test, PR P olishing Re view , CI, merg e Regr ession Oracle + dr ift measurement Backlog ( ∼ 15 min): Evaluates urg ency and co verag e gaps, promotes candidates to the work queue, g enerates new scenario tick ets when supply r uns lo w . Ideate ( ∼ 15-45 min): Selects a scenario, implements it as a real user would, r uns it ag ainst the test en vironment, documents what breaks. Optionally passes through an external f easibility chec k. T riage ( ∼ 5-10 min): Con verts findings into labeled, pr ioritized tickets with root cause hypotheses and file pointers. Deduplicates against exis ting tick ets, and reopens tick ets whose pr ior fix PRs were closed without merging. Ex ecute ( ∼ 30-60 min): For each top-N urg ent tick et: creates a f eature branch in an isolated w orktree, implements the fix, writes tests, opens a PR. Includes bac kpressure control. P olish ( ∼ 10-90 min): PR hardening and merging through a graduated state machine. Each PR is tracked with an attempt counter; after a configurable number of failures (default: 1), it is labeled needs-attention and ex cluded from future processing. Operators can raise the threshold to enable multi-attempt escalation with f ollow -up tick et creation and failure classification. PRs block ed by security or arc hitectural concerns are retired — closed with a comment routing the ticket back to the bac klog. Bef ore an y merge, a TOCTOU-safe deletion check verifies no files w ere une xpectedly remo v ed. Regr ess ( ∼ 40-150 min): Runs the regression oracle (read-only — no fix es). U pdates loop s tate directl y to the base branc h. Promotes confir med patterns to durable memory . Produces iteration summary with drift metrics. Bef ore updating shared s tate files (e.g., loop-state.md), the Regress phase re-reads the cur rent v ersion to pre vent stale o verwrites from inter r upted or concurrent r uns. 16 8.2 Skills as Prom pts-as-Code Each phase is implemented as a skill — a structured markdo wn file encoding a repeatable w orkflo w in natural languag e. Skills are version-controlled, por table across LLM pro viders, and impro v ed in natural languag e without deployment. A Kitchen Loop-compatible sys tem needs standardized skill inter faces: Skill Input Output Must Guarantee backlog Tic ket state Promoted + ne w scenario tic kets No duplicate tick ets created ideate Scenario cr iteria Experience report + scenar io Runs agains t real test env triage Experience repor t Prior itized tickets w/ file pointers Deduplicates agains t e xisting execute Ranked ticket list Feature branch + PR per tic ket Every PR includes tests regress Full codebase state Pass/f ail report + drift metrics Runs reg ression oracle T eams adopting the Kitchen Loop implement these inter f aces f or their domain. The orchestration — phase sequencing, timeout management, error recov er y — is domain-independent. 8.3 Ex ecution Modes Mode Purpose When to Use F ramew ork modes (domain-independent): strategy (default) Full cy cle: ideate + e xecute + reg ress Standard operation user-only Rapid ideation loops to fill the backlog Backlog empty , need disco very dev-only Implementation-f ocused loops to drain the backlog Bac klog full, need throughput drain (auto) Polish-onl y with increased PR throughput A uto-tr iggered by backpressure regress-quick Shor tened regression: one scenario per platf or m Ev er y iteration (f ast feedbac k) ui Bro wser -dr iv en user flo ws with visual c heckpoint veri- fication W eb application testing Domain-specific modes (DeF i strat egy fr amework example): backtest Bac ktesting pipeline ins tead of on-chain ex ecution Backtes ting stress-testing exploration Explore new protocol/chain co verag e gaps Co verag e discov er y 8.4 The Shared Memory : Tic k et Management as PM La y er The central ner v ous sy stem is the project management tool — used not as a lightw eight task track er but as the shared, durable memory lay er between the AI loop and human contr ibutors. Both AI and human operators wr ite to the same backlog. The Ex ecute phase cannot tell the difference — it queries “top N ur gent unbloc ked tic kets” and w orks through them. This creates parity betw een AI and human judgment : the backlog is the truth, neither source is privileged. Field Con vention Wh y Labels bug , feature , improvement , exploration Picks bugs first, then balances Prior ity critical , high , medium , low Respects human prior ity , no o verr ide State Backlog → Todo → In Progress → In Review → Done PRs auto-link to tickets via title Dependencies blocks / blockedBy Skips blocked tickets automatically 17 8.5 The Bar: Production-R eadiness Standard The Kitchen Loop applies “The Bar” — a project-specific quality standard defined in a cus tomizable quality.bar_file . The framew ork ships a generic template co v er ing code quality , PR standards, safety , and documentation; each deplo yment customizes f or its domain. This approach is supported by A btahi and Azim (2025), who find that categorizing quality issues b y type achie ves 100% resolution vs. 71.6% f or bulk processing [ 5 ] — the Kitchen Loop’ s phase-based arc hitecture naturall y decomposes enf orcement into categories (functional cor rectness in Ex ecute, sty le in Polish, regression saf ety in Regress). For ex ample, the DeFi strategy frame w ork (Case Study A) instantiates The Bar as fiv e domain-specific principles: 1. Production Ready — N o TODOs, shortcuts, or “we ’ll fix later .” Ship-quality on merg e. 2. Zero Hardcoding — All configuration from resol vers and registries, nev er literals. 3. Million-Dollar Saf e — Ev ery external call handled, amounts rounded saf ely , no silent fallbac ks. 4. Hedge-Fund Serious — Precise ar ithmetic e verywhere, meaningful test assertions, conser v ation chec ks in all failure paths. 5. UX First, Saf ety Alwa ys — Clean, declarativ e APIs; non-bypassable safety g ates; secrets redacted from logs. These pr inciples are enforced through multi-model revie w tribunals — ev ery PR is ev aluated ag ainst them bef ore merg e. Other domains w ould define their own pr inciples (e.g., HIP AA compliance f or healthcare, W CA G conf or mance f or w eb accessibility). 9 The Discussion Manag er: Structured Multi- AI Deliberation 9.1 Be y ond R evie w T ribunals Section 4.6 introduced multi-model revie w tribunals — three independent AI revie w ers ev aluating the same ar tif act in parallel. The Discussion Manager extends this concept into a full structured deliberation sy stem : a multi-round, multi-model debate frame work where heterog eneous AI agents argue substantiv e positions, challeng e eac h other ’ s reasoning, and con ver ge tow ard actionable decisions through an impar tial moderation la yer . Where revie w tr ibunals answ er “is this code cor rect?”, the Discussion Manag er answ ers harder ques tions: “should w e build this at all?”, “which of three competing architectures best ser v es the specification surface?”, and “what are we not seeing?” These are the judgment-intensiv e decisions that the Kitchen Loop encounters in its Ideate and T r iage phases — decisions where a single model’ s blind spots can compound into w asted iterations. The Discussion Manager operationalizes what Cai et al. (2025) identify as the “Debate-Based Cooperation ” patter n — adversarial argumentation to sur face errors — whic h appears in onl y 4 of 94 surve y ed multi-agent sys tems (4.3% adoption) [ 8 ]. The lo w adoption rate, despite theoretical v alue f or error disco v er y , sugges ts this is an undere xplored design space. The Kitchen Loop’ s production implementation, with its anti-sy cophancy saf eguards and empirical v alidation across 23 discussions, represents one of the firs t operational deplo yments of this pattern at scale. 9.2 Arc hitecture The Discussion Manag er uses a centralized moderator architectur e with an inf or mation firew all between moderator and debaters: Ke y design decisions : • Heter ogeneous models : Three different model families (Gemini, Code x/GPT , Claude) ensure g enuine perspectiv e div ersity . Each model br ings different training data, different reasoning patter ns, and different failure modes. Homog eneous debate (three instances of the same model) rapidly con ver ges to groupthink. 18 Moderator (Claude) Impartial. Manag es rounds, chec ks conv ergence, writes synthesis. Inf or mation Fir ewall Debater A (Gemini) Debater B (Code x) Debater C (Claude subagent) Isolated instance • Centralized moderation : The moderator orchestrates tur n order , checks conv ergence, and wr ites the synthesis — but nev er injects opinions into the debate itself. Centralized judg e structures outperform peer -re view s tr uctures f or debate quality (Y ao et al., 2025). • Claude subagent isolation : When the moderator is Claude, the Claude debater r uns as an isolated subag ent with no access to the moderator’ s reasoning, notes, or context. The subag ent sees only the debate prompt. This is the information fire wall that pre vents the moderator’ s framing from contaminating the debate. • Code gr ounding : F or technical discussions, debaters receiv e codebase context (specific files, architecture documents, or topic-aw are summaries). Gemini gets autonomous file-search access; other debaters g et relev ant content injected into their prompts. This anchors debate in verifiable f acts rather than rhetor ical polish. 9.3 Empirical Evidence: 23 Pr oduction Discussions The Discussion Manag er’ s design was shaped by Y ao et al.’ s (2025) research on sy cophancy in multi- agent debate [ 14 ], whic h identifies three failure modes: sycophancy , disagr eement collapse , and neg ative agr eement . Ke y design-relev ant findings: heterogeneous models outper f or m homog eneous; centralized judg es outper f or m peer -revie w; and capping debate at 2-3 rounds preser v es productive disag reement. W e ev aluated the sy stem against these criter ia using a cor pus of 23 production discussions conducted across three codebases o ver a 4-w eek per iod. T w o independent meta-analy ses — one by Claude (moderator) and one b y Code x (independent e valuator) — mitig ate self-assessment bias. Dimension V alue T otal discussions 23 unique discussions across 3 codebases Discussion types Architecture/PRDs (10), Code re view s (7), Strategy (5), Exploration (1) T ypical participants 3 models (Gemini, Code x, Claude); 1 discussion used 2 models T ypical rounds 3 (range: 1–10) Con ver gence 2/23 formall y con ver ged; 21/23 hit max rounds Conclusions 100% “ Agreed” — zero “ Agreed to Disag ree ” Ev aluation against paper metrics: Findings from both meta-analy ses: Strengths. Code grounding anc hors e v er y discussion in specific line numbers and v er ifiable API beha vior, strongl y mitig ating the “rhetorical polish without substance ” f ailure mode. Actionability is consistentl y high (phased remediation plans, PRDs, file-le vel recommendations), and raw transcripts confirm substantiv e disag reement, corrections, and e xplicit position changes. W eaknesses. Disagreements resol ve b y one par ty conceding rather than g enuine synthesis (premature con ver gence). The first speaker sets the problem’ s ontology , with later speakers refining rather than challenging it (sequential anchoring). Conv ergence tracking is cosmetic — counting self-repor ted 19 Criterion Measured Paper’ s Ideal Assessment Disagree. Collapse Rate (DCR) ∼ 0% (all conv erg e) Low but non-zero Red flag — univ ersal conv erg ence implausible Sy cophancy Score (SS) 35–50 < 20 Moderate-high — mitigated by code g rounding Neg ative Agreement Rate (NAR) ∼ 25% < 10% Moderate — Claude concedes readily Evidence Quality 85/100 High Strong — code grounding is best f eature Perspectiv e Diversity 60/100 High heterogeneity Moderate — good diversity , rigid roles R ound Efficiency ∼ 50% productive > 80% Low — 2–3 was ted rounds per discussion DISAGREEMENTS items rather than v er ifying semantic resolution. No discussion concluded with “do not build this, ” rev ealing a str uctural bias to w ard action. Meta-analy st disagreement. The independent Code x meta-analy sis challeng ed Claude ’ s “100% ratification ” claim (one discussion lac ked ratifications in the JSON) and “monotonic con ver gence ” patter n (some discussions sho wed non-monotonic trajectories). Mitigations implemented. Based on these findings and Y ao et al.’ s recommendations, the Discussion Manag er now implements: blind opening rounds (eliminating first-speaker anchor ing), a structured issue register (replacing cosmetic conv ergence counting), e xplicit proposal ratification, a kill gate requir ing a “do not build this ” ar gument bef ore proceeding, and full transcr ipt preser v ation for reproducible meta-anal ysis. 9.4 Implementation and Open Challenges The Discussion Manager is implemented as a Python orchestrator ( discuss.py , open-sourced in the companion repositor y) that manag es conv ersation creation, turn ex ecution, conv ergence chec king, ratification, and repor t g eneration. It is designed for use at three integ ration points: (1) Ideate phase f or architectural decisions, (2) Code re view in audit mode, and (3) Retr ospective compar ing implementation agains t agreed designs. In the cur rent open-source release, the Discussion Manager is in v oked manuall y; automated orches trator integ ration is planned. Despite the epistemic improv ements, sev eral challeng es remain: • Cost : Three models f or 3-5 rounds at ∼ 400 words per tur n consumes 15,000-25,000 tok ens per model per discussion. With the ∼ 50% round efficiency obser v ed in the cor pus, appro ximately 40% of token spend is on conv ergence dynamics rather than no vel insight. The cost implications f or scaling are discussed in Section 11.4. • R ole calcification : Ev en with heterog eneous models, each model tends to ward a fix ed role (Gemini as breadth anal yst, Codex as bug hunter, Claude as compromise broker). Random role assignment per discussion can help, but the under lying model tendencies persist. • Ev aluation circularity : When the moderator is Claude and one meta-analy st is also Claude, self-assessment bias is difficult to full y eliminate. The independent Code x meta-analy sis mitigates this but does not resol v e it entirely . • Scaling bey ond three : The current protocol is optimized f or three debaters. Scaling to more par ticipants increases round duration and con ver gence complexity without propor tional improv ement in perspectiv e div ersity . These are activ e research ques tions. The frame work’ s modular design — separating orches tration, moderation, and debater e x ecution — allo ws incremental improv ement without architectural o v erhaul. 20 10 Case Study A: DeFi Strategy Frame w ork 10.1 The Product and Its Specification Surface The first v alidation tar get is a production DeFi strategy frame work built b y Almanak ( https://almanak. co ), the Almanak SDK for DeFi Quants — a public, open-source repositor y . 2 A quant wr ites a strategy class, declares high-le vel intents (swap, LP open/close, borrow , supply), and the frame work compiles those intents to on-chain transactions, ex ecutes them, parses receipts, and updates state. Figure 3: The Almanak SDK repositor y: a production DeFi strategy framew ork suppor ting 14 chains and 30+ protocol connectors. The specification surface: Dimension Count Supported chains 14 (13 EVM + Solana) Protocol connectors 30+ (13 core + Aa ve, Compound, GMX, Mor pho, Lido, . . . ) Intent types 21 (Swap, LP Open/Close, Bor ro w , Supply , Perp, S take, . . . ) Cross-combinations ∼ 1,000 (chain × protocol × intent) 10.2 The Regression Oracle: F ork Ex ecution The regression oracle is a suite of demo s trategies e xecuted on Anvil (a local EVM chain f ork). Each strategy runs ag ainst a f ork of the targ et chain, compiles intents to transactions, e xecutes them, and v er ifies on-chain state chang es. A PASS requires at least one on-chain transaction with v er ified balance deltas. A PASS(HOLD) is valid (market conditions didn ’ t trigger a trade). A FAIL is alwa ys a regression until pro v en environmental. The 4-la yer verification pattern applied: La yer Implementation Compilation compiler.compile(intent) → asser t status SUCCESS Ex ecution orchestrator.execute(bundle) → asser t success on-chain R eceipt Parsing Protocol-specific parser e xtracts amounts, position IDs 2 https://github.com/almanak- co/sdk 21 La yer Implementation State Deltas get_token_balance() BEFORE and AFTER, assert ex act e xpected chang es 10.3 Results Metric V alue Loop iterations completed 122+ Merg ed pull requests 728+ U nique tic kets resolv ed 350+ Lines of code added ∼ 250,000 U nit tests (star t) ∼ 6,400 U nit tests (cur rent) 10,913 Demo strategies 62 (up from 13) Incubating strategies 183 Chains cov ered 13 EVM + Solana Protocols ex ercised 30+ (Unisw ap V3/V4, Aav e, Morpho, Curve, Pendle, . . . ) Intent types e xercised All 21 (S W AP , LP_OPEN, LP_CLOSE, B ORR O W , SUPPL Y , . . . ) Consecutiv e zero-bug iters 16 (iterations 55–70) Regressions (loop-merg ed) 0 These results span 122+ iterations o v er 5 weeks (Feb 18 – Mar 23, 2026). 10.4 Phase Progression Phase 1: Bug Disco very (Iterations 1–23). Dominated by “ob viously missing” bugs — table-stakes f eatures ev er y user w ould hit immediately : Repr esent ativ e Finding Impact Iter API method documented but not e xposed to users AttributeError on firs t call 12 Gas pr ice cap correct f or one chain, blocks all txs All strategies on one chain fail silentl y 18 Protocol router config missing for entire protocol Blocks swap compilation e verywhere 50 Nativ e tok en symbol missing f or tw o c hains Blocks ALL strategies on those chains 51 Phase 2: Co v erage Expansion (Iterations 24–54). Focus shifted to new protocol connectors, new chains, and multi-protocol composability . Notable: first lending protocol on a ne w chain (37 unit tests); disco very of router inter face V1 vs V2 mismatch causing silent re verts; first multi-protocol strategy (11 transactions across 4 intent types, zero bugs); first indicator -dr iv en LP strategy validating the market data → position sizing pipeline. Phase 3: Maturity (Iterations 55–71). The sys tem reached maturity: 16 consecutiv e zer o-bug ideation phases . Bug discov er y shifted from “obviousl y missing” to configuration gaps and infras tr ucture reliability issues. Ne w capabilities shipped autonomously : a direct-action CLI, suppor t for a new blockc hain, advanced demo strategies. 22 Phase 4: Deep V erification (Iterations 72–89). The loop transitioned from co verag e breadth to v er ification depth, finding production-se verity bugs that traditional testing would miss: Finding Sev er ity Impact Iter Pendle PT sell uses 1:1 e xc hange rate, but PT trades at ∼ 19% discount HIGH Ev er y PT sell at nor mal slippage (0.5-10%) w ould re vert with INSUFFICIENT_TOKEN_OUT 89 BTC.b completely missing from A valanc he token resol ver HIGH An y BTC.b strategy on A v alanche fails at compilation 83 U niswap V4 adapter silently falls back to 18 decimals MEDIUM Incorrect balance calculations for non-18-decimal tokens 80-81 A single 7-iteration o v er night batch (iterations 83-89) produced 18 PRs, 370+ new tests, and 5 bug disco v eries — demonstrating sustained throughput ev en at high iteration counts. Independent multi-model code re view (Claude pr-auditor + Code x/GPT -5.4) rated the mer ged code as S TR ONG across all quality principles. Regr ession st ability under load : Despite 20+ PRs merg ed between checkpoints, the demo strategy suite maintained 100% pass rate across all recent iterations. Quick reg ression tests 4 chains in under 2 minutes. Phase 5: Operational Maturity (Iterations 90–122). The loop entered a sus tained production rh ythm with improv ed strategy div ersity , predictable throughput, and a stabilizing PR bac kpressure cy cle: • Cross-c hain validation at scale : First BSC sw ap (iter 92), firs t Sonic ex ecution (iter 100), first P oly gon multi-protocol composition (iter 94). 13 EVM chains + Solana now cov ered. • Saf ety -critical disco v eries : F abr icated U niswap V4 addresses (P0, iter 93), wrong Ethena unstake selector (iter 99), Cr yptoSwap zero-slippage protection (iter 95), pr ice impact guard for zero-liquidity pools (iter 122). These are deep bugs that traditional testing misses. • Infrastructure healing cy cle matures : The loop diagnosed and fix ed its own PR merger timeout (iter 101), stale loop-state detection (iter 107+), and demo reg ression blind spots (iter 74). Each infrastructure cr isis resol v ed fas ter than the pre vious one. • Zero-bug iterations become common : Iterations 94, 108, 115, 116, and 121 all passed with zero ne w bugs, validating connector por tability across chains. A single 3-iteration ov er night batch (iterations 120-122) produced 9 PRs, 12 mer ges, ∼ 249 tests, and a sy stemic pr ice impact guard — demons trating sustained throughput at high iteration counts. 10.5 The Self-Healing Property The loop discov ered and fix ed its own infras tr ucture bugs through its s tandard process — no human diagnosis required: • Apple Silicon memory bug : PR merge automation used the wrong memor y pag e size (4096 instead of 16384), causing a vailable memor y to read 4x too lo w . This block ed merging f or 5 consecutiv e iterations. The loop-re view skill identified the pattern; Ex ecute fix ed it; Regress confirmed the fix. • Codex feasibility chec k er : The e xter nal AI cooperation model (Section 12.3) matured through the same loop cy cle — strict response contracts, timeout handling, and salv age paths were all refined b y the loop itself. • Ideate persistence bug (iter 13-19): Six iterations of lost deliv erables (repor ts not committed to git) before the loop identified the patter n, filed a tick et, and shipped a fix that decoupled the commit/push lif ecy cle. V er ified s table from iter 19 on ward. 23 • PR Manag er death spiral (iter 71-73): Permanent needs-attention labels caused a 62-hour deadlock with zero mer ges. The loop re vie w diagnosed the root cause; subsequent iterations added TTL -based label reco v er y and a circuit break er . Full reco very b y iter 74 (5/5 PRs merg ed). • Demo regression blind spot (iter 53-74): T wenty iterations of blind demo testing because ALCHEMY_API_KEY w as missing in w orktree environments. The loop detected the pattern, shipped a .env cop y to w orktrees with a pre-flight ke y chec k, and confir med the fix in iter 75. This proper ty — the loop impro ving its o wn tooling through the same process it uses to improv e the product — is further demonstrated in Case S tudy B (Section 9) and elaborated in Section 13. 24 11 Case Study B: Signal Intelligence Platf orm 11.1 The Product and Its Specification Surface The second validation target is Edg e, a DeFi signal intellig ence platf or m built by Almanak ( https: //almanak.co ), a vailable at https://app.almanak.co . Edge compr ises 46+ detection ag ents that scan on-chain and off-chain data sources, produce structured signals, and f eed a synthesis pipeline that cor relates signals into actionable intelligence. The specification surface: Dimension Count Detection agents 46+ Signal types 77 (all verified) V erifier f amilies 22 Cross-combinations ∼ 1,078 (agent × signal type) 11.2 The Regression Oracle: Quality Gates + Canaries Signals are ephemeral and non-deterministic — the y cannot be “repla yed” like strategy ex ecutions. The regression oracle combines deter ministic quality gates with anti-signal canar ies: 4-la y er quality g ate: La yer N ame Method What It Catches L1 Structural Schema validation Format er rors, in valid enums, missing metadata L2 F actual API cross-reference (DefiLlama, Snapshot, CoinGeck o, Geck o T erminal, on-chain RPCs) F alse claims, wrong numbers L3 T emporal Dedup + timestamp anal ysis Stale signals, zombie duplicates L4 Cognitiv e LLM tribunal (3 models) Bad reasoning, wrong conclusions from cor rect data 4-tier anti-signal canaries (24 total + 3 API degradation): Tier Canaries Catch Rate (iter 1) Catch Rate (iter 163) Tier 1 (Ob viously Bad) 6 100% 100% (zero escapes across 163 iters) Tier 2 (Shado w) 5 33% 100% (fixed iter 124) Tier 3 (A dversarial) 5 67% 100% Tier 4 (Mix ed T r ue/F alse) 5 N/A (added ∼ iter 100) 100% (fixed iter 124) API Deg radation 3 N/A (added later) 100% (3/3) 11.3 Results Metric V alue Loop iterations completed 163 25 Metric V alue Tic kets created 200+ Pull reques ts merged 366 Signal type co v erage 77/77 (100%) — up from 33/36 at iter 1 V erifier co verag e 100% (22 v erifier families) — e very signal type has a dedicated L2 f actual verifier Non-functional agents identified 18 (manag ed via ex clusion list) L1/L2/L3 pass rates 100% / 100% / 100% (up from 85% / 76% / 85% at iter 1) L2 un verifiable rate 0% (do wn from 9-31% at iter 1) Canary health 100% all 4 tiers (zero Tier 1 escapes across 163 iterations) T es t cases 2,171 across 152 test files A v erage iteration speed ∼ 45 min a v g (25-35 min signal-f ocused, 80-230 min full, 15-20 min circuit-breaker skip) PR merg e rate 97% (366/377) Median time to merg e 1.1 hours R eg ressions from loop-merg ed code 0 These results span 163 iterations o v er 2.5 weeks. 11.4 Quality Gate Maturity T ra jectory The most compelling evidence of the loop’ s self-impro ving quality infrastructure: Metric Iter 1-10 Iter 50-60 Iter 128-135 Iter 163 L1 pass rate 85-100% 100% 100% 100% L2 pass rate 76-91% 90-97% 100% 100% L2 un verifiable 9-31% 3-10% 0% 0% L3 pass rate 85-100% 92-100% 100% 100% V erifier co verag e ∼ 92% (33/36) ∼ 100% 100% (72/72) 100% (77/77) PR merg e rate ∼ 0% ∼ 85% 100% 97% Open PR bac kpressure 0-3 3-10 0 0-3 Ke y insight: In this deplo yment, the quality gates impro v ed monotonicall y . No regression in pass rates was detected by the oracle after any of the 366 merg ed PRs across 163 iterations. This sugg ests that autonomous code chang es can maintain (and impro ve) quality in variants without continuous human o versight, given sufficient v erification infrastr ucture. 11.5 Phase Progression The Edg e deplo yment tra v ersed f our distinct phases, mir roring the SDK’ s progression but at higher iteration speed: Phase 1: Bootstrap (Iterations 1-12, Da ys 1-2) Detection engine without self-cor rection capability . 12 PRs created, 0 merged — the loop could find bugs but couldn ’ t fix them. PR manager and polish phase activ ated; first merg es in iter 11. Quality tra jector y: L1 85 → 100%, L2 69 → 96%, L3 85 → 100%. Phase 2: Steady-State Impr o vement (Iterations 13-87, Da ys 2-6) Productiv e self-impro vement with monotonically impro ving quality . 200+ PRs merg ed. The loop identified and fix ed 7 infrastr ucture bugs through its o wn Ideate → T r iage → Ex ecute cycle: chainV alid permanently fix ed, drift persistence migrated to file-back ed storag e, verifier f amilies e xpanded, canar y 26 Tier 2 fixed, circuit breaker added, non-functional ag ent e x clusion list created. L1/L2/L3 all reached and held 100%. Phase 3: St arv ation (Iterations 88-127, Da ys 6-9) All remaining tick ets required SDK (Python) chang es unreachable from the Edge (T ypeScript) codebase. The circuit breaker fired on 10+ consecutive iterations. T wo of three revie wers recommended STOP THE LOOP . This was cor rect beha vior: the loop did not deg rade or inv ent w ork — it recognized e xhaustion and recommended s topping. Quality gates maintained 100% throughout. Phase 4: Reco v ery and Expansion (Iterations 128-163, Da ys 9-17) Ne w w ork became av ailable after SDK block ers resolv ed. 100+ PRs merg ed. 20+ new ag ents tested, LPOppor tunityA gent fix ed, ArbitrageA gent activ ated, T echnicalA gent added. Canaries reac hed 100% across all 4 tiers b y iteration 124 and held f or 40+ subsequent iterations. 11.6 Ke y Findings (Early Iterations) • A critical cooldo wn bypass needed — 33+ ag ents unreachable due to dedup, blocking 75% of fleet from quality tes ting • A production ag ent using Math.random() simulation instead of real on-chain data — causing temporal f ailures with stale signals from tok ens launched o ver a year ago • A “dead” ag ent registered in config but not in the agent sy stem — producing zero signals silentl y • The loop’ s o wn quality g ate conflating “unv er ifiable ” with “f ailed” — a false positive the loop disco vered and fixed through its own tr iag e process 11.7 The Self-Correction Fl ywheel (Iterations 128-135) At matur ity , the loop’ s most po w erful proper ty is iterativ e self-correction — it doesn ’ t just find bugs, it fix es them, verifies the fix es, and catches when fix es are incomplete. The OnChainSentinel Fix Chain: Iter What Happened 130 Ideate discov ers OnChainSentinel hangs the pipeline via W ebSock et constructor blocking. Creates tick et. 130 Ex ecute ships PR: 5-second timeout guard around getNetwork() + adds ag ent to ex clusion list. 131 Ideate retests. Disco vers the fix was incomplete — the W ebSock etProvider constructor itself blocks before getNetwork() is ev er called. Reopens tick et. 131 Ex ecute ships PR: wraps entire W S provider init in a single timeout with provider cleanup on e xpir y . A dds destro y-on-timeout test. 131 R egress confirms: build passes, 27 signals produced, L1/L2/L3 100%, no regressions. This is a 3-iteration cy cle from discov er y → incomplete fix → complete fix → regression v er ification, with zero human inter v ention. The loop caught its own incomplete fix and iterated until the root cause w as fully addressed. The ArbitrageAg ent Data Pipeline Resurrection: The Arbitrag eAg ent went through a 3-PR fix chain across multiple iterations: 1. PR #278 — R eplaced dead The Graph hosted subg raph URLs with Gec ko T erminal API (The Graph deprecated its free hosted ser vice — a real e xter nal platf orm change the loop detected) 2. PR #282 — F ix ed incor rect base_token_price_usd vs quote_token_price_usd field usag e that produced nonsensical 3,499 × spreads 27 3. PR #286 — Lo wered ARBITRAGE_PROFIT_MIN_PERCENT from 1% to 0.1% after observing 2 consecutiv e zero-signal r uns (1% required a 1.6% gross spread on Ethereum — far abo v e real market conditions of 0.1-0.5%) Each iteration peeled bac k one more la y er of the problem. The loop doesn ’ t giv e up after the first fix — it keeps testing until the ag ent actually produces signals. 11.8 T riage Intelligence: The Classification System The triage phase demons trates sophisticated classification that goes be y ond binar y “w orking/broken ”: Classification Example R esponse Non-functional (confir med) R W ARiskA gent: hardcoded stub data (backingRatio=1.0) A dd to ex clusion list immediatel y Non-functional (suspected) LPR otationAg ent: 0 signals, 1st obser vation Monitor — needs 2+ confirmations before e xcluding Cold-start De vA ctivityAg ent: needs w ar m baseline from pr ior runs Don ’ t e xclude — works in scheduled mode Timing-dependent Br ibeT rackingA g ent: no activ e v ote rounds Don ’ t e xclude — market conditions, not a bug Deprecated YieldA gent: conv er ted to return-empty stub A dd to ex clusion list Self-correcting classification : In iteration 131, triage classified De vActivityA gent as “non-functional.” In iteration 132, it corrected this diagnosis — the ag ent has a w orking GitHub API integration and is actuall y cold-star t, not broken. The tick et was updated, and the ag ent was NOT added to the e x clusion list. Self-cor rection e xtends be yond code fixes to the classification sys tem itself. 11.9 Infrastructure Self-Healing (Iterations 128-135) The loop identified and fix ed 6 problems in its o wn infrastructure dur ing a single 8-iteration batch: Finding Fix Impact RESULTS_DIR undefined in pr -manager .sh Grep prep log instead Gate rejection memory completely non-functional gh --jq --arg in valid syntax Pipe to jq --arg Shell command silently f ailing Missing iter 127 ro w in loop-state.md Bac kfill + verification step History gap invisible to dr ift detection synthesisPipelineDead f alse positiv e localDev flag f or drift suppression 9 iterations of w asted CRITIC AL escalation Deplo yment dr ift (local main s tale) Pre-flight detection All merg ed fix es untested f or 5+ iterations NON_FUNCTIONAL_AGENTS set out of sync Sync with confirmed findings + e xpose via API Ideate kept selecting dead agents As in Case Study A (Section 8.4), the loop impro ves at two le vels: the product and its o wn infrastructure. 28 11.10 Speed Difference: T w o Domains, Same Loop Phase Strategy Framew ork Signal Platf or m Speedup Ideate 10-15 min 1-2 min ∼ 8x T r iage 5-8 min <1 min ∼ 8x Ex ecute 25-60 min 10-30 min ∼ 2x R eg ress 40-150 min 2-3 min ∼ 30x T otal 80-230 min ∼ 15-35 min * ∼ 5-15x * Signal-onl y iterations (ideate + tr iag e) complete in ∼ 5 min; full 6-phase iterations take 80-230 min. The ∼ 15-35 min figure reflects signal-focused iterations including e xecute and polish. Circuit-break er skip iterations complete in 15-20 min; the a v erage across all 163 iterations was ∼ 45 min. The signal platform iterates f aster because signal quality v er ification is API-bound, not e x ecution- bound. This speed adv antage compounds o v er time: at 163 iterations, the signal platform has perf ormed more quality g ate ev aluations than most human QA teams accomplish in a year . 12 Cost of Operation 12.1 T ooling Spend A common concern: “Ho w much does it cost to r un an AI agent continuously?” The answ er is sur pr isingl y modest: the Kitchen Loop runs entirely on flat-rate subscriptions , not metered API calls. Cost Component Monthly Cos t Notes Claude Code Max (20x plan) $200 Primar y agent f or all 6 phases — unlimited usage within plan Codex (subscr iption) $20 External f easibility check er + tr ibunal re view er Gemini (subscription) $20 T r ibunal re view er f or multi-model audits CodeRabbit ∼ $15 A utomated PR code revie w An vil / F oundry $0 Open-source, r uns locall y External data APIs ∼ $0-50 CoinGeck o, DefiLlama (free tiers sufficient) CI compute ∼ $50-100 GitHub A ctions minutes T otal ∼ $305-405/month This is the total cost for r unning the Kitchen Loop across both production sys tems simultaneously — 285+ iterations, 1,094+ merged PRs, 700+ tickets resolv ed. A senior engineer costs $12,000-25,000/month full y loaded; the Kitchen Loop’ s monthl y tooling cos t is ∼ 2% of a single engineer’s cost . Cr itically , this includes the quality infras tr ucture (unbeatable tests, U A T gates, regression oracles) that pre vents the comple xity-driv en velocity erosion documented by He et al. [ 11 ] and Beck er et al. [ 12 ]. Metric Kitchen Loop Single Engineer Ratio Monthl y cost ∼ $350 ∼ $15,000 ∼ 43x cheaper PRs merg ed/month 600+ ∼ 15-25 ∼ 30x more Co verag e scenarios tested/month 150+ ∼ 10-20 ∼ 10x more Cost per merged PR ∼ $0.38 $600-1,000 ∼ 1,800x cheaper Because cost is fix ed regardless of iteration count, the marginal cost of additional cov erage is effectiv ely zero. 29 12.2 T est Suite Gro wth and Runtime As the test suite g ro ws, reg ression time gro ws. This is a real scaling concern: Metric Start (iter 1) Cur rent (iter 122) Gro wth Rate U nit tests ∼ 6,400 10,913 +70% o v er 122 iterations Full regression time ∼ 90 min ∼ 150 min ∼ 0.5 min/iteration gro wth Quick reg ression time ∼ 25 min ∼ 40 min ∼ 0.12 min/iteration gro wth Demo strategies 13 62 ∼ 0.40/iteration Incubating strategies 0 183 ∼ 1.5/iteration Mitigation strategies already in use: - --regress-quick mode: one scenar io per platf orm ( ∼ 40 min v s ∼ 150 min) - Full regression r uns w eekly , quick r uns e very iteration - P arallel test ex ecution f or unit tests (pytes t-xdist) - T est pruning: deprecated/redundant strategies archiv ed, not accumulated Projected ceiling : At cur rent g ro wth rates, full regression w ould reach ∼ 4 hours b y iteration 200. This is manageable with parallelization (dynamic port allocation f or multiple tes t environments), sharding (split the co v erag e matr ix across N parallel loops), or sampling (statistical cov erage sampling rather than e xhaustiv e ev er y iteration). 12.3 Business Outcomes For readers f ocused on outcomes: 50+ features shipped, 200+ bugs f ound before users (including fund-saf ety issues), cov erage matrix fill rate impro ved from ∼ 5% to ∼ 50% of 1,800 combinations, and test suite grew +70%. Detailed breakdo wns are in Sections 8.3 and 9.3. 13 Production Saf ety R ecord 13.1 Incidents and Ill Effects A critical question f or any autonomous system: has it caused harm? Concern R ecord Details Production downtime 0 incidents Loop operates on isolated branc hes + test en vironments (An vil f orks). No direct production access. Security incidents 0 incidents Loop has no access to production k ey s, wallets, or user data. T est w allets use Anvil defaults. Data loss 0 incidents Git w orktree isolation pre vents branch contamination. All chang es are re v ersible PRs. Regr essions shipped to main 0 Quality gates (multi-model revie w + regression oracle) catch issues before merg e. Cost o v erruns 0 Monthly tooling cost is bounded and predictable ( ∼ $305-405/month across both sy stems, ∼ $1.50/iteration). 13.2 Wh y Zer o Incidents? The clean saf ety record f ollow s from arc hitectural isolation : no production access (tes t environments onl y), branc h isolation (git w orktrees on feature branches), automated pause gates (Section 5.5), and backpressure control. These are structural proper ties, not luck. 30 13.3 Kno wn Limitations (Honest Assessment) Limitation Impact Mitigation An vil-only testing Loop has not been e xercised with real mainnet transactions Mainnet mode e xists but requires e xplicit operator appro val + real g as API k ey dependency Some scenar ios require e xter nal API k e ys not in standard config Produces PASS(caveat) instead of clean PASS — a kno wn quality ceiling Single-threaded e xecution Gate wa y por t constraint limits to sequential strategy r uns Future: dynamic por t allocation f or parallel e xecution PR merge automation fragility The P olish phase has been the most failure-prone component Each failure mode has been fix ed through the self-impro v ement cycle Cooldo wn phantom failures (signal platf or m) Ag ents with cooldo wn periods inflate dead-agent counts Distinguishing “cooldown active ” from “ag ent broken ” is track ed as an open tick et This section is intentionall y honest. The loop is not infallible. But its f ailure modes are operational (merg e automation bugs, API ke y manag ement) rather than saf ety-critical (data loss, fund loss, production outag es). The architectural isolation ensures that operational f ailures was te iteration time, not user tr ust. 13.4 Human-in-the-Loop Cost Constraints The loop is autonomous end-to-end: it g enerates scenar ios from the specification sur face, fills its o wn backlog, tr iag es findings into tick ets, implements fix es, and auto-merg es PRs after multi-model re view and CI. N o human is in the critical path during nor mal operation. The human role reduces to (1) initial specification surface definition, (2) occasional strategic s teer ing, and (3) supervisory intervention when the loop encounters f ailure modes it cannot self-cor rect. In practice, the mos t common intervention tr igg er w as merg e conflict resolution gone wrong : LLMs resol ving git conflicts w ould sometimes silently undo w ork from previous commits or PRs, requiring a human to notice the regression and rev er t. GitHub infras tructure issues (API rate limits, w ebhook f ailures, s tale branch state) w ere the second most frequent cause. Section 12.2 frames the steady -state human cost as ∼ 30-60 minutes per week once calibrated — but this assumes the loop is r unning smoothly . Intervention spikes during infrastructure instability . The practical scaling constraint is envir onmental : test environment throughput (Anvil f ork startup time, API rate limits), CI pipeline capacity , and the cost of multi-model re view tokens per PR. Drain mode (Section 5.5) throttles output when PR backpressure ex ceeds a threshold, ensur ing the loop does not outrun its own merg e infrastr ucture. The Discussion Manag er adds a separate cos t dimension. Three models debating f or 3-5 rounds at ∼ 400 w ords per tur n consumes 15,000-25,000 tok ens per model per discussion. With the ∼ 50% round efficiency observed in our 23-discussion cor pus (Section 7.3), appro ximately 40% of deliberation token spend produces con ver gence dynamics rather than no v el insight. At current subscr iption pricing this cost is absorbed into flat-rate plans, but metered API usage w ould make frequent deliberation e xpensive. P er -iteration compute cost breakdo wn bey ond the agg regate $0.38/PR figure has not yet been ins tr umented — this is a g ap we intend to address in future work. These cons traints do not inv alidate the framew ork, but the y bound its applicability: the Kitchen Loop is most efficient in domains where (a) the specification sur f ace is w ell-defined upfront, (b) the regression oracle provides high-confidence automated v er ification, and (c) the human ’ s role can be g enuinely asynchronous rather than synchronous with each iteration. 31 13.5 Open Problems Our deplo yments e xpose f our open problems that we believ e warrant dedicated research: • OP1: Oracle T ransf er . The Kitchen Loop relies on a bespoke reg ression oracle per domain (chain f ork ex ecution f or DeFi, quality-g ate pipeline for signals). A utomatic generation of regression oracles from natural-languag e specifications — eliminating the per -domain engineer ing cost — is an unsol ved challeng e. • OP2: Specification Acq uisition. Our method assumes an enumerable specification sur f ace. For legacy codebases with implicit specifications, automating surface e xtraction from telemetry , documentation, and user beha vior is a cr itical bottlenec k f or adoption. • OP3: Multi-Objectiv e Drift. Current dr ift metr ics f ocus on functional cor rectness. Extending the frame work to simultaneously monitor non-functional req uirements (latency , secur ity , f air ness) without human inter v ention remains open. Our drift detection (Section 5.5) w ould need to compose multiple objectiv e functions without one dominating the pause-gate signal. • OP4: Sy cophancy at Scale. The Discussion Manag er mitigates sycophancy in 3-model debates (Section 7), but optimal model composition and debate protocols f or larg er , heterogeneous agent sw ar ms are unkno wn. Our cor pus of 23 discussions is too small to establish whether the observed SS < 20 threshold g eneralizes (cf. Y ao et al., 2025 [ 14 ]). 13.6 When the Kitchen Loop Should NOT Be Used The frame work is not univ ersally applicable. It should not be applied when: (1) ground tr uth is w eak or unobservable — without a reliable oracle, the v er ification la yer pro vides f alse confidence; (2) success criter ia are highly subjectiv e (aesthetic quality , UX taste) — the oracle cannot arbitrate matters of judgment; (3) the specification sur face is not enumerable, as in e xplorator y R&D where the goal is disco very rather than con v erg ence; or (4) saf ety-critical domains lack robus t external v er ification — the oracle ’ s bounded cov erage (Section 4.1) is insufficient when failures ha v e ir rev ersible consequences. 14 The Human- AI Collaboration Model 14.1 No t Either/Or — Both Chen et al. (2025) identify three automation points — human-only , copilot, and ag ent — finding agents achie v e 60% task cor rectness vs. 25% f or copilots, y et 60% of par ticipants w ould not continue using ag ents due to a comprehension gap [ 13 ]. The Kitchen Loop operates at a f our th point: full y autonomous loop with asynchr onous human o v ersight , eliminating the idle-time and comprehension problems Chen et al. identify . The Kitchen Loop is explicitl y a complementary sy stem , not a replacement: Concern AI (Kitchen Loop) Human Co v erage Exhaus tive — 1000x the scenar ios a human w ould test Strategic — f ocuses on what matters most given user conte xt Bug disco v ery Bottom-up — finds what ’ s broken b y tr ying e verything T op-down — know s what users are complaining about Tic ke t writing Precise, file-le vel, with reproduction steps Context-rich, business-a ware 32 Concern AI (Kitchen Loop) Human Implementation Fas t, consistent, f ollow s established patterns Creativ e, architectural, can break patterns when needed Code re vie w Multi-model parallel tribunals Domain e xper tise, security intuition Backlog curation Deduplication, se verity assessment, dependency graphs Strategic pr ior ity , business v alue, e xter nal signals 14.2 The Human’ s Highest-Le v erage Input The human ’ s pr imar y contribution is specification and backlog curation : unders tanding users, monitor ing the competitiv e landscape, and conv er ting those signals into tick ets. In our deplo yments, this required ∼ 30-60 minutes per w eek once the loop w as calibrated — but this figure assumes a mature specification surface (see Section 11.4 f or scaling constraints and the human bottlenec k at high iteration speeds). 14.3 The External F easibility Chec ker The Kitchen Loop optionally uses an external AI (a different model from the loop’ s pr imary agent) as a f easibility check er bef ore committing to an idea. Empirical results (285+ combined iterations): ∼ 78% PR OCEED, ∼ 15% REDIRECT , <5% REJECT + timeout. The lo w rejection rate sugg ests the loop’ s scenar io selection is well-calibrated. REDIRECT cases adjusted scope productivel y . 15 The Self-Im pro ving Loop 15.1 Meta-Le v el Impro vement The loop-revie w skill audits loop behavior ev er y N iterations and produces tic kets for infrastructure impro v ements. Sections 8.4 and 9.3 sho wed specific ex amples; the full catalog across both deployments: • Platf orm-specific failures : Merg e automation memor y bug on Apple Silicon caused 5 consecutiv e stalls; loop revie w identified the patter n and the Ex ecute phase fix ed it. • Resour ce contention : Process collisions when human and loop run concur rently ; tool v ersion dr ift breaking the re view phase. Fix es added process tracking and version pinning. • Re try and backpr essure loops : PR Manager stuck retr ying the same failing PR, wasting entire P olish phases. Fix ed by skip-after -2-failures with needs-attention labeling. PR backlog growing unbounded during sustained runs prompted automatic drain mode. • State manag ement : Loop-state lost when w orktree PRs w eren’ t merg ed promptly (fix ed b y decoupling state sync from merg e lifecy cle). Backlog g roomer promoting already-in-progress tick ets (fix ed by counting viable tickets only). • Silent failur es : Missing .env in w orktrees causing strategy failures; loop-re view repor ts silentl y lost when output director y didn ’ t e xist. Fix ed b y pre-flight checks and wr ite v er ification guards. • Budge t management : Regress timeout consuming 100% of iteration budget, prompting a quic k - regression parameter . The frame work has also been applied to its own codebase (dogfooding). In its first iterations running agains t the KitchenLoop orches trator and PR manager , the loop discov ered and fix ed multiple bugs — including race conditions in temporar y file handling, missing agent-liv eness detection, and incorrect timeout beha vior on macOS. This validates that the self-improv ement proper ty extends to the meta lev el: the loop can impro v e the tool that r uns the loop. 33 15.2 Pattern Consolidation When the same pattern is confirmed across 2+ iterations, it is promoted from a session obser v ation to a durable memory entr y . This creates institutional know ledge that persists across con v ersations and loop runs. 15.3 The Skill La yer Starting from 5 basic phase skills, the validated deplo yments no w ha ve 30+ skills co vering loop orches tration, quality g ates, protocol integration, competitiv e intellig ence, release manag ement, and documentation maintenance. Each skill emerg ed from the loop’ s own discov er y of a repeatable workflo w w or th encoding. A dditionally , the loop dev eloped a gate rejection memory system that prev ents redundant LLM auditor calls — if a PR w as rejected and no ne w commits w ere pushed, it is immediatel y marked NOT_MER GEABLE without was ting an audit cy cle. This optimization, disco vered and implemented by the loop itself, eliminated the zero-bac kpressure bottleneck observed in earlier iterations. R ecent additions include /loop-review-meta — a macro-lev el analy sis skill that aggregates findings across all loop re view repor ts to sur f ace systemic trends and strategic recommendations. 16 Generalization: Making the Kitc hen Loop P ort able 16.1 What Makes a Codebase Loop-Ready The Kitchen Loop works on any codebase where: 1. The specification is enumerable : The product has a definable set of f eatures, platf or ms, and action types that can be e xpressed as a co v erage matrix. 2. Usage can be automated : “Using the product” can be per f or med by an LLM agent without phy sical interaction (APIs, CLIs, SDKs). 3. Quality is measurable b y regression : There e xists a tes t oracle that can answ er “is the sys tem still w orking?” in bounded time. 16.2 Ho w to A dapt the Kitchen Loop to Y our Domain Domain Specification Surface R egression Oracle Example Adaptation W eb Appli- cation Pag es x User Flow s x Bro wsers Brow ser automation + visual regression (Pla ywr ight, Cypress) Ideate generates user journey s; R eg ress r uns screenshot diffing ag ainst kno wn-good baselines ML Pipeline Models x Datasets x Metr ics W&B / MLflo w run comparison + statistical tests Ideate trains model variants; Regress compares against baseline metr ics with significance tests Smart Contracts Functions x Chains x Edge Cases Foundry/An vil f ork e xecution + inv ar iant checks Ideate wr ites fuzz test scenarios; R eg ress r uns in variant test suites Back end API Endpoints x Methods x Auth R oles Contract testing + liv e traffic shado w comparison Ideate generates API call sequences; Regress replay s agains t shado w traffic Mobile App Screens x Ges tures x De vices Appium automation + visual regression Ideate scr ipts user flo ws; Regress runs on device f arm 34 Domain Specification Surface R egression Oracle Example Adaptation Compiler / Language Grammar x Optimizations x T arg ets T est suite e xecution + benchmark comparison Ideate generates programs e xercising edge cases; Regress runs conf or mance suite The tw o adaptation points are alwa ys the same: what is the specification sur face? and what is the reg ression or acle? Ev er ything else — backlog manag ement, phase sequencing, dr ift control, self-impro vement — transf ers intact. 16.3 The Specification La yer The Kitchen Loop w orks because the products it’ s been applied to hav e w ell-defined specifications. Most codebases lack this. A specification la y er — structured Y AML or Markdo wn specs stored alongside code — sol ves this by making specs machine-readable, v ersion-controlled, and auditable. 16.4 The Scaffolding La yer A scaffolding la y er — templates f or the six-phase orchestrator , tick et manag ement stubs, oracle skeletons, and CI w orkflo ws — reduces time-to-firs t-loop from day s to hours. 16.5 A Composable Stac k Kitchen Loop Engine Backlog → Ideate → T r iage → Execute → Polish → Regress → (repeat) Specification La yer What does the product promise? What should we test? Y AML/Markdown specs v ersioned alongside code Scaffolding La yer Project setup, CI/CD, skill templates, oracle stubs Reduces time-to-first-loop from day s to hours 17 Conclusion 17.1 The Shift In our observation, softw are engineer ing is undergoing a phase transition. Code production — once the bottleneck — is becoming a commodity . Code revie w — once a human-only activity — is becoming automated. The competitive advantag e is shifting to: 1. Specification : Knowing what to build (the AaU1000 method) 2. V erification : Proving it works (unbeatable tests) 3. Con verg ence : Ensur ing it k eeps working (reg ression and drift control) 17.2 The Evidence A cross two production systems and 285+ combined iterations (Sections 8.3 and 9.3), zero regressions w ere detected by the regression oracle, quality gates improv ed monotonically from 76–91% to 100%, and 35 the loop autonomousl y fix ed 17+ infras tr ucture bugs in its o wn tooling. The second sys tem v alidated por tability : the same architecture applied to a fundamentally different domain achiev ed comparable results with a 25x f aster iteration cy cle, requiring only tw o reimplemented components (specification sur f ace and regression oracle). The unified tr ust model held under sustained autonomous operation without increasing human intervention. 17.3 The Invitation The Kitchen Loop is not the future of softw are de velopment. It is a practice a vailable today , on a codebase of an y size, with tools that already e xist. The prerequisites are: 1. An enumerable specification surface — what does y our product claim to do? 2. An automatable test en vironment — can an AI ag ent ex ercise your product? 3. A regr ession oracle — can y ou answer “is the system still w orking?” in bounded time? 4. The discipline to run it continuousl y — and tr ust the output when the tests are unbeatable. In our e xper ience, the tests are the trust la yer . The spec is the compass. The loop is the engine. Given sufficient v er ification infrastructure, the product e vol v es itself. 17.4 T est able Hypotheses Our deplo yments sugges t four empir ical hypotheses that subsequent work could v alidate, replicate, or refute: • H1: Co verag e-e xhaustion sy stems (as defined in the regime tax onom y , Section 2.9) disco ver more user -visible f ailures per iteration than task -completion sy stems in partially mature products with >50% specification surface co v erage. • H2: A dversarial U A T gates — sealed tes t cards ex ecuted b y a fresh ev aluator with zero implemen- tation conte xt — reduce false-positiv e readiness assessments compared with implementer -authored test suites alone. • H3: Tier -w eighted scenar io selection (Foundation 30% / Composition 50% / Frontier 20%) disco vers more bugs per iteration than uniform random selection across the specification surface. The superlinear g ro wth of the Composition tier (Section 3.6) predicts that the advantag e increases with product maturity . • H4: In user -facing verification tasks, w eak -model ev aluation (the least capable model a vailable) is a better pro xy for real-user v er ifiability than strong-model e valuation, because w eak models fail on the same ambiguities that trip real users. These h ypotheses are intended to f acilitate direct replication, extension, or refutation in future agentic sy stems research. 17.5 Kno wn Limitations & Future W ork The Kitchen Loop has five structural limitations that bound its applicability: 1. Single-threaded e xecution. The cur rent orc hestrator runs one iteration at a time. Parallelization across multiple w orktrees is architecturally straightf or w ard but not yet implemented (Section 13.3 ). 2. Enumerable specification surface r equired. The method assumes the product’ s capabilities can be listed as a co verag e matrix. Legacy monoliths with implicit specifications require a specification-e xtraction step (OP2, Section 13.5 ) bef ore the loop can operate. 3. Oracle quality is the ceiling. The loop can only catc h what the regression oracle can v er ify . If the oracle misses a failure mode, the loop is blind to it. Oracle transf er across domains (OP1) remains an open research problem. 36 4. Human role persists. Backlog grooming, specification design, and production promotion require human judgment. At scale, human merg e capacity — not AI generation speed — becomes the binding constraint (Section 13.4 ). 5. T w o-domain v alidation only . Our e vidence spans tw o production sy stems in the DeF i domain. Generalization to other domains (Section 16 ) is architecturally suppor ted but empir icall y un v alidated. 37 A The Co v erage Matrix (Strategy Frame w ork Example) The s trategy frame w ork’s intent tes t co v erage matr ix illus trates the exhaus tive approach. Ev er y cell represents a claim: “Protocol X works on Chain Y f or Intent Z.” Empty cells are untested claims. Protocol Chain Sw ap LP Open LP Close Supply Bor ro w Repa y Withdra w Aerodrome Base P0 P0 P0 - - - - T raderJoe V2 A v alanche P0 P0 P0 - - - - U niswap V3 Ethereum P1 P1 P1 - - - - U niswap V3 Arbitrum P1 P1 P1 - - - - U niswap V3 Base P1 P1 P1 - - - - Pancak eSwap V3 BSC P1 - - - - - - Aa ve V3 Ethereum - - - P0 P0 P0 P0 Aa ve V3 Arbitrum - - - P0 P0 P0 P0 Aa ve V3 Base - - - P0 P0 P0 P0 Compound V3 Ethereum - - - P1 P1 P1 P1 Morpho Blue Ethereum - - - P2 P2 P2 P2 Enso Multi- chain P3 - - - - - - Curve Ethereum P3 - - - - - - P0 = critical path, P1 = breadt h, P2 = depth, P3 = comple teness B R epresentativ e Strategy Examples B.1 D.1 Multi-Protocol Composability (Tier 2 — Com position) A Tier 2 strategy e x ercised f our intent types across tw o protocols in a single strategy : Step 1: SUPPLY 0.5 WETH as collateral to Lending Protocol : → 3 TXs (approve + supply + setCollateral) 270K gas Step 2: BORROW 311.20 USDC at 30% LTV : → 1 TX (borrow) 286K gas Step 3: SWAP 155.60 USDC → 0.0749 WETH via DEX : → 3 TXs (approve + approve_reset + swap) 213K gas Step 4: LP_OPEN WETH/USDC range [1867–2282] : → 4 TXs (approve + approve_reset + approve + lp_mint) 526K gas Total: 11 transactions, 1.3M gas zero bugs All f our v er ification lay ers passed on e v er y intent. This was the firs t tes t of cross-protocol enr ichment data handoff — pro ving that the seams betw een components w ork, not just the components themselv es. B.2 D.2 Additional F oundation Examples (Tier 1) T w o Tier 1 scenarios illustrate the “obviousl y missing” signal. A basic DEX swap disco v ered that the protocol router configuration had no entry on ANY chain (compilation failed) and used a different interface v ersion (8 vs. 7 parameters, causing silent rev er ts) — both fixed in the same iteration. Separately , the first-e v er strategy on a new chain disco v ered a missing native token symbol — a one-line omission that block ed ALL strategies on the entire chain. 38 C Signal Platf orm Quality Gate Ar chitecture C.1 E.1 V erifier F amilies V erifier F amily Signal T ypes Co v ered Data Source Protocol TVL Disco very , TVL mig ration, undervalued protocol DeFi analytics API P ool Yield Y ield oppor tunity , LP oppor tunity , cross-DEX arbitrag e P ool analytics API Lending Rate Lending arbitrag e, utilization spike Lending analytics API Go vernance Gov er nance cataly st, proposal tracking Go vernance API Exploit Exploit detection, e xploit warning Security feeds Price Feed Whale alert, depeg r isk Pr ice oracle API Derivativ es Funding e xtreme, OI div erg ence, liquidation clus ter Derivativ es aggregator Emission T oken unlock, emission c hange U nlock calendars Social Bribe oppor tunity , sentiment shift, nar rativ e momentum Social + on-chain Stablecoin Stablecoin supply c hanges S tablecoin analytics Co verag e: 72/72 signal types (100%) — up from 33/71 at iteration 1. The verifiers are not mocks — the y call real APIs and cross-reference signal claims agains t live data. When The Graph deprecated its free hosted ser vice, the loop detected the f ailure (ag ents producing 0 signals), diagnosed the root cause (stale subg raph URLs), and migrated to alter nativ e data sources (Geck o T er minal, DefiLlama) — all autonomousl y . C.2 E.2 Anti-Signal Canary De t ails Tier 1 — Obviousl y Bad (6 canaries): - Fabricated e vent f or non-e xistent entity - Signal with impossible metric v alue (e.g., 999,999% yield) - F alse repor t for an entity that w as nev er affected - Signal with null/empty identifiers - Signal with empty title and descr iption - Signal with out-of-rang e confidence score Result : 100% caught acr oss 163 iter ations. Zero escapes. Tier 2 — Shado w (5 canaries): - Real e vent that already concluded - Kno wn trend repor ted w eeks ago (stale) - Opportunity that dropped below threshold since detection - V alid signal f or a deprecated/discontinued entity - A ccurate data from a non-author itativ e source Result : Initially 33% caught by L1-L3 (iter 1). No w 100% (iter 163) — impro ved as v erifiers wer e added and L2 f actual chec kers gained cross-r efer ence capabilities. Tier 3 — A dversarial (5 canaries): - Real data + fabricated inter pretation - Accurate ra w data + incor rect derived conclusion - V alid input data + wrong calculation - Cor rect methodology applied to wrong time windo w - Signal mixing real and f abr icated sub-claims Result : Initially 67% caught by L1-L3 (iter 1). No w 100% (iter 163) — the r emaining r equir ed cr oss-ref erencing conclusions ag ainst source data, which improv ed as verifier cov erag e reac hed 100%. Tier 4 — Mixed T rue/F alse (5 canaries, added ∼ iter 100): - Signal with 3 tr ue claims and 2 f abr icated claims blended tog ether - A ccurate quantitativ e data with fabricated qualitativ e assessment - R eal protocol ev ent attr ibuted to wrong protocol - V alid historical data presented as cur rent - Correct anal ysis of real data with one inv er ted conclusion Result : 100% caught. T ests the quality g ate ’ s ability to de tect partial failur es rat her than binar y pass/f ail. API Degradation Canaries (3 canaries): - Simulated timeout from pr imary data source - Simulated er ror response from price oracle - Simulated par tial data return (50% of e xpected fields) 39 Result : 100% resilience (3/3). Quality g ates deg rade gracefully — they flag signals as unv erifiable r ather than pr oducing false passes when dependencies f ail. C.3 E.3 Ke y Lesson: Rapid Iteration R ev eals Intermittent Bugs One ag ent’ s validation f ailure rate fluctuated betw een 75-100% across iterations because its data source API retur ns different f or mats at different times. A single test r un would sho w “pass ” or “fail” — 163 iterations re veals the ∼ 85% steady -state rate. This class of bug, in visible to traditional CI, demonstrates the v alue of cov erage through repetition, not just breadth. C.4 E.4 Drift Detection at Scale The drift detection system monitors quality metr ics o v er a sliding window (5 recent v s. 20 baseline iterations). The ke y insight: dr ift detection pro vides earl y warning befor e quality gates f ail — a 5% drop in f actual pass rate ov er 10 iterations tr igg ers an aler t bef ore an y individual signal f ails hard. This is the mechanism that allow s the loop to operate autonomously f or 163+ iterations without human supervision. D Skill Interface Ref erence A Kitc hen Loop deplo yment consists of domain-independent orchestration plus domain-specific skills. The f ollo wing table summar izes the skill interfaces that a new deplo yment must implement: Skill Phase Input Output Domain-Specific? backlog Backlog Tic ket state Promoted tick ets Partially (labels) ideate Ideate Scenario + spec Report + scenar io Y es (defines “usage ”) triage T r iage Experience repor t Labeled, deduped tick ets P ar tially (tax onomy) execute Execute Ranked tick ets Branch + PR + tests Partially (test patter ns) pr-manager P olish Open PR list Re view ed, CI-passing PRs No regress Regress Codebase state Pass/f ail + drift Y es (defines “oracle”) loop-review Meta Logs + diffs Health + improv ement No review-meta Meta All revie w repor ts T rends + recommendations No Bold = must be fully reim plemented f or each ne w domain. Non-bold skills transf er with minimal configuration (tick et labels, CI commands, re view tool selection). R efer ences [1] F awzy , S., T ahir , A., & Blincoe, K. (2025). Vibe Coding in Practice: Motivations, Challenges, and a Future Outlook — A Grey Literature R evie w . arXiv :2510.00328 . [2] Huang, Y ., Re yna, K., Ler ner , B. S., Xia, C. S., & Hempel, J. (2025). Professional Softw are De velopers Don ’ t Vibe, The y Control: AI Ag ent Use f or Coding in 2025. arXiv :2512.14012 . [3] Rahman, M. M., et al. (2026). T ask -Lev el Ev aluation of AI-Generated Pull Reques ts in Open-Source Softw are. arXiv preprint . [4] R obbes, R., Matr icon, T ., Degueule, T ., Hora, A., & Zacc hiroli, S. (2026). Ag entic Much? Adoption of Coding A g ents on GitHub. arXiv :2601.18341 . [5] Abtahi, S. & Azim, A. (2025). A ugmenting Large Language Models with Static Code Anal ysis f or A utomated Code Quality Impro v ements. arXiv :2506.10330 . 40 [6] Liang, X., Garg, S., & Zilouchian Moghaddam, R. (2025). The SWE-Benc h Illusion: When State-of-the- Ar t LLMs R emember Instead of R eason. arXiv :2506.12286 . [7] Thai, T . D., et al. (2025). SWE-EV O: Benc hmarking Coding A gents in Long-Horizon Softw are Ev olution Scenar ios. arXiv :2512.18470 . [8] Cai, Y ., Li, R., Liang, P ., Shahin, M., & Li, Z. (2025). Designing LLM-based Multi-Ag ent Sy stems f or Softw are Engineer ing T asks: Quality A ttributes, Design Patterns and Rationale. A CM T rans. Softw . Eng. Methodol. arXiv :2511.08475. [9] Shukla, S., Joshi, S., & Sy ed, T . (2025). Security Deg radation in Iterativ e AI Code Generation — A Sy stematic Analy sis of the P arado x. arXiv :2506.11022 . [10] Gao, Y ., et al. (2025). A Surve y of Bugs in AI-Generated Code. arXiv :2512.05239 . [11] He, H., Miller , C., Ag ar wal, S., Kastner , C., & V asilescu, B. (2025). Speed at the Cost of Quality: Ho w Cursor AI Increases Shor t- T er m V elocity and Long- T erm Comple xity in Open-Source Projects. Pr oc. MSR 2026 . [12] Beck er , K., Rush, A. M., Barnes, C., & R ein, D. (2025). Measur ing the Impact of Earl y-2025 AI on Experienced Open-Source Dev eloper Productivity . METR . arXiv :2507.09089. [13] Chen, Y ., T alwalkar , A., Brennan, G., & Neubig, G. (2025). Code with Me or for Me? How Increasing AI A utomation T ransf or ms Dev eloper W orkflo ws. arXiv :2507.08149 . [14] Y ao, S., et al. (2025). Sy cophancy in Multi-A gent Debate. arXiv :2509.23055 . E Ho w to Cite @misc{kitchenloop2026, title = {The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase}, author = {Roy, Yannick}, year = {2026}, howpublished = {arXiv preprint}, url = {https://github.com/0xagentkitchen/kitchenloop}, note = {Companion repository with skills, oracles, and canary templates} } 41

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment