The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase

The Kitc hen Loop: User -Spec-Dr iv en De v elopment f or a Self-Ev ol ving Codebase Y annick R o y 0xAgentKitchen@gmail.com March 2026 Abstract Code production is now a commodity ; the bottleneck is knowing what to build and pr oving it w orks . W e present the Kitchen Loop , 1 a framew ork for autonomous, self-ev olving softw are built on a uniﬁed tr ust model: (1) a speciﬁcation sur face enumerating what the product claims to suppor t; (2) “ As a User × 1000” , where an LLM agent e x ercises that sur face as a synthetic pow er user at ∼ 1 , 000 × human cadence; (3) U nbeat able T ests , g round-truth v eriﬁcation the code author cannot fak e; and (4) Drift Control , continuous quality measurement with automated pause gates. W e validate across tw o production sy stems ov er 285+ iterations, producing 1,094+ merg ed pull requests with zero regressions detected by the reg ression oracle (methodology in Section 6.1 ). W e obser v e emerg ent proper ties at scale: multi-iteration self-correction chains, autonomous infrastructure healing, and monotonically impro ving q uality g ates. The pr imitiv es are not ne w; our contribution is their composition into a production-tested sys tem with the operational discipline that makes long-running autonomous ev olution safe. Backlog Groom queue Ideation Use the product T riage Findings → tick ets Ex ecution Branch, ﬁx, PR P olishing Re view , CI, merg e Regr ession Oracle + dr ift Figure 1: The Kitchen Loop: a six-phase autonomous improv ement cycle. 1 Open-source: https://github.com/0xagentkitchen/kitchenloop 1 1 Core Contributions This paper mak es f our claims: 1. “ As a User” v s. task -completion. Ag entic sy stems should not just close tic kets — the y should sy stematically ex ercise a product ’ s speciﬁcation surface the wa y a user w ould. The Kitchen Loop is a production-tested frame work f or this, grounded in synthetic user journey s rather than isolated issues. 2. The uniﬁed trust model. Autonomous ev olution requires a uniﬁed tr ust model. Three components interloc k to make it safe: a speciﬁcation surface (what the product claims to do), unbeatable tests (ground-tr uth v er iﬁcation the author can ’ t f ake), and a regression oracle with drif t control and automatic pause g ates. 3. U nbeat able tests. Cor rectness requires adv ersar ial, multi-model revie w . Implementer -wr itten tests are necessary but insuﬃcient — in our deployments, 38 passing unit tests coe xisted with complete f eature failure. The Kitchen Loop enf orces cor rectness through adv ersarial U A T gates (sealed test cards, fresh e valuator , zero context) and mandatory cross-model re view : ev er y PR is challeng ed b y independent agents (Code x, Gemini, CodeRabbit) bef ore merg e. No output is accepted as-is from the model that wrote it. 4. Bounded production evidence. Across two production sy stems and 285+ iterations, the Kitc hen Loop produced 1,094+ merg ed pull reques ts with zero regressions detected by the regression oracle (methodology in Section 4.1), monotonically impro ving quality gates (76–91% → 100%), and a cost of ∼ $0.38 per merg ed PR. Ke y deﬁnitions used throughout: T erm Deﬁnition Speciﬁcation surface The enumerable set of capabilities a product claims to suppor t — the input to cov erage-e xhaustion U nbeat able test A test that veriﬁes outcomes agains t ground tr uth that the code author cannot f ake Regr ession oracle A repeatable, bounded test that answers “is the sys tem at least as good as bef ore?” Co v erage-exhaustion mode An operating regime where the ag ent sys tematically e x ercises e very combination in the speciﬁcation sur f ace until cov erage g aps approach zero U A T gate A dversarial user -acceptance testing b y a fresh ev aluator with zero implementation conte xt Drift control Continuous measurement of quality trends with automated pause gates that halt the loop when metr ics degrade 2 2 Ex ecutiv e Summary (TL;DR) The method An LLM agent uses the product as a synthetic pow er user at ∼ 1 , 000 × human cadence agains t its speciﬁcation sur f ace and bey ond, validates through unbeatable tests, and controls drift and reg ression bef ore accepting the w ork to ensure the product self-ev olv es in the r ight direction (Figure 2 ). Speciﬁcation Sur face “What does the product claim to do?” 𝑁 features × 𝑀 platf or ms × 𝐾 actions = cov erage matrix As a User × 1000 (AaU1000) T1 Found. (30%) T2 Comp. (50%) T3 Front. (20%) Usage Scenario + Experience Report + Actionable Tick ets Unbeatable T ests L1 Unit → L2 SDK → L3 Integration → L4 E2E 4-Lay er: Compile → Execute → Parse → State Deltas (ground truth) Drift Control Regression Oracle → Dr ift Metr ics → Pause Gates “Is the system at least as good as last iteration?” Ne xt Iteration Figure 2: The uniﬁed tr ust model: each iteration ﬂo ws through the full veriﬁcation stack bef ore proceeding. The arc hitecture A six-phase loop (Figure 1 ) with automated dr ift control and pause g ates. The results Se ven w eeks, tw o production sys tems: 3 Kitchen Loop Results Iterations = 285+ PRs merg ed = 1,094+ Tic kets = 700+ T es ts = 13,000+ R eg ressions = 0 Cost/PR = ∼ $0.38 Quality gates L1 100%, L2 100%, L3 100% (from 76–91%) Canary escapes (Tier 1) 0 across all 163 signal iterations Monthl y cost ∼ $350 (ﬂat-rate subscriptions, both sys tems) Iteration speed 5 min (signals) to 80–230 min (e xecution) Production incidents 0 3 The P ost-Commodity Code Thesis 3.1 Code Is No Longer the Har d P art LLM-based coding ag ents can produce functional code at a rate that renders the wr iting step non-limiting. R obbes et al. (2026) ﬁnd that 15–23% of mature open-source projects adopted coding ag ents within just nine months of tool a vailability [ 4 ]. A senior engineer’ s value is shifting from wr iting code to kno wing whic h code to wr ite, why , and how to pr ov e it’ s corr ect . Y et the productivity story is more nuanced than adoption rates sugg est. Becker et al. (2025) conducted the most rigorous randomized controlled tr ial to date: e xper ienced open-source dev elopers using AI copilots w ere 19% slo wer than those w orking without AI, contradicting both dev eloper self-estimates (+20% speedup) and e xper t f orecasts (+38% speedup) [ 12 ]. He et al. (2025) pro vide causal evidence that Cursor adoption produces a large but transient v elocity boost (+281% lines added in month one, dissipating b y month three) accompanied by persistent quality deg radation: +30% static analy sis warnings, +42% cognitiv e comple xity , and +7% code duplication [ 11 ]. The accumulated comple xity feeds back into v elocity — ev er y doubling of comple xity reduces future velocity by ∼ 64.5% [ 11 ]. The Kitchen Loop can be read as a response to both ﬁndings: by shifting emphasis from code g eneration to speciﬁcation and unbeatable veriﬁcation, it addresses the quality degradation that erodes v elocity g ains and the lack of structural v er iﬁcation that lea ves AI-assisted de velopment slo wer than e xpected. The same shift is happening to code revie w . Automated revie w tools can identify bugs, sty le violations, and security issues at a pace no human revie w er can match. The human revie wer ’ s role shifts from “ﬁnd the bug” to “judg e whether this code ser v es the product ’ s intent.” This creates a ne w bottleneck: speciﬁcation and v eriﬁcation . The hard problems are no w: • What should the product do? (Speciﬁcation) • Does the product actuall y do it? (V eriﬁcation) • Is the product g etting better or w orse ov er time? (Dr ift control) 3.2 The T uring T est for Code Alan T uring’ s insight was that y ou don ’ t need to understand ho w a machine thinks — y ou need a test rigorous enough that passing it constitutes suﬃcient evidence of capability . The machine can remain a black box. The test is the arbiter . W e appl y the same pr inciple to AI-g enerated code. When an LLM agent wr ites a function, a connector, or a full f eature, y ou face a choice: • Option A : R ead ev er y line, understand e very branch, v er ify the logic manually . This scales poorl y when the ag ent produces thousands of lines per da y . • Option B : Build tests so rigorous that if the code passes them, y ou can trust it w orks — without needing to audit e very line. The code becomes a black bo x. The tests are the T uring T est. 4 Empirical e vidence suppor ts both options coe xisting in practice. Huang et al. (2025) ﬁnd that 69% of prof essional dev elopers carefully re view ev er y ag entic chang e and 75% read ev er y line of AI-g enerated code — but crucially , dev elopers w orking on unfamiliar tasks let agents driv e implementation while monitoring prog ram outputs rather than re viewing code line-by -line [ 2 ]. This v alidates Option B for the regime where the Kitchen Loop operates: autonomous ag ents on structured tasks with rigorous output v er iﬁcation. The urg ency f or Option B is underscored by a gro wing QA cr isis in AI-assisted dev elopment. F awzy et al. (2025) ﬁnd that 36% of practitioners using AI code g eneration skip quality assurance entirely , 18% place uncritical tr ust in AI output, and 10% delegate QA back to the same AI that wrote the code [ 1 ]. The result is predictable: 68% of practitioners characterize the output as “f ast but ﬂa wed” [ 1 ]. V oluntar y QA discipline is demons trably insuﬃcient; str uctural enf orcement — tests that cannot be skipped — is the onl y reliable solution. Option B requires a speciﬁc kind of tes t: not unit tests written by the same ag ent that wrote the code (the ag ent could pass its own tests trivially), but end-to-end v eriﬁcation ag ainst r eal-wor ld state . For a DeFi SDK, this means ex ecuting transactions on chain forks and verifying balance chang es. For a signal platf or m, this means cross-ref erencing claims ag ainst live data APIs. For a w eb application, this means bro wser automation against a real render ing engine. For an y product, it means tes ting at the boundar y where the softw are meets reality . W e call these unbeatable tests — tes ts that v er ify outcomes against g round truth that the code author cannot f ake. The test doesn ’ t care how the code ac hiev es the result — onl y that it does. It is of utmost impor tance to be aw are of the w eaknesses of LLMs when it comes to tests. There is a f allacy to belie v e that LLMs are so good at writing code, that they should be equall y good at writing tests. The y aren ’ t. They often obsess ov er the test itself f org etting that passing the tes t is not the goal in itself. That obsession to pass the tes t often leads to c heating, modifying the test or e ven wr iting side scripts nailing the test, leaving the actual codebase untested. 4 R elated W or k The Kitchen Loop e xists within a rapidly ev ol ving landscape of ag entic softw are engineer ing. W e position our contributions against f our categories of related w ork. 4.1 A utonomous Coding Agents SWE-agent (Y ang et al., 2024) and AutoCodeR o v er (Zhang et al., 2024) demonstrated that LLM ag ents can resolv e GitHub issues autonomousl y by navig ating codebases, editing ﬁles, and r unning tests. OpenHands (W ang et al., 2024) and De vin (Cognition, 2024) e xtended this to multi-s tep w orkﬂow s including en vironment setup, debugging, and deplo yment. RepairAg ent (Bouzenia et al., 2024) focused speciﬁcall y on automated prog ram repair with LLM-driv en fault localization. These sys tems operate in task -completion mode : giv en an issue, produce a patch. The Kitchen Loop operates in co v erage-e xhaustion mode : giv en a speciﬁcation surf ace, sy stematicall y e xercise e very combination until the gap betw een spec and reality approaches zero. The unit of w ork is not “resol v e this issue ” but “attempt this user scenario end-to-end and document e v er ything that breaks.” T ask completion is a component of the loop (the Ex ecute phase), not the loop itself. 4.2 Self-Impr o ving and Self-Ev olving Agents SIC A (“ A Self-Improving Coding A gent ”, arXiv 2504.15228, 2025) introduced ag ents that edit their o wn prompts and tools based on e xecution f eedback, achieving self-improv ement without human inter v ention. AlphaEv olv e (DeepMind, 2025) demonstrated repositor y -scale code e v olution using LLMs guided b y automated ev aluators. Confucius SDK proposed a meta-ag ent build-test-impro ve loop for SDK de velopment. 5 The Kitc hen Loop shares the self-impro vement proper ty (Section 13) — it disco vers its o wn infrastructure bugs and updates its o wn skill ﬁles. The k ey diﬀerence is that self-impro vement in the Kitchen Loop is anchor ed to a speciﬁcation sur f ace and reg ression oracle , not to an optimization objectiv e. The loop doesn ’t optimize f or a metr ic; it conv erges tow ard a speciﬁcation. This prev ents the “goodharting” failure mode where an agent optimizes a proxy metr ic while the product degrades in unmeasured dimensions. 4.3 Long-Running and Looping Patterns Ralph Wiggum loops (Huntley , 2025; snarktank/ralph) demonstrated that autonomous agents can produce 1,000+ commits o vernight in a continuous loop. Multi-ag ent SE framew orks like MetaGPT (Hong et al., 2024), ChatDe v (Qian et al., 2024), and ALMAS assigned diﬀerent ag ent roles (PM, architect, coder, tester) to simulate a software team. The Kitchen Loop draw s from this lineag e but adds three structural elements that Ralph-sty le loops lack: (1) a production-parity tick et management la y er (Linear) where human and AI tick ets are treated identically , pre v enting the loop from diver ging from human pr ior ities; (2) automated drift control with pause gates that halt the loop when quality degrades; and (3) the three-tier strategy model (Foundation/Composition/Frontier) that ensures balanced co verag e rather than random or greedy scenar io selection. 4.4 Spec-Driv en and V eriﬁcation-F ocused Approaches SpecR ov er (Ruan et al., 2024) used natural-languag e speciﬁcations to guide automated debugging. AgileGen and A ugmented Agile e xplored LLM-assisted requirements engineer ing and test generation from user stories. USEagent proposed user -story-driven end-to-end testing. The Kitchen Loop’ s AaU1000 method is closest to USEagent ’ s vision but diﬀers in scope and tr ust model. USEagent generates tests from user stories; AaU1000 generates usag e scenarios from speciﬁcation surfaces and v alidates them through unbeatable t ests (4-la yer v er iﬁcation agains t ground tr uth). The distinction matters: a test g enerated from a user story prov es the story is implemented; a usage scenario v alidated through state-delta v er iﬁcation pro v es the product actually w orks in the w ay a real user w ould e xper ience it. 4.5 Vibe Coding and the QA Crisis F awzy et al. (2025) ﬁnd a “speed-quality paradox” across 101 practitioner sources: 62% of vibe coders cite speed as their pr imary motivation, y et 68% characterize output as “fast but ﬂaw ed” [ 1 ]. Huang et al. (2025) conﬁr m that prof essional de velopers reject the vibe coding paradigm, carefully controlling ag ents through strategic plans — av eraging onl y 2.1 ex ecution steps per prompt despite plans spanning 70+ steps [ 2 ]. Their task suitability tax onomy identiﬁes agents as unsuitable f or business logic, domain kno wledg e, comple x reasoning, and security-critical code — precisely the categor ies where the Kitchen Loop’ s spec surface and U A T gate pro vide structural guardrails. 4.6 Benchmar k Contamination, A doption, and Quality Evidence The dominant e valuation benchmark, SWE-Bench, suﬀers from data contamination: Liang et al. (2025) sho w a 23-percentag e-point accuracy gap betw een memor ized and no v el repositor ies [ 6 ], while Thai et al. (2025) ﬁnd GPT -5 scores 65% on single-issue tasks but onl y 21% on multi-ﬁle e v olution tasks [ 7 ]. Meanwhile, adoption is accelerating — Robbes et al. (2026) ﬁnd 15–23% of mature open-source projects adopted coding agents within nine months [ 4 ] — but quality lags: 90.6% of agent-authored PRs receiv e zero human revie w [ 3 ], and ∼ 40% of Copilot-generated code contains secur ity vulnerabilities [ 10 ]. The Kitchen Loop addresses both problems: it rejects benchmark -sty le ev aluation in fa v or of veriﬁcation agains t liv e product state, and inter poses adv ersar ial v eriﬁcation between g eneration and merg e. 6 4.7 Multi-Ag ent Design Patterns Cai et al. (2025) sy stematicall y re vie w 94 LLM-based multi-ag ent sys tems f or softw are engineer ing, identifying 16 design patterns ranked by adoption frequency [ 8 ]. The Kitchen Loop composes se ven of these patterns into a single integrated loop: R ole-Based Cooperation (six-phase loop with distinct roles per phase), Self-R eﬂection (Polish and Regress phases), Cross-Reﬂection (U A T Gate), Debate- Based Cooperation (Discussion Manager), V oting-Based Cooperation (multi-model tr ibunal), T ool- Ag ent R egistr y (skills director y), and A gent Ev aluator (regression oracle). Notably , three of these — Debate- Based Cooperation (4.3% adoption), V oting-Based Cooperation (3.2%), and A gent Ev aluator (3.2%) — are among the least adopted patterns in the literature [ 8 ], sugg esting the Kitc hen Loop operationalizes architectural ideas the ﬁeld has not y et widely implemented. 4.8 Our Diﬀerentiation T able 1: Compar ison of ag entic softw are engineer ing approaches. T ask-Com pletion Ralph Loops Self-Impro ving Kitchen Loop U nit of w ork Issue → Patch Commit Optimization step Scenario → e xper ience repor t Stopping cond. Issue resolv ed Time/count Metric plateau Speciﬁcation e xhausted Co verag e Reactiv e (given issues) Random/greedy Objectiv e-guided Three-tier (F/C/F) Quality gate CI passes CI + typecheck Evaluator function 4-lay er + multi-model tr ibunal Drift control None Basic (prog ress log) Implicit (metr ics) Explicit (oracle + pause gates) Human role PR revie w Manual ov ersight None Shared backlog, ticket parity Self-impro vement N one Manual (A GENTS.md) Core feature Anchored to spec + oracle From our anal ysis, three dis tinct operating regimes f or agentic softw are engineer ing emerg e: T able 2: Three operating regimes f or agentic softw are engineer ing. Regime Unit of W ork Stopping Cond. V eriﬁcation T arget F ailure Mode T ask -completion Issue Issue resolv ed Patc h correctness Local ﬁx / global mismatch Metric-optimization Objectiv e step Metric plateau Pro xy metric Goodharting Co verag e-e xhaustion User scenario Surface e xhaustion User -visible behavior Drift / incomplete sur face The Kitchen Loop operates in the co verag e-e xhaustion regime. The regime deter mines the sy stem’ s f ailure mode: task -completion r isks local ﬁxes that break global behavior; metric-optimization r isks Goodhar ting on proxy measures; cov erag e-e xhaustion r isks dr ift or incomplete speciﬁcation sur f aces, which the dr ift control mechanisms in Section 7 are designed to address. 5 The “ As a User x 1000” Method (AaU1000) 5.1 The Core Insight The most reliable signal about whether a f eature works is a real usage attempt — not a unit tes t, not a code re view , but someone e x ercising the product end-to-end. The problem: human usage attempts are slo w and e xpensive. The AaU1000 method replaces the “someone ” with an LLM agent that: 1. Selects a realistic usage scenario derived from the product’ s speciﬁcation sur f ace 2. A ttempts it as a real user w ould — writing code, r unning it, observing what breaks 3. Documents failur es as actionable tick ets — not v ague obser vations but precise bug repor ts with reproduction steps, root cause hypotheses, and ﬁle-lev el pointers 7 4. Fix es those failures immediatel y — implementing the ﬁx, wr iting tes ts, and shipping a PR 5. W atches for regr ession and drift — verifying nothing else broke and the codebase is not getting w orse And then does it ag ain. At whate ver cadence the infrastructure allow s. 5.2 Quantifying the 1000x The “1000x” is an order -of-magnitude claim, not a precise multiplier . W e ground it empir icall y: Single-thread v elocity : A senior engineer realisticall y ships ∼ 15-25 merg ed PRs per month (bug ﬁxes, f eatures, tests). Each Kitchen Loop instance runs single-threaded per product. The strategy frame w ork produced 728+ merged PRs in ∼ 5 w eeks ( ∼ 145/w eek); the signal platf orm produced 366 in 17 da ys ( ∼ 150/w eek). Per -sys tem, this is a ∼ 24-48x single-thread throughput increase depending on domain iteration speed (the signal platf orm iterates ∼ 25x f aster than the SDK). Combined across both sys tems running concur rently , total output w as 1,094+ merg ed PRs. Scenario throughput : The DeFi strategy framew ork’ s loop completes one full usag e scenar io (ideate → implement → test → ﬁx → regress) in 80-230 minutes. A human attempting the same scenar io takes 1-3 da ys. This is a 7-25x per-scenario speedup . Parallelization potential : The loop cur rentl y runs single-threaded per product. Running N loops on N products (as demonstrated with SDK + Edg e concur rently) scales linear ly . W ith parallel ex ecution infrastructure (dynamic por t allocation, containerized test en vironments), a single product could r un N loops e xploring diﬀerent speciﬁcation regions simultaneousl y . The 1000x represents the achiev able ceiling with moderate parallelization — not the current single-thread reality . 5.3 Spec-Driv en, N ot Spec-Speculativ e The method is scoped to what exists and is expected to w ork toda y — not speculativ e f eature engineer ing, benchmarking, or user research. 5.4 The “Obviousl y Missing” Signal The most important class of ﬁndings is the “ob viously missing” signal : f eatures any competent user w ould e xpect, but that don’ t w ork. These are objectivel y v er iﬁable f ailures — a function retur ning AttributeError , a missing conﬁguration entry , a g as cap bloc king all transactions on a chain. The loop disco vers veriﬁable f ailures, not speculative issues. 5.5 The Three- Tier Strategy Model The Kitc hen Loop f or malizes a three-tier scenario generation model that ensures balanced cov erage across the product ’ s matur ity spectrum: T1 T2 T3 30% 50% 20% F oundation — “Does the basic stuﬀ work per f ectly?” Composition — “What breaks when we combine things?” Frontier — “What’ s missing f or the ne xt generation?” Tier 1 — F oundation (30%) e xercises a single f eature on a single platform or conﬁguration. One integration, one action, happy path. These scenar ios should be tr ivially achiev able b y a new user in a f ew minutes. If anything goes wrong, it is a critical reg ression. Foundation iterations maintain the baseline: the easy stuﬀ must alw ay s be bulletproof. 8 Tier 2 — Com position (50%) combines tw o or more features in creative w ay s. Multi-ser vice ﬂo ws, indicator -driven beha vior , multi-step w orkﬂo ws, cross-platf orm conﬁguration stress tests. The goal is to ﬁnd bugs at the seams betw een components that pass individuall y but fail in combination. This is where the ma jor ity of nov el bug discov er y happens, because the combinator ial space of f eature pairs and tr iples is f ar larger than the space of individual f eatures. Tier 3 — Frontier (20%) deliberatel y reaches bey ond the product ’ s cur rent capabilities. The deliv erable shifts from “working scenario ” to “gap analy sis ”: what would you need to build this? What’ s missing? What w ould it unlock? The experience repor t emphasizes speciﬁc missing f eatures, ordered by implementation eﬀor t and user value. 5.6 The Self-Expanding Property The three tiers f or m a self-reinf orcing gro wth cy cle . This is the “secret sauce ” that transf or ms a random ag ent into a strategic engineer: Iteration N: Features = {A, B} T1 (Foundation): test A, test B (2 scenarios) T2 (Composition): test A+B, test B+A (2 scenarios) T3 (Frontier): “I need C to do A+C” → gap analysis (1 gap report) — C gets built — Iteration N+k: Features = {A, B, C} T1 (Foundation): test A, test B, test C (3 scenarios) T2 (Composition): test A+B, A+C, B+C, A+B+C (4+ scenarios) T3 (Frontier): “Now I want D to do A+C+D” (1 gap report) Foundation gro ws linearl y with eac h new f eature. Composition gro ws superlinear ly — each ne w f eature can be combined with ev er y e xisting f eature and pair . Frontier pushes the boundar y outward. The loop e vol v es the product b y sy stematically testing what e xists and identifying the ne xt most v aluable capability . Design Principles P1. Ground- T ruth V eriﬁcation T r ust in AI-g enerated code should be proportional to ho w ground-tr uth-v er iﬁable the tes t outcomes are, not to tes t co v erag e percentage. P2. W eak est-Evaluator A user test is only trustw or th y if a minimally capable ev aluator can e x ecute it. (§4.8) P3. Spec-Anchored Impro vement Self-impro ving ag ents should optimize to ward speciﬁcation satisf action, not pro xy metrics. (§2.2, §3.5) P4. Drift-Before-F ailure A utonomous loops need continuous trend monitor ing, not onl y binar y quality g ates. (§5.5) 6 U nbeat able T ests: The Multi- Tier QA Frame w ork 6.1 Methodology : What “Zer o R egressions” Means Throughout this paper , we repor t “zero regressions introduced b y loop-merg ed code.” This claim requires precise deﬁnition to be f alsiﬁable: 9 • Detection method : Ev er y iteration ends with an automated regression oracle run (Section 5.2). For the DeFi strategy frame work, this means e xecuting demo strategies on Anvil chain f orks and v er ifying 4-lay er s tate deltas. For the signal platform, this means running the full quality -gate pipeline (structural, f actual, temporal, cognitiv e) plus anti-signal canaries. • Measurement windo w : Continuous — e v er y iteration, not sampled. The oracle runs after ev er y merg e, not on a periodic schedule. • Scope : The claim co vers regressions det ected by the or acle in code merg ed through the loop’ s own pr ocess . It does not claim the codebase is bug-free, nor that latent regressions undetectable by the oracle do not exis t. The oracle’ s co verag e is bounded by its test suite (10,913 unit tes ts, 62 demo strategies, 77 signal veriﬁers). • Ex clusions : Environmental failures (API timeouts, chain fork instability) are dis tinguished from code regressions b y cor relation with recent merg es. A f ailure that reproduces on the pre-mer ge commit is en vironmental; a failure introduced by a speciﬁc PR is a reg ression. This deﬁnition makes the claim auditable: an y third party with access to the oracle suite and git history can reproduce the measurement. 6.2 Wh y T ests Ar e the Ne w Compe titive A dvantage Gao et al. (2025) ﬁnd functional bugs in 78% of studies on AI-g enerated code, with ∼ 40% of Copilot code containing secur ity vulnerabilities [ 10 ]. Dominant benc hmarks (HumanEval, MBPP) rely on P ass@k metrics that o v erlook semantic cor rectness [ 10 ], and Liang et al. (2025) sho w SWE-Benc h per f or mance is inﬂated b y data contamination (76% accuracy on memorized paths vs. 53% on nov el repositories) [ 6 ]. U nbeatable tests sidestep both f ailure modes: the y v er ify against liv e product s tate that c hanges with e very deplo yment, making both contamination and pass-rate g aming ir relev ant. This requires tes ts at multiple tiers, each catching a diﬀerent class of def ect that low er tiers miss. 6.3 The 4-Lev el T esting Pyramid Le vel 4: E2E Scenario T ests Scenario → Actions → Execution → State Le vel 3: Integration T ests Compile, ex ecute, verify agains t ground tr uth Le vel 2: API/A dapter T ests Individual method tests with real dependencies Le vel 1: U nit T ests Isolated, mock ed, f ast — pure function cor rectness F ull user journey Real execution API contracts Logic validation Le vel What It T ests Speed T rust Le vel W ritten By L1 Isolated logic, calculations Fas t (ms) Lo w — pro ves logic, not integration Ag ent or human L2 API methods, adapters Medium (s) Medium — pro ves API contracts Ag ent or human L3 Full e xecution pipeline Slow (min) High — pro v es real-w orld behavior Loop + ag ent L4 Complete user journey s Slo wes t Highest — prov es product works Loop (AaU1000) The critical insight : L1 and L2 are necessar y but not suﬃcient . A function can pass all its unit tests and still fail in production because the unit tests don ’t e xercise the real ex ecution environment. L3 and L4 are the unbeatable tests — the y verify agains t g round truth (real state, real APIs, real e xecution) that the code author cannot f ake. 10 6.4 The 4-La y er V eriﬁcation Pattern Ev er y L3 integration test implements f our v eriﬁcation la y ers , each catc hing a diﬀerent class of def ect. W e illustrate with the DeFi strategy framew ork (Case Study A, Section 10 ); other domains substitute their o wn v er iﬁcation lay ers (e.g., brow ser automation f or w eb apps, API contract testing f or back ends — see Section 16 f or adaptation e xamples). La y er Name What It Catches Ho w 1 Compilation W rong params, missing conﬁg, type er rors Build/compile the action, assert success 2 Execution Runtime failures, per mission er rors, timeouts Execute against real en v , asser t success 3 Output Parsing W rong decoding, missing ﬁelds, malf ormed data Parse output, assert e xpected data e xtracted 4 State Deltas Silent failures, par tial ex ec, wrong outcomes Measure state before/after , assert e xact deltas A test that only compiles is incomplete . A test that compiles and e x ecutes but doesn ’ t chec k state deltas is dang erously incomplete — it could silently succeed while doing the wrong thing. Lay er 4 is what makes the test unbeatable: it veriﬁes the outcome , not jus t the execution . For f ailure-mode tests: 3 lay ers are required (compilation, ex ecution, state deltas), and state conservation MUST be asser ted (state bef ore and after remain unchang ed). The sy stem must not silently lose assets when it f ails. 6.5 The Cov erage Matrix: Exhaustiv e b y Design The speciﬁcation deﬁnes a co v erage matrix: ev er y combination of f eature, platf or m, and action type that the product claims to support. The test suite ’ s goal is to ﬁll ev er y cell in this matrix. For a product with N f eatures, M platforms, and K action types, the matr ix has N x M x K cells. Each cell represents a claim: “Feature X works on Platf or m Y f or Action Z.” Each empty cell is an untested claim — a place where the product might silently f ail. The Kitchen Loop ﬁlls these cells sys tematically , prior itizing b y r isk: 1. P0 : Core features on pr imar y platf orms (table stakes) 2. P1 : Core features on secondar y platf or ms (breadth) 3. P2 : Adv anced features on pr imary platforms (depth) 4. P3 : Edge cases and aggregators (completeness) No manual QA process can ﬁll a matrix with hundreds or thousands of cells on a w eekly cadence. The Kitchen Loop can, because it generates and r uns tests at AI speed. 6.6 Anti-Signal Canaries: T esting the T ests For sy stems where outputs are non-deter ministic (signals, recommendations, g enerated content), the Kitchen Loop uses anti-signal canaries : intentionally crafted bad inputs injected alongside real ones to v er ify the quality gate catches what it should. Four tiers of increasing deceptiv eness: A dditionally , 3 API degradation canaries tes t resilience to partial e xter nal API f ailures (e.g., a data source returning er rors or timeouts). These v er ify the quality gates degrade gracefully rather than producing f alse passes when dependencies fail. The canar y sys tem pro vides a known-bad baseline that the reg ression phase can measure agains t. Tier 1 canar y escapes are treated as a cr itical w ar ning signal f or operator revie w . The monotonic improv ement across all tiers — from par tial catch rates in earl y iterations to 100% across all 4 tiers by iteration 124 — demonstrates that the loop’ s quality infrastructure impro ves alongside the product it protects. A notable ﬁnding from the Edg e deplo yment: Tier 2 catch rates w ere stuck at 33% f or 70+ iterations. Earl y attempts to improv e them via the L4 LLM tribunal (GPT/Claude/Gemini judgment) produced 11 Tier N ame Description Expected Observ ed (iter 163) 1 Ob viously Bad Glaring structural or factual er rors any gate should catch 100% 100% (0 escapes / 163 iters) 2 Shado w F actually true but stale, low -nov elty , or below threshold 50–80% 100% (from 33% at iter 1, ﬁxed iter 124) 3 A dversarial Real data with wrong conclusions — fools deterministic and LLM chec ks 30–60% 100% (from 67% at iter 1) 4 Mix ed T/F Blend of valid and f abr icated data in one signal — par tial-f ailure detection 20–50% 100% (added iter ∼ 100, ﬁxed iter 124) non-deterministic results — catc hing 0-2 canar ies per iteration with no conv erg ence. The durable ﬁx came from encoding f ailure patterns as det erministic rules (e.g., STALE_NARRATIVES , STALE_CATALYSTS ), achie ving 100% catch rate that held f or 40+ subseq uent iterations. This v alidates Design Pr inciple P1 (Ground- T r uth V eriﬁcation): f or saf ety-critical gates, deterministic v er iﬁcation outper f orms probabilis tic LLM judgment. 6.7 Multi-Model Re view T ribunals For decisions that require judgment (architectural choices, ambiguous test results, code quality assessment), the Kitchen Loop uses multi-model tribunals : three independent AI revie wers e valuate the same artifact in parallel. Findings are synthesized with consensus classiﬁcation: • Consensus (all three agree): treated as a conﬁrmed ﬁnding • Ma jority (tw o ag ree): treated as a likel y ﬁnding, pr ioritized f or action • Solo (one re view er): ﬂagged f or human judgment This reduces the f alse-positive rate of any single model and pro vides higher -conﬁdence assessments f or critical decisions. Section 7 e xtends this concept into a full structured deliberation system — the Discussion Manag er — with multi-round debate, epistemic saf eguards against sy cophancy , and empir ical v alidation across 23 production discussions. 6.8 The Adv ersarial U A T Gate: “Ho w W ould a User T est This?” U nit tests pro ve code works; the y don’ t pro ve f eatures w ork f or users. When the same agent that implements a f eature also tests it, three failure modes emerg e: 1. Happ y-path blindness — the implementer only tests the case the y built f or 2. Context leakag e — the agent “kno ws ” the implementation and unconsciousl y compensates f or gaps 3. Cheating — AI models optimize f or green chec ks, not product tr uth. The y write side scripts, mock data, and reinterpret asser tions to f orce a pass The solution is not “ask the model to test honestl y” but to design the loop so honesty is the easiest beha vior and cheating is mechanicall y visible . The U A T Gate implements this through a three-s tep adv ersar ial protocol: 6.9 Step 1: “Ho w w ould a user test this?” After implementing a tick et and creating a PR, the implementing ag ent must wr ite a sealed test car d — a step-b y-s tep recipe that an y user could f ollow to verify the f eature w orks. The test card f or mat is strict: • Ev er y step has an exact command (no placeholders req uiring judgment) 12 • Ev er y step has an e xact expected e xit code and exact output assertions (not “should work” — speciﬁc strings) • At least one step veriﬁes that bad input is rejected (not just happy path) • No manual code edits — if testing requires conﬁguration, the implementer ships a ﬁxture in the PR • No implementation details — user -visible behavior only This is the “as a user, how can I test this?” f orcing function. If the implementer cannot write a card that demonstrates the feature working end-to-end, that’ s a signal the feature isn ’ t done — e ven if all unit tests pass. 6.10 Step 2: “Clear session — r emo ve biases” A fresh ag ent is spawned with zero implementation conte xt to ex ecute the test card: • Inf ormation wall : The e v aluator receiv es only the test card. No diﬀ, no tick et, no code context, no con versation histor y from the implementing agent. • W eak est model : The ev aluator uses the w eakes t av ailable model (e.g., Haiku) as a “dumb user ” pro xy . A strong model compensates f or bad test cards b y “ﬁgur ing out ” what the f eature does. A w eak model fails if the card is incomplete — which is e xactl y the signal y ou want. If the w eakes t model can f ollo w the card and get expected results, a real user can too. • Isolated w orktree : The ev aluator r uns in a clean git w orktree — a separate copy of the repository with pristine state. • Read-onl y mandate : The ev aluator is explicitl y f orbidden from editing any product ﬁle. Its pr ime directiv e is to disconﬁrm the feature, not conﬁr m it. 6.11 Step 3: “Do it — sho w me the r esults” The ev aluator e xecutes ev er y step and produces structured evidence: ra w command output, ex act e xit codes, actual-v s-e xpected compar isons. Then the implementing agent per f or ms a mechanical integrity chec k : Three-La y er Anti-Cheating 1. Inf ormation w all. Evaluator gets ONL Y the test card. No diﬀ, no tick et, no implementation conte xt. 2. Mechanical integrity check. After e valuation: git diff on the U A T w orktree. Any product ﬁle modiﬁcation = EVAL_CHEAT_FAIL . An y untrack ed ﬁles outside evidence dir = F AIL. 3. Evidence structure. Every step: ra w command output + e xit codes. Missing steps = f ailure. “It w orked” without output = failure. The v erdict tax onomy makes the outcome actionable: V erdict Meaning A ction PASS Feature works from user perspectiv e Proceed, attach evidence to PR PRODUCT_FAIL Feature is broken Keep tick et open, tag PR uat-failed UAT_SPEC_FAIL T est card is ambiguous or un-runnable Log f or process impro vement, don ’ t block EVAL_CHEAT_FAIL Ev aluator modiﬁed product ﬁles Serious process issue, ﬂag f or human re view 13 6.12 Real- W orld Exam ple: The Bac ktest Service Gap This e xample illustrates ex actly wh y the U A T gate exis ts. An ag ent implemented a backtest HTTP ser vice with 38 passing unit tests cov er ing HTTP routing, job lif ecycle state machines, model serialization, and capacity limits. All tests green. Lint passes. PR created. Tick et mo ved to “In Re view .” But nobody e ver actually started the service and sent a req uest to it. The real smok e test — curl -X POST http://localhost:8000/api/v1/backtest f ollo wed b y polling f or completion — would ha ve re v ealed that PnLBacktester() w as instantiated with no constructor arguments. The data pro viders, f ee models, and slippag e models were nev er wired up. The service accepted jobs and immediately failed them. 38 unit tests passed. The featur e was comple tely brok en. With the U A T gate, the implementer must write a test card whose steps include starting the service, submitting a job, polling f or completion, and v er ifying results contain data. The Haik u e valuator r uns this card blindl y . Step 3 retur ns "status": "failed" . V erdict: PR ODUCT_F AIL . The tick et sta ys open until the ag ent actually wires up the backtest pipeline. The implementer can ’ t dodge this: - Can ’ t wr ite “run the unit tes ts ” as the card — v alidation rejects it (not user testing) - Can ’ t test onl y HTTP routing — the card r ules require testing the actual user journey - Can ’ t w eaken assertions — “Expected output contains: completed” is binary - Can ’ t skip the gate — mandatory for any change to product code. In autonomous loops where 90.6% of agent-authored PRs receiv e zero human revie w [ 3 ], the U A T gate ﬁlls this g ap mechanicall y . 7 Controlling Regr ession and Drift 7.1 The Drift Problem A self-e vol ving codebase f aces a unique r isk: quality drift . Each iteration produces code that passes its o wn tests — but does the accumulation of changes make the sys tem better or worse? Without continuous measurement, the answ er is unkno wable until a user hits a reg ression in production. Shukla et al. (2025) quantify this r isk: iterativ e LLM code reﬁnement parado xically degr ades security , with vulnerabilities r ising from 2.1 per sample in ear ly iterations to 6.2 by iterations 8–10 — a 37.6% increase in critical vulnerabilities after just ﬁv e iterations [ 9 ]. He et al. (2025) corroborate the dr ift risk from a diﬀerent angle: Cursor adoption produces persistent quality degradation (+30% static anal y sis w ar nings, +42% cognitiv e complexity) that outlasts the transient v elocity gains [ 11 ]. The Kitchen Loop treats regression control as a ﬁrst-class concer n, not an after thought. Every iteration ends with a regression phase that answers: “Is the sys tem at least as good as it w as before this iteration?” 7.2 The Regression Oracle Each product domain requires a r egression oracle — a repeatable test that answers “is the sys tem still w orking?” in bounded time. The oracle ’ s proper ties: • Deterministic : Same inputs produce same pass/f ail on same codebase • Compr ehensive : Co vers the product’ s cr itical paths • F ast enough : Must complete within the iteration budg et • Independent of the loop : The oracle tests the product, not the loop’ s o wn output Mode Duration Co verag e When to Use Full 120-150 min All scenarios, all platf or ms Scheduled weekl y 14 Mode Duration Co verag e When to Use Quick 30-40 min One scenario per platf or m Ev er y iteration 7.3 The Block ed Combos Registry A critical operational tool is the Block ed Combos registr y — a machine-readable list of f eature/platf or m combinations that are kno wn-broken and should not be re-tested b y the ideation phase. This prev ents the loop from w asting iterations on kno wn-broken paths. When a bloc king tick et is resolv ed, the combo is remo ved and becomes a vailable f or ideation ag ain. The registry grow s and shrinks as bugs are f ound and ﬁxed, creating a living map of the product’ s actual (not assumed) capability surface. 7.4 Drift Metrics The loop trac ks se veral metr ics across iterations to detect drift: Metric Health y T rend W arning Signal T est count Gro wing or stable Declining Pass rate Stable at > 95% Declining ov er 3+ iterations Bug discov er y rate Declining (maturity) Sudden spike (regression) Oracle pass rate 100% An y f ailure correlated with recent c hanges Block ed combos Declining Gro wing without cor responding ﬁx tic kets Canary escape rate 0% for Tier 1 An y Tier 1 escape The Regress phase v eriﬁes iteration history completeness bef ore drift anal ysis; missing ro ws are bac kﬁlled automaticall y , since gaps would silently def eat sliding-windo w trend detection. A complementar y cross-PR interaction detector identiﬁes structural regression patter ns — ﬁles added in one PR and deleted b y another , re vert commits, and high-churn ﬁles modiﬁed b y 3+ independent merg es — that the regression oracle cannot catch because they span multiple PRs. 7.5 A utomated P ause Gates Fiv e automated gates determine whether the loop should continue or pause f or human re vie w: Gate T rigger Response A uto? Regression F ailure Oracle pass rate drops belo w threshold Pause after 𝑁 consecutiv e failures (de- fault 3) Semi Canary Escape Tier 1 canar y passes quality gate W arn operator Advisory Drift Threshold Quality metr ic declines 3+ con- secutiv e iters W arn operator Advisory Backpressure Open PRs ex ceed threshold Enter drain mode (polish-only) Y es Starvation Ex ecute starved 𝑁 consecutiv e iters Monitor -only , alert human Y es Drain Mode : When open PRs e x ceed a threshold (default: 10), the loop automatically enters drain mode — skipping all phases ex cept Polish and increasing the PR processing limit. When PRs drop belo w the e xit threshold (default: 5), nor mal operation resumes. The or iginal phase conﬁguration is sa ved and restored, so drain mode is transparent to the r unning loop. This prev ents the failure mode where the loop produces PRs f aster than it can merg e them, causing an unbounded bac klog. 15 Starvation Gate : When the Ex ecute phase produces zero output f or N consecutiv e iterations (def ault: 10), the loop transitions to monitor -only mode and aler ts the operator . This g ate was validated empirically in the Edge deplo yment (Section 11.5 ): at iterations 112-127, all remaining w ork required chang es in an e xter nal dependency (SDK/Python) unreachable from the Edge (T ypeScr ipt) codebase. The circuit breaker ﬁred 11 times, and the loop cor rectl y recommended stopping. When the dependency blockers w ere resolv ed at iteration 128, the loop immediately resumed productiv e work (14 PRs in one batch). The starvation gate v alidates the spec-anchored design: when the speciﬁcation surface is fully co v ered, the loop stops rather than in venting work. Three control counters — star v ation, drain entr ies, and no-work loops — are persisted to disk, so the loop retains its operational position across process res tar ts. Of the ﬁv e gates, drain mode and starvation are fully automatic. Regression failure halts only after consecutiv e iteration failures ex ceed a conﬁgurable threshold (def ault: 3). Drift and canar y escape cur rentl y warn rather than pause — making them advisor y g ates that rely on operator attention. These gates ensure the loop cannot deg rade the product f aster than it impro v es it. The loop is allo wed to r un autonomousl y because the gates pro vide a saf ety net. 8 Sy stem Arc hitecture 8.1 The Six-Phase Loop The Kitchen Loop is orches trated b y a shell framew ork that manag es state transitions, git w orktree isolation, and er ror reco very . The core logic is delegated to specialized AI skills, each encoding a repeatable, autonomous w orkﬂo w . Backlog Groom queue, ﬁll pipeline Ideation Select spec, attempt usage T riage Findings to tick ets Ex ecution Branch, ﬁx, test, PR P olishing Re view , CI, merg e Regr ession Oracle + dr ift measurement Backlog ( ∼ 15 min): Evaluates urg ency and co verag e gaps, promotes candidates to the work queue, g enerates new scenario tick ets when supply r uns lo w . Ideate ( ∼ 15-45 min): Selects a scenario, implements it as a real user would, r uns it ag ainst the test en vironment, documents what breaks. Optionally passes through an external f easibility chec k. T riage ( ∼ 5-10 min): Con verts ﬁndings into labeled, pr ioritized tickets with root cause hypotheses and ﬁle pointers. Deduplicates against exis ting tick ets, and reopens tick ets whose pr ior ﬁx PRs were closed without merging. Ex ecute ( ∼ 30-60 min): For each top-N urg ent tick et: creates a f eature branch in an isolated w orktree, implements the ﬁx, writes tests, opens a PR. Includes bac kpressure control. P olish ( ∼ 10-90 min): PR hardening and merging through a graduated state machine. Each PR is tracked with an attempt counter; after a conﬁgurable number of failures (default: 1), it is labeled needs-attention and ex cluded from future processing. Operators can raise the threshold to enable multi-attempt escalation with f ollow -up tick et creation and failure classiﬁcation. PRs block ed by security or arc hitectural concerns are retired — closed with a comment routing the ticket back to the bac klog. Bef ore an y merge, a TOCTOU-safe deletion check veriﬁes no ﬁles w ere une xpectedly remo v ed. Regr ess ( ∼ 40-150 min): Runs the regression oracle (read-only — no ﬁx es). U pdates loop s tate directl y to the base branc h. Promotes conﬁr med patterns to durable memory . Produces iteration summary with drift metrics. Bef ore updating shared s tate ﬁles (e.g., loop-state.md), the Regress phase re-reads the cur rent v ersion to pre vent stale o verwrites from inter r upted or concurrent r uns. 16 8.2 Skills as Prom pts-as-Code Each phase is implemented as a skill — a structured markdo wn ﬁle encoding a repeatable w orkﬂo w in natural languag e. Skills are version-controlled, por table across LLM pro viders, and impro v ed in natural languag e without deployment. A Kitchen Loop-compatible sys tem needs standardized skill inter faces: Skill Input Output Must Guarantee backlog Tic ket state Promoted + ne w scenario tic kets No duplicate tick ets created ideate Scenario cr iteria Experience report + scenar io Runs agains t real test env triage Experience repor t Prior itized tickets w/ ﬁle pointers Deduplicates agains t e xisting execute Ranked ticket list Feature branch + PR per tic ket Every PR includes tests regress Full codebase state Pass/f ail report + drift metrics Runs reg ression oracle T eams adopting the Kitchen Loop implement these inter f aces f or their domain. The orchestration — phase sequencing, timeout management, error recov er y — is domain-independent. 8.3 Ex ecution Modes Mode Purpose When to Use F ramew ork modes (domain-independent): strategy (default) Full cy cle: ideate + e xecute + reg ress Standard operation user-only Rapid ideation loops to ﬁll the backlog Backlog empty , need disco very dev-only Implementation-f ocused loops to drain the backlog Bac klog full, need throughput drain (auto) Polish-onl y with increased PR throughput A uto-tr iggered by backpressure regress-quick Shor tened regression: one scenario per platf or m Ev er y iteration (f ast feedbac k) ui Bro wser -dr iv en user ﬂo ws with visual c heckpoint veri- ﬁcation W eb application testing Domain-speciﬁc modes (DeF i strat egy fr amework example): backtest Bac ktesting pipeline ins tead of on-chain ex ecution Backtes ting stress-testing exploration Explore new protocol/chain co verag e gaps Co verag e discov er y 8.4 The Shared Memory : Tic k et Management as PM La y er The central ner v ous sy stem is the project management tool — used not as a lightw eight task track er but as the shared, durable memory lay er between the AI loop and human contr ibutors. Both AI and human operators wr ite to the same backlog. The Ex ecute phase cannot tell the diﬀerence — it queries “top N ur gent unbloc ked tic kets” and w orks through them. This creates parity betw een AI and human judgment : the backlog is the truth, neither source is privileged. Field Con vention Wh y Labels bug , feature , improvement , exploration Picks bugs ﬁrst, then balances Prior ity critical , high , medium , low Respects human prior ity , no o verr ide State Backlog → Todo → In Progress → In Review → Done PRs auto-link to tickets via title Dependencies blocks / blockedBy Skips blocked tickets automatically 17 8.5 The Bar: Production-R eadiness Standard The Kitchen Loop applies “The Bar” — a project-speciﬁc quality standard deﬁned in a cus tomizable quality.bar_file . The framew ork ships a generic template co v er ing code quality , PR standards, safety , and documentation; each deplo yment customizes f or its domain. This approach is supported by A btahi and Azim (2025), who ﬁnd that categorizing quality issues b y type achie ves 100% resolution vs. 71.6% f or bulk processing [ 5 ] — the Kitchen Loop’ s phase-based arc hitecture naturall y decomposes enf orcement into categories (functional cor rectness in Ex ecute, sty le in Polish, regression saf ety in Regress). For ex ample, the DeFi strategy frame w ork (Case Study A) instantiates The Bar as ﬁv e domain-speciﬁc principles: 1. Production Ready — N o TODOs, shortcuts, or “we ’ll ﬁx later .” Ship-quality on merg e. 2. Zero Hardcoding — All conﬁguration from resol vers and registries, nev er literals. 3. Million-Dollar Saf e — Ev ery external call handled, amounts rounded saf ely , no silent fallbac ks. 4. Hedge-Fund Serious — Precise ar ithmetic e verywhere, meaningful test assertions, conser v ation chec ks in all failure paths. 5. UX First, Saf ety Alwa ys — Clean, declarativ e APIs; non-bypassable safety g ates; secrets redacted from logs. These pr inciples are enforced through multi-model revie w tribunals — ev ery PR is ev aluated ag ainst them bef ore merg e. Other domains w ould deﬁne their own pr inciples (e.g., HIP AA compliance f or healthcare, W CA G conf or mance f or w eb accessibility). 9 The Discussion Manag er: Structured Multi- AI Deliberation 9.1 Be y ond R evie w T ribunals Section 4.6 introduced multi-model revie w tribunals — three independent AI revie w ers ev aluating the same ar tif act in parallel. The Discussion Manager extends this concept into a full structured deliberation sy stem : a multi-round, multi-model debate frame work where heterog eneous AI agents argue substantiv e positions, challeng e eac h other ’ s reasoning, and con ver ge tow ard actionable decisions through an impar tial moderation la yer . Where revie w tr ibunals answ er “is this code cor rect?”, the Discussion Manag er answ ers harder ques tions: “should w e build this at all?”, “which of three competing architectures best ser v es the speciﬁcation surface?”, and “what are we not seeing?” These are the judgment-intensiv e decisions that the Kitchen Loop encounters in its Ideate and T r iage phases — decisions where a single model’ s blind spots can compound into w asted iterations. The Discussion Manager operationalizes what Cai et al. (2025) identify as the “Debate-Based Cooperation ” patter n — adversarial argumentation to sur face errors — whic h appears in onl y 4 of 94 surve y ed multi-agent sys tems (4.3% adoption) [ 8 ]. The lo w adoption rate, despite theoretical v alue f or error disco v er y , sugges ts this is an undere xplored design space. The Kitchen Loop’ s production implementation, with its anti-sy cophancy saf eguards and empirical v alidation across 23 discussions, represents one of the ﬁrs t operational deplo yments of this pattern at scale. 9.2 Arc hitecture The Discussion Manag er uses a centralized moderator architectur e with an inf or mation ﬁrew all between moderator and debaters: Ke y design decisions : • Heter ogeneous models : Three diﬀerent model families (Gemini, Code x/GPT , Claude) ensure g enuine perspectiv e div ersity . Each model br ings diﬀerent training data, diﬀerent reasoning patter ns, and diﬀerent failure modes. Homog eneous debate (three instances of the same model) rapidly con ver ges to groupthink. 18 Moderator (Claude) Impartial. Manag es rounds, chec ks conv ergence, writes synthesis. Inf or mation Fir ewall Debater A (Gemini) Debater B (Code x) Debater C (Claude subagent) Isolated instance • Centralized moderation : The moderator orchestrates tur n order , checks conv ergence, and wr ites the synthesis — but nev er injects opinions into the debate itself. Centralized judg e structures outperform peer -re view s tr uctures f or debate quality (Y ao et al., 2025). • Claude subagent isolation : When the moderator is Claude, the Claude debater r uns as an isolated subag ent with no access to the moderator’ s reasoning, notes, or context. The subag ent sees only the debate prompt. This is the information ﬁre wall that pre vents the moderator’ s framing from contaminating the debate. • Code gr ounding : F or technical discussions, debaters receiv e codebase context (speciﬁc ﬁles, architecture documents, or topic-aw are summaries). Gemini gets autonomous ﬁle-search access; other debaters g et relev ant content injected into their prompts. This anchors debate in veriﬁable f acts rather than rhetor ical polish. 9.3 Empirical Evidence: 23 Pr oduction Discussions The Discussion Manag er’ s design was shaped by Y ao et al.’ s (2025) research on sy cophancy in multi- agent debate [ 14 ], whic h identiﬁes three failure modes: sycophancy , disagr eement collapse , and neg ative agr eement . Ke y design-relev ant ﬁndings: heterogeneous models outper f or m homog eneous; centralized judg es outper f or m peer -revie w; and capping debate at 2-3 rounds preser v es productive disag reement. W e ev aluated the sy stem against these criter ia using a cor pus of 23 production discussions conducted across three codebases o ver a 4-w eek per iod. T w o independent meta-analy ses — one by Claude (moderator) and one b y Code x (independent e valuator) — mitig ate self-assessment bias. Dimension V alue T otal discussions 23 unique discussions across 3 codebases Discussion types Architecture/PRDs (10), Code re view s (7), Strategy (5), Exploration (1) T ypical participants 3 models (Gemini, Code x, Claude); 1 discussion used 2 models T ypical rounds 3 (range: 1–10) Con ver gence 2/23 formall y con ver ged; 21/23 hit max rounds Conclusions 100% “ Agreed” — zero “ Agreed to Disag ree ” Ev aluation against paper metrics: Findings from both meta-analy ses: Strengths. Code grounding anc hors e v er y discussion in speciﬁc line numbers and v er iﬁable API beha vior, strongl y mitig ating the “rhetorical polish without substance ” f ailure mode. Actionability is consistentl y high (phased remediation plans, PRDs, ﬁle-le vel recommendations), and raw transcripts conﬁrm substantiv e disag reement, corrections, and e xplicit position changes. W eaknesses. Disagreements resol ve b y one par ty conceding rather than g enuine synthesis (premature con ver gence). The ﬁrst speaker sets the problem’ s ontology , with later speakers reﬁning rather than challenging it (sequential anchoring). Conv ergence tracking is cosmetic — counting self-repor ted 19 Criterion Measured Paper’ s Ideal Assessment Disagree. Collapse Rate (DCR) ∼ 0% (all conv erg e) Low but non-zero Red ﬂag — univ ersal conv erg ence implausible Sy cophancy Score (SS) 35–50 < 20 Moderate-high — mitigated by code g rounding Neg ative Agreement Rate (NAR) ∼ 25% < 10% Moderate — Claude concedes readily Evidence Quality 85/100 High Strong — code grounding is best f eature Perspectiv e Diversity 60/100 High heterogeneity Moderate — good diversity , rigid roles R ound Eﬃciency ∼ 50% productive > 80% Low — 2–3 was ted rounds per discussion DISAGREEMENTS items rather than v er ifying semantic resolution. No discussion concluded with “do not build this, ” rev ealing a str uctural bias to w ard action. Meta-analy st disagreement. The independent Code x meta-analy sis challeng ed Claude ’ s “100% ratiﬁcation ” claim (one discussion lac ked ratiﬁcations in the JSON) and “monotonic con ver gence ” patter n (some discussions sho wed non-monotonic trajectories). Mitigations implemented. Based on these ﬁndings and Y ao et al.’ s recommendations, the Discussion Manag er now implements: blind opening rounds (eliminating ﬁrst-speaker anchor ing), a structured issue register (replacing cosmetic conv ergence counting), e xplicit proposal ratiﬁcation, a kill gate requir ing a “do not build this ” ar gument bef ore proceeding, and full transcr ipt preser v ation for reproducible meta-anal ysis. 9.4 Implementation and Open Challenges The Discussion Manager is implemented as a Python orchestrator ( discuss.py , open-sourced in the companion repositor y) that manag es conv ersation creation, turn ex ecution, conv ergence chec king, ratiﬁcation, and repor t g eneration. It is designed for use at three integ ration points: (1) Ideate phase f or architectural decisions, (2) Code re view in audit mode, and (3) Retr ospective compar ing implementation agains t agreed designs. In the cur rent open-source release, the Discussion Manager is in v oked manuall y; automated orches trator integ ration is planned. Despite the epistemic improv ements, sev eral challeng es remain: • Cost : Three models f or 3-5 rounds at ∼ 400 words per tur n consumes 15,000-25,000 tok ens per model per discussion. With the ∼ 50% round eﬃciency obser v ed in the cor pus, appro ximately 40% of token spend is on conv ergence dynamics rather than no vel insight. The cost implications f or scaling are discussed in Section 11.4. • R ole calciﬁcation : Ev en with heterog eneous models, each model tends to ward a ﬁx ed role (Gemini as breadth anal yst, Codex as bug hunter, Claude as compromise broker). Random role assignment per discussion can help, but the under lying model tendencies persist. • Ev aluation circularity : When the moderator is Claude and one meta-analy st is also Claude, self-assessment bias is diﬃcult to full y eliminate. The independent Code x meta-analy sis mitigates this but does not resol v e it entirely . • Scaling bey ond three : The current protocol is optimized f or three debaters. Scaling to more par ticipants increases round duration and con ver gence complexity without propor tional improv ement in perspectiv e div ersity . These are activ e research ques tions. The frame work’ s modular design — separating orches tration, moderation, and debater e x ecution — allo ws incremental improv ement without architectural o v erhaul. 20 10 Case Study A: DeFi Strategy Frame w ork 10.1 The Product and Its Speciﬁcation Surface The ﬁrst v alidation tar get is a production DeFi strategy frame work built b y Almanak ( https://almanak. co ), the Almanak SDK for DeFi Quants — a public, open-source repositor y . 2 A quant wr ites a strategy class, declares high-le vel intents (swap, LP open/close, borrow , supply), and the frame work compiles those intents to on-chain transactions, ex ecutes them, parses receipts, and updates state. Figure 3: The Almanak SDK repositor y: a production DeFi strategy framew ork suppor ting 14 chains and 30+ protocol connectors. The speciﬁcation surface: Dimension Count Supported chains 14 (13 EVM + Solana) Protocol connectors 30+ (13 core + Aa ve, Compound, GMX, Mor pho, Lido, . . . ) Intent types 21 (Swap, LP Open/Close, Bor ro w , Supply , Perp, S take, . . . ) Cross-combinations ∼ 1,000 (chain × protocol × intent) 10.2 The Regression Oracle: F ork Ex ecution The regression oracle is a suite of demo s trategies e xecuted on Anvil (a local EVM chain f ork). Each strategy runs ag ainst a f ork of the targ et chain, compiles intents to transactions, e xecutes them, and v er iﬁes on-chain state chang es. A PASS requires at least one on-chain transaction with v er iﬁed balance deltas. A PASS(HOLD) is valid (market conditions didn ’ t trigger a trade). A FAIL is alwa ys a regression until pro v en environmental. The 4-la yer veriﬁcation pattern applied: La yer Implementation Compilation compiler.compile(intent) → asser t status SUCCESS Ex ecution orchestrator.execute(bundle) → asser t success on-chain R eceipt Parsing Protocol-speciﬁc parser e xtracts amounts, position IDs 2 https://github.com/almanak- co/sdk 21 La yer Implementation State Deltas get_token_balance() BEFORE and AFTER, assert ex act e xpected chang es 10.3 Results Metric V alue Loop iterations completed 122+ Merg ed pull requests 728+ U nique tic kets resolv ed 350+ Lines of code added ∼ 250,000 U nit tests (star t) ∼ 6,400 U nit tests (cur rent) 10,913 Demo strategies 62 (up from 13) Incubating strategies 183 Chains cov ered 13 EVM + Solana Protocols ex ercised 30+ (Unisw ap V3/V4, Aav e, Morpho, Curve, Pendle, . . . ) Intent types e xercised All 21 (S W AP , LP_OPEN, LP_CLOSE, B ORR O W , SUPPL Y , . . . ) Consecutiv e zero-bug iters 16 (iterations 55–70) Regressions (loop-merg ed) 0 These results span 122+ iterations o v er 5 weeks (Feb 18 – Mar 23, 2026). 10.4 Phase Progression Phase 1: Bug Disco very (Iterations 1–23). Dominated by “ob viously missing” bugs — table-stakes f eatures ev er y user w ould hit immediately : Repr esent ativ e Finding Impact Iter API method documented but not e xposed to users AttributeError on ﬁrs t call 12 Gas pr ice cap correct f or one chain, blocks all txs All strategies on one chain fail silentl y 18 Protocol router conﬁg missing for entire protocol Blocks swap compilation e verywhere 50 Nativ e tok en symbol missing f or tw o c hains Blocks ALL strategies on those chains 51 Phase 2: Co v erage Expansion (Iterations 24–54). Focus shifted to new protocol connectors, new chains, and multi-protocol composability . Notable: ﬁrst lending protocol on a ne w chain (37 unit tests); disco very of router inter face V1 vs V2 mismatch causing silent re verts; ﬁrst multi-protocol strategy (11 transactions across 4 intent types, zero bugs); ﬁrst indicator -dr iv en LP strategy validating the market data → position sizing pipeline. Phase 3: Maturity (Iterations 55–71). The sys tem reached maturity: 16 consecutiv e zer o-bug ideation phases . Bug discov er y shifted from “obviousl y missing” to conﬁguration gaps and infras tr ucture reliability issues. Ne w capabilities shipped autonomously : a direct-action CLI, suppor t for a new blockc hain, advanced demo strategies. 22 Phase 4: Deep V eriﬁcation (Iterations 72–89). The loop transitioned from co verag e breadth to v er iﬁcation depth, ﬁnding production-se verity bugs that traditional testing would miss: Finding Sev er ity Impact Iter Pendle PT sell uses 1:1 e xc hange rate, but PT trades at ∼ 19% discount HIGH Ev er y PT sell at nor mal slippage (0.5-10%) w ould re vert with INSUFFICIENT_TOKEN_OUT 89 BTC.b completely missing from A valanc he token resol ver HIGH An y BTC.b strategy on A v alanche fails at compilation 83 U niswap V4 adapter silently falls back to 18 decimals MEDIUM Incorrect balance calculations for non-18-decimal tokens 80-81 A single 7-iteration o v er night batch (iterations 83-89) produced 18 PRs, 370+ new tests, and 5 bug disco v eries — demonstrating sustained throughput ev en at high iteration counts. Independent multi-model code re view (Claude pr-auditor + Code x/GPT -5.4) rated the mer ged code as S TR ONG across all quality principles. Regr ession st ability under load : Despite 20+ PRs merg ed between checkpoints, the demo strategy suite maintained 100% pass rate across all recent iterations. Quick reg ression tests 4 chains in under 2 minutes. Phase 5: Operational Maturity (Iterations 90–122). The loop entered a sus tained production rh ythm with improv ed strategy div ersity , predictable throughput, and a stabilizing PR bac kpressure cy cle: • Cross-c hain validation at scale : First BSC sw ap (iter 92), ﬁrs t Sonic ex ecution (iter 100), ﬁrst P oly gon multi-protocol composition (iter 94). 13 EVM chains + Solana now cov ered. • Saf ety -critical disco v eries : F abr icated U niswap V4 addresses (P0, iter 93), wrong Ethena unstake selector (iter 99), Cr yptoSwap zero-slippage protection (iter 95), pr ice impact guard for zero-liquidity pools (iter 122). These are deep bugs that traditional testing misses. • Infrastructure healing cy cle matures : The loop diagnosed and ﬁx ed its own PR merger timeout (iter 101), stale loop-state detection (iter 107+), and demo reg ression blind spots (iter 74). Each infrastructure cr isis resol v ed fas ter than the pre vious one. • Zero-bug iterations become common : Iterations 94, 108, 115, 116, and 121 all passed with zero ne w bugs, validating connector por tability across chains. A single 3-iteration ov er night batch (iterations 120-122) produced 9 PRs, 12 mer ges, ∼ 249 tests, and a sy stemic pr ice impact guard — demons trating sustained throughput at high iteration counts. 10.5 The Self-Healing Property The loop discov ered and ﬁx ed its own infras tr ucture bugs through its s tandard process — no human diagnosis required: • Apple Silicon memory bug : PR merge automation used the wrong memor y pag e size (4096 instead of 16384), causing a vailable memor y to read 4x too lo w . This block ed merging f or 5 consecutiv e iterations. The loop-re view skill identiﬁed the pattern; Ex ecute ﬁx ed it; Regress conﬁrmed the ﬁx. • Codex feasibility chec k er : The e xter nal AI cooperation model (Section 12.3) matured through the same loop cy cle — strict response contracts, timeout handling, and salv age paths were all reﬁned b y the loop itself. • Ideate persistence bug (iter 13-19): Six iterations of lost deliv erables (repor ts not committed to git) before the loop identiﬁed the patter n, ﬁled a tick et, and shipped a ﬁx that decoupled the commit/push lif ecy cle. V er iﬁed s table from iter 19 on ward. 23 • PR Manag er death spiral (iter 71-73): Permanent needs-attention labels caused a 62-hour deadlock with zero mer ges. The loop re vie w diagnosed the root cause; subsequent iterations added TTL -based label reco v er y and a circuit break er . Full reco very b y iter 74 (5/5 PRs merg ed). • Demo regression blind spot (iter 53-74): T wenty iterations of blind demo testing because ALCHEMY_API_KEY w as missing in w orktree environments. The loop detected the pattern, shipped a .env cop y to w orktrees with a pre-ﬂight ke y chec k, and conﬁr med the ﬁx in iter 75. This proper ty — the loop impro ving its o wn tooling through the same process it uses to improv e the product — is further demonstrated in Case S tudy B (Section 9) and elaborated in Section 13. 24 11 Case Study B: Signal Intelligence Platf orm 11.1 The Product and Its Speciﬁcation Surface The second validation target is Edg e, a DeFi signal intellig ence platf or m built by Almanak ( https: //almanak.co ), a vailable at https://app.almanak.co . Edge compr ises 46+ detection ag ents that scan on-chain and oﬀ-chain data sources, produce structured signals, and f eed a synthesis pipeline that cor relates signals into actionable intelligence. The speciﬁcation surface: Dimension Count Detection agents 46+ Signal types 77 (all veriﬁed) V eriﬁer f amilies 22 Cross-combinations ∼ 1,078 (agent × signal type) 11.2 The Regression Oracle: Quality Gates + Canaries Signals are ephemeral and non-deterministic — the y cannot be “repla yed” like strategy ex ecutions. The regression oracle combines deter ministic quality gates with anti-signal canar ies: 4-la y er quality g ate: La yer N ame Method What It Catches L1 Structural Schema validation Format er rors, in valid enums, missing metadata L2 F actual API cross-reference (DeﬁLlama, Snapshot, CoinGeck o, Geck o T erminal, on-chain RPCs) F alse claims, wrong numbers L3 T emporal Dedup + timestamp anal ysis Stale signals, zombie duplicates L4 Cognitiv e LLM tribunal (3 models) Bad reasoning, wrong conclusions from cor rect data 4-tier anti-signal canaries (24 total + 3 API degradation): Tier Canaries Catch Rate (iter 1) Catch Rate (iter 163) Tier 1 (Ob viously Bad) 6 100% 100% (zero escapes across 163 iters) Tier 2 (Shado w) 5 33% 100% (ﬁxed iter 124) Tier 3 (A dversarial) 5 67% 100% Tier 4 (Mix ed T r ue/F alse) 5 N/A (added ∼ iter 100) 100% (ﬁxed iter 124) API Deg radation 3 N/A (added later) 100% (3/3) 11.3 Results Metric V alue Loop iterations completed 163 25 Metric V alue Tic kets created 200+ Pull reques ts merged 366 Signal type co v erage 77/77 (100%) — up from 33/36 at iter 1 V eriﬁer co verag e 100% (22 v eriﬁer families) — e very signal type has a dedicated L2 f actual veriﬁer Non-functional agents identiﬁed 18 (manag ed via ex clusion list) L1/L2/L3 pass rates 100% / 100% / 100% (up from 85% / 76% / 85% at iter 1) L2 un veriﬁable rate 0% (do wn from 9-31% at iter 1) Canary health 100% all 4 tiers (zero Tier 1 escapes across 163 iterations) T es t cases 2,171 across 152 test ﬁles A v erage iteration speed ∼ 45 min a v g (25-35 min signal-f ocused, 80-230 min full, 15-20 min circuit-breaker skip) PR merg e rate 97% (366/377) Median time to merg e 1.1 hours R eg ressions from loop-merg ed code 0 These results span 163 iterations o v er 2.5 weeks. 11.4 Quality Gate Maturity T ra jectory The most compelling evidence of the loop’ s self-impro ving quality infrastructure: Metric Iter 1-10 Iter 50-60 Iter 128-135 Iter 163 L1 pass rate 85-100% 100% 100% 100% L2 pass rate 76-91% 90-97% 100% 100% L2 un veriﬁable 9-31% 3-10% 0% 0% L3 pass rate 85-100% 92-100% 100% 100% V eriﬁer co verag e ∼ 92% (33/36) ∼ 100% 100% (72/72) 100% (77/77) PR merg e rate ∼ 0% ∼ 85% 100% 97% Open PR bac kpressure 0-3 3-10 0 0-3 Ke y insight: In this deplo yment, the quality gates impro v ed monotonicall y . No regression in pass rates was detected by the oracle after any of the 366 merg ed PRs across 163 iterations. This sugg ests that autonomous code chang es can maintain (and impro ve) quality in variants without continuous human o versight, given suﬃcient v eriﬁcation infrastr ucture. 11.5 Phase Progression The Edg e deplo yment tra v ersed f our distinct phases, mir roring the SDK’ s progression but at higher iteration speed: Phase 1: Bootstrap (Iterations 1-12, Da ys 1-2) Detection engine without self-cor rection capability . 12 PRs created, 0 merged — the loop could ﬁnd bugs but couldn ’ t ﬁx them. PR manager and polish phase activ ated; ﬁrst merg es in iter 11. Quality tra jector y: L1 85 → 100%, L2 69 → 96%, L3 85 → 100%. Phase 2: Steady-State Impr o vement (Iterations 13-87, Da ys 2-6) Productiv e self-impro vement with monotonically impro ving quality . 200+ PRs merg ed. The loop identiﬁed and ﬁx ed 7 infrastr ucture bugs through its o wn Ideate → T r iage → Ex ecute cycle: chainV alid permanently ﬁx ed, drift persistence migrated to ﬁle-back ed storag e, veriﬁer f amilies e xpanded, canar y 26 Tier 2 ﬁxed, circuit breaker added, non-functional ag ent e x clusion list created. L1/L2/L3 all reached and held 100%. Phase 3: St arv ation (Iterations 88-127, Da ys 6-9) All remaining tick ets required SDK (Python) chang es unreachable from the Edge (T ypeScript) codebase. The circuit breaker ﬁred on 10+ consecutive iterations. T wo of three revie wers recommended STOP THE LOOP . This was cor rect beha vior: the loop did not deg rade or inv ent w ork — it recognized e xhaustion and recommended s topping. Quality gates maintained 100% throughout. Phase 4: Reco v ery and Expansion (Iterations 128-163, Da ys 9-17) Ne w w ork became av ailable after SDK block ers resolv ed. 100+ PRs merg ed. 20+ new ag ents tested, LPOppor tunityA gent ﬁx ed, ArbitrageA gent activ ated, T echnicalA gent added. Canaries reac hed 100% across all 4 tiers b y iteration 124 and held f or 40+ subsequent iterations. 11.6 Ke y Findings (Early Iterations) • A critical cooldo wn bypass needed — 33+ ag ents unreachable due to dedup, blocking 75% of ﬂeet from quality tes ting • A production ag ent using Math.random() simulation instead of real on-chain data — causing temporal f ailures with stale signals from tok ens launched o ver a year ago • A “dead” ag ent registered in conﬁg but not in the agent sy stem — producing zero signals silentl y • The loop’ s o wn quality g ate conﬂating “unv er iﬁable ” with “f ailed” — a false positive the loop disco vered and ﬁxed through its own tr iag e process 11.7 The Self-Correction Fl ywheel (Iterations 128-135) At matur ity , the loop’ s most po w erful proper ty is iterativ e self-correction — it doesn ’ t just ﬁnd bugs, it ﬁx es them, veriﬁes the ﬁx es, and catches when ﬁx es are incomplete. The OnChainSentinel Fix Chain: Iter What Happened 130 Ideate discov ers OnChainSentinel hangs the pipeline via W ebSock et constructor blocking. Creates tick et. 130 Ex ecute ships PR: 5-second timeout guard around getNetwork() + adds ag ent to ex clusion list. 131 Ideate retests. Disco vers the ﬁx was incomplete — the W ebSock etProvider constructor itself blocks before getNetwork() is ev er called. Reopens tick et. 131 Ex ecute ships PR: wraps entire W S provider init in a single timeout with provider cleanup on e xpir y . A dds destro y-on-timeout test. 131 R egress conﬁrms: build passes, 27 signals produced, L1/L2/L3 100%, no regressions. This is a 3-iteration cy cle from discov er y → incomplete ﬁx → complete ﬁx → regression v er iﬁcation, with zero human inter v ention. The loop caught its own incomplete ﬁx and iterated until the root cause w as fully addressed. The ArbitrageAg ent Data Pipeline Resurrection: The Arbitrag eAg ent went through a 3-PR ﬁx chain across multiple iterations: 1. PR #278 — R eplaced dead The Graph hosted subg raph URLs with Gec ko T erminal API (The Graph deprecated its free hosted ser vice — a real e xter nal platf orm change the loop detected) 2. PR #282 — F ix ed incor rect base_token_price_usd vs quote_token_price_usd ﬁeld usag e that produced nonsensical 3,499 × spreads 27 3. PR #286 — Lo wered ARBITRAGE_PROFIT_MIN_PERCENT from 1% to 0.1% after observing 2 consecutiv e zero-signal r uns (1% required a 1.6% gross spread on Ethereum — far abo v e real market conditions of 0.1-0.5%) Each iteration peeled bac k one more la y er of the problem. The loop doesn ’ t giv e up after the ﬁrst ﬁx — it keeps testing until the ag ent actually produces signals. 11.8 T riage Intelligence: The Classiﬁcation System The triage phase demons trates sophisticated classiﬁcation that goes be y ond binar y “w orking/broken ”: Classiﬁcation Example R esponse Non-functional (conﬁr med) R W ARiskA gent: hardcoded stub data (backingRatio=1.0) A dd to ex clusion list immediatel y Non-functional (suspected) LPR otationAg ent: 0 signals, 1st obser vation Monitor — needs 2+ conﬁrmations before e xcluding Cold-start De vA ctivityAg ent: needs w ar m baseline from pr ior runs Don ’ t e xclude — works in scheduled mode Timing-dependent Br ibeT rackingA g ent: no activ e v ote rounds Don ’ t e xclude — market conditions, not a bug Deprecated YieldA gent: conv er ted to return-empty stub A dd to ex clusion list Self-correcting classiﬁcation : In iteration 131, triage classiﬁed De vActivityA gent as “non-functional.” In iteration 132, it corrected this diagnosis — the ag ent has a w orking GitHub API integration and is actuall y cold-star t, not broken. The tick et was updated, and the ag ent was NOT added to the e x clusion list. Self-cor rection e xtends be yond code ﬁxes to the classiﬁcation sys tem itself. 11.9 Infrastructure Self-Healing (Iterations 128-135) The loop identiﬁed and ﬁx ed 6 problems in its o wn infrastructure dur ing a single 8-iteration batch: Finding Fix Impact RESULTS_DIR undeﬁned in pr -manager .sh Grep prep log instead Gate rejection memory completely non-functional gh --jq --arg in valid syntax Pipe to jq --arg Shell command silently f ailing Missing iter 127 ro w in loop-state.md Bac kﬁll + veriﬁcation step History gap invisible to dr ift detection synthesisPipelineDead f alse positiv e localDev ﬂag f or drift suppression 9 iterations of w asted CRITIC AL escalation Deplo yment dr ift (local main s tale) Pre-ﬂight detection All merg ed ﬁx es untested f or 5+ iterations NON_FUNCTIONAL_AGENTS set out of sync Sync with conﬁrmed ﬁndings + e xpose via API Ideate kept selecting dead agents As in Case Study A (Section 8.4), the loop impro ves at two le vels: the product and its o wn infrastructure. 28 11.10 Speed Diﬀerence: T w o Domains, Same Loop Phase Strategy Framew ork Signal Platf or m Speedup Ideate 10-15 min 1-2 min ∼ 8x T r iage 5-8 min <1 min ∼ 8x Ex ecute 25-60 min 10-30 min ∼ 2x R eg ress 40-150 min 2-3 min ∼ 30x T otal 80-230 min ∼ 15-35 min * ∼ 5-15x * Signal-onl y iterations (ideate + tr iag e) complete in ∼ 5 min; full 6-phase iterations take 80-230 min. The ∼ 15-35 min ﬁgure reﬂects signal-focused iterations including e xecute and polish. Circuit-break er skip iterations complete in 15-20 min; the a v erage across all 163 iterations was ∼ 45 min. The signal platform iterates f aster because signal quality v er iﬁcation is API-bound, not e x ecution- bound. This speed adv antage compounds o v er time: at 163 iterations, the signal platform has perf ormed more quality g ate ev aluations than most human QA teams accomplish in a year . 12 Cost of Operation 12.1 T ooling Spend A common concern: “Ho w much does it cost to r un an AI agent continuously?” The answ er is sur pr isingl y modest: the Kitchen Loop runs entirely on ﬂat-rate subscriptions , not metered API calls. Cost Component Monthly Cos t Notes Claude Code Max (20x plan) $200 Primar y agent f or all 6 phases — unlimited usage within plan Codex (subscr iption) $20 External f easibility check er + tr ibunal re view er Gemini (subscription) $20 T r ibunal re view er f or multi-model audits CodeRabbit ∼ $15 A utomated PR code revie w An vil / F oundry $0 Open-source, r uns locall y External data APIs ∼ $0-50 CoinGeck o, DeﬁLlama (free tiers suﬃcient) CI compute ∼ $50-100 GitHub A ctions minutes T otal ∼ $305-405/month This is the total cost for r unning the Kitchen Loop across both production sys tems simultaneously — 285+ iterations, 1,094+ merged PRs, 700+ tickets resolv ed. A senior engineer costs $12,000-25,000/month full y loaded; the Kitchen Loop’ s monthl y tooling cos t is ∼ 2% of a single engineer’s cost . Cr itically , this includes the quality infras tr ucture (unbeatable tests, U A T gates, regression oracles) that pre vents the comple xity-driv en velocity erosion documented by He et al. [ 11 ] and Beck er et al. [ 12 ]. Metric Kitchen Loop Single Engineer Ratio Monthl y cost ∼ $350 ∼ $15,000 ∼ 43x cheaper PRs merg ed/month 600+ ∼ 15-25 ∼ 30x more Co verag e scenarios tested/month 150+ ∼ 10-20 ∼ 10x more Cost per merged PR ∼ $0.38 $600-1,000 ∼ 1,800x cheaper Because cost is ﬁx ed regardless of iteration count, the marginal cost of additional cov erage is eﬀectiv ely zero. 29 12.2 T est Suite Gro wth and Runtime As the test suite g ro ws, reg ression time gro ws. This is a real scaling concern: Metric Start (iter 1) Cur rent (iter 122) Gro wth Rate U nit tests ∼ 6,400 10,913 +70% o v er 122 iterations Full regression time ∼ 90 min ∼ 150 min ∼ 0.5 min/iteration gro wth Quick reg ression time ∼ 25 min ∼ 40 min ∼ 0.12 min/iteration gro wth Demo strategies 13 62 ∼ 0.40/iteration Incubating strategies 0 183 ∼ 1.5/iteration Mitigation strategies already in use: - --regress-quick mode: one scenar io per platf orm ( ∼ 40 min v s ∼ 150 min) - Full regression r uns w eekly , quick r uns e very iteration - P arallel test ex ecution f or unit tests (pytes t-xdist) - T est pruning: deprecated/redundant strategies archiv ed, not accumulated Projected ceiling : At cur rent g ro wth rates, full regression w ould reach ∼ 4 hours b y iteration 200. This is manageable with parallelization (dynamic port allocation f or multiple tes t environments), sharding (split the co v erag e matr ix across N parallel loops), or sampling (statistical cov erage sampling rather than e xhaustiv e ev er y iteration). 12.3 Business Outcomes For readers f ocused on outcomes: 50+ features shipped, 200+ bugs f ound before users (including fund-saf ety issues), cov erage matrix ﬁll rate impro ved from ∼ 5% to ∼ 50% of 1,800 combinations, and test suite grew +70%. Detailed breakdo wns are in Sections 8.3 and 9.3. 13 Production Saf ety R ecord 13.1 Incidents and Ill Eﬀects A critical question f or any autonomous system: has it caused harm? Concern R ecord Details Production downtime 0 incidents Loop operates on isolated branc hes + test en vironments (An vil f orks). No direct production access. Security incidents 0 incidents Loop has no access to production k ey s, wallets, or user data. T est w allets use Anvil defaults. Data loss 0 incidents Git w orktree isolation pre vents branch contamination. All chang es are re v ersible PRs. Regr essions shipped to main 0 Quality gates (multi-model revie w + regression oracle) catch issues before merg e. Cost o v erruns 0 Monthly tooling cost is bounded and predictable ( ∼ $305-405/month across both sy stems, ∼ $1.50/iteration). 13.2 Wh y Zer o Incidents? The clean saf ety record f ollow s from arc hitectural isolation : no production access (tes t environments onl y), branc h isolation (git w orktrees on feature branches), automated pause gates (Section 5.5), and backpressure control. These are structural proper ties, not luck. 30 13.3 Kno wn Limitations (Honest Assessment) Limitation Impact Mitigation An vil-only testing Loop has not been e xercised with real mainnet transactions Mainnet mode e xists but requires e xplicit operator appro val + real g as API k ey dependency Some scenar ios require e xter nal API k e ys not in standard conﬁg Produces PASS(caveat) instead of clean PASS — a kno wn quality ceiling Single-threaded e xecution Gate wa y por t constraint limits to sequential strategy r uns Future: dynamic por t allocation f or parallel e xecution PR merge automation fragility The P olish phase has been the most failure-prone component Each failure mode has been ﬁx ed through the self-impro v ement cycle Cooldo wn phantom failures (signal platf or m) Ag ents with cooldo wn periods inﬂate dead-agent counts Distinguishing “cooldown active ” from “ag ent broken ” is track ed as an open tick et This section is intentionall y honest. The loop is not infallible. But its f ailure modes are operational (merg e automation bugs, API ke y manag ement) rather than saf ety-critical (data loss, fund loss, production outag es). The architectural isolation ensures that operational f ailures was te iteration time, not user tr ust. 13.4 Human-in-the-Loop Cost Constraints The loop is autonomous end-to-end: it g enerates scenar ios from the speciﬁcation sur face, ﬁlls its o wn backlog, tr iag es ﬁndings into tick ets, implements ﬁx es, and auto-merg es PRs after multi-model re view and CI. N o human is in the critical path during nor mal operation. The human role reduces to (1) initial speciﬁcation surface deﬁnition, (2) occasional strategic s teer ing, and (3) supervisory intervention when the loop encounters f ailure modes it cannot self-cor rect. In practice, the mos t common intervention tr igg er w as merg e conﬂict resolution gone wrong : LLMs resol ving git conﬂicts w ould sometimes silently undo w ork from previous commits or PRs, requiring a human to notice the regression and rev er t. GitHub infras tructure issues (API rate limits, w ebhook f ailures, s tale branch state) w ere the second most frequent cause. Section 12.2 frames the steady -state human cost as ∼ 30-60 minutes per week once calibrated — but this assumes the loop is r unning smoothly . Intervention spikes during infrastructure instability . The practical scaling constraint is envir onmental : test environment throughput (Anvil f ork startup time, API rate limits), CI pipeline capacity , and the cost of multi-model re view tokens per PR. Drain mode (Section 5.5) throttles output when PR backpressure ex ceeds a threshold, ensur ing the loop does not outrun its own merg e infrastr ucture. The Discussion Manag er adds a separate cos t dimension. Three models debating f or 3-5 rounds at ∼ 400 w ords per tur n consumes 15,000-25,000 tok ens per model per discussion. With the ∼ 50% round eﬃciency observed in our 23-discussion cor pus (Section 7.3), appro ximately 40% of deliberation token spend produces con ver gence dynamics rather than no v el insight. At current subscr iption pricing this cost is absorbed into ﬂat-rate plans, but metered API usage w ould make frequent deliberation e xpensive. P er -iteration compute cost breakdo wn bey ond the agg regate $0.38/PR ﬁgure has not yet been ins tr umented — this is a g ap we intend to address in future work. These cons traints do not inv alidate the framew ork, but the y bound its applicability: the Kitchen Loop is most eﬃcient in domains where (a) the speciﬁcation sur f ace is w ell-deﬁned upfront, (b) the regression oracle provides high-conﬁdence automated v er iﬁcation, and (c) the human ’ s role can be g enuinely asynchronous rather than synchronous with each iteration. 31 13.5 Open Problems Our deplo yments e xpose f our open problems that we believ e warrant dedicated research: • OP1: Oracle T ransf er . The Kitchen Loop relies on a bespoke reg ression oracle per domain (chain f ork ex ecution f or DeFi, quality-g ate pipeline for signals). A utomatic generation of regression oracles from natural-languag e speciﬁcations — eliminating the per -domain engineer ing cost — is an unsol ved challeng e. • OP2: Speciﬁcation Acq uisition. Our method assumes an enumerable speciﬁcation sur f ace. For legacy codebases with implicit speciﬁcations, automating surface e xtraction from telemetry , documentation, and user beha vior is a cr itical bottlenec k f or adoption. • OP3: Multi-Objectiv e Drift. Current dr ift metr ics f ocus on functional cor rectness. Extending the frame work to simultaneously monitor non-functional req uirements (latency , secur ity , f air ness) without human inter v ention remains open. Our drift detection (Section 5.5) w ould need to compose multiple objectiv e functions without one dominating the pause-gate signal. • OP4: Sy cophancy at Scale. The Discussion Manag er mitigates sycophancy in 3-model debates (Section 7), but optimal model composition and debate protocols f or larg er , heterogeneous agent sw ar ms are unkno wn. Our cor pus of 23 discussions is too small to establish whether the observed SS < 20 threshold g eneralizes (cf. Y ao et al., 2025 [ 14 ]). 13.6 When the Kitchen Loop Should NOT Be Used The frame work is not univ ersally applicable. It should not be applied when: (1) ground tr uth is w eak or unobservable — without a reliable oracle, the v er iﬁcation la yer pro vides f alse conﬁdence; (2) success criter ia are highly subjectiv e (aesthetic quality , UX taste) — the oracle cannot arbitrate matters of judgment; (3) the speciﬁcation sur face is not enumerable, as in e xplorator y R&D where the goal is disco very rather than con v erg ence; or (4) saf ety-critical domains lack robus t external v er iﬁcation — the oracle ’ s bounded cov erage (Section 4.1) is insuﬃcient when failures ha v e ir rev ersible consequences. 14 The Human- AI Collaboration Model 14.1 No t Either/Or — Both Chen et al. (2025) identify three automation points — human-only , copilot, and ag ent — ﬁnding agents achie v e 60% task cor rectness vs. 25% f or copilots, y et 60% of par ticipants w ould not continue using ag ents due to a comprehension gap [ 13 ]. The Kitchen Loop operates at a f our th point: full y autonomous loop with asynchr onous human o v ersight , eliminating the idle-time and comprehension problems Chen et al. identify . The Kitchen Loop is explicitl y a complementary sy stem , not a replacement: Concern AI (Kitchen Loop) Human Co v erage Exhaus tive — 1000x the scenar ios a human w ould test Strategic — f ocuses on what matters most given user conte xt Bug disco v ery Bottom-up — ﬁnds what ’ s broken b y tr ying e verything T op-down — know s what users are complaining about Tic ke t writing Precise, ﬁle-le vel, with reproduction steps Context-rich, business-a ware 32 Concern AI (Kitchen Loop) Human Implementation Fas t, consistent, f ollow s established patterns Creativ e, architectural, can break patterns when needed Code re vie w Multi-model parallel tribunals Domain e xper tise, security intuition Backlog curation Deduplication, se verity assessment, dependency graphs Strategic pr ior ity , business v alue, e xter nal signals 14.2 The Human’ s Highest-Le v erage Input The human ’ s pr imar y contribution is speciﬁcation and backlog curation : unders tanding users, monitor ing the competitiv e landscape, and conv er ting those signals into tick ets. In our deplo yments, this required ∼ 30-60 minutes per w eek once the loop w as calibrated — but this ﬁgure assumes a mature speciﬁcation surface (see Section 11.4 f or scaling constraints and the human bottlenec k at high iteration speeds). 14.3 The External F easibility Chec ker The Kitchen Loop optionally uses an external AI (a diﬀerent model from the loop’ s pr imary agent) as a f easibility check er bef ore committing to an idea. Empirical results (285+ combined iterations): ∼ 78% PR OCEED, ∼ 15% REDIRECT , <5% REJECT + timeout. The lo w rejection rate sugg ests the loop’ s scenar io selection is well-calibrated. REDIRECT cases adjusted scope productivel y . 15 The Self-Im pro ving Loop 15.1 Meta-Le v el Impro vement The loop-revie w skill audits loop behavior ev er y N iterations and produces tic kets for infrastructure impro v ements. Sections 8.4 and 9.3 sho wed speciﬁc ex amples; the full catalog across both deployments: • Platf orm-speciﬁc failures : Merg e automation memor y bug on Apple Silicon caused 5 consecutiv e stalls; loop revie w identiﬁed the patter n and the Ex ecute phase ﬁx ed it. • Resour ce contention : Process collisions when human and loop run concur rently ; tool v ersion dr ift breaking the re view phase. Fix es added process tracking and version pinning. • Re try and backpr essure loops : PR Manager stuck retr ying the same failing PR, wasting entire P olish phases. Fix ed by skip-after -2-failures with needs-attention labeling. PR backlog growing unbounded during sustained runs prompted automatic drain mode. • State manag ement : Loop-state lost when w orktree PRs w eren’ t merg ed promptly (ﬁx ed b y decoupling state sync from merg e lifecy cle). Backlog g roomer promoting already-in-progress tick ets (ﬁx ed by counting viable tickets only). • Silent failur es : Missing .env in w orktrees causing strategy failures; loop-re view repor ts silentl y lost when output director y didn ’ t e xist. Fix ed b y pre-ﬂight checks and wr ite v er iﬁcation guards. • Budge t management : Regress timeout consuming 100% of iteration budget, prompting a quic k - regression parameter . The frame work has also been applied to its own codebase (dogfooding). In its ﬁrst iterations running agains t the KitchenLoop orches trator and PR manager , the loop discov ered and ﬁx ed multiple bugs — including race conditions in temporar y ﬁle handling, missing agent-liv eness detection, and incorrect timeout beha vior on macOS. This validates that the self-improv ement proper ty extends to the meta lev el: the loop can impro v e the tool that r uns the loop. 33 15.2 Pattern Consolidation When the same pattern is conﬁrmed across 2+ iterations, it is promoted from a session obser v ation to a durable memory entr y . This creates institutional know ledge that persists across con v ersations and loop runs. 15.3 The Skill La yer Starting from 5 basic phase skills, the validated deplo yments no w ha ve 30+ skills co vering loop orches tration, quality g ates, protocol integration, competitiv e intellig ence, release manag ement, and documentation maintenance. Each skill emerg ed from the loop’ s own discov er y of a repeatable workﬂo w w or th encoding. A dditionally , the loop dev eloped a gate rejection memory system that prev ents redundant LLM auditor calls — if a PR w as rejected and no ne w commits w ere pushed, it is immediatel y marked NOT_MER GEABLE without was ting an audit cy cle. This optimization, disco vered and implemented by the loop itself, eliminated the zero-bac kpressure bottleneck observed in earlier iterations. R ecent additions include /loop-review-meta — a macro-lev el analy sis skill that aggregates ﬁndings across all loop re view repor ts to sur f ace systemic trends and strategic recommendations. 16 Generalization: Making the Kitc hen Loop P ort able 16.1 What Makes a Codebase Loop-Ready The Kitchen Loop works on any codebase where: 1. The speciﬁcation is enumerable : The product has a deﬁnable set of f eatures, platf or ms, and action types that can be e xpressed as a co v erage matrix. 2. Usage can be automated : “Using the product” can be per f or med by an LLM agent without phy sical interaction (APIs, CLIs, SDKs). 3. Quality is measurable b y regression : There e xists a tes t oracle that can answ er “is the sys tem still w orking?” in bounded time. 16.2 Ho w to A dapt the Kitchen Loop to Y our Domain Domain Speciﬁcation Surface R egression Oracle Example Adaptation W eb Appli- cation Pag es x User Flow s x Bro wsers Brow ser automation + visual regression (Pla ywr ight, Cypress) Ideate generates user journey s; R eg ress r uns screenshot diﬃng ag ainst kno wn-good baselines ML Pipeline Models x Datasets x Metr ics W&B / MLﬂo w run comparison + statistical tests Ideate trains model variants; Regress compares against baseline metr ics with signiﬁcance tests Smart Contracts Functions x Chains x Edge Cases Foundry/An vil f ork e xecution + inv ar iant checks Ideate wr ites fuzz test scenarios; R eg ress r uns in variant test suites Back end API Endpoints x Methods x Auth R oles Contract testing + liv e traﬃc shado w comparison Ideate generates API call sequences; Regress replay s agains t shado w traﬃc Mobile App Screens x Ges tures x De vices Appium automation + visual regression Ideate scr ipts user ﬂo ws; Regress runs on device f arm 34 Domain Speciﬁcation Surface R egression Oracle Example Adaptation Compiler / Language Grammar x Optimizations x T arg ets T est suite e xecution + benchmark comparison Ideate generates programs e xercising edge cases; Regress runs conf or mance suite The tw o adaptation points are alwa ys the same: what is the speciﬁcation sur face? and what is the reg ression or acle? Ev er ything else — backlog manag ement, phase sequencing, dr ift control, self-impro vement — transf ers intact. 16.3 The Speciﬁcation La yer The Kitchen Loop w orks because the products it’ s been applied to hav e w ell-deﬁned speciﬁcations. Most codebases lack this. A speciﬁcation la y er — structured Y AML or Markdo wn specs stored alongside code — sol ves this by making specs machine-readable, v ersion-controlled, and auditable. 16.4 The Scaﬀolding La yer A scaﬀolding la y er — templates f or the six-phase orchestrator , tick et manag ement stubs, oracle skeletons, and CI w orkﬂo ws — reduces time-to-ﬁrs t-loop from day s to hours. 16.5 A Composable Stac k Kitchen Loop Engine Backlog → Ideate → T r iage → Execute → Polish → Regress → (repeat) Speciﬁcation La yer What does the product promise? What should we test? Y AML/Markdown specs v ersioned alongside code Scaﬀolding La yer Project setup, CI/CD, skill templates, oracle stubs Reduces time-to-ﬁrst-loop from day s to hours 17 Conclusion 17.1 The Shift In our observation, softw are engineer ing is undergoing a phase transition. Code production — once the bottleneck — is becoming a commodity . Code revie w — once a human-only activity — is becoming automated. The competitive advantag e is shifting to: 1. Speciﬁcation : Knowing what to build (the AaU1000 method) 2. V eriﬁcation : Proving it works (unbeatable tests) 3. Con verg ence : Ensur ing it k eeps working (reg ression and drift control) 17.2 The Evidence A cross two production systems and 285+ combined iterations (Sections 8.3 and 9.3), zero regressions w ere detected by the regression oracle, quality gates improv ed monotonically from 76–91% to 100%, and 35 the loop autonomousl y ﬁx ed 17+ infras tr ucture bugs in its o wn tooling. The second sys tem v alidated por tability : the same architecture applied to a fundamentally diﬀerent domain achiev ed comparable results with a 25x f aster iteration cy cle, requiring only tw o reimplemented components (speciﬁcation sur f ace and regression oracle). The uniﬁed tr ust model held under sustained autonomous operation without increasing human intervention. 17.3 The Invitation The Kitchen Loop is not the future of softw are de velopment. It is a practice a vailable today , on a codebase of an y size, with tools that already e xist. The prerequisites are: 1. An enumerable speciﬁcation surface — what does y our product claim to do? 2. An automatable test en vironment — can an AI ag ent ex ercise your product? 3. A regr ession oracle — can y ou answer “is the system still w orking?” in bounded time? 4. The discipline to run it continuousl y — and tr ust the output when the tests are unbeatable. In our e xper ience, the tests are the trust la yer . The spec is the compass. The loop is the engine. Given suﬃcient v er iﬁcation infrastructure, the product e vol v es itself. 17.4 T est able Hypotheses Our deplo yments sugges t four empir ical hypotheses that subsequent work could v alidate, replicate, or refute: • H1: Co verag e-e xhaustion sy stems (as deﬁned in the regime tax onom y , Section 2.9) disco ver more user -visible f ailures per iteration than task -completion sy stems in partially mature products with >50% speciﬁcation surface co v erage. • H2: A dversarial U A T gates — sealed tes t cards ex ecuted b y a fresh ev aluator with zero implemen- tation conte xt — reduce false-positiv e readiness assessments compared with implementer -authored test suites alone. • H3: Tier -w eighted scenar io selection (Foundation 30% / Composition 50% / Frontier 20%) disco vers more bugs per iteration than uniform random selection across the speciﬁcation surface. The superlinear g ro wth of the Composition tier (Section 3.6) predicts that the advantag e increases with product maturity . • H4: In user -facing veriﬁcation tasks, w eak -model ev aluation (the least capable model a vailable) is a better pro xy for real-user v er iﬁability than strong-model e valuation, because w eak models fail on the same ambiguities that trip real users. These h ypotheses are intended to f acilitate direct replication, extension, or refutation in future agentic sy stems research. 17.5 Kno wn Limitations & Future W ork The Kitchen Loop has ﬁve structural limitations that bound its applicability: 1. Single-threaded e xecution. The cur rent orc hestrator runs one iteration at a time. Parallelization across multiple w orktrees is architecturally straightf or w ard but not yet implemented (Section 13.3 ). 2. Enumerable speciﬁcation surface r equired. The method assumes the product’ s capabilities can be listed as a co verag e matrix. Legacy monoliths with implicit speciﬁcations require a speciﬁcation-e xtraction step (OP2, Section 13.5 ) bef ore the loop can operate. 3. Oracle quality is the ceiling. The loop can only catc h what the regression oracle can v er ify . If the oracle misses a failure mode, the loop is blind to it. Oracle transf er across domains (OP1) remains an open research problem. 36 4. Human role persists. Backlog grooming, speciﬁcation design, and production promotion require human judgment. At scale, human merg e capacity — not AI generation speed — becomes the binding constraint (Section 13.4 ). 5. T w o-domain v alidation only . Our e vidence spans tw o production sy stems in the DeF i domain. Generalization to other domains (Section 16 ) is architecturally suppor ted but empir icall y un v alidated. 37 A The Co v erage Matrix (Strategy Frame w ork Example) The s trategy frame w ork’s intent tes t co v erage matr ix illus trates the exhaus tive approach. Ev er y cell represents a claim: “Protocol X works on Chain Y f or Intent Z.” Empty cells are untested claims. Protocol Chain Sw ap LP Open LP Close Supply Bor ro w Repa y Withdra w Aerodrome Base P0 P0 P0 - - - - T raderJoe V2 A v alanche P0 P0 P0 - - - - U niswap V3 Ethereum P1 P1 P1 - - - - U niswap V3 Arbitrum P1 P1 P1 - - - - U niswap V3 Base P1 P1 P1 - - - - Pancak eSwap V3 BSC P1 - - - - - - Aa ve V3 Ethereum - - - P0 P0 P0 P0 Aa ve V3 Arbitrum - - - P0 P0 P0 P0 Aa ve V3 Base - - - P0 P0 P0 P0 Compound V3 Ethereum - - - P1 P1 P1 P1 Morpho Blue Ethereum - - - P2 P2 P2 P2 Enso Multi- chain P3 - - - - - - Curve Ethereum P3 - - - - - - P0 = critical path, P1 = breadt h, P2 = depth, P3 = comple teness B R epresentativ e Strategy Examples B.1 D.1 Multi-Protocol Composability (Tier 2 — Com position) A Tier 2 strategy e x ercised f our intent types across tw o protocols in a single strategy : Step 1: SUPPLY 0.5 WETH as collateral to Lending Protocol : → 3 TXs (approve + supply + setCollateral) 270K gas Step 2: BORROW 311.20 USDC at 30% LTV : → 1 TX (borrow) 286K gas Step 3: SWAP 155.60 USDC → 0.0749 WETH via DEX : → 3 TXs (approve + approve_reset + swap) 213K gas Step 4: LP_OPEN WETH/USDC range [1867–2282] : → 4 TXs (approve + approve_reset + approve + lp_mint) 526K gas Total: 11 transactions, 1.3M gas zero bugs All f our v er iﬁcation lay ers passed on e v er y intent. This was the ﬁrs t tes t of cross-protocol enr ichment data handoﬀ — pro ving that the seams betw een components w ork, not just the components themselv es. B.2 D.2 Additional F oundation Examples (Tier 1) T w o Tier 1 scenarios illustrate the “obviousl y missing” signal. A basic DEX swap disco v ered that the protocol router conﬁguration had no entry on ANY chain (compilation failed) and used a diﬀerent interface v ersion (8 vs. 7 parameters, causing silent rev er ts) — both ﬁxed in the same iteration. Separately , the ﬁrst-e v er strategy on a new chain disco v ered a missing native token symbol — a one-line omission that block ed ALL strategies on the entire chain. 38 C Signal Platf orm Quality Gate Ar chitecture C.1 E.1 V eriﬁer F amilies V eriﬁer F amily Signal T ypes Co v ered Data Source Protocol TVL Disco very , TVL mig ration, undervalued protocol DeFi analytics API P ool Yield Y ield oppor tunity , LP oppor tunity , cross-DEX arbitrag e P ool analytics API Lending Rate Lending arbitrag e, utilization spike Lending analytics API Go vernance Gov er nance cataly st, proposal tracking Go vernance API Exploit Exploit detection, e xploit warning Security feeds Price Feed Whale alert, depeg r isk Pr ice oracle API Derivativ es Funding e xtreme, OI div erg ence, liquidation clus ter Derivativ es aggregator Emission T oken unlock, emission c hange U nlock calendars Social Bribe oppor tunity , sentiment shift, nar rativ e momentum Social + on-chain Stablecoin Stablecoin supply c hanges S tablecoin analytics Co verag e: 72/72 signal types (100%) — up from 33/71 at iteration 1. The veriﬁers are not mocks — the y call real APIs and cross-reference signal claims agains t live data. When The Graph deprecated its free hosted ser vice, the loop detected the f ailure (ag ents producing 0 signals), diagnosed the root cause (stale subg raph URLs), and migrated to alter nativ e data sources (Geck o T er minal, DeﬁLlama) — all autonomousl y . C.2 E.2 Anti-Signal Canary De t ails Tier 1 — Obviousl y Bad (6 canaries): - Fabricated e vent f or non-e xistent entity - Signal with impossible metric v alue (e.g., 999,999% yield) - F alse repor t for an entity that w as nev er aﬀected - Signal with null/empty identiﬁers - Signal with empty title and descr iption - Signal with out-of-rang e conﬁdence score Result : 100% caught acr oss 163 iter ations. Zero escapes. Tier 2 — Shado w (5 canaries): - Real e vent that already concluded - Kno wn trend repor ted w eeks ago (stale) - Opportunity that dropped below threshold since detection - V alid signal f or a deprecated/discontinued entity - A ccurate data from a non-author itativ e source Result : Initially 33% caught by L1-L3 (iter 1). No w 100% (iter 163) — impro ved as v eriﬁers wer e added and L2 f actual chec kers gained cross-r efer ence capabilities. Tier 3 — A dversarial (5 canaries): - Real data + fabricated inter pretation - Accurate ra w data + incor rect derived conclusion - V alid input data + wrong calculation - Cor rect methodology applied to wrong time windo w - Signal mixing real and f abr icated sub-claims Result : Initially 67% caught by L1-L3 (iter 1). No w 100% (iter 163) — the r emaining r equir ed cr oss-ref erencing conclusions ag ainst source data, which improv ed as veriﬁer cov erag e reac hed 100%. Tier 4 — Mixed T rue/F alse (5 canaries, added ∼ iter 100): - Signal with 3 tr ue claims and 2 f abr icated claims blended tog ether - A ccurate quantitativ e data with fabricated qualitativ e assessment - R eal protocol ev ent attr ibuted to wrong protocol - V alid historical data presented as cur rent - Correct anal ysis of real data with one inv er ted conclusion Result : 100% caught. T ests the quality g ate ’ s ability to de tect partial failur es rat her than binar y pass/f ail. API Degradation Canaries (3 canaries): - Simulated timeout from pr imary data source - Simulated er ror response from price oracle - Simulated par tial data return (50% of e xpected ﬁelds) 39 Result : 100% resilience (3/3). Quality g ates deg rade gracefully — they ﬂag signals as unv eriﬁable r ather than pr oducing false passes when dependencies f ail. C.3 E.3 Ke y Lesson: Rapid Iteration R ev eals Intermittent Bugs One ag ent’ s validation f ailure rate ﬂuctuated betw een 75-100% across iterations because its data source API retur ns diﬀerent f or mats at diﬀerent times. A single test r un would sho w “pass ” or “fail” — 163 iterations re veals the ∼ 85% steady -state rate. This class of bug, in visible to traditional CI, demonstrates the v alue of cov erage through repetition, not just breadth. C.4 E.4 Drift Detection at Scale The drift detection system monitors quality metr ics o v er a sliding window (5 recent v s. 20 baseline iterations). The ke y insight: dr ift detection pro vides earl y warning befor e quality gates f ail — a 5% drop in f actual pass rate ov er 10 iterations tr igg ers an aler t bef ore an y individual signal f ails hard. This is the mechanism that allow s the loop to operate autonomously f or 163+ iterations without human supervision. D Skill Interface Ref erence A Kitc hen Loop deplo yment consists of domain-independent orchestration plus domain-speciﬁc skills. The f ollo wing table summar izes the skill interfaces that a new deplo yment must implement: Skill Phase Input Output Domain-Speciﬁc? backlog Backlog Tic ket state Promoted tick ets Partially (labels) ideate Ideate Scenario + spec Report + scenar io Y es (deﬁnes “usage ”) triage T r iage Experience repor t Labeled, deduped tick ets P ar tially (tax onomy) execute Execute Ranked tick ets Branch + PR + tests Partially (test patter ns) pr-manager P olish Open PR list Re view ed, CI-passing PRs No regress Regress Codebase state Pass/f ail + drift Y es (deﬁnes “oracle”) loop-review Meta Logs + diﬀs Health + improv ement No review-meta Meta All revie w repor ts T rends + recommendations No Bold = must be fully reim plemented f or each ne w domain. Non-bold skills transf er with minimal conﬁguration (tick et labels, CI commands, re view tool selection). R efer ences [1] F awzy , S., T ahir , A., & Blincoe, K. (2025). Vibe Coding in Practice: Motivations, Challenges, and a Future Outlook — A Grey Literature R evie w . arXiv :2510.00328 . [2] Huang, Y ., Re yna, K., Ler ner , B. S., Xia, C. S., & Hempel, J. (2025). Professional Softw are De velopers Don ’ t Vibe, The y Control: AI Ag ent Use f or Coding in 2025. arXiv :2512.14012 . [3] Rahman, M. M., et al. (2026). T ask -Lev el Ev aluation of AI-Generated Pull Reques ts in Open-Source Softw are. arXiv preprint . [4] R obbes, R., Matr icon, T ., Degueule, T ., Hora, A., & Zacc hiroli, S. (2026). Ag entic Much? Adoption of Coding A g ents on GitHub. arXiv :2601.18341 . [5] Abtahi, S. & Azim, A. (2025). A ugmenting Large Language Models with Static Code Anal ysis f or A utomated Code Quality Impro v ements. arXiv :2506.10330 . 40 [6] Liang, X., Garg, S., & Zilouchian Moghaddam, R. (2025). The SWE-Benc h Illusion: When State-of-the- Ar t LLMs R emember Instead of R eason. arXiv :2506.12286 . [7] Thai, T . D., et al. (2025). SWE-EV O: Benc hmarking Coding A gents in Long-Horizon Softw are Ev olution Scenar ios. arXiv :2512.18470 . [8] Cai, Y ., Li, R., Liang, P ., Shahin, M., & Li, Z. (2025). Designing LLM-based Multi-Ag ent Sy stems f or Softw are Engineer ing T asks: Quality A ttributes, Design Patterns and Rationale. A CM T rans. Softw . Eng. Methodol. arXiv :2511.08475. [9] Shukla, S., Joshi, S., & Sy ed, T . (2025). Security Deg radation in Iterativ e AI Code Generation — A Sy stematic Analy sis of the P arado x. arXiv :2506.11022 . [10] Gao, Y ., et al. (2025). A Surve y of Bugs in AI-Generated Code. arXiv :2512.05239 . [11] He, H., Miller , C., Ag ar wal, S., Kastner , C., & V asilescu, B. (2025). Speed at the Cost of Quality: Ho w Cursor AI Increases Shor t- T er m V elocity and Long- T erm Comple xity in Open-Source Projects. Pr oc. MSR 2026 . [12] Beck er , K., Rush, A. M., Barnes, C., & R ein, D. (2025). Measur ing the Impact of Earl y-2025 AI on Experienced Open-Source Dev eloper Productivity . METR . arXiv :2507.09089. [13] Chen, Y ., T alwalkar , A., Brennan, G., & Neubig, G. (2025). Code with Me or for Me? How Increasing AI A utomation T ransf or ms Dev eloper W orkﬂo ws. arXiv :2507.08149 . [14] Y ao, S., et al. (2025). Sy cophancy in Multi-A gent Debate. arXiv :2509.23055 . E Ho w to Cite @misc{kitchenloop2026, title = {The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase}, author = {Roy, Yannick}, year = {2026}, howpublished = {arXiv preprint}, url = {https://github.com/0xagentkitchen/kitchenloop}, note = {Companion repository with skills, oracles, and canary templates} } 41

The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment