Interpretable Context Methodology: Folder Structure as Agentic Architecture

Current approaches to AI agent orchestration typically involve building multi-agent frameworks that manage context passing, memory, error handling, and step coordination through code. These frameworks work well for complex, concurrent systems. But fo…

Authors: Jake Van Clief, David McDermott

Interpretable Context Methodology: Folder Structure as Agent Architecture JAKE V AN CLIEF , D A VID MCDERMOT T, Eduba, University of Edinburgh, USA Current appr oaches to AI agent orchestration typically involv e building multi-agent frameworks that manage context passing, memory , error handling, and step coordination through code. These framew orks work well for complex, concurrent systems. But for sequential workows where a human reviews output at each step, they introduce engineering overhead that the problem does not require. This paper presents Interpretable Context Methodology (ICM), a method that replaces framework- level orchestration with lesystem structure . Numbered folders represent stages. Plain markdown les carry the prompts and context that tell a single AI agent what role to play at each step. Local scripts handle the mechanical work that does not need AI at all. The result is a system where one agent, reading the right les at the right moment, does the work that would otherwise require a multi-agent framework. This approach applies ideas from Unix pipeline design, modular decomposition, multi-pass compilation, and literate programming to the specic problem of structuring context for AI agents. The protocol is open source under the MI T license . 1 CCS Concepts: • Human-centered computing → Interactive systems and tools ; HCI design and evaluation methods ; • Computing methodologies → A rticial intelligence ; • Software and its engineering → Software design engineering . Additional K ey W or ds and Phrases: context engineering, human- AI interaction, AI agent orchestration, lesystem architecture, human-in-the-loop, mixed-initiative systems, workow automation 1 Introduction There are genuinely good agentic frameworks available today . Crew AI, LangChain, A utoGen, and others handle multi-step orchestration, memory management, tool use, and error recovery . They work. But they work within their own structures, and adjusting those structures requires development w ork. Changing the order of steps, swapping a prompt, adding or remo ving a stage, skipping something that is not r elevant today: these actions typically mean editing code, understanding abstractions, and redeploying. For practitioners whose workows are sequential and need human review at each step, the control surface can be much simpler . This paper describes Interpretable Context Methodology (ICM), a method for orchestrating AI agent workows using folder structure, markdown les, and local scripts. The central observation is straightfor ward: if the prompts and context for each stage of a w orkow already e xist as les in a well-organized folder hierarchy , y ou do not need a coordination framework to manage multiple specialized agents. Y ou need one orchestrating agent that reads the right les at the right moment. The folder structure tells it what to do at each step, and if the agent delegates sub-tasks, the same folder structure determines what context those sub-agents receive. Local Python scripts handle the parts that do not need AI: fetching data, moving les, formatting output, sending emails. This is going backward before going forward. The principles that made Unix pipelines ee ctive in the 1970s 2 and multi-pass compilers tractable in the 1980s apply directly to AI agent orchestration in the 2020s. ICM applies those principles to the specic challenge of structuring context for language models. The central question this paper examines is how structuring the context delivery me chanism as a lesystem hierarchy aects practitioners’ ability to control, inspect, and edit AI agent behavior across multi-step workows, and what this structure means for the quality of the model’s output at each stage. 1 https://github.com/RinDig/Interpretable- Context- Methodology- ICM- 2 Programs that do one thing. Output of one becomes input of another . P lain text as universal interface. These ideas ar e over fty years old and they hold up. A uthor’s Contact Information: Jake V an Clief, David McDermott, theceo@eduba.io, Eduba, University of Edinburgh, Palm Coast, Florida, USA. 2 • Jake V an Clief, David McDermo T able 1. Comparison of control surfaces for sequential, human-reviewed workf lows. The first six rows show dimensions where ICM’s filesystem approach simplifies common operations. The last four rows show dimensions where framework-based approaches provide capabilities that ICM lacks or handles less w ell. Dimension Framework approach ICM approach Change stage order Edit orchestration code, rede- ploy Rename or reorder folders Modify a prompt Edit agent conguration in co de Edit a markdown le Add or remov e a stage W rite new agent class, update orchestrator Add or delete a folder Inspect intermediate state Add logging, build dashboard Open the folder , read the les Hand o to another person Document environment, depen- dencies, setup Copy the folder Who can make changes Developer Anyone with a text editor Error recovery mid- pipeline Built-in retry , fallback, excep- tion handling Manual re-run of failed stage Conditional branching Programmatic routing based on agent output Human decides between stages Concurrent execution Native parallel agent coordina- tion Sequential by design External service inte- gration Programmatic API calls, auth management Local scripts or MCP connec- tions The paper is organized as follows. Section 2 traces the relevant background across software engineering, context engineering, and human oversight research. Section 3 describes the protocol itself. Se ction 4 walks through working implementations and r eports on early practitioner experience. Section 5 discusses where this approach ts and where it does not, including implications for the design of interactive intelligent systems more broadly . Section 6 explores futur e directions, drawing on the structural parallels b etween ICM and multi-pass compilation to propose semantic debugging and source-level traceability for AI w orkows. 2 Background and Related W ork 2.1 Composability and the Unix Tradition In 1978, Doug McIlroy articulated the principles that would dene Unix’s design philosophy: make each program do one thing well, expect the output of ev ery program to b ecome the input to another , and use text streams as the universal interface between pr ograms [ 1 ]. These principles w ere not theoretical. They wer e engineering decisions driven by constraints. The PDP-11 machines that ran early Unix had limited memor y . Programs had to be small. The way to build powerful systems from small programs was to connect them through a common interface [ 2 ]. Interpretable Context Methodology: Folder Structur e as Agent Architecture • 3 Kernighan and Pike later argue d that the power of a Unix system comes mor e from the relationships among programs than from the programs themselves [ 5 ]. Eric Raymond codied this into explicit design rules: the Rule of Modularity (write simple parts connected by clean interfaces), the Rule of Transparency (design for visibility to make inspection and debugging easier), and the Rule of Comp osition (design programs to be connecte d to other programs) [ 4 ]. These principles were formalized in software architecture as the “pipe-and-lter” pattern by Shaw and Garlan [ 6 ]: a system of independent components, each reading from inputs and writing to outputs, connecte d by data streams. The pattern’s strength is that any component can be replaced, inspected, or teste d independently . A related lineage runs through build systems. Stuart Feldman’s Make (1979) established that workows could be dened as dep endency graphs between les using declarative specications [ 7 ]. The key insight: les are both the artifacts of work and the coordination mechanism b etween stages. Y ou do not nee d a separate orchestration layer when the lesystem tracks what has b een produced and what depends on it. Multi-pass compilers work on the same principle: sour ce code transforms through a sequence of interme diate representations, each pass reading the output of the previous pass, with w ell-dened interfaces between them [ 52 ]. David Parnas argue d in 1972 that systems should b e decomp osed base d on what each mo dule hides from the rest of the system, yielding components that can be modied independently [ 9 ]. Edsger Dijkstra coined the term “separation of concerns” to describe the discipline of addressing one thing at a time as the only available technique for eective ordering of one ’s thoughts [ 8 ]. These ideas appear across decades and contexts b ecause they describe something real about how systems stay manageable as they grow . They are relevant here because the problem of orchestrating AI agents through multi- step workows is, at its core, a problem of modular decomposition, clean interfaces, and readable intermediate representations. 2.2 Context Engineering and Agentic AI The practitioner community has increasingly adopted the term “context engineering” to describe what building production AI systems actually inv olves. Andrej Karpathy gave the term its clear est articulation in June 2025, arguing that “prompt engineering” understates the work [ 16 ]. The distinction is useful. Prompt engineering suggests crafting a single instruction. Context engineering describes the broader discipline of lling the context window with the right information: instructions, retrieved knowledge, memory , tool descriptions, and prior outputs, all structured so the model can use them eectively . This paper uses the term in that sense. Lance Martin at LangChain formalized this into a taxonomy of strategies: write (author instructions), select (choose relevant context), compress (reduce token waste), and isolate (keep unrelated context separate) [ 17 ]. Simon Willison argued that the entire information environment, including previous model responses and system state, is part of the context that needs engineering [ 18 ]. The current generation of agentic framew orks, LangChain [ 21 ], A utoGen [ 20 ], Crew AI, and others, handle context engineering through code-level abstractions. They dene agents as objects, conversations as message arrays, and orchestration as programmatic control ow . This works well for systems that need dynamic multi- agent collaboration, concurrent execution, or complex branching logic. But for sequential workows, these frame works solve a coor dination problem that may not need to exist. If Agent A ’s job is to resear ch, Agent B’s job is to lter , and Agent C’s job is to write, the framework’s role is to pass the right context to the right agent at the right time. That coordination can also be achieved by putting the right les in the right folders. The orchestrating agent reads dierent instructions at each stage. If it delegates sub-tasks to smaller models (as current agent-team architectures allow ), the folder structure provides the context for those delegations too. The coordination logic lives in the lesystem, not in application code. 4 • Jake V an Clief, David McDermo This matters because of how language mo dels handle context. Liu et al. demonstrated that LLMs perform signicantly worse when relevant information is buried in the middle of long contexts [ 25 ]. The more irrelevant material in the context window , the worse the model p erforms on the material that matters. Jiang et al. showed that prompt compression can achiev e up to 20x token reduction with minimal performance loss [ 31 ], but a simpler approach is to avoid loading irrelevant context in the rst place. Stage-specic context loading, where each stage only sees the les it needs, prevents the problem rather than treating it after the fact. It is worth distinguishing ICM from Anthr opic’s Model Context Protocol (MCP) [ 23 ]. MCP standardizes ho w models access external tools and data sources, solving the integration problem between AI systems and the services they need to call. ICM addresses a dierent layer: how to structure and deliver context to an agent across a multi-stage workow . The two are complementary . An ICM stage might use MCP connections to access external services, while the stage ’s folder structure determines what context the agent receives when doing so. This separation matters for eciency as well. Jones and Kelly at Anthropic observed that loading all to ol denitions upfr ont into the context window slows agents and increases costs [ 24 ]. ICM’s stage-based architecture avoids this by scoping tool denitions to individual stages, loading only the tools relevant to the current step . 2.3 Human Oversight and Observability The question of how humans should relate to automated systems has been studie d for decades, and the ndings are remarkably consistent. Fails and Olsen introduced the interactive machine learning paradigm in 2003: rapid cycles of system output, human feedback, and correction [ 35 ]. Amershi et al. argued that interactive ML must involve users at all stages, from training through evaluation, with interfaces that supp ort steering and correction [ 34 ]. Dudley and Kristens- son’s review of interface design for interactive ML emphasized that transparent, inspectable representations are essential for eective human- AI collaboration [ 36 ]. Eric Horvitz’s work on mixed-initiative systems established principles for coupling automated services with human control [ 37 ]. The key insight: systems should let users invoke, adjust, and terminate automated processes at natural breakpoints. This requires that the system’s state be visible and its actions be reversible. Parasuraman and Riley identied the failure modes that emerge when this goes wrong [ 41 ]. When automated outputs are opaque, people either trust them blindly (misuse) or stop using them entirely (disuse). Both failures stem from the same cause: the human cannot see what happened b etween input and output. Lee and See’s work on trust calibration reinforced this: appropriate trust requires that system behavior b e observable [ 40 ]. Parasuraman, Sheridan, and Wickens proposed a taxonomy of automation levels, noting that the right level of automation varies by task and that systems should support dierent levels at dierent stages [ 42 ]. Ben Shneiderman synthesized these threads into the Human-Centered AI framework, arguing that systems can achieve b oth high human control and high automation simultaneously [ 43 ]. The two are not in tension. They reinforce each other when the system is designed to be comprehensible, predictable , and controllable [ 44 ]. Cynthia Rudin made the most forceful version of this argument: stop building opaque systems and then tr ying to explain them after the fact. Build systems that are inherently interpretable [ 45 ]. This applies at the workow level as much as at the model le vel. A production pipeline where ev ery intermediate output is a readable le is inherently interpretable . There is nothing to explain because nothing was hidden. This is also becoming a regulatory concern. The EU AI Act requires human oversight of high-risk AI systems, distinguishing between human-in-the-lo op, human-on-the-loop, and human-in-command approaches [ 49 ]. Nov- elli et al. argue that eective ov ersight requires institutional design, not just te chnical capability [ 50 ]. Systems with staged review points, audit trails, and dened intervention surfaces have a practical advantage as these requirements take eect. Interpretable Context Methodology: Folder Structur e as Agent Architecture • 5 Layer 0: CLAUDE.md Layer 1: CON TEXT .md Layer 2: Stage CON TEXT .md Layer 3: Reference material Layer 4: W orking artifacts “Where am I?” “Where do I go?” “What do I do?” “What rules apply?” “What am I working with?” ~ 800 tok ~ 300 tok 200–500 tok 500–2k tok varies Structural (routing) Content (factory / product) Fig. 1. The five-layer conte xt hierarchy . Layers 0–2 provide structural routing and stage instructions. Lay ers 3 and 4 carry content: Layer 3 holds reference material (the factor y), stable across runs; Layer 4 holds working artifacts (the product), unique to each run. 3 Interpretable Context Methodology 3.1 Design Principles ICM is built on ve principles, each borrowed from established practice. One stage, one job. Each stage in a workspace handles a single step of the workow and writes its output to its own folder . This follows McIlro y’s Unix principle and Parnas’s information-hiding criterion [ 1 , 9 ]. A stage that fetches data does not also lter it. A stage that lters does not also format the nal output. Each stage reads a dened input, transforms it, and writes a dened output, the same structure that governs individual passes in a multi-pass compiler . Plain text as the interface. Stages communicate through markdown and JSON les. No binary formats, no database connections, no proprietary serialization. This follows Kernighan and Pike’s argument that text is the universal interface [ 5 ]. Any tool that can read a text le can participate in the workow . Any human who can open a text editor can inspect or mo dify any artifact. Layered context loading. Agents load only the conte xt they nee d for the current stage, following the principle that less irrele vant context means better model performance [ 25 ]. This is prevention rather than compression [ 31 ]. Within the content layers, ICM further distinguishes between reference material (stable rules and conventions that persist across runs) and working artifacts (p er-run content that changes ev ery time). The model receives these as structurally separate context, which matters b ecause they require dierent kinds of attention: reference material should be internalized as constraints, while working artifacts should be processed as input. Every output is an edit surface. The intermediate output of each stage is a le a human can open, read, edit, and save before the next stage runs. This implements Hor vitz’s mixe d-initiative principles [ 37 ] and Shneiderman’s direct manipulation paradigm [ 46 ]: the human works with visible, manipulable objects, and the system picks up whatever the human left there . Congure the factory , not the product. A workspace is set up once with the user’s preferences, brand, style, and structural decisions. After that, each run of the pipeline produces a new deliv erable using the same conguration. This follows the continuous delivery principle that production pipelines should b e repeatable [ 15 ]. 3.2 Architecture An ICM workspace is a folder . Inside it, agents navigate a ve-layer context hierarchy (Figure 1 ). 6 • Jake V an Clief, David McDermo T able 2. Layer 3 (reference material) versus Layer 4 ( working artifacts). Layer 3: Reference Layer 4: W orking Changes between runs No Y es Example les voice.md, design-system.md, con- ventions.md research-output.md, script- draft.md Model should Internalize as constraints Process as input Congured during W orkspace setup (once) Pipeline execution (each run) Folder location references/ , _config/ , shared/ output/ Analogy The recipe The ingredients Layer 0 is the global identity le. It tells the agent which workspace it is in, what the folder structure contains, and where to nd things. Layer 1 is workspace-level task routing: giv en what the user wants to do, which stage handles it, and what shared resour ces e xist acr oss stages. Layer 2 is stage-specic: the contract that denes inputs, process, and outputs for one step of the workow . Layers 3 and 4 are both content that the agent loads while executing a stage, but they represent fundamentally dierent kinds of context. Layer 3 is reference material: design systems, voice rules, build conventions, style guides, domain knowledge bundled as skill les. These les are congured once during workspace setup and remain stable across every run of the pipeline. They are the factor y . 3 Layer 4 is working artifacts: the output of the previous stage, user-provided source material, anything specic to this particular run of the pipeline. These les are produced and consumed during execution and change every time. The distinction matters for how the mo del processes context. Layer 3 material nee ds to be internalized as constraints and patterns: the mo del should write like this , use these colors , follow these conventions . Layer 4 material needs to be processed as input: the model should transform this research into a script, or conv ert this script into a visual specication. Mixing persistent rules with per-run artifacts in an undierentiated context window forces the model to sort them on its own. Separating them in the folder structure means the model receives already-organized context. A rendering agent might only nee d Layers 0 through 2. A script-writing agent reads down to Layer 4 to access both the voice rules (Layer 3) and the source material (Layer 4). No agent reads everything. This keeps token cost low and context focused, and it avoids the degradation that Liu et al. do cumented when models process long contexts full of irrelevant material [ 25 ]. The folder structure for a typical workspace is shown in Figur e 2 . The numbering encodes execution order . The folder b oundaries enforce separation of concerns. The output/ directories are the Layer 4 hando p oints: the output of stage 01 b ecomes available as input to stage 02. If a human edits a le in 01_research/output/ before running stage 02, the agent picks up the edited version. The references/ directories and _config/ folder hold Layer 3 material: the stable knowledge and constraints that persist across runs. 3 This connects to the fth design principle: congure the factory , not the product. Layer 3 is the factory conguration. Layer 4 is what the factory produces each time it runs. Interpretable Context Methodology: Folder Structur e as Agent Architecture • 7 workspace/ CLAUDE.md CONTEXT.md stages/ 01_research/ CONTEXT.md references/ output/ 02_script/ CONTEXT.md references/ output/ 03_production/ CONTEXT.md references/ output/ _config/ shared/ setup/ questionnaire.md Layer 0 Layer 1 Layer 2 Layer 2 Layer 2 Layer 3 Layer 3 Layer 3 Layer 4 Layer 4 Layer 4 Layer 3 Layer 3 Fig. 2. Folder structure of a typical ICM workspace, with layer annotations. Files and folders are color-coded by their role in the context hierarchy . Layer 3 material (reference) p ersists across runs. Lay er 4 material ( working artifacts) changes each time the pipeline executes. Layer 2 is the control point of the entir e system. Each stage contract includes an Inputs table that species exactly which les from Lay ers 3 and 4 the agent should load, and which sections of those les are r elevant. 4 Without this scoping mechanism, an agent would either load everything in the workspace or rely on its own judgment about what matters. The Inputs table makes the selection explicit, editable, and auditable. This is the lesystem doing the work that a frame work would otherwise do in code. Stage sequencing is the folder numbering. Context scoping is the folder hierarchy . State management is the les on disk. Coordination between stages is one folder’s output being another folder’s input. From the model’s perspective, this lay ered loading changes the composition of the context window at each stage. Layers 0 through 2 together contribute roughly 1,300 to 1,600 tokens of identity , routing, and stage-specic instruction. Layer 3 adds reference material scoped to the current stage, typically 500 to 2,000 tokens depending on how many conventions and guidelines apply . Layer 4 adds the working material for this run, a research document, a script, a specication, which varies with the content but rarely e xceeds a few thousand tokens when the previous stage has done its job of condensing and structuring. The total context deliver ed to the model at any given stage typically ranges from 2,000 to 8,000 tokens, well within the range where curr ent models perform at 4 Larger reference collections can include their own r outing les, a CON TEXT .md within a conguration or design system folder , that help agents navigate to the right content within the collection. This is the routing pattern from Layer 1 applied recursively within Layer 3. 8 • Jake V an Clief, David McDermo Research Script Production Monolithic 0 0 . 5 1 1 . 5 2 3 4 5 · 10 4 ~ 42k ~ 5.6k ~ 5.5k ~ 4.9k T okens in context window Layers 0–2 (structural) Layer 3 (reference) Layer 4 (w orking) Unused/irrelevant context Fig. 3. Context window comp osition by stage (representativ e token counts from the script-to-animation workspace). The three ICM stages each deliver 2,000–8,000 focused tokens. A monolithic approach loading all stages’ instructions, all reference material, and all prior outputs produces a context window exceeding 40,000 tokens, most of it irrelevant to the current task. their best. Figure 3 illustrates this composition acr oss three example stages and contrasts it with a monolithic approach. Contrast this with a monolithic approach where all stage instructions, all reference les, and all prior outputs are loaded into a single prompt. That approach can easily reach 30,000 to 50,000 tokens, pushing into the range where Liu et al. found signicant performance degradation on information retrieval tasks [ 25 ]. The “unused/irrelevant context” segment in the monolithic bar of Figur e 3 represents tokens fr om stages other than the one currently executing: instructions the agent will not follow during this step, reference material that applies to a dierent stage, and prior outputs alr eady consumed by earlier stages. In ICM, these tokens ar e never loaded. In a monolithic prompt, they occupy context space without contributing to the curr ent task. The compression r esearch by Jiang et al. [ 31 , 32 ] addresses this problem after the fact. ICM’s architecture avoids it by construction: each stage receives a focused, appropriately sized context window because the folder structure determines what gets loaded. Richard Gabriel argued that systems prioritizing simplicity of implementation over feature completeness tend to survive and spread, because they are easier to port, easier to understand, and easier to improve incrementally [ 11 ]. ICM trades the exibility of a programmatic orchestrator for the portability , inspectability , and editability of plain les. That tradeo is the point. In the same spirit, Plan 9 from Bell Labs extended Unix’s “e verything is a le” principle to its full conclusion, representing all system resources as les in per-process namespaces [ 12 ]. ICM applies the same idea to AI workows: all state, all conte xt, all instructions exist as les in a folder namespace. 3.3 Stage Contracts and Handos Figure 4 illustrates the ow between stages. Each stage reads from the previous stage ’s output folder , processes it according to its own contract, and writes to its own output folder . At each b oundary , the human can inspe ct and edit the output before the next stage runs. Interpretable Context Methodology: Folder Structur e as Agent Architecture • 9 Stage 1 Research Stage 2 Script Stage 3 Production output/ output/ output/ Review gate Review gate human edits here Human Human Layers 0–2 + 3 + 4 Layers 0–2 + 3 + 4 Layers 0–2 + 3 + 4 Fig. 4. Pipeline f low through three stages with review gates. Each stage r eceives its own context (Layers 0–4), writes output to its folder , and the human reviews and optionally edits before the next stage reads it. The same model executes every stage; the folder structure controls what context it r eceives. Each stage in an ICM workspace denes a contract with three parts: what it reads (inputs), what it does (process), and what it writes (outputs). This contract is spelled out in the stage ’s CONTEXT.md le. A typical stage contract looks like this: # # I n p u t s - L a y e r 4 ( w o r k i n g ) : . . / 0 1 _ r e s e a r c h / o u t p u t / - L a y e r 3 ( r e f e r e n c e ) : . . / . . / _ c o n f i g / v o i c e . m d - L a y e r 3 ( r e f e r e n c e ) : r e f e r e n c e s / s t r u c t u r e . m d # # P r o c e s s W r i t e a s c r i p t b a s e d o n t h e r e s e a r c h o u t p u t . F o l l o w t h e s t r u c t u r e i n s t r u c t u r e . m d . M a t c h t h e t o n e d e s c r i b e d i n v o i c e . m d . # # O u t p u t s - s c r i p t _ d r a f t . m d - > o u t p u t / The Inputs table distinguishes between Layer 3 les (reference material that stays the same every run) and Layer 4 les (working artifacts from this specic run). The agent reads the CONTEXT.md , follows the instructions, and writes its output. The human revie ws what landed in output/ . If it needs adjustment, the human edits the le directly . The next stage reads whatev er is there. This implements pr ompt chaining at the lesystem lev el. W u, T erry , and Cai introduced AI Chains as a method for creating transparent, controllable multi-step LLM workows where each step ’s output becomes the next step’s input [ 26 ]. ICM does the same thing, but the chain is a sequence of folders and the links b etween them are plain les. The stage outputs ser ve as intermediate representations: each one is a complete, readable artifact that captures the work done so far and pro vides everything the next stage needs to continue. There is also something of Knuth’s literate programming in this design [ 10 ]. The markdown les that instruct the agent are simultaneously the documentation that tells a human what the stage does, what it expects, and what it produces. The instruction set and the documentation are the same artifact. This is useful in practice because it means the workspace is self-documenting. A new team memb er can read the CONTEXT.md les top to bottom and understand the entire pipeline without running it. 10 • Jake V an Clief, David McDermo W ei et al. demonstrated that breaking complex reasoning into intermediate steps dramatically improves LLM performance [ 27 ]. ICM applies this nding architecturally: complex workows are decomposed into stages with explicit boundaries, and each stage receives focused, stage-appropriate context. The model gets a clear , scoped task at each step rather than a monolithic instruction to do everything in a single pass. 3.4 Portability and Reproducibility A workspace is a folder . It can be copie d to another machine, committed to Git, emailed as a zip le, or synced through any cloud storage ser vice. It carries its own prompts, its own context structure, its own stage denitions. There is no server to congure, no envir onment to replicate, no deployment step . ICM workspaces are Git-compatible by default [ 13 ]. Every change to a prompt, ev ery edit to a stage output, every conguration adjustment is diable and re versible. Stage outputs can be committed after each run, creating a v ersion history of the entire production pipeline’s behavior over time. This is infrastructure as code [ 14 ] applied to AI workows: the workspace denition is the system. Ther e is no separate deployment artifact. This portability matters for a practical r eason. If a consultant builds a workspace for a client’s w eekly reporting workow , handing it over means copying a folder . The client can run it, e dit the prompts to match their evolving needs, and adjust stages without involving a developer . The same hando with a framework-based solution typically requires documentation, environment setup, dependency management, and ongoing technical support. 4 W orking Implementations ICM is not a theoretical proposal. The protocol has be en implemente d and tested across several production workows. 5 4.1 Model and Environment All workspaces described here were developed and run using Claude Code with Claude Opus 4.6 as the primar y agent [ 54 ]. For sub-agent tasks within stages, Opus 4.6 delegates to Claude Sonnet 4.6 through its Agent T eams capability , which coordinates multiple agents working in parallel from a single orchestrator . A detail worth noting: Opus 4.6 uses the workspace ’s own context les, the CONTEXT.md hierarchy and Lay er 3 reference material, to ll prompts for its sub-agents. The mo del reads the folder structure to determine what context each sub-agent should receive and what task it should perform. This means the ICM architecture is doing double duty . It structures context for the primary agent, and it provides the specication that the primary agent uses to delegate work. The folder hierarchy is b oth the human’s control surface and the model’s orchestration logic. ICM is designe d to be model-agnostic. The protocol sp ecies folder structure, le formats, and naming conventions. It does not depend on any model-specic capability . A workspace built for Claude could be run with a dierent model by pointing that model at the same les. Whether the results would be equivalent is an empirical question that depends on how dierent models handle the same context, but the protocol itself imposes no vendor lock-in. The workspaces described below were tested with the models listed ab ove . 4.2 Script-to- Animation Pip eline The rst workspace built on ICM takes a content idea through three stages to produce a working animated video. Stage 1 ( 01_research ) takes a topic and produces structured research output: key points, narrative angles, supporting data. The agent reads a research brief from the user and writes a research document to its output folder . 5 All workspaces referenced here are available or buildable through the ICM repository at https://github.com/RinDig/Interpretable- Context- Methodology- ICM- . Interpretable Context Methodology: Folder Structur e as Agent Architecture • 11 Stage 2 ( 02_script ) reads the research output and writes a script. The stage ’s CONTEXT.md points the agent to a voice guide and structural template in the _config/ folder . The script follows the user’s established tone and format. Stage 3 ( 03_production ) reads the nished script and produces animation specications and working Re- motion 6 code. The stage ’s context includes design guidelines, color palettes, and animation conventions fr om setup. At each stage boundar y , the human reviews the output. A research document that misses an important angle gets edited before the script stage runs. A script that runs too long gets trimme d before the production stage sees it. The agent at each stage works with whatever the human left in the pr evious output folder . This workspace runs on a single Claude Co de session. One orchestrating agent (Opus 4.6) manages the pipeline, delegating sub-tasks within stages to faster sub-agents (Sonnet 4.6) as describ ed in Section 4.1. The delegation is itself driven by the folder structure: the orchestrating agent reads the stage’s CONTEXT.md to determine what work to delegate and what conte xt to provide . There is no separate orchestration framework. The same folder hierarchy that tells the human what each stage do es tells the agent how to coordinate its sub-agents. In compiler terms, the workspace performs multi-pass compilation: the processing engine runs multiple times, producing a dierent intermediate representation at each pass, with the folder structure determining which pass runs ne xt. 4.3 Course Deck Production A second workspace takes unstructured source material (PDFs, papers, le cture notes, r ough outlines) and produces polished PowerPoint slide decks thr ough ve stages: content extraction, structural planning, slide drafting, visual design specication, and nal assembly . The ve-stage structure matters b ecause slide deck production is a process wher e human judgment is essential at several points. The structural plan (stage 2 output) determines the entire ar c of the presentation. Getting it wrong means ev erything downstream is wrong. By surfacing the structural plan as an editable markdown le before any slides are drafted, ICM lets the human course-correct at the point where correction is cheapest and most eective. 4.4 Building New W orkspaces ICM includes a workspace-builder: a ve-stage workspace whose output is a ne w workspace. It walks thr ough discovery (what is the domain, what is the workow ), stage mapping (where are the natural breakpoints), scaolding (creating the folder structure), questionnaire design ( what setup questions should the workspace ask), and validation (does the pipeline run end to end). The workspace-builder itself follows ICM conventions. The workspaces it produces are consistent because the builder enforces the same structural rules it was built with. This means practitioners can create new workspaces for their own domains without understanding the underlying conventions in detail. The builder enco des the conventions into its pr ocess. A marketing team can build a workspace for campaign production. A research group can build one for literature re view and synthesis. A consultancy can build one for client deliverable pipelines. Each workspace is a folder they own and contr ol. ICM workspaces hav e been adopted by groups outside the author’s organization. Researchers at the University of Edinburgh’s Neuropolitics Lab hav e built workspaces for their domain, and teams at ICR Research and the Academy of International Aairs in Bonn are developing workspaces for their o wn workows. The details of these implementations are limited by nondisclosure agreements, but their e xistence is noted here b ecause the re viewer’s natural question, does ICM work when some one other than its designer builds and operates the workspace, has at 6 Remotion is a React-based framework for creating videos programmatically . 12 • Jake V an Clief, David McDermo Stage 1 output (Research) Stage 2 output (Script) Stage 3 output (Production) Never Rarely Sometimes Oen Almost always 92% 30% 78% Frequency of human edits Fig. 5. Obser ved frequency of human edits at each stage b oundary , reported by 33 practitioners using multi-stage ICM workspaces. Intervention follows a U-shaped paern: heav y at stage 1 (direction-seing), light at middle stages (constrained execution), heav y again at the final stage (aligning output with earlier de cisions). Stage 1 editing is creative judgment. Final-stage editing is closer to debugging. V alues are approximate and based on practitioner self-r eport through conv ersation, not instrumented measurement. least a preliminary answer: yes, across academic research, policy analysis, and content production. A structured study of these external deployments is a clear next step . 4.5 Early Practitioner Experience ICM has been used in production across content creation, training material development, research analysis, and policy workows. The observations reported here are drawn from an invite-only practitioner community of 52 members whose backgrounds range from AI engineers and software developers to business owners, content creators, and academic researchers. These observations come from ongoing conversations with community members rather than from formal data collection protocols. They should be read as practitioner r eports rather than controlled ndings, but they reect a broader base of experience than the author’s own use alone . The most consistent observation is where p eople cho ose to inter vene (Figure 5 ). A cross 33 community memb ers who have used the script-to-animation workspace or structurally similar multi-stage w orkspaces, 30 report an intervention pattern consistent with a U-shape: heavy e diting at stage 1 (dir ection-setting), light editing at the middle stages, and heavy e diting again at the nal stage (aligning output with earlier decisions). The remaining three report roughly equal editing across all stages. These numbers come from practitioner conv ersations, not from instrumented measurement, and should be interpreted accordingly . The two peaks reect dier ent kinds of editing. Stage 1 editing is directional: the user is narrowing from broad possibilities to a spe cic angle, deciding what the piece is ab out. This is creative judgment. Final-stage editing is alignment work: the user is checking that the output faithfully repr esents decisions made in earlier stages. This is closer to debugging. The practitioner traces a misalignment in the output back thr ough the pip eline to nd where it diverged from the source material. Section 6 explores what tooling for this kind of traceability might look like. The middle stages get the lightest touch be cause they sit b etween well-dened anchors. The earlier stage output sets the direction. The refer ence material (Layer 3 v oice guides, structural templates) constrains the e xecution. With both anchors in place, the middle stages have less room to go wrong, and practitioners tend to trust them. This aligns with Parasuraman, Sheridan, and Wickens’s observation that appropriate automation levels vary by task function [ 42 ]. Interpretable Context Methodology: Folder Structur e as Agent Architecture • 13 A second pattern involves prompt editing. Non-technical users, people without development experience, have successfully modied stage b ehavior by editing the markdown CONTEXT.md les. Changes include adjusting tone instructions, adding constraints (“keep scripts under 90 seconds”), and reordering the emphasis within a stage’s process description. These edits would be equivalent to modifying agent conguration in a framework-based system, a task that typically requires a developer . The plain-text interface lowers this barrier in practice. A third pattern is worth noting for its implications ab out accessibility . Three community members with no prior coding experience and no previous exposure to Claude Co de used the ICM workspace-builder’s questionnaire and setup process to create and run workspaces that produced ten-minute animated videos from scripts. They edited CONTEXT.md les, review ed stage outputs, and iterated on their workspaces without developer assistance. This is a single data point from a small group, but it suggests that the lesystem interface can make AI agent orchestration accessible to people who would not be able to use a framework-based system at all. A fourth pattern is workspace duplication. Users who have a working workspace for one content format (say , short explainer videos) duplicate the folder , mo dify the stage prompts to target a dierent format (say , long-form essays), and run the new workspace without rebuilding fr om scratch. The workspace-builder supports creating workspaces from nothing, but in practice p eople often prefer to copy and adapt an e xisting one. This mirrors how Unix users build new shell scripts by modifying existing ones rather than starting from a blank le . These observations are drawn from a community of varie d backgrounds across a growing but still limited set of workow types. A structur ed evaluation with formal data collection, systematic interviews, and controlled comparisons would be nee ded to draw rm conclusions about the generality of these patterns. The observations are reported here because they informe d the protocol’s design evolution and because they suggest directions for future study . 4.6 Threats to V alidity Several limitations constrain the conclusions that can be drawn from the current work, and naming them is important for interpreting the results above. The practitioner community provides a broader evidence base than single-author observation, but data collection has be en informal: observations come from ongoing conversations rather than structured interviews, diar y studies, or instrumented usage logging. The community is invite-only and self-selected, introducing b oth selection bias and potential enthusiasm bias. The reported intervention patterns (30 of 33 practitioners observing a U-shape) are self-reported through conversation and hav e not been veried through controlled measurement. While ICM has been adopted across content production, academic research, and policy analysis workows, the majority of active use r emains concentrated in content production. The academic and p olicy deployments are early-stage, and their outcomes cannot yet be reported in detail. All testing was conducted using a single mo del family (Claude Opus 4.6 and Sonnet 4.6). Cross-model evaluation is a natural next step but falls outside the scop e of this paper , which focuses on the architectural pattern and its interaction properties rather than mo del-specic performance. Output quality may var y with other mo dels, particularly those with dierent context-handling characteristics. No controlled comparison has b een conducte d b etween ICM’s staged context loading and a monolithic prompting approach on the same tasks, so the claim that scoped context improves output quality rests on the theoretical support from the “lost in the middle” literature [ 25 ] and practitioner judgment rather than measured eect sizes. A formal user study with systematic data colle ction, structured interviews, and controlled comparisons across varied workow typ es and participant backgrounds would substantially strengthen the empirical foundations of this work. 14 • Jake V an Clief, David McDermo 5 Discussion 5.1 Where This W orks ICM handles sequential multi-step workows where a human reviews output at each stage. In practice, the protocol has b een applied to content production pipelines (script-to-animation, short-form video), training material development ( slide deck generation from source material), academic research workows ( at the University of Edinburgh and ICR Research), and policy analysis (at the Academy of International Aairs, Bonn). The common thread across these deployments is that the worko ws are sequential, the outputs benet from human review at each step, and the same pipeline runs repeatedly with dierent input. The common thread is that these workows are sequential (step 2 follows step 1), reviewable (a human should check each step’s output), and r epeatable (the same pipeline runs weekly or daily with dierent input). For this class of workow , ICM provides full orchestration capability with no framew ork code, no server infrastructure, and no developer dependency for day-to-day operation. 5.2 Where This Does Not W ork ICM is not a replacement for multi-agent frameworks in e very context. Real-time multi-agent collab oration, where agents ne ed to communicate dynamically and respond to each other’s outputs in tight loops, requires the kind of message-passing infrastructure that A utoGen [ 20 ] and similar frameworks pro vide. ICM’s sequential, le-based handos are too slow for this. High-concurrency systems where many users hit the same pipeline simultaneously nee d proper queueing, state isolation, and deployment infrastructure. ICM is local-rst by design. Scaling it to concurrent users would require building the infrastructure ICM was designed to avoid. W orkows that require complex branching logic based on AI decisions mid-pipeline are awkward in ICM. A human can make branching decisions between stages (run stage 3a instead of 3b based on what they see in the stage 2 output), but automated branching would r equire scripting that mov es ICM toward being a framew ork itself. These boundaries matter . The claim is not that ICM replaces existing tools across the board. The claim is that for a large and common class of workows, the existing tools pr ovide more complexity than the pr oblem r equires, and that complexity has real costs: opacity , fragility , developer dependency , and overhead that slows iteration. 5.3 Observability as a Side Eect The most useful property of ICM may be one that was not designed as a feature. Because every intermediate output is a plain le, the system is observable by default. There is no logging layer to build, no dashboard to congure, no special tooling to inspect pipeline state. Y ou open a folder and read the les. Rudin argued that inherently interpretable systems should be preferred over post-hoc explanations of opaque ones [ 45 ]. ICM is a glass-box AI workow . It did not b ecome transparent through the addition of an explanation layer . It was never opaque in the rst place, because every artifact is a plain-text le that a human can read. Amershi et al. ’s guidelines for human- AI interaction include “make clear what the system can do, ” “support ecient correction, ” and “support ecient dismissal” [ 47 ]. Stage contracts make capabilities explicit. Markdown les supp ort ecient correction (open, e dit, save). Revie w gates at every stage boundar y supp ort dismissal ( decide not to proceed, re-run the previous stage with dierent input, or abandon the run entir ely). The regulatory landscape may also be relevant here . The EU AI Act’s human ov ersight requirements [ 49 , 50 ] emphasize staged review , audit trails, and dened inter vention points. ICM produces these as a byproduct of its architecture: there is no way to run an ICM pipeline without generating inspectable intermediate artifacts, because the intermediate artifacts are how the stages communicate. Whether this constitutes compliance with Interpretable Context Methodology: Folder Structure as Agent Architecture • 15 specic regulatory requirements is a legal question this paper does not attempt to answer , but the structural alignment is worth noting. 5.4 Implications for Intelligent System Design The discussion so far has focuse d on how ICM structures the human side of human- AI interaction: edit surfaces, review gates, observability . But the architecture also has implications for how the intelligent system itself p erforms, and these are worth examining. The core mechanism is context scoping. By delivering dierent context to the same model at each stage, ICM changes the task the model is performing. A model that receiv es research instructions, sour ce material, and a topic brief behaves dierently from the same model receiving a script template, a voice guide, and a r esearch summary . The model’s capabilities do not change b etween stages. What changes is the information it has available when generating output. This is context engineering in practice: the performance of the system depends on what context is delivered, in what structure , and at what moment. The Layer 3/Layer 4 distinction adds a further dimension. Reference material (Layer 3) and working artifacts (Layer 4) ask dierent things of the model. Reference material says: here are the rules, follow them. W orking artifacts say: here is the input, transform it. Delivering these as structurally separate context, rather than mixing them in a single undierentiated prompt, gives the model clearer signals about which information constrains its behavior and which information it should act on. Whether this structural separation measurably improves output quality compared to a at context of equivalent content is an open empirical question, but early practitioner experience suggests that stages where reference and working material are clearly separated produce more consistent adherence to style and format guidelines. This raises a question about the relationship between context structure and output quality . In early use, a pattern emerge d: stages with tightly scop ed context (clear instructions, limited reference material, a spe cic output format) produced more consistent results than stages with broad context (open-ended instructions, large volumes of r eference material, loosely dened output expectations). This is consistent with the “lost in the middle ” ndings [ 25 ] and with the chain-of-thought literature showing that de composed tasks outperform monolithic ones [ 27 ], but it suggests something more specic. The structure of the context delivery , how information is organized and bounded, may matter as much as the content of the context itself. ICM’s folder-base d scoping enforces this structure by default: each stage folder contains only what that stage needs, and the boundaries are visible and editable. There are open questions here that the current work does not answer . First, does the ve-layer hierarchy (workspace identity , task routing, stage contracts, reference material, working artifacts) generalize across model families, or is it tuned to the spe cic attention patterns of the models tested? The protocol is designe d to be model-agnostic (Section 4.1), but all current testing has been conducted on a single model family . Cross-model evaluation, running the same workspace on Claude , GPT , Gemini, and open-weight models such as Llama, is a clear next step. This paper scop es that question as future work be cause the present contribution is the architectural pattern and its interaction properties, not a model-spe cic performance claim. Se cond, as context windows grow larger , does selective loading become less important? If a model can reliably attend to 200,000 tokens without degradation, the engineering argument for ICM’s scoping weakens, though the human-interaction arguments (observability , editability , review gates) remain. Third, how sensitiv e is stage output quality to the ordering and formatting of context within a layer? The curr ent protocol species what les a stage should load but does not prescribe the order in which they appear in the context window . Whether ordering matters at the scale of ICM’s typical context sizes (2,000 to 8,000 tokens per stage) is an empirical question worth investigating. These questions point toward a research program that sits at the intersection of context engine ering and interaction design: understanding how the structure of information delivery to language models aects both the 16 • Jake V an Clief, David McDermo model’s output quality and the human’s ability to steer , inspe ct, and correct that output. ICM provides a concrete platform for investigating these questions because its architecture makes the context structure explicit, editable, and observable at every stage. 6 Future Directions: Compilation, Debugging, and Source Integrity The previous sections describe ICM as it currently w orks in production. This section describes where it should go next. The ideas here are informed by early practitioner experience and by a structural analogy that the paper has not yet drawn: the relationship between ICM workspaces and multi-pass compilers. 6.1 ICM as Multi-Pass Incremental Compilation The paper has grounded ICM in Unix pipelines, Make, and the pipe-and-lter pattern. There is a closer analogy that deserves attention: multi-pass compilation [ 52 ]. A multi-pass compiler transforms source co de through a sequence of discrete passes. The lexer produces tokens. The parser produces a syntax tr ee. Semantic analysis annotates the tree. Optimization passes rewrite it. Code generation produces the nal output. Each pass reads the output of the previous pass, transforms it according to its own rules, and writes an intermediate representation that the next pass can consume. The interme diate representations are w ell-dened, inspectable, and (in debugging builds) preserved for examination. ICM does the same thing with content. Stage 1 (research) transforms a topic brief into structured research output. Stage 2 (script) transforms the research output into a script. Stage 3 (pr oduction) transforms the script into animation specications and code. Each stage reads the previous stage’s output, applies its own context and instructions, and writes an interme diate artifact that the next stage consumes. The intermediate artifacts are plain les that can be opened, read, and edite d. The analogy extends further . Incremental compilation means recompiling only the parts of the program that changed, rather than rebuilding from scratch. ICM supports this by default: if the research output is ne but the script nee ds rework, the practitioner re-runs stage 2 without touching stage 1. If a voice guide in the reference material changes, only the stages that load that le ne ed to run again. The folder structure tracks these dependencies implicitly: a stage’s Inputs table declares which les it reads, and a change to any of those les signals that the stage’s output may be stale. This is worth naming because it connects ICM to a body of compiler engineering that has spent fty y ears solving the problems of pass decomposition, intermediate representation design, and selective recompilation. The current paper draws from Unix and softwar e architecture. Futur e work should draw fr om compiler theory as well, particularly around dependency tracking, change propagation, and the formal properties of interme diate representations. 6.2 T oward Semantic Debugging Traditional debugging r ests on a simple principle: when the output is wrong, trace the failure back thr ough the program’s execution to nd the instruction that caused it [ 53 ]. Debuggers provide tools for this: breakpoints that pause execution at specic instructions, stack traces that show the call chain, variable inspection that shows state at each point, and source maps that connect compiled output back to the original source. ICM currently pro vides observability but not traceability . A practitioner can open any stage’s output folder and read what the agent produced. But if a phrase in the stage 3 output sounds wrong, there is no direct way to trace that phrase back to the specic instruction, reference le, or previous stage output that caused it. The practitioner has to do this manually: read the stage 3 contract, check which les it loaded, read those les, and form a judgment about which source is responsible. This w orks. It is how most ICM debugging happens today . But it is the equivalent of debugging a program by reading the source code and thinking hard, without a debugger . Interpretable Context Methodology: Folder Structure as Agent Architecture • 17 The question is what a debugger for semantic content would lo ok like. Several directions are w orth exploring. Output provenance through identiers. If each se ction of a stage’s output carried an identier linking it to the source instruction or reference le that pr oduced it, a practitioner could trace backward from any part of the output to the context that generated it. In compiler terms, this is the equivalent of debug symbols or source maps: metadata that connects output back to source without changing the output itself. In practice, this could mean embedding lightweight markers (GUIDs, section tags, or comment annotations) in stage output les that reference specic sections of the stage’s CONTEXT.md or Layer 3 reference les. Cross-stage trace verication. In the script-to-animation workspace described in Section 4, one recurring problem has b een misalignment between the animation specication (stage 3 output) and the script (stage 2 output). Timing drifts. Animations r eference phrases that w ere re vised. Visual density does not match pacing. This is the source of the U-shap ed intervention pattern observed in Section 4.4: the stage 3 editing that brings intervention back up is almost entirely alignment w ork, tracing the nal output back through the pipeline to nd where it diverged from earlier decisions. The current solution is an audit le that forces the agent to trace back from the sp ecication to the original script, re-verifying timing for each phrase and agging inconsistencies. This works w ell enough that it catches most alignment errors, and the errors it catches are remarkably consistent in kind: frame count discrepancies, visual density mismatches, and pacing breaks at scene boundaries. This audit le is a proto-debugger . It implements a spe cic kind of cross-stage verication: checking that the output of stage 𝑛 is consistent with the output of stage 𝑛 − 2 by re-reading both and comparing them against dened criteria. The pattern could be generalized. A stage contract could include a Verify section alongside its Inputs , Process , and Outputs sections, sp ecifying which earlier stage outputs should be checked for consistency and what criteria to che ck against. The agent would run these verication checks as part of the stage’s execution and ag discrepancies before the human revie ws. Breakpoints in markdown. The most speculative direction involves something like breakpoints for mark- down les. In a traditional debugger , a breakpoint says “pause here and let me inspe ct the state. ” In ICM, a breakpoint in a CONTEXT.md le might say “after the agent processes this instruction, show me what it produced before continuing. ” This would be particularly useful in stages with complex instructions where the practitioner wants to verify that the agent interpreted a specic constraint correctly before it nishes the rest of the stage ’s work. It turns a single-pass stage into a sequence of veriable sub-steps. These ideas ar e not yet implemente d. They are described here because the gap they address, the ability to trace output back to source, is a gap that compiler engineering solved decades ago and that ICM will need to solve as workspaces grow mor e complex. 6.3 Source Integrity and the Edit-Source Principle The current paper describes ICM’s r eview gates as places where practitioners edit stage output. This is useful and it works. But there is an argument, drawn directly from software engineering practice, that the source les should be what improves ov er time, and that editing output is treating symptoms rather than causes. The argument is straightforward. If a script sounds wrong at stage 2, there ar e two possible r esponses. The rst is to edit the script directly: x the tone, adjust the phrasing, mov e on. The second is to ask why the script sounds wrong and trace the problem back to the source that produced it. Mayb e the voice guide in the reference material is underspe cied. Mayb e the stage contract’s instructions emphasize the wrong quality . Mayb e the research output from stage 1 framed the topic in a way that led the script in the wrong direction. Editing the output xes this run. Editing the source xes every future run. In compiler terms, editing the output is patching the binary . It works, but it does not improve the compiler . A developer who nds a bug in compiled code traces it back to the source and xes it there, so that every subsequent build is correct. 18 • Jake V an Clief, David McDermo For ICM, the tension is real. Creative content is fuzzier than compiled code. Sometimes the output ne eds a human touch that cannot be reduced to a source-level rule. A script might benet from a turn of phrase that no amount of voice guide renement w ould have produced. Editing the output in that case is the right mov e. The practitioner is adding value that the system cannot generate on its own. But there is a class of output edits that are diagnostic. If the practitioner consistently tightens the opening paragraph, that is a signal that the stage contract should say “keep the opening under three sentences. ” If the tone drifts formal every time, that is a signal that the voice guide needs a stronger example of the target register . These recurring edits are debugging information. They point to xable source-level pr oblems. A future version of ICM could support this by tracking output edits across runs. If a practitioner edits the same kind of thing in the same stage’s output three runs in a row , the system could surface that pattern and suggest a source-level change: a contract amendment, a reference le update, a new constraint. This w ould close the loop between output editing and source improvement, turning one-o xes into durable system impro vements. The principle matters because it addresses a question about ICM’s long-term trajector y . If workspaces are only as good as the last human edit of their output, they remain tools. If workspaces impro ve their own sour ce les over time , incorporating the patterns they learn from human corr ections, they become systems that get better with use. The debugging and traceability infrastructure described in the previous subsection is a prerequisite for this: you cannot improv e the source if you cannot trace the pr oblem back to it. 7 Conclusion The principles that made Unix pipelines ee ctive in the 1970s apply to AI agent orchestration in the 2020s. Programs that do one thing. Output of one becomes input of another . P lain text as universal interface . Human- readable intermediate state. ICM applies these principles to a sp ecic problem: structuring context for AI agents across multi-step workows. The result is a system where the folder structure replaces the frame work. One agent r eads dierent context at each stage rather than multiple agents coordinating through code. Lo cal scripts handle the mechanical work that does not need AI. Ever y intermediate output is a le a human can read and edit. For practitioners whose AI workows are sequential, reviewable , and repeatable, this means full pipeline capability with no framework to learn, no ser ver to maintain, and no developer nee ded for day-to-day op eration. The workspace is a folder . It can be copied, versioned, shared, and edited with a text editor . The simplest viable architecture for this class of problem is one that already e xists on every computer: the lesystem. The protocol is open source under the MI T license and includes a workspace-builder for creating new workspaces across any domain. References [1] M. D. McIlr oy , E. N. Pinson, and B. A. T ague, “Unix Time-Sharing System: Fore word, ” The Bell System T echnical Journal , vol. 57, no. 6, part 2, pp. 1902–1903, 1978. [2] D. M. Ritchie and K. Thompson, “The UNIX Time-Sharing System, ” Communications of the ACM , vol. 17, no . 7, pp. 365–375, 1974. DOI: https://doi.org/10.1145/361011.361061 [3] P. H. Salus, A Quarter Centur y of Unix . Addison- W esley , 1994. ISBN: 0-201-54777-5. [4] E. S. Raymond, The Art of Unix Programming . Addison- W esley Professional, 2003. ISBN: 0-13-142901-9. A vailable: http://ww w .catb.org/esr/writings/taoup/html/ [5] B. W . Kernighan and R. Pike, The UNIX Programming Environment . Prentice Hall, 1984. ISBN: 0-13-937681-X. [6] M. Shaw and D. Garlan, Software A rchitecture: Perspe ctives on an Emerging Discipline . Prentice Hall, 1996. ISBN: 0-13-182957-2. [7] S. I. Feldman, “Make — A Program for Maintaining Computer Programs, ” Software: Practice and Experience , vol. 9, no. 4, pp. 255–265, 1979. DOI: https://doi.org/10.1002/spe.4380090402 [8] E. W . Dijkstra, “On the Role of Scientic Thought, ” Manuscript EWD447, 1974. Reprinte d in Selected Writings on Computing: A Personal Perspective , pp. 60–66. Springer- V erlag, 1982. Interpretable Context Methodology: Folder Structure as Agent Architecture • 19 A vailable: https://ww w .cs.utexas.edu/~EWD/transcriptions/EWD04x x/EWD447.html [9] D. L. Parnas, “On the Criteria T o Be Used in Decomposing Systems into Modules, ” Communications of the ACM , vol. 15, no . 12, pp. 1053– 1058, 1972. DOI: https://doi.org/10.1145/361598.361623 [10] D. E. Knuth, “Literate Programming, ” The Computer Journal , vol. 27, no. 2, pp. 97–111, 1984. DOI: https://doi.org/10.1093/comjnl/27.2.97 [11] R. P. Gabriel, “The Rise of ‘W orse is Better’ , ” Originally part of “Lisp: Good News, Bad News, Ho w to Win Big. ” AI Expert , vol. 6, no. 6, pp. 33–35, 1991. A vailable: https://ww w .dreamsongs.com/W orseIsBetter .html [12] R. Pike, D. Presotto, S. Dorward, B. F landrena, K. Thompson, H. Trickey , and P. Winterbottom, “P lan 9 from Bell Labs, ” Computing Systems , vol. 8, no. 3, pp. 221–254, 1995. A vailable: https://css.csail.mit.edu/6.824/2014/papers/plan9.p df [13] S. Chacon and B. Straub, Pro Git , 2nd ed. Apress, 2014. ISBN: 978-1-4842-0076-6. A vailable: https://git- scm.com/book [14] K. Morris, Infrastructure as Code: Dynamic Systems for the Cloud Age , 2nd ed. O’Reilly Me dia, 2021. ISBN: 978-1-098-11467-1. [15] J. Humble and D . Farle y , Continuous Delivery: Reliable Software Releases through Build, T est, and Deployment Automation . Addison- W esley Professional, 2010. ISBN: 978-0-321-60191-9. [16] A. Karpathy , “+1 for ‘context engineering’ over ‘prompt engineering’. . . , ” X (formerly Twitter ), June 25, 2025. A vailable: https://x.com/karpathy/status/1937902205765607626 [17] L. Martin, “Context Engineering, ” LangChain Blog, July 2, 2025. A vailable: https://blog.langchain.com/context- engineering- for- agents/ [18] S. Willison, “Context Engineering, ” Simon Willison’s W eblog , June 27, 2025. A vailable: https://simonwillison.net/2025/jun/27/context- engineering/ [19] DAIR.AI, “Context Engineering Guide, ” Prompting Guide , 2025. A vailable: https://ww w .promptingguide.ai/guides/context- engineering- guide [20] Q. W u, G. Bansal, J. Zhang, Y. W u, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, A. A wadallah, R. W . White, D. Burger , and C. Wang, “ AutoGen: Enabling Next-Gen LLM Applications via Multi- Agent Conv ersation, ” COLM 2024 , arXiv:2308.08155, August 2023. A vailable: https://arxiv .org/abs/2308.08155 [21] H. Chase, LangChain [open-source framework]. First released October 2022. A vailable: https://github.com/langchain- ai/langchain [22] J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative Agents: Interactive Simulacra of Human Behavior , ” Proceedings of UIST ’23 . ACM, 2023. DOI: https://doi.org/10.1145/3586183.3606763 [23] Anthropic, “Introducing the Model Context Protocol, ” Anthropic Blog, November 25, 2024. A vailable: https://ww w .anthropic.com/news/model- context- protocol [24] A. Jones and C. Kelly , “Code Exe cution with MCP, ” Anthropic Engineering Blog, 2025. A vailable: https://ww w .anthropic.com/engineering/co de- execution- with- mcp [25] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Be vilacqua, F. Petroni, and P . Liang, “Lost in the Middle: How Language Models Use Long Contexts, ” T ransactions of the Association for Computational Linguistics , vol. 12, pp. 157–173, 2024. A vailable: https://arxiv .org/abs/2307.03172 [26] T . Wu, M. T err y , and C. J. Cai, “ AI Chains: T ransparent and Controllable Human- AI Interaction by Chaining Large Language Model Prompts, ” CHI Conference on Human Factors in Computing Systems (CHI ’22) . ACM, 2022. DOI: https://doi.org/10.1145/3491102.3517582 [27] J. W ei, X. W ang, D. Schuurmans, M. Bosma, B. Ichter , F. Xia, E. Chi, Q. V . Le, and D. Zhou, “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, ” NeurIPS 2022 . A vailable: https://arxiv .org/abs/2201.11903 [28] T . Schick, J. D wivedi- Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer , N. Cancedda, and T . Scialom, “T oolformer: Language Models Can T each Themselves to Use T o ols, ” NeurIPS 2023 . A vailable: https://arxiv .org/abs/2302.04761 [29] S. G. Patil, T . Zhang, X. W ang, and J. E. Gonzalez, “Gorilla: Large Language Model Connected with Massive APIs, ” NeurIPS 2024 , arXiv:2305.15334, 2023. A vailable: https://arxiv .org/abs/2305.15334 [30] P. Lewis, E. Per ez, A. Piktus, F. Petr oni, V . Karpukhin, N. Goyal, H. Küttler , M. Lewis, W .-t. Yih, T . Ro cktäschel, S. Riedel, and D. Kiela, “Retrieval- A ugmented Generation for Knowledge-Intensive NLP T asks, ” NeurIPS 2020 , pp. 9459–9474. A vailable: https://arxiv .org/abs/2005.11401 20 • Jake V an Clief, David McDermo [31] H. Jiang, Q . Wu, C.- Y. Lin, Y. Y ang, and L. Qiu, “LLMLingua: Compressing Pr ompts for Accelerated Inference of Large Language Models, ” EMNLP 2023 , pp. 13358–13376. DOI: https://doi.org/10.18653/v1/2023.emnlp- main.825 [32] H. Jiang, Q. W u, X. Luo, D. Li, C.- Y. Lin, Y . Y ang, and L. Qiu, “LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression, ” ACL 2024 , pp. 1658–1677. A vailable: https://arxiv .org/abs/2310.06839 [33] Addyo, “Context Engineering: Bringing Engineering Discipline to Prompts, ” Substack, 2025. A vailable: https://addyo.substack.com/p/context- engineering- bringing- engineering [34] S. Amershi, M. Cakmak, W . B. Knox, and T . Kulesza, “Power to the People: The Role of Humans in Interactive Machine Learning, ” AI Magazine , vol. 35, no. 4, pp. 105–120, 2014. DOI: https://doi.org/10.1609/aimag.v35i4.2513 [35] J. A. Fails and D. R. Olsen, Jr ., “Interactive Machine Learning, ” Proceedings of IUI ’03 , pp. 39–45. ACM, 2003. DOI: https://doi.org/10.1145/604045.604056 [36] J. J. Dudley and P . O . Kristensson, “ A Review of User Interface Design for Interactive Machine Learning, ” A CM Transactions on Interactive Intelligent Systems , vol. 8, no. 2, Article 8, pp. 1–37, 2018. DOI: https://doi.org/10.1145/3185517 [37] E. Hor vitz, “Principles of Mixed-Initiative User Interfaces, ” CHI ’99 , pp. 159–166. ACM, 1999. DOI: https://doi.org/10.1145/302979.303030 [38] M. T . Rib eiro, S. Singh, and C. Guestrin, “‘Why Should I Trust Y ou?’: Explaining the Predictions of Any Classier , ” KDD ’16 , pp. 1135–1144. A CM, 2016. DOI: https://doi.org/10.1145/2939672.2939778 [39] S. M. Lundberg and S.-I. Le e, “ A Unied Approach to Interpreting Model Predictions, ” NeurIPS 2017 , pp. 4765–4774. A vailable: https://papers.nips.cc/paper/7062- a- unied- approach- to- interpreting- model- predictions [40] J. D. Lee and K. A. See, “Trust in A utomation: Designing for Appropriate Reliance, ” Human Factors , vol. 46, no. 1, pp. 50–80, 2004. DOI: https://doi.org/10.1518/hfes.46.1.50_30392 [41] R. Parasuraman and V . Riley , “Humans and A utomation: Use, Misuse, Disuse, Abuse , ” Human Factors , vol. 39, no. 2, pp. 230–253, 1997. DOI: https://doi.org/10.1518/001872097778543886 [42] R. Parasuraman, T . B. Sheridan, and C. D. Wickens, “ A Mo del for T ypes and Levels of Human Interaction with Automation, ” IEEE Transactions on Systems, Man, and Cybernetics — Part A , vol. 30, no. 3, pp. 286–297, 2000. DOI: https://doi.org/10.1109/3468.844354 [43] B. Shneiderman, “Human-Centered Articial Intelligence: Reliable, Safe & Trustw orthy , ” International Journal of Human–Computer Interaction , vol. 36, no. 6, pp. 495–504, 2020. DOI: https://doi.org/10.1080/10447318.2020.1741118 [44] B. Shneiderman, Human-Centered AI . Oxford University Press, 2022. ISBN: 978-0192845290. [45] C. Rudin, “Stop Explaining Black Box Machine Learning Mo dels for High Stakes Decisions and Use Interpretable Models Instead, ” Nature Machine Intelligence , vol. 1, pp. 206–215, 2019. DOI: https://doi.org/10.1038/s42256- 019- 0048- x [46] B. Shneiderman, “Direct Manipulation: A Step Beyond Programming Languages, ” IEEE Computer , vol. 16, no. 8, pp. 57–69, 1983. DOI: https://doi.org/10.1109/MC.1983.1654471 [47] S. Amershi, D. W eld, M. V orvoreanu, A. Fourney , B. Nushi, P. Collisson, J. Suh, S. Iqbal, P. N. Bennett, K. Inkpen, J. T eevan, R. Kikin-Gil, and E. Horvitz, “Guidelines for Human-AI Interaction, ” CHI 2019 , Article 3, pp. 1–13. ACM, 2019. DOI: https://doi.org/10.1145/3290605.3300233 [48] M. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. A. Hong, A. K onwinski, S. Murching, T . Nykodym, P. Ogilvie, M. Parkhe, F . Xie, and C. Zumar , “A ccelerating the Machine Learning Lifecycle with MLow , ” IEEE Data Engine ering Bulletin , vol. 41, no. 4, pp. 39–45, 2018. A vailable: https://people.ee cs.berkeley .edu/~matei/pap ers/2018/ieee_mlow .p df [49] L. Enqvist, “‘Human Oversight’ in the EU Articial Intelligence Act, ” The The ory and Practice of Legislation , vol. 11, no. 3, 2023. DOI: https://doi.org/10.1080/17579961.2023.2245683 [50] C. Novelli, F . Casolari, A. Rotolo, M. T addeo, and L. Floridi, “Institutionalised Distrust and Human O versight of Articial Intelligence, ” Digital Society , vol. 3, no. 8, 2024. A vailable: https://pmc.ncbi.nlm.nih.gov/articles/PMC11614927/ [51] M. Fink, “Human O versight under Article 14 of the EU AI Act, ” SSRN: 5147196, 2025. Forthcoming in Malgieri et al. (eds.), AI Act Commentary . Hart-Bloomsbur y , 2026. DOI: https://doi.org/10.2139/ssrn.5147196 [52] A. V . Aho, M. S. Lam, R. Sethi, and J. D. Ullman, Compilers: Principles, T echniques, and T o ols , 2nd ed. Addison- W esley , 2006. ISBN: 978-0-321-48681-3. Interpretable Context Methodology: Folder Structure as Agent Architecture • 21 [53] A. Zeller , Why Programs Fail: A Guide to Systematic Debugging , 2nd ed. Morgan Kaufmann, 2009. ISBN: 978-0-12-374515-6. [54] Anthropic, “Introducing Claude Opus 4.6, ” https://www.anthr opic.com/news/claude- opus- 4- 6 , February 2026.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment