ViviDoc: Generating Interactive Documents through Human-Agent Collaboration

V iviDoc : Generating Interactive Documents through Human- Agent Collaboration Yinghao T ang State Ke y Lab of CAD&CG, Zhejiang University Hangzhou, China tangyinghao@zju.edu.cn Y up eng Xie HKUST(GZ) Guangzhou, China yxie686@connect.hkust- gz.edu.cn Yingchaojie Feng National University of Singapore Singapore, Singapore yingchaojie@nus.edu.sg Tingfeng Lan University of Virginia Charlottesville, V A, USA tingfeng@virginia.edu Jiale Lao Cornell University Ithaca, NY, USA jiale@cs.cornell.edu Y ue Cheng University of Virginia Charlottesville, V A, USA mrz7dp@virginia.edu W ei Chen State Ke y Lab of CAD&CG, Zhejiang University Hangzhou, China chenwei@zju.edu.cn Abstract Interactive documents help readers engage with complex ideas through dynamic visualization, interactive animations, and ex- ploratory interfaces. However , creating such documents remains costly , as it requires both domain expertise and web development skills. Recent Large Language Model (LLM)-based agents can au- tomate content creation, but directly applying them to interactive document generation often produces outputs that are dicult to control. T o address this, we present ViviDoc , to the b est of our knowledge the rst work to systematically address interactive doc- ument generation. ViviDoc introduces a multi-agent pipeline (Plan- ner , Styler , Executor , Evaluator). T o make the generation process controllable, we provide three levels of human control: (1) the Docu- ment Specication (DocSpec) with SRTC Interaction Sp ecications (State, Render , Transition, Constraint) for structured planning, (2) a content-aware Style Palette for customizing writing and interac- tion styles, and (3) chat-based editing for iterative renement. W e also construct ViviBench , a benchmark of 101 topics derived from real-world interactive documents across 11 domains, along with a taxonomy of 8 interaction typ es and a 4-dimensional automated evaluation framework validated against human ratings (Pearson 𝑟 > 0 . 84 ). Experiments show that ViviDoc achieves the highest content richness and interaction quality in both automated and hu- man e valuation. A 12-person user study conrms that the system is easy to use, provides eective control o ver the generation process, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honor ed. Abstracting with credit is permitted. T o copy otherwise, or republish, to post on servers or to redistribute to lists, r equires prior specic permission and /or a fe e. Request permissions from permissions@acm.org. XXX, XXX, XXX © 2026 Copyright held by the owner/author(s). Publication rights licensed to A CM. ACM ISBN 978-1-4503-XXXX -X/2026/03 https://doi.org/XXXXXXX.XXXXXXX and produces documents that satisfy users. Our project homepage is available at https://vividoc- homepage.v ercel.app/. CCS Concepts • Human-centered computing → Human computer interac- tion (HCI) ; • Computing methodologies → Articial intelli- gence . Ke ywords interactive documents, multi-agent systems, human-agent collab o- ration, LLM-based generation A CM Reference Format: Yinghao T ang, Y upeng Xie, Yingchaojie Feng, Tingfeng Lan, Jiale Lao, Yue Cheng, and W ei Chen. 2026. V iviDoc : Generating Interactive Documents through Human-Agent Collaboration. In Proceedings of ACM Conference (XXX). ACM, New Y ork, N Y, USA, 13 pages. https://doi.org/XXXXXXX. XXXXXXX 1 Introduction Interactive documents are an emerging communication medium that leverages the dynamic capabilities of the web to help readers engage with complex ideas [ Victor(2011) ]. Through interactive ele- ments such as sliders, dropdowns, and direct manipulation controls, readers can actively explore concepts, obser ve cause-and-eect relationships, and build intuition through experimentation. This form of communication has been adopted across diverse domains including education [ Chi and W ylie(2014) ], data-driven journal- ism [ Branch(2012) ], and scientic publishing [ Hohman et al.(2020) ]. Studies show that interactive documents improve r eader engage- ment and learning compared to static alternatives [ Hohman et al.(2020) ]. Despite their promise, cr eating interactive documents remains prohibitively costly . Producing a single piece, such as those on Ex- plorable Explanations [ exp([n. d.]) ] or Distill.pub [ Hohman et al.(2020) ], requires expertise in both domain knowledge and web development, XXX, 2026, XXX, XXX T ang et al. (a) parameter exploration (b) direct manipulation (c) inspection (d) freeform construction (e) scroll driven narrative (f) spatial navigation (g) state switching (h) temporal control Figure 1: Eight interactive visualization examples generate d by ViviDoc , cov ering all 8 interaction types in our taxonomy , with styles automatically adapted to each topic’s content. (a) Parameter Exploration: sliders adjust ow rate and elevation parameters in real time. ( b) Direct Manipulation: drag the object or focal points to update the lens equation live. (c) Inspection: hover to rev eal a V oronoi cell and nearest-neighbor envelope. (d) Freeform Construction: click to place neurons and trigger animated signal propagation. (e) Scroll-driv en Narrative: scroll to mix two particle gases and watch entropy rise. (f ) Spatial Navigation: drag to rotate a 3D Möbius strip freely in space. (g) State Switching: switch quantum orbital states to redraw the electron probability cloud. (h) T emporal Control: play/pause and tune harmonics to build a Fourier series. often demanding days or weeks of eort per article. This bottleneck severely limits the pr oduction of high-quality interactive content. In this paper , we aim to address this by leveraging the reason- ing and code generation capabilities of LLM-based agents. T o the best of our knowledge, this is the rst work to systematically address interactive document generation. Although end-to-end LLM-based agents have achieved promising results in content cre- ation tasks such as video generation [ Chen et al.(2025) ], slide de- sign [ Zheng et al.(2025) , Liang et al.(2025) ], and infographic pro- duction [ T ang et al.(2026) ], naively applying them to interactive documents generation faces three key limitations. First, the gener- ation process is largely uncontrollable : there is a fundamental gap between authoring intent and the executable code that real- izes it, and the agent resolves this gap based on its own implicit preferences. Second, humans cannot eectively participate in the process: the agent operates as a black box , leaving educators no meaningful way to review or adjust interme diate decisions that shape the nal output. Third, evaluation resources are scarce : no grounded dataset or systematic evaluation framework exists for this task, making it dicult to compare approaches. T o address these limitations, we present ViviDoc , a human-agent collaborative system for generating interactive do cuments. Vivi- Doc is built on a multi-agent pipeline consisting of a Planner , Styler , Executor , and Evaluator . T o make the generation process control- lable and transparent, w e provide thr ee levels of human contr ol: (1) The Document Specication (Do cSpec), a structured intermediate representation that organizes a document into knowledge units, each containing a text description for content generation and an Interaction Specication for visualization generation. The Interac- tion Specication uses a four-component decomposition of State, Render , Transition, and Constraint (SRTC), inspired by interactive visualization theory [ Munzner(2014) ]. SRTC can express all com- mon interaction types found in real-world educational documents. Educators can review , modify , and rene both the content plan and the interaction design before any code is produced. (2) A Style Palette, where the LLM analyzes the spe c content and generates style options for users to customize the writing and interaction ViviDoc : Generating Interactive Documents through Human-A gent Collaboration XXX, 2026, XXX, XXX style of the document. (3) Chat-base d editing, which allows users to rene both the spec and the generated document through natural language conversation. T o enable systematic evaluation, we construct ViviBench , a benchmark grounded in 101 real-world interactive documents col- lected from over 60 websites across 11 domains. W e reverse-engineer teaching topics from these documents, ensuring that evaluation is grounded in topics for which high-quality interactive content already exists. From 482 interaction instances in these documents, we derive a taxonomy of 8 interaction typ es (e.g., Parameter Ex- ploration, State Switching, Direct Manipulation). W e also design a 4-dimensional automated evaluation framework combining rule- based checks (Interaction Functionality , Eciency) with LLM-as- Judge assessment (Content Richness, Interaction Quality), and vali- date its reliability through human evaluation alignment (Pearson 𝑟 > 0 . 84 ). W e evaluate ViviDoc through both automated evaluation and hu- man evaluation with 12 raters. Results show that V iviDoc achieves the highest content richness and interaction quality among all meth- ods. A 12-person user study further conrms that the system is easy to use (5.0/5), that DocSp ec, Style Palette, and chat editing provide eective control (all rate d ab ove 4.0/5), and that participants are satised with the generated documents (4.58/5). W e summarize our contributions as follows: • W e introduce the task of interactiv e document generation and identify three key challenges of naively applying LLM-based agents to this task: uncontrollability , absence of human partici- pation, and lack of evaluation resources. • W e propose V iviDoc , a multi-agent system that provides three levels of human control: the Document Specication (DocSpec) with SRTC Interaction Specications for structured planning, a content-aware Style Palette for style customization, and chat- based editing for iterative renement, making the generation process controllable and transparent. • W e construct ViviBench , a benchmark grounded in 101 real- world interactive documents from over 60 websites across 11 domains. W e derive a taxonomy of 8 interaction typ es from 482 interaction instances and design an automated evaluation framework validated against human ratings. • W e conduct comprehensive experiments including automated evaluation, human evaluation, and a user study , demonstrat- ing that V iviDoc achieves the highest document quality and provides an eective and satisfying authoring experience. 2 Related W ork 2.1 Interactive Document Interactive documents, often conceptualize d as explorable expla- nations [ Victor(2011) ], are a transformative medium that lever- ages dynamic web capabilities to foster deep engagement with complex ideas. The theoretical foundation for such active r eading environments traces back to early human-computer interaction paradigms, including Engelbart’s framework for augmenting hu- man intellect [ Engelbart(2023) ] and foundational hypertext struc- tures [ Nelson(1965) ], which were operationalized in early systems like PLA TO [ Bitzer et al.(2007) ]. T oday , empirical studies conrm that interactive documents signicantly improve reader engage- ment and comprehension compared to static alternatives [ Chi and W ylie(2014) , Hohman et al.(2020) ]. The medium has been successfully adopte d across diverse elds, from immersive data journalism ( e.g., The New Y ork Times’ Snow Fall [ Branch(2012) ] and Bloomberg’s climate vi- sualizations [ Roston and Migliozzi(2015) ]) to scholarly communica- tions exploring machine learning fairness [ W attenberg et al.(2016) ]. Despite their ecacy , creating these artifacts remains prohibi- tively costly , demanding a rare intersection of de ep domain knowl- edge and web development expertise. This bottleneck severely lim- its the availability of high-quality interactive content. T o overcome these authoring barriers, w e introduce V iviDoc , a human-agent col- laborative system designed to systematically automate and control the generation of interactive documents. 2.2 LLM Agents for Content Creation Recent advancements increasingly rely on LLMs and multi-agent systems to automate complex content creation by decomposing tasks across specialize d agents [ Lin et al.(2025) , Zou et al.(2025) ]. These frameworks have achie ved success across various domains, including data visualization systems like HAIChart [ Xie et al.(2024) ], VisPilot [ W en et al.(2025) ], and DataLab [ W eng et al.(2025) ]. Simi- larly , presentation agents like PPT Agent [ Zheng et al.(2025) ], and SlideGen [ Liang et al.(2025) ] use structural schemas for coherent slide decks, while systems like LA VES [ Y an et al.(2026) ] extend multi-agent orchestration to generate synchronized educational videos. Howev er , applying such agentic approaches to interactive doc- uments remains largely une xplored. Naive implementation faces critical limitations regarding uncontrollable generation lacking in- tent alignment, the black-box nature of agents that preclude mean- ingful human intervention, and the absence of spe cialized datasets and systematic evaluation methods. A ddressing these fundamen- tal bottlenecks, ViviDoc establishes a controllable, human-in-the- loop generation paradigm by structuring the collab orative process around an interpretable intermediate representation. 3 System Over view As illustrate d in Figure 2, ViviDoc generates interactiv e documents through a four-agent pipeline: P lanner , Styler , Executor , and Evalu- ator , coordinated by a structured intermediate representation called the Document Specication (Do cSpec). The pipeline exposes three points of human control: editing the DocSpec after planning, cus- tomizing style preferences thr ough the Style Palette, and rening the generated document through chat-based editing. Planner . The Planner agent takes a topic (e .g., “What is Bernoulli’s equation?”) and produces a DocSpe c. It decomposes the topic into a sequence of knowledge units, each containing a text description that guides content generation and an SRTC Interaction Specica- tion that denes the interactive visualization. The Planner uses an LLM with structured output to ensure the DocSpec conforms to a predened schema. W e describe the DocSp ec structure in detail in Section 4. Styler . The Styler agent analyzes the DocSpec content and gener- ates a Style Palette, a set of style dimensions with multiple options XXX, 2026, XXX, XXX T ang et al. H uman-In-T h e- L oop Inter v ention Edit D oc Sp ec C ustomi z e S t y le 80% S tart KU1 w ith ener g y c onservation . OK, I w ill edit  KU1 to meet this . T opic Input What is Bernoulli's equation? Huma Educ ator Multi-Agent F r amework D oc Refinement #KU1, A dd the Bernoulli equation . T unin g BRIGHTNESS COL OR ........ Result Refinement OK, it has b een  added . Planner D oc Sp ec S t y ler S t y le C onfi g Exec utor Evaluator D oc F eed b ac k Intermediate Repr esentations KU1 Unit Summary: Intr oduction to Bernoulli's equation. Text Description: In a closed fluid system, the total ener gy r emains constant — a balance of pr essur e, velocity , and height Interaction Spec (SRTC): State: velocity:{range:[0,100], default:30}...; Render: A slider to adjust fluid velocity (V); T ransition: Dragging the velocity slider incr eases kinetic ener gy and decr eases static pr essur e; Constraint: static pr essur e + kinetic ener gy + potential ener gy = total ener gy . Writing Style Instructional Tone: A c ademic Narrative Structure: A uto Text Density: C onc ise ....... Interaction Style Visual Complexity: Realistic Animation Intensity: Dy namic F lo w Color E nco d in g : A uto ...... Figure 2: The ViviDoc pipeline. Given a topic, the Planner generates a DocSpec consisting of knowledge units with text descriptions and SRTC Interaction Specications. The Styler analyzes the DocSpe c and generates a Style Palette for users to customize writing and interaction styles. The Executor generates the document code guided by the DocSpec and style instructions. The Evaluator checks the output for correctness. Users can inter vene at three points: e diting the DocSpec, customizing the Style Palette, and rening the document through chat. for the user to choose from. The dimensions are organize d into two categories: writing style (e .g., narrative tone, terminology le vel) and interaction style (e .g., visual complexity , animation intensity ). Each dimension oers 2–3 LLM-generated options along with an “ A uto” option (delegating the choice to the LLM) and a “Custom” option (free-text input). The selected options are compile d into natural lan- guage instructions and injecte d into the Executor prompts: writing style instructions for Step 1 (text generation) and interaction style instructions for Step 2 (visualization generation). Executor . The Exe cutor agent takes the DocSp ec and style in- structions and generates the nal H TML do cument. It processes each knowledge unit in two steps. In Step 1, it generates the te xt content as an H TML fragment, guided by the text description, the writing style instructions, and the context of previously generated sections to maintain consistency . In Step 2, it generates the inter- active visualization as HTML, CSS, and JavaScript, guided by the SRTC Interaction Specication and the interaction style instruc- tions. Evaluator . The Evaluator agent checks the generated do cument for correctness. It validates the H TML structure, veries that all knowledge units have be en successfully generated, and checks that each section passes H TML validation. If issues are found, the Evaluator provides feedback that can trigger re-execution of spe cic components. Human-in-the-Loop Control. Human control spans three points in the pipeline. First, after the Planner pr oduces the DocSpec, users can reorder kno wledge units, modify text descriptions, adjust inter- action parameters (e.g., variable ranges, control types), or add and remove units entirely . Because the DocSpe c is structured rather than free-form, e dits are targeted and predictable: changing a slider range in the Interaction Spe cication directly aects the generate d visualization without requiring the user to write co de. Second, the Style Palette allows users to customize the writing and interaction styles of the document by selecting from LLM-generated options or providing custom instructions. Third, after the Executor produces the do cument, users can rene the output through a chat-based interface, describing desired changes in natural language. The sys- tem parses these requests and applies corresponding e dits to the document. The DocSpec ser ves as a contract between pipeline stages: the Planner expresses intent in a structured form, the Styler translates user preferences into generation instructions, the Executor imple- ments the plan as code, and the Evaluator v eries the result. This decomposition isolates the most error-pr one step, translating intent ViviDoc : Generating Interactive Documents through Human-A gent Collaboration XXX, 2026, XXX, XXX into code, and constrains it with a structured specication rather than ambiguous natural language. 4 Document Spe cication The Document Specication (DocSp ec) is the structured interme- diate representation at the core of ViviDoc . It bridges the gap between a high-level topic and the generated interactive document by decomposing the content into a sequence of kno wledge units, each with explicit instructions for both text and interaction genera- tion. 4.1 Structure A DocSpe c consists of a topic and an ordered list of knowledge units. Each knowledge unit contains three components: • A unit summary that briey states the concept cover ed. • A text description that provides a self-contained guide for generating the explanatory text of the section. It spe cies what the reader should understand after reading, without prescribing exact wording. • An Interaction Specication that denes the interactive visu- alization using the SRTC decomposition described b elow . The text description and Interaction Sp ecication serve dierent steps of the Executor: the former guides Step 1 (text generation) and the latter guides Step 2 (visualization generation). This separation allows each step to operate independently with minimal ambiguity . The Interaction Specication supports all common interaction pat- terns such as parameter exploration, dir ect manipulation, and mode switching (see Figure 1). Due to space limitations, we provide the corresponding specications for these cases in the supplementary material. 4.2 Interaction Specication: SRTC Munzner’s What- Why-How framework [ Munzner(2014) ] provides a foundational decomposition for visualization design: What data is being visualized, How it is visually enco ded and interacted with, and Why the user engages with it. W e adapt this framework to the setting of LLM-based generation by introducing the SRTC Interac- tion Specication, which decomposes each interactive visualization into four components: • State (S) : The variables underlying the visualization, including their types, domains, default values, and derivation rules. V ari- ables are either user-controllable (e .g., a slider with a specied range) or derived from other variables (e .g., a formula). This corresponds to Munzner’s What. • Render (R) : A description of how the state maps to visual el- ements on screen, such as geometric shapes, labels, or charts. This corresponds to the visual encoding aspect of How . • Transition (T) : A description of how user actions modify the state, specifying the cause-and-eect relationship between input events and state changes. This corresponds to the interaction idiom aspect of How . • Constraint (C) : The key invariant that the visualization is de- signed to demonstrate. This is the cor e insight the reader should discover through interaction. This corresponds to Munzner’s T able 1: Comparison of a natural language interaction de- scription and its SRTC Interaction Sp ecication for a visual- ization about 𝜋 . Natural language description “The reader can adjust the radius of a circle using a slider , and as the radius changes, the visualization updates the circle while recalculating and displaying the circumference, diameter , and ratio, allowing the reader to see that the ratio stays roughly the same. ” SRTC Interaction Specication S r : slider [ 0 . 5 , 5 ] , default 1; C : derived 2 𝜋 𝑟 ; D : derived 2 𝑟 ; ratio : derived 𝐶 / 𝐷 R A circle whose size reects 𝑟 ; labels showing 𝐶 , 𝐷 , and ratio T Dragging the slider changes 𝑟 ; 𝐶 , 𝐷 , ratio update automatically C ratio ≈ 3 . 14159 regardless of 𝑟 Why , adapted from “what task is the user p erforming” to “what should the reader observe. ” Example. T able 1 contrasts a natural language interaction de- scription with its corresponding SRTC specication for a visual- ization about 𝜋 . The natural language version leaves implicit what variables exist, how they relate , what appears on screen, and what invariant the reader should notice. The SRTC v ersion makes each of these explicit. The State denes a user-controllable variable 𝑟 (a slider over [ 0 . 5 , 5 ] ) and three derived variables: circumference 𝐶 = 2 𝜋 𝑟 , diameter 𝐷 = 2 𝑟 , and their ratio 𝐶 / 𝐷 . The Render spe cies that the visualization displays a circle whose size r eects 𝑟 , along with labels for 𝐶 , 𝐷 , and the ratio . The Transition links the slider in- teraction to the state: dragging the slider changes 𝑟 , and all derived variables update automatically . Finally , the Constraint encodes the key invariant that the reader should discover: 𝐶 / 𝐷 ≈ 3 . 14159 re- gardless of 𝑟 . 5 Benchmark T o systematically evaluate the quality of generated interactive doc- uments, we construct ViviBench , a benchmark consisting of a curated topic dataset grounded in real-world interactive documents and an automated evaluation framework co vering four complemen- tary dimensions. 5.1 Dataset A key challenge in benchmarking interactive document genera- tion is obtaining topics that reect genuine needs rather than syn- thetic or contrived examples. T o address this, we adopt a r everse- engineering approach: we collect 101 real-w orld interactive docu- ments from 63 distinct websites acr oss 11 subject areas, including platforms such as setosa.io [ Powell and Lehe(2023) ] and distill.pub [ Hohman et al.(2020) ]. For each document, we use an LLM to extract the core topic as a concise natural-language description (e.g., “Exp onential growth as repeate d multiplication”). These extracted topics ser ve as the input prompts for all generation methods. Due to space limitations, we provide the distribution of topics across subje ct areas in the supplementary material. XXX, 2026, XXX, XXX T ang et al. From these 101 do cuments, we also extract 482 interaction in- stances. Three visualization experts collaboratively classify these instances into 8 interaction types based on interaction intent such as Parameter Exploration and State Switching (see Figure 1). This taxonomy inspires the design of the SRTC Interaction Specication (Section 4), which can express all 8 identied types. 5.2 Evaluation Metric W e evaluate generated documents along four dimensions using a two-layer evaluation frame work. The rst layer applies determinis- tic, rule-based checks to verify functional correctness and genera- tion eciency . The second layer employs LLM-as-Judge to assess higher-level content and interaction quality [ Zheng et al.(2023) , Xie et al.(2025)]. Layer 1: Rule-based Evaluation. T wo dimensions are evaluated through automated checks, providing reproducible and determinis- tic assessments: • Interaction Functionality (IF): T ests whether interactive el- ements (buttons, sliders, che ckboxes, dropdowns) respond to programmatic interaction using Play wright browser automation. For each element, we record the DOM state b efore and after triggering the interaction; elements that produce a DOM change are counted as functional. The score is the ratio of responsive elements to total interactive elements found. • Eciency (E ): Measures the generation throughput as the ratio of output H TML length (in characters) to generation time (in seconds). This metric captures ho w eciently each method utilizes its LLM calls to produce content. Layer 2: LLM-as-Judge Evaluation. T wo dimensions are assessed by prompting an LLM to score each do cument on a 1–5 Likert scale with detailed rubrics: • Content Richness (CR): Whether the document covers the topic with sucient breadth (e.g., multiple sub-concepts) and depth (e.g., explanations, examples, connections b etween ideas). W e extract the main textual content from the H TML, stripping scripts and styling, so the judge fo cuses on the educational substance rather than code volume. • Interaction Quality (IQ): A composite metric dened as IQ = ID × IF , where ID (Interaction Design) is the LLM-judged score for whether interactive elements serve a clear purpose, and IF is the rule-based Interaction Functionality score. This formulation ensures that only functionally w orking interactions contribute to the quality score. 6 User Interface Figure 3 shows the ViviDoc user interface, which consists of three panels: a sidebar (A ), a center panel (B), and an AI assistant panel (C). The sidebar (A ) provides a “New Doc” button (A1) to start a new generation and a histor y list (A2) of previously generated documents. The center panel (B) guides users through a four-stage workow indicated by the top navigation bar (B1): T opic, Spec, Style, and Doc. A download button (B2) allows exporting the nal document. The AI assistant panel (C) provides a chat interface (C1) for rening the spec or the generated document through natural language conversation. Stage 1: T opic. The user enters a topic of interest in the input eld (B3) and clicks “Generate Spec” to start the pipeline. Stage 2: Spec. The Planner generates a DocSpec, displayed as a list of knowledge units (Figure 3b). Each unit shows its title, text description, and SRTC Interaction Specication. Users can e dit, reorder , add, or delete knowledge units before proceeding. Stage 3: Style. The Styler generates a Style Palette with two columns: W riting Style and Interaction Style (Figure 3c). Each col- umn contains several dimensions with LLM-generated options. Users can select an option, choose “ A uto” to let the LLM decide, or provide custom instructions. Stage 4: Doc. The Executor generates the nal interactive docu- ment, rendered in the center panel (Figure 3d). Users can interact with and download the document. The chat assistant ( C1) allows further renement through natural language requests. 7 Experiments 7.1 Eectiveness of ViviDoc Setup. W e generated interactive documents for all topics in ViviBench using ViviDoc and three multi-agent baselines: A utoGen [ Wu et al.(2024) ], CAMEL [ Li et al.(2023) ], and MetaGPT [ Hong et al.(2023) ]. Each method was run with three backbone LLMs (Gemini 3 Flash [ T eam et al.(2023) ], Mistral Small [ T eam(2024) ], and Qwen 3.5-35B [ Bai et al.(2023) ]) to assess robustness across model capabilities. For fair comparison in the quantitative evaluations (Sections 7.1–7.3), we remove hu- man control from V iviDoc and operate its full pipeline (Planner → Styler → Executor → Evaluator) in fully automated mode with the style set to auto. For the three baselines, w e implement the equiv- alent roles from our pipeline within their respective frame works, using the same topics as input. All LLM-as-Judge e valuations use Gemini 3.1 Pro [T eam et al.(2023)] as the judge model. Results. Figure 4 reports the four metrics p er method-backbone combination. ViviDoc consistently achieves the highest Content Richness and Interaction Quality acr oss all backbones, substantially outperforming all baselines. For example, with Gemini 3 Flash as the backbone, V iviDoc achieves a normalized CR of 1.00 and IQ of 0.92, compared to 0.53 and 0.64 for the strongest baseline A utoGen. No- tably , CAMEL and MetaGPT produce documents with near-zer o IQ scores ( < 0.05 across all backbones), indicating that their generate d interactions rarely function correctly—the rule-based IF compo- nent ee ctively p enalizes non-functional interactions. Furthermore, ViviDoc achieves signicantly higher end-to-end Eciency . For in- stance, when using Gemini 3 F lash as the backbone, the eciency of ViviDoc (505 chars/s) is 3.3 × , 9.9 × , and 2.0 × higher than A utoGen, CAMEL, and MetaGPT , respectively . The above results demonstrate that ViviDoc , as a curated pipeline for the interactive document generation task, vastly outperforms general-purpose multi-agent frameworks in both document generation quality and eciency . 7.2 Eectiveness of DocSp ec T o isolate the contribution of DocSpec, we compare ViviDoc against a Naive Agent baseline that utilizes our generation pipeline but removes the DocSp ec generated by the Planner as guidance, instead ViviDoc : Generating Interactive Documents through Human-A gent Collaboration XXX, 2026, XXX, XXX A B C B1 B2 A 1 A2 B3 C1 （a）Stage1 : Enter Topic （b）Stage2: Edit Spec （c）Stage3: Customize Style （d）Stage4: Refine Doc Figure 3: The V iviDoc user interface. T op: the main view with (A ) sidebar for history and new document creation, (B) center panel with topic input and four-stage navigation bar , and (C) AI chat assistant. Bottom ( left to right): the Sp ec stage showing editable knowledge units, the Style stage with writing and interaction style options, and the Doc stage displaying the generated interactive document. generating the interactive document directly end-to-end based on the topic. Results. T able 2 shows the comparison across all three backbones. ViviDoc consistently outperforms Naive Agent on CR, IQ , and IF across all backbone LLMs, conrming that Do cSpec’s structured planning leads to richer content and more functional interactions. The improvement is most pronounced on IQ (+41% with Gemini 3 Flash), highlighting that generation without interaction speci- cations struggles to produce coherent interactive elements even when content quality is reasonable. In terms of eciency , ViviDoc introduces additional planning overhead, resulting in modest e- ciency reductions for Gemini and Mistral (6% and 19%, respectively ). Qwen exhibits a larger drop (50%), as the additional planning calls consume a greater share of total time on a lower-throughput mo del. Overall, the eciency trade-o is w ell justied by the consistent gains in content richness and interaction quality . T able 2: DocSpec eectiveness: ViviDoc vs. Naive Agent across three backbone LLMs. CR: Content Richness (1–5), IQ: Interaction Quality (0–5), IF: Interaction Functionality (0–1), E: Eciency (chars/s). Y ellow : b est per backbone. Backbone Method CR ↑ IQ ↑ IF ↑ E ↑ Gemini 3 Flash Naive Agent 4.00 3.24 0.85 539 ViviDoc 5.00 4.58 0.92 505 Mistral Small Naive Agent 4.44 2.36 0.61 695 ViviDoc 5.00 3.24 0.65 562 Qwen3.5-35B Naive Agent 4.11 2.76 0.70 355 ViviDoc 4.56 3.57 0.73 178 XXX, 2026, XXX, XXX T ang et al. Figure 4: Automated evaluation results for ViviDoc vs. three multi-agent baselines across three backbone LLMs. Content Richness (CR) and Interaction Quality (IQ) are normalized to [ 0 , 1 ] . Interaction Functionality is on a 0–1 scale; Eciency is measured in characters per second (chars/s). 7.3 Human Evaluation T o validate the automated evaluation, we conducted a human eval- uation study with 12 raters. Setup. W e selected 9 topics randomly from ViviBench , ensuring coverage across dier ent subject areas, and generated documents using all four methods with the Gemini 3 F lash backbone, yielding 36 do cuments. W e recruited 12 human raters and divide d them into 3 groups of 4, with each group evaluating 12 documents (3 topics × 4 methods) in a blind setting. Raters scored each document on Content Richness (CR) and Interaction Design (ID) using a 1–5 Likert scale. Human Scores. T able 3 shows the mean human scores per metho d. ViviDoc achieves the highest scores on both dimensions (CR: 4.43, ID: 4.33), followed by AutoGen. CAMEL receives the lowest scores, consistent with the automated evaluation results. T able 3: Mean human evaluation scores by metho d (Gemini 3 Flash backbone), averaged across 12 raters and 9 topics. Y ellow : b est per metric. A utoGen MetaGPT CAMEL ViviDoc CR ↑ 2.79 2.51 1.60 4.43 ID ↑ 3.55 1.85 1.76 4.33 LLM-Human Alignment. T o assess whether the LLM judge can serve as a reliable proxy for human evaluation, we compute the correlation between LLM judge scores and averaged human scores across all 36 paired items. Content Richness shows strong align- ment (Pearson 𝑟 = 0 . 843 , Spearman 𝜌 = 0 . 835 ), and Interaction Quality like wise (Pearson 𝑟 = 0 . 870 , Spearman 𝜌 = 0 . 796 ). These re- sults suggest that the LLM-as-Judge framework provides a reliable automated proxy for human assessment on these dimensions. 7.4 User Study W e conducted a user study to evaluate ViviDoc as an end-to-end authoring tool, combining a free-form usage session with a follow- up semi-structured interview . 7.4.1 System Use. Participants and Procedure. W e recruited 12 participants (P1– P12) with backgrounds in visualization, educational technology , or computer science. After a 10-minute introduction, each participant used ViviDoc to generate two interactive documents on self-chosen topics, experiencing the full workow from topic input through DocSpec e diting, style customization, and generation. Participants then rated the system on 9 items using a 5-point Likert scale, cov- ering usability (Q1–Q2), controllability (Q3–Q5), output quality (Q6–Q8), and intent to reuse ( Q9). Results. T able 4 summarizes the ratings. All items received mean scores above 4.0 on the 5-point scale, with Usability items (Q1–Q2) achieving a perfect 5.00, indicating that the system is easy to learn and intuitive to use. For Contr ollability , DocSp ec (Q3: 4 . 50 ± 0 . 76 ) and chat-based e diting (Q5: 4 . 67 ± 0 . 47 ) were both rate d highly , while the Style Palette r eceived a comparatively low er score (Q4: 4 . 17 ± 0 . 49 ). Some participants noted that the lack of inline previews made it dicult to anticipate how style choices would aect the nal output, a point we re visit in the qualitative feedback below . Output quality was rated highly acr oss all dimensions (Q6–Q8 ≥ 4 . 58 ), with text content and visualizations both considered informative and ViviDoc : Generating Interactive Documents through Human-A gent Collaboration XXX, 2026, XXX, XXX T able 4: User study ratings (5-point Likert scale, 𝑛 = 12 ). Item Mean ± SD Usability Q1 Easy to learn and use 5 . 00 ± 0 . 00 Q2 Interface intuitive and satisfying 5 . 00 ± 0 . 00 Controllability Q3 DocSpec gives meaningful control 4 . 50 ± 0 . 76 Q4 Style Palette gives meaningful control 4 . 17 ± 0 . 49 Q5 Chat editing renes output eectively 4 . 67 ± 0 . 47 Output Quality Q6 T ext content informative and well-written 4 . 75 ± 0 . 43 Q7 Visualizations engaging and functional 4 . 67 ± 0 . 47 Q8 Overall document satisfying 4 . 58 ± 0 . 49 Intent to Reuse Q9 W ould use again 4 . 75 ± 0 . 43 engaging. The strong intent-to-r euse score (Q9: 4 . 75 ± 0 . 43 ) further suggests overall satisfaction with the system. Interview . Following the system-use session, we conducted a brief semi-structured interview with each participant to gather qualitative feedback. Participants highlighted three main strengths: (1) DocSpec provides transparent, ne-grained control over docu- ment structure and interaction design “without writing any code ” (P4); (2) the Style Palette and chat-based editing work in a comple- mentary fashion, with the former for setting the overall tone and interaction style upfront and the latter for targeted post-generation renements; and (3) the T opic → Spec → Style → Doc workow felt natural, with each stage having a clear and distinct r ole. T wo suggestions for improvement w ere raised: adding inline previews within the Style Palette so that style options are easier to evaluate before generation, and supporting retrieval-augmented generation for topics requiring specialized or up-to-date knowledge (e .g., re- cent research papers or live datasets). W e leav e the exploration of these features to future work. 8 Conclusion W e presented V iviDoc , a multi-agent framework for controllable interactive document generation. It supports meaningful human collaboration through structured planning (DocSpe c), stylistic cus- tomization, and chat-based editing. W e also introduced V iviBench , a comprehensive evaluation benchmark. Extensive experiments and user studies conrm that V iviDoc signicantly outp erforms existing baselines, oering a highly eective and intuitive author- ing experience. W e hop e this work lays a foundation for further research on human-agent collaboration in interactiv e content au- thoring. XXX, 2026, XXX, XXX T ang et al. References [exp([n. d.])] [n. d.]. Explorable Explanations. https://explorabl.es/. [Bai et al.(2023)] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Y ang Fan, W enbin Ge, Yu Han, Fei Huang, et al . 2023. Qwen te chnical report. arXiv preprint arXiv:2309.16609 (2023). [Bitzer et al.(2007)] Donald Bitzer , Peter Braunfeld, and W ayne Lichtenberger . 2007. PLA TO: An automatic teaching device. IRE Transactions on Education 4, 4 (2007), 157–161. [Branch(2012)] John Branch. 2012. Snow Fall: The A valanche at T unnel Creek. The New Y ork Times. https://ww w .nytimes.com/projects/2012/snow- fall/index.html. [Chen et al.(2025)] Y anzhe Chen, Kevin Qinghong Lin, and Mike Zheng Shou. 2025. Code2Video: A Code-centric Paradigm for Educational Video Generation. arXiv preprint arXiv:2510.01174 (2025). [Chi and W ylie(2014)] Michelene T . H. Chi and Ruth W ylie. 2014. The ICAP Frame- work: Linking Cognitive Engagement to Active Learning Outcomes. Educational Psychologist 49, 4 (2014), 219–243. doi:10.1080/00461520.2014.965823 [Engelbart(2023)] Douglas C Engelbart. 2023. Augmenting human intellect: A concep- tual framework. In A ugmented education in the global age . Routledge, 13–29. [Hohman et al.(2020)] Fred Hohman, Matthew Conlen, Jerey Heer , and Duen Horng Chau. 2020. Communicating with Interactive Articles. Distill (2020). doi:10.23915/ distill.00028 [Hong et al.(2023)] Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Y uheng Cheng, Jinlin W ang, Ceyao Zhang, Zili W ang, Steven Ka Shing Y au, Zijuan Lin, et al . 2023. MetaGPT: Meta programming for a multi-agent collabora- tive frame work. In The twelfth international conference on learning representations . [Li et al.(2023)] Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for" mind" exploration of large language model society . Advances in neural information processing systems 36 (2023), 51991–52008. [Liang et al.(2025)] Xin Liang, Xiang Zhang, Yiwei Xu, Siqi Sun, and Chenyu Y ou. 2025. Slidegen: Collaborative multimodal agents for scientic slide generation. arXiv preprint arXiv:2512.04529 (2025). [Lin et al.(2025)] Yi-Cheng Lin, Kang-Chieh Chen, Zhe- Y an Li, Tzu-Heng Wu, Tzu- Hsuan Wu, Kuan- Yu Chen, Hung-yi Lee , and Yun-Nung Chen. 2025. Creativity in llm-based multi-agent systems: A survey . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing . 27572–27595. [Munzner(2014)] T amara Munzner . 2014. Visualization Analysis and Design . A K Pe- ters/CRC Press. [Nelson(1965)] Theodor Holm Nelson. 1965. Complex information processing: a le structure for the complex, the changing and the indeterminate. In Proceedings of the 1965 20th national conference . 84–100. [Powell and Lehe(2023)] Victor Powell and Lewis Lehe. 2023. Setosa: Explained Visu- ally . https://setosa.io/. Accessed: 2026-03-29. [Roston and Migliozzi(2015)] Eric Roston and Blacki Migliozzi. 2015. What’s really warming the world. Bloomberg Business. https://w ww .bloomberg.com/graphics/ 2015- whats- warming- the- world/. [T ang et al.(2026)] Yinghao T ang, Xueding Liu, Boyuan Zhang, Tingfeng Lan, Y up eng Xie, Jiale Lao, Yiyao W ang, Haoxuan Li, Tingting Gao, Bo Pan, et al . 2026. IGen- Bench: Benchmarking the Reliability of T ext-to-Infographic Generation. arXiv preprint arXiv:2601.04498 (2026). [T eam et al.(2023)] Gemini T eam, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Y u, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al . 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023). [T eam(2024)] Mistral AI T eam. 2024. Mistral Small. https://mistral.ai/ne ws/september- 24- release/. [Victor(2011)] Bret Victor. 2011. Explorable Explanations. http://worrydream.com/ ExplorableExplanations/. Accessed: 2026-02-27. [W attenb erg et al.(2016)] Martin W attenberg, Fernanda Viégas, and Moritz Hardt. 2016. Attacking discrimination with smarter machine learning. Google Research. https://research.google.com/bigpicture/attacking- discrimination- in- ml/. [W en et al.(2025)] Zhen W en, Luoxuan W eng, Yinghao Tang, Runjin Zhang, Yuxin Liu, Bo Pan, Minfeng Zhu, and W ei Chen. 2025. Exploring multimodal prompt for vi- sualization authoring with large language models. arXiv preprint (2025). [W eng et al.(2025)] Luoxuan W eng, Yinghao T ang, Yingchaojie Feng, Zhuo Chang, Ruiqin Chen, Haozhe Feng, Chen Hou, Danqing Huang, Y ang Li, Huaming Rao, et al . 2025. Datalab: A unied platform for llm-powered business intelligence. In 2025 IEEE 41st International Conference on Data Engineering (ICDE) . IEEE, 4346–4359. [Wu et al.(2024)] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al . 2024. Autogen: Enabling ne xt-gen LLM applications via multi-agent conversations. In First conference on language modeling . [Xie et al.(2024)] Y upeng Xie, Yuyu Luo , Guoliang Li, and Nan T ang. 2024. Haichart: Human and AI pair ed visualization system. arXiv preprint arXiv:2406.11033 (2024). [Xie et al.(2025)] Y upeng Xie, Zhiyang Zhang, Yifan Wu, Sirong Lu, Jiayi Zhang, Zhaoyang Yu, Jinlin W ang, Sirui Hong, Bang Liu, Chenglin Wu, et al . 2025. Visjudge-bench: Aesthetics and quality assessment of visualizations. arXiv preprint arXiv:2510.22373 (2025). [Y an et al.(2026)] Lingyong Y an, Jiulong Wu, Dong Xie, W eixian Shi, Deguo Xia, and Jizhou Huang. 2026. Beyond End-to-End Video Models: An LLM-Base d Multi- Agent System for Educational Video Generation. arXiv preprint (2026). [Zheng et al.(2025)] Hao Zheng, Xinyan Guan, Hao Kong, W enkai Zhang, Jia Zheng, W eixiang Zhou, Hongyu Lin, Y aojie Lu, Xianpei Han, and Le Sun. 2025. Pptagent: Generating and evaluating presentations beyond text-to-slides. In Procee dings of the 2025 Conference on Empirical Methods in Natural Language Processing . 14413–14429. [Zheng et al.(2023)] Lianmin Zheng, W ei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al . 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems 36 (2023), 46595–46623. [Zou et al.(2025)] Henry Peng Zou, W ei-Chieh Huang, Y aozu Wu, Y ankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, W eizhi Zhang, Liancheng Fang, Langzhou He, Y angning Li, Dongyuan Li, Renhe Jiang, Xue Liu, and P hilip S. Y u. 2025. LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey . arXiv:2505.00753 [cs.CL] https://arxiv .org/abs/2505.00753 ViviDoc : Generating Interactive Documents through Human-A gent Collaboration XXX, 2026, XXX, XXX A Ethical Considerations The dataset of 101 interactive documents was collected and utilized in strict compliance with applicable cop yright regulations. T o re- spect potential copyright concerns, we will only release the URLs of the collected documents rather than distributing their content. For our user study , we obtained explicit informed consent from all participants and rigorously anonymized all inter view records to protect personal privacy . Although V iviDoc employs LLMs to assist in generation, we mitigate potential risks of AI hallucinations and uncontr ollable outputs through a human-in-the-loop paradigm. By allowing users to r eview and edit the Document Specication (DocSpec) prior to code synthesis, customize styles via the Style Palette, and rene output through chat-based editing, the system ensures that creators retain full contr ol over the authoring intent. T able 5: Distribution of topics in ViviBench by subject area. Subject Area # T opics Algorithms 25 Mathematics 24 T ools & Resources 13 Physics 10 Science 9 Other 6 Explorable Explanations 5 Systems & Thought Experiments 4 Psychology 2 Creativity 2 Books & Essays 1 T otal 101 B Interaction T axonomy From the 101 collected do cuments, we extracted 482 interaction in- stances. Three visualization experts collaboratively classied these instances into 8 types based on interaction intent, inspired by Mun- zner’s What- Why-How framework [Munzner(2014)]: • State Switching (181, 37.6%): Switching between discrete op- tions, such as selecting a dataset, algorithm, or display mode. • Parameter Exploration (121, 25.1%): Adjusting continuous pa- rameters to observe changes, such as tuning a slider for radius, frequency , or threshold. • Freeform Construction (53, 11.0%): Freely creating content, such as drawing shapes, writing code, editing values, or upload- ing les. • Direct Manipulation (45, 9.3%): Dragging objects within a visu- alization, such as data points, control points, or graph nodes. • T emporal Control (32, 6.6%): Controlling the time dimension, such as play/pause, stepping, spe ed adjustment, or timeline scrub- bing. • Inspection (24, 5.0%): Exploring details on demand, such as hover tooltips or cursor tracking. • Spatial Navigation (24, 5.0%): Navigating in space, such as zoom- ing, panning, or rotating 3D views. • Scroll-driven Narrative (2, 0.4%): Scrolling drives the narrative progression. C SRTC Specications for Case Study Examples The following structured SRTC (State, Render , Transition, Con- straint) specications correspond to the eight interactive visualiza- tions shown in Figure 1. Each sp ecication was produce d by the ViviDoc Planner agent and ser ved as the code-generation contract for the Executor . (a) Parameter Exploration — The Lorenz Attractor S sigma : slider [ 1 , 30 ] , default 10; rho : slider [ 10 , 60 ] , default 28; beta : constant 2.667; trajectory : deriv ed numerical integration of 𝑑 𝑥 / 𝑑𝑡 = 𝜎 ( 𝑦 − 𝑥 ) , 𝑑 𝑦 / 𝑑 𝑡 = 𝑥 ( 𝜌 − 𝑧 ) − 𝑦, 𝑑 𝑧 / 𝑑 𝑡 = 𝑥 𝑦 − 𝛽𝑧 R • A continuously growing 3D phase-space trajectory projected onto a 2D canvas • The trail fades over time to emphasize recent motion • A slow auto-rotation of the view around the v ertical axis • T wo sliders for 𝜎 and 𝜌 displayed below the canvas T • Adjusting the 𝜎 slider resets the trajectory and restarts integration from the same initial point • Adjusting the 𝜌 slider resets the trajectory , causing the attractor shape to deform or collapse into a stable orbit C For classical values ( 𝜎 = 10 , 𝜌 = 28 , 𝛽 = 8 / 3 ), the trajectory never re- peats and draws a distinctiv e buttery-shape—demonstrating sensitive dependence on initial conditions. (b) Direct Manipulation — Geometric Optics Ray Tracing S object_x : drag-x [ 20 , lens_x − 10 ] ; object_y : drag-y [ 0 , canvas_height ] ; f : drag-x [ 40 , 300 ] ; u : derived lens_x − object_x ; v : derived ( 𝑢 · 𝑓 ) / ( 𝑢 − 𝑓 ) ; M : derived 𝑣 / 𝑢 ; image_type : derived if 𝑣 > 0 : Real Inverted, if 𝑣 < 0 : Virtual Upright, if 𝑣 = ∞ : Undened R • A central convex lens whose thickness scales with focal length • An orange draggable object arrow on the left of the lens • T wo green draggable focal p oints 𝐹 and 𝐹 ′ on the optical axis • Three principal rays drawn from the object tip through the lens • A colored image arrow on the right (gr een for real, blue for virtual) • An instrument dashboard showing live values of 𝑓 , 𝑢 , 𝑣 , and 𝑀 • A status tag indicating image type and orientation T • Dragging the orange object arrow horizontally changes 𝑢 and updates all derived optical values • Dragging the object arrow v ertically changes the object height and redraws the ray diagram • Dragging a focal point ( 𝐹 or 𝐹 ′ ) changes 𝑓 , reshaping both the lens and all derived values simultaneously C The thin lens equation 1 / 𝑢 + 1 / 𝑣 = 1 / 𝑓 is always satised. When 𝑢 < 𝑓 , the image distance 𝑣 becomes negative (virtual image). When 𝑢 = 𝑓 , the image forms at innity . XXX, 2026, XXX, XXX T ang et al. (c) Inspe ction — V oronoi T essellation S seeds : 15 points initialized at random positions, moving with slow random velocity; mouse_pos : hover; nearest : de- rived arg min 𝑘 𝑑 ( mouse_pos , seeds [ 𝑘 ] ) ; min_dist : derived 𝑑 ( mouse_pos , seeds [ nearest ] ) ; cell_region : derived all pixels closest to seeds[nearest] within 150px from mouse_pos R • An animated canvas of 15 slowly drifting see d p oints on a dark background • On hover: the hov ered V oronoi cell illuminated with a purple radial gradient • On hover: a dashed gold line connecting the cursor to its nearest seed • On hover: a transparent circle of radius min_dist visualizing the nearest-neighbor envelope • The nearest seed highlighted in gold; all others remain purple T • Moving the mouse over the canvas updates mouse_pos continuously • Each frame recalculates the nearest seed and up dates the illuminated cell, connecting line, and envelope circle in real-time C The illuminated region always corresponds exactly to the V oronoi cell of the nearest seed. Every point within the highlighted region is provably closer to that seed than to any other—demonstrating the core denition of V oronoi partitioning. (d) Freeform Construction — Neural Network Forward Propagation S hidden_nodes : click-to-place list of { 𝑥 , 𝑦 } , default [ ] ; weights : de- rived random initialization; activations : derived forward pass (sig- moid); output_val : derived average of output neuron activations R • Three xed red input nodes ( 𝑥 1 , 𝑥 2 , 𝑥 3 ) on the left • T wo xed blue output no des ( 𝑦 1 , 𝑦 2 ) on the right • User-placed yellow hidden nodes in the central region • Directed arro ws connecting all layers; animate green (input → hidden) then orange (hidden → output) during forward pass • Activation values displayed on each node • An output activation progress bar and percentage readout • A ’Send Signal’ button and a ’Clear’ button T • Clicking in the central canvas zone places a new hidden neuron and immediately triggers an animated forward pass • Pressing ’Send Signal’ replays the forward pass with newly random- ized weights • Pressing ’Clear’ remov es all hidden nodes and resets the output bar to 50% C Adding more hidden neurons generally introduces more non-linearity . The output activation always falls in ( 0 , 1 ) due to the sigmoid function, regardless of network topology . (e) Scroll-driven Narrative — Thermodynamic Entropy S scroll_progress : scroll-wheel [ 0 , 1 ] , default 0; partition_y : de- rived scroll_progress × canvas_height ; particles : 150 red + 150 blue bouncing particles; entropy_S : derived fraction of particles that have crossed to the other side times 100 R • A dark split-canvas with 300 bouncing particles: red on left, blue on right • A central vertical wall separating the two halves, pr esent only from partition_y downward • A live entropy counter ( 𝑆 = . . . ) displayed in r ed in the top-left corner • A scroll-progress ll bar on the left edge T • Scrolling downward incr eases scroll_progress , raising partition_y and shortening the wall from the top, allowing particles to mix • Scrolling upward decreases scroll_progress , lowering partition_y and re-blocking particle passage C As the wall is removed ( scroll_progress → 1 ), particles irreversibly mix—entropy 𝑆 increases monotonically . The constraint Δ 𝑆 ≥ 0 maps directly to the unidirectional scroll interaction, visualizing time’s arrow . (f ) Spatial Navigation — The Möbius Strip S rotX : drag-y [ − 3 . 14 , 3 . 14 ] , default 0.5; rotY : drag-x [ − 3 . 14 , 3 . 14 ] , default -0.5; surface : derived parametric mesh 𝑥 ( 𝑢 , 𝑣 ) = ( 𝑅 + 𝑣 cos ( 𝑢 / 2 ) ) cos ( 𝑢 ) , 𝑦 ( 𝑢 , 𝑣 ) = 𝑣 sin ( 𝑢 / 2 ) , 𝑧 ( 𝑢, 𝑣 ) = ( 𝑅 + 𝑣 cos ( 𝑢 / 2 ) ) sin ( 𝑢 ) R • A 3D polygon mesh of the Möbius strip rendered using the Painter’s Algorithm • Faces colored with a teal-to-sky-blue gradient mapped to the 𝑢 pa- rameter • Depth shading simulating a diuse light source • Numeric o verlays sho wing the current rotX and rotY viewing angles T • Clicking and dragging horizontally on the canvas updates rotY , ro- tating the strip around the vertical axis • Clicking and dragging vertically updates rotX , tilting the strip for- ward or backward • Releasing the mouse locks the current orientation C No matter how the strip is rotated, a continuous path along its surface always returns to the starting point in a mirrored orientation, demon- strating the strip’s single-sidedness. ViviDoc : Generating Interactive Documents through Human-A gent Collaboration XXX, 2026, XXX, XXX (g) State Switching — Quantum Ele ctron Orbitals S orbital_state : segmented-button [ 1s , 2p , 3d ], default 1s ; wavefunction : derived 𝜓 ( 𝑟 , 𝜃 ) = radial_part ( 𝑛, 𝑙 ) × angular_part ( 𝑙 , 𝑚 ) ; density_cloud : derived Monte Carlo sam- pling, accept point ( 𝑥 , 𝑦 ) with probability ∝ | 𝜓 ( 𝑥 , 𝑦 ) | 2 R • A 2D canvas with coordinate axes and a nucleus dot at the origin • Points progressively sampled and plotted, accumulating into a proba- bility density cloud • A wavefunction e quation display that updates to reect the active orbital state • A segmented button control with options 1s , 2p , and 3d T • Clicking a segment button sets orbital_state to the selected orbital • Switching state clears all existing sample points and triggers a ne w Monte Carlo sampling run C Each orbital state produces a distinct, characteristic spatial density pattern: 1s is spherically symmetric, 2p is a dumbbell shape with a nodal plane, and 3d forms a four-lob ed clover pattern—matching theoretical predictions. (h) T emporal Control — Fourier Series Epicycles S time : playback [ 0 , ∞ ] , default 0; is_playing : toggle, default true; n_harmonics : slider [ 1 , 15 ] (step 2), default 5; wave : derived last 500 𝑦 -values of the epicycle tip R • A chain of rotating circles (epicycles), each at a frequency proportional to its harmonic index • A dot tracing the tip of the outermost epicycle • A reconstructed square wave drawn on the right by recording the tip’s 𝑦 -position over time • A dashed line connecting the epicycle tip to the leading edge of the wave • A Play/Pause button and a harmonic count slider T • Pressing Play/Pause starts or freezes the rotation of all epicycles • Dragging the harmonic slider adds or removes outer epicycles in odd increments, immediately reshaping the output wave C As n_harmonics increases toward innity , the reconstructe d wave converges to a perfect square wave, demonstrating Fourier’s theorem that any periodic signal decomposes into sinusoids.

ViviDoc: Generating Interactive Documents through Human-Agent Collaboration

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment