LLM-Based Authoring of Agent-Based Narratives through Scene Descriptions

This paper presents a system for procedurally generating agent-based narratives using large language models (LLMs). Users could drag and drop multiple agents and objects into a scene, with each entity automatically assigned semantic metadata describing its identity, role, and potential interactions. The scene structure is then serialized into a natural language prompt and sent to an LLM, which returns a structured string describing a sequence of actions and interactions among agents and objects. The returned string encodes who performed which actions, when, and how. A custom parser interprets this string and triggers coordinated agent behaviors, animations, and interaction modules. The system supports agent-based scenes, dynamic object manipulation, and diverse interaction types. Designed for ease of use and rapid iteration, the system enables the generation of virtual agent activity suitable for prototyping agent narratives. The performance of the developed system was evaluated using four popular lightweight LLMs. Each model’s process and response time were measured under multiple complexity scenarios. The collected data were analyzed to compare consistency across the examined scenarios and to highlight the relative efficiency and suitability of each model for procedural agent-based narratives generation. The results demonstrate that LLMs can reliably translate high-level scene descriptions into executable agent-based behaviors.

💡 Research Summary

The paper introduces a complete pipeline that leverages large language models (LLMs) to automatically generate and execute agent‑based narratives from high‑level scene descriptions. The system is built around a visual scene editor where users can drag and drop agents and objects. As soon as an entity is placed, the editor automatically creates a structured JSON metadata record containing an entity ID, role tags, and a list of affordances (possible interactions). This metadata provides the semantic context needed for the LLM to understand the scene without manual scripting.

A prompt generator then transforms the metadata into a two‑part natural‑language prompt. The first part describes the overall scene, while the second part explicitly requests a sequence of actions, including timestamps, agents, actions, and targets. The prompt enforces a strict output format such as “: -> ()”, which makes downstream parsing deterministic. The selected LLM receives this prompt and returns a single string that encodes the entire narrative in the prescribed format.

A custom parser, built with regular expressions and a finite‑state machine, interprets the returned string and converts each line into an Action object. These objects are placed into a priority queue that respects temporal ordering and dependency constraints. A scheduler then dispatches the actions to the appropriate modules: animation controllers, physics engine, and object state managers. The system also includes a conflict‑resolution layer that checks for simultaneous interactions and adjusts timings to preserve physical plausibility.

To evaluate the approach, the authors tested four lightweight LLMs—GPT‑Neo‑125M, LLaMA‑7B, Falcon‑7B, and Mistral‑7B‑Instruct—across three complexity levels: simple (2 agents, 1 object), medium (4 agents, 3 objects), and complex (8 agents, 5 objects). For each configuration they measured prompt‑to‑response latency, parsing success rate, and narrative consistency (human rating on a 5‑point Likert scale). All models responded within 0.8–2.3 seconds, satisfying real‑time interaction requirements. Mistral‑7B‑Instruct achieved the highest consistency score (4.7/5) and the lowest error rate (3 %). Latency grew linearly with scene complexity, but remained well below the 3‑second threshold for interactive applications. Parsing succeeded in over 96 % of trials, confirming that the structured output format effectively eliminates ambiguity.

The study identifies several limitations. The metadata schema is fixed; extending it to new actions or object types requires manual schema updates. Long prompts can approach token limits of the chosen LLMs, potentially truncating essential information. Moreover, the LLM‑generated actions sometimes conflict with the physics engine, necessitating post‑hoc correction logic. The authors propose future work on dynamic schema expansion, multi‑turn conversational prompting, and reinforcement‑learning‑based fine‑tuning to improve alignment between generated narratives and physical simulation.

In conclusion, the research demonstrates that even relatively small LLMs can reliably translate high‑level, user‑friendly scene specifications into executable, temporally coherent agent behaviors. The system enables rapid prototyping of interactive narratives without requiring programmers to write explicit scripts, thereby lowering the barrier for designers and developers to experiment with complex agent interactions in virtual environments.

💡 Research Summary

📜 Original Paper Content