KOINEU

February 10, 2026

Reading time: 23 minute

...

📝 Original Info

Title:
ArXiv ID: 2512.22876
Date:
Authors: Unknown

📝 Abstract

Modern AI systems often comprise multiple learnable components that can be naturally organized as graphs. A central challenge is the end-to-end training of such systems without restrictive architectural or training assumptions. Such tasks fit the theory and approaches of the collaborative Multi-Agent Reinforcement Learning (MARL) field. We introduce Reinforcement Networks, a general framework for MARL that organizes agents as vertices in a directed acyclic graph (DAG). This structure extends hierarchical RL to arbitrary DAGs, enabling flexible credit assignment and scalable coordination while avoiding strict topologies, fully centralized training, and other limitations of current approaches. We formalize training and inference methods for the Reinforcement Networks framework and connect it to the LevelEnv concept to support reproducible construction, training, and evaluation. We demonstrate the effectiveness of our approach on several collaborative MARL setups by developing several Reinforcement Networks models that achieve improved performance over standard MARL baselines. Beyond empirical gains, Reinforcement Networks unify hierarchical, modular, and graph-structured views of MARL, opening a principled path toward designing and training complex multi-agent systems. We conclude with theoretical and practical directionsricher graph morphologies, compositional curricula, and graph-aware exploration. That positions Reinforcement Networks as a foundation for a new line of research in scalable, structured MARL.

📄 Full Content

Modern AI systems increasingly comprise multiple learnable components that must coordinate to solve complex tasks. Prominent examples include large-language-model (LLM) workflows for tool use and orchestration, Retrieval-Augmented Generation (RAG) pipelines, and multi-agent LLM systems. These systems are naturally expressed as graphs of interacting modules: nodes represent learnable components (e.g., retrievers, planners, controllers), and edges encode information flow and control dependencies. This view is well-grounded in the literature on graph-structured learning and computation, where graphs serve as a unifying abstraction for modular architectures and message passing Zhou et al. [2020a], Khemani et al. [2024]. In practice, real-world workflows are frequently modeled as directed acyclic graphs (DAGs), which make dependencies explicit, enable topological scheduling, and support scalable orchestration in scientific and data-intensive computing Verucchi et al. [2023]. Recent LLM systems adopt precisely these ideas: LLM-driven workflow generators and orchestrators (e.g., WorkflowLLM; automated DAG construction for enterprise workflows) formalize complex LLM pipelines as DAGs for reliable execution and optimization Fan et al. [2025], Xu et al. [2024], and DAG-structured plans have been shown to improve task decomposition and concurrency in embodied LLM-agent settings Gao et al. [2024]. RAG, now a standard pattern for knowledge-intensive tasks, is also naturally modular -combining retrievers, rerankers, and generators in a graph that can be adapted and optimized end-to-end Gao et al. [2023], Gupta et al. [2024], Zhao et al. [2024].

From learnable systems to collaborative MARL. Graph-structured AI pipelines consist of interacting modules whose objectives and behaviors naturally motivate viewing them as agents with (partially) aligned goals under partial information and non-stationarity-the core setting of cooperative multi-agent reinforcement learning (MARL). Foundational work in cooperative MARL formalizes centralized training for decentralized execution (CTDE), credit assignment, and coordination as central challenges Amato [2024], Huh and Mohapatra [2024]. Credit-assignment methods such as LICA show that joint optimization is possible without fully centralized execution Zhou et al. [2020b], while hierarchical reinforcement learning (HRL) offers temporal abstraction and multi-level decision-making Sutton et al. [1999]. Together, these insights suggest a unified view: graph-structured, learnable systems as collaborative MARL problems where nodes (modules/agents) coordinate over a computation graph.

Leveled architectures and the LevelEnv abstraction. The LevelEnv abstraction from the TAG framework Paolo et al. [2025] operationalizes this view for decentralized hierarchical MARL by treating each hierarchy level as the “environment” for the level above, standardizing information exchange while preserving loose coupling. This supports deep, heterogeneous hierarchies, improves sample efficiency and performance on standard MARL benchmarks, and avoids rigid two-level manager-worker designs, aligning closely with modern LLM workflows and RAG pipelines Paolo et al. [2025].

Why DAGs for multi-agent systems? Modeling multi-agent, multi-module systems as DAGs enables topological ordering and parallelization of independent subgraphs Verucchi et al. [2023], clarifies credit-assignment paths along directed edges and aligns with hierarchical task/dependency and HTN formulations Georgievski and Aiello [2015], Chen et al. [2021], and avoids deadlocks and circular dependencies in training and execution Verucchi et al. [2023]. DAGs also generalize trees, supporting shared substructures and multi-parent dependencies common in tool-using LLM agents and modular RAG systems Fan et al. [2025], Xu et al. [2024], Gao et al. [2023].

Our research. We build on these observations to introduce Reinforcement Networks, a general framework that organizes collaborating agents as nodes in a DAG. Our formulation unifies hierarchical, modular, and graph-structured views of MARL: it supports flexible credit assignment and scalable coordination without strict topologies or fully centralized training. Moreover, it connects directly to the LevelEnv interface Paolo et al. [2025] for reproducible construction, training, and evaluation. We show that DAGs outperform the tree structures used in TAG for collaborative tasks. We highlight future directions-richer graph structures, compositional curricula, and graph-aware exploration-for scalable, structured MARL.

Multi-Agent Reinforcement Learning Multi-agent systems have seen rapid growth, driven by autocurricula emerging from interacting learning agents and enabling continual improvement [Nguyen et al., 2020, Oroojlooy andHajinezhad, 2023]. Tooling such as PettingZoo [Terry et al., 2021] and BenchMARL [Bettini et al., 2024] standardizes environments and benchmarks, improving comparability and reproducibility. Independent learning methods treat each agent as solving a partially observable RL problem where others are part of the environment. Examples include IPPO [De Witt et al., 2020], IQL [Tan, 1997], and ISAC [Bettini et al., 2024], extending PPO [Schulman et al., 2017], Q-Learning [Watkins and Dayan, 1992], and SAC [Haarnoja et al., 2018]. Parameter-sharing architectures rely on shared critics or value functions, as in MAPPO [Yu et al., 2022], MASAC [Bettini et al., 2024], and MADDPG [Lowe et al., 2017]. Explicit communication methods enable inter-agent information exchange via consensus schemes [Cassano et al., 2020] or learned communication protocols [Foerster et al., 2016, Jorge et al., 2016], directly addressing coordination. A core challenge in MARL is non-stationarity from simultaneous policy updates, which quickly invalidates replay data [Foerster et al., 2016]. Centralized Training with Decentralized Execution (CTDE) partly mitigates this via shared training components [Oroojlooy and Hajinezhad, 2023], but its constraints limit applicability in lifelong learning settings requiring continuous adaptation.

Hierarchical Reinforcement Learning Hierarchy enables abstraction-based value propagation and temporally and spatially extended behavior, improving exploration and efficiency over flat RL [Hutsebaut-Buysse et al., 2022, Nachum et al., 2019]. Decomposing tasks also reduces computational complexity and promotes sub-problem reuse, accelerating learning. Various approaches instantiate HRL. The Options framework models temporally extended actions via SMDPs (“options”) with policies, termination conditions, and initiation sets [Sutton et al., 1999], later trained end-to-end in Option-Critic [Bacon et al., 2017]. Feudal RL instead uses a manager-worker hierarchy, where highlevel managers issue intrinsic goals to lower-level workers [Dayan andHinton, 1992, Vezhnevets et al., 2017]. A core challenge is that changing low-level policies induces non-stationarity in higher-level value estimation, motivating model-based methods such as CSRL [Li et al., 2017]. Combining HRL with MARL adds further non-stationarity Reinforcement Networks: novel framework for collaborative Multi-Agent Reinforcement Learning tasks A PREPRINT through multi-agent dependencies. The TAG framework [Paolo et al., 2025] proposes methods to mitigate these issues but remains closely tied to the TAME formulation [Levin, 2021], constraining architectures to layered digraphs. Our work extends this line by relaxing these structural limitations.

Multi-Agent Reinforcement Learning A MARL setting is defined by an environment with state space S env , agents w i i = 1 k , a joint action space A with individual spaces A i , a transition function p : S env × A → ∆(S env ), an initialstate distribution p 0 ∈ ∆(S env ), and rewards R i i = 1 k , where r env i : S env × A i → R is the reward of the i-th agent.

Figure 1: Considering agent ω 2 colored yellow, its subordinate agents V - 2 are marked blue and the superior agents are marked red. Edges of the DAG represent agents’ couples where direct communication is present. Red arrows depict interaction between motors and the real environment.

Reinforcement Networks Consider a directed acyclic graph G = (V, E), where V = {w 1 , . . . , w N } stands for the set of vertices representing the agents, and E is the set of directed edges. We introduce the following notation to formally characterize the local connectivity of each vertex w i ∈ V :

Here, I + i denotes the indices list of vertices that have an outgoing edge incident to w i , whereas I - i denotes the indices list of vertices that have an incoming edge incident to ω i . The corresponding cardinalities are designated as

Furthermore, we define the sets of agents associated with these indices as

The sink nodes of the graph, for which V - i = ∅, interact directly with the external environment and consequently receive rewards and observations from it. We denote the set of these agents by V 0 ⊆ V and refer to them as motors.

In this formulation, we refer to a vertex w j as a superior agent of w i if there exists an edge (w j , w i ) ∈ E, and as a subordinate agent if there exists an edge (w i , w j ) ∈ E. This terminology enables a natural hierarchical interpretation of the directed acyclic structure.

For the traversal along an edge, we use the verb pass, whereas for the reverse direction we use the verb return. Similar to TAG [Paolo et al., 2025], for a given agent w i the set of subordinate agents V - i plays the role of the environment, returning both the reward and the observation in response to the agent’s actions.

which is defined as follows:

• M i -the message space, i.e., the set of messages that the agent may return to its superior agents.

• O - i -the observation space, defined as

, corresponding to the list of messages passed by subordinate agents.

• A i -the action space used to influence subordinate agents.

• O + i -the space of instructions, i.e., the actions passed by superior agents, defined as

depends both on the observations obtained from subordinate agents and the instructions provided by superior agents. Introducing the notation O i := O - i × O + i , the policy can equivalently be written as

It determines the message describing current environment state to be transmitted upward. Here, R li represents the vector of rewards returned by subordinate agents. Thus,

For source vertices with V + i = ∅, the choice of ϕ i has no impact on the overall system dynamics.

Reinforcement Networks: novel framework for collaborative Multi-Agent Reinforcement Learning tasks

The value of this function is returned to superior agents as an element of the reward list. For source vertices with V + i = ∅, the choice of ψ i has no impact on the overall system dynamics.

• R i : R li → R is the aggregation function, which interprets the list of received rewards.

We use the following notation throughout:

where the elements are assumed to be ordered according to related lists I + i , I - i . Finally, ∆(A i ) denotes the probability simplex over the action space A i .

We describe the exchange of information both among agents and between the environment and the agent system. Figure 2 illustrates the information flow across the entire system, while Figure 3 provides a visualization from the perspective of an individual agent. This mechanism generalizes the standard agent-environment interaction in reinforcement learning to hierarchical multi-agent systems. A single timestep of an agent corresponds to one cycle of interaction, beginning with processing observations, followed by sampling an action, and concluding with receiving a new observation as the environment’s response to the chosen action. At each timestep, information flow is decomposed into two phases: upstream inference and downstream inference. Information from the environment -namely observations and rewards -is transmitted to an agent against the direction of outgoing edges, whereas the agent’s action is propagated downwards along the edges of the graph. It is important to note that subordinate agents function as an environment for a given agent. For motors, as long as no subordinate agents are present, the external environment assumes this role. Sharing experience. During upstream inference, agents’ observations and rewards are propagated upward through the graph. The functions ϕ and ψ generate the messages and rewards that are passed to superior agents. This design enables efficient knowledge sharing. At the same time, it enables partial hiding of the environment state depending on the recipient’s abstraction level or task perspective.

In the beginning of a timestep, agent w i possesses the information received at the previous step as a response from the environment:

Given that agent w i generates a message and a reward to be passed to its superior agents:

) . These values then become elements of the input received by the agents of V + i . This process propagates experience upward in the hierarchy, enabling higher-level agents to incorporate information from subordinate agents.

Interaction with the environment. For a given vertex w i , the set V - i serves as the environment for this agent. For motors external environment instead of empty sets of subordinate agents is used.

At the start of the second phase of step t, which is considered in this paragraph, agent w i possesses the information received at the end of the previous step from the environment and during the previous phase from its superior agents, respectively:

and selects an action according to

In response to this action, the environment represented by the set V - i returns two vectors: new observations and rewards,

All subordinate agents must perform at least one step of execution to generate m j (t + 1) and r j (t) for all j ∈ I - i .

To interpret the resulting list of rewards, we utilize the agent’s reward aggregation function R i : R li → R. The reward of agent w i for executing action a i (t) is then defined as

. The function R i may be specified in various ways (e.g., mean, maximum, weighted sum), providing flexibility and opening avenues for further investigation.

Figure 3: Information flow from the perspective of agent w i . Rounded rectangles indicate V + i (red) and V - i (blue). Actions propagate downstream (red), while messages and rewards flow upstream (blue). One timestep corresponds to a counterclockwise cycle starting at the upper-right corner.

The system inference begins by sampling an initial state s 0 ∼ p 0 from the external environment. This state is provided as o - i (0) to all motors w i ∈ V 0 . We initialize r i (-1) = 0 for all motors. With these initial values, the motors possess all the information required to perform upstream inference at t = 0. Once the motors execute, they trigger a cascade of observations and rewards through the superior agents, which in turn return their directives. Using these signals, the motors interact with the external environment and advance to the next timestep.

Multiple time scales. It is worth noting that hierarchical graph structure naturally allows different agents to operate on distinct though consistent time scales, reflecting the varying reaction speeds of agents at different depths. Specifically, for a single time step of agent w i , one can execute T steps for the subordinate levels (s = 1, . . . , T ), producing sequences (m j (s)) T s=1 , (r j (s)) T s=1 for all j ∈ I - i . At each internal step, the actions from superior agents a + j (1) are reused, keeping the top-level instructions constant throughout the execution of the internal sequence of steps.

The resulting sequences of rewards and observations are then aggregated over time to form the inputs for w i , for example:

Reinforcement Networks: novel framework for collaborative Multi-Agent Reinforcement Learning tasks A PREPRINT

We propose using reinforcement learning to train both agents’ policies as well as their communication and proxyreward functions. Accordingly, the following Markov Decision Processes (MDPs) are defined for each agent. In what follows, we take agent w i as a reference.

Each component of the agent -policy, communication, and proxy-reward function -can be formalized as a separate MDP. This formulation allows us to treat learning each component consistently within the reinforcement learning framework, while capturing the interactions among them.

Policy MDP. The MDP for the agent’s policy is defined as ⟨O i , A i , p πi , p R πi , p 0 πi ⟩, where:

A j is the state space.

• A i is the action space.

•

is the reward distribution, determined by R i , subordinate agents, and the external environment.

• p 0 πi ∈ ∆(O i ) is the initial state distribution, derived from the external environment and the communication functions of subordinate agents.

A trajectory for learning the policy of agent w i is

• reward distribution: p R ϕi : D i × M i → ∆(R) as the reward distribution, determined by R i , proxy-reward functions, and the external environment.

• p 0 ϕi ∈ ∆(D i ) as the initial state distribution, derived from the external environment and subordinate agents’ communication functions.

A trajectory for learning the communication function of agent w i is

Proxy-Reward Function. Finally, for learning the proxy-reward function, the trajectory is

with R as the action space.

This MDP-based formulation enables the usage of various RL and MARL algorithms for learning each component while accounting for interactions between the agent’s policy, communication, and proxy-reward function.

In the previous section, we formulated the Reinforcement Networks framework and described the inference procedure of the model. We now present one possible approach -among the wide variety of potential implementations -for realizing this framework. Our method builds upon the LevelEnv framework introduced in [Paolo et al., 2025].

Reinforcement Networks: novel framework for collaborative Multi-Agent Reinforcement Learning tasks

The notion of outgoing depth of a vertex in the graph G is defined by induction. First, for each sink vertex w s , we set d s := 0. Next, consider a vertex w i . Assume that the outgoing depths d j have already been defined for all w j ∈ V - i . We then define d i := max j∈I - i d j + 1. In other words, the outgoing depth d i corresponds to the maximum length of a directed path from w i to any sink of the acyclic graph G.

We now introduce identity node. Such nodes effectively act as transparent intermediaries within the graph, simply forwarding information and actions without modification. Formally, an identity node w n is a vertex satisfying the following conditions:

, the communication and reward-proxy functions simply forwards the observations and reward from the subordinate agent;

, the policy is deterministic and directly replicates the actions received from superior agents;

• M n = O - n and A n = A + n , so that the message and action spaces coincide with the observation and instruction spaces, respectively.

We define the edge expansion (by one node) operation for an edge (w i , w j ) ∈ E within a graph G = (V, E), where G ∈ G and G denotes the set of DAGs. Formally, let N = |V | and define a new identity node w N +1 such that

and the edge set is modified as

The expansion of an edge (w i , w j ) by k ⩾ 2 nodes is defined recursively as a sequence of single-node expansions. Specifically, first we expand the edge (w i , w j ) with identity node w N +1 , then at step t = 1, . . . , k -1, each edge (w N +t , w j ) is expanded by inserting an identity node.

Namely, the operation inserts a new vertex w N +1 along the edge (w i , w j ) that merely forwards received messages and actions without modification. See Figure 4 for an illustration.

A layered digraph is a directed acyclic graph (DAG) in which vertices are assigned to discrete layers such that every edge connects a vertex on one layer to a vertex on the immediately lower layer. Any directed acyclic graph (DAG) can be transformed into a functionally equivalent layered digraph. To achieve this, it suffices to apply the edge expansion operation to every edge (w i , w j ) ∈ E for which

The partitioning of a directed acyclic graph (DAG) into layers can be accomplished using algorithms such as the Longest Path algorithm, the Coffman-Graham algorithm, or the ILP algorithm of Gansner et al. (for an overview, see [Healy and Nikolov, 2001]). In this context, the edge expansion operation serves as an instrument to convert any DAG into a layered form, ensuring that each edge connects vertices on consecutive layers. The aforementioned algorithms can then be applied to determine an optimal placement of vertices across the layers, minimizing the total width or other layout-related objectives. However, all the graph sinks are restricted to be placed on the same level for the following application.

Reinforcement Networks: novel framework for collaborative Multi-Agent Reinforcement Learning tasks A PREPRINT

Transforming the DAG into a layered digraph enables a direct implementation of the LevelEnv abstraction [Paolo et al., 2025], where each layer acts as an environment for the superior level while functioning as a set of agents with respect to the subordinate one.

At layer L l , each agent selects its action based on observations and rewards received from subordinate agents in V - i together with directives provided by superior agents in V + i . The individual actions are then assembled into a joint action vector and passed to L l-1 , where masking ensures that each subordinate agent processes only the relevant components. Using its local observations and rewards, each subordinate agent produces outputs via the communication function ϕ (messages) and the function ψ (rewards). These outputs are aggregated into layer-wide vectors and returned upward. At L l , masking redistributes the returned signals to the appropriate agents, after which ϕ and ψ are applied again to produce the aggregated outputs for the superior layer.

Thus, layers exchange information exclusively through unified vectors, while masking enforces the graph structure by restricting each agent to signals from its designated neighbors. This design simultaneously preserves agent-level independence and allows each layer to operate on its own temporal and spatial scale.

To assess the benefits of Reinforcement Networks relative to TAG-based systems with policies trained at each vertex, we perform experiments on two MARL tasks: MPE Simple Spread and a discrete version of VMAS Balance. All experiments use four agents. We compare our method with two baselines: IPPO De Witt et al. [2020] and 3PPO Paolo et al. [2025]. IPPO corresponds to a four-node DAG with no edges, in which each agent is trained independently via PPO; this architecture is expressible within the Reinforcement Network framework. 3PPO comprises a three-level hierarchy: four motor nodes, two mid-level controllers (each supervising two motors), and a single top-level controller. This structure forms a full binary tree of height three and can also be encoded as a Reinforcement Network.

Our proposed bridged-3PPO architecture augments 3PPO with skip-connections linking the top and bottom-level nodes. Representing these skip-connections in TAG requires introducing identity vertices. For all multi-level configurations, higher layers act only every two timesteps. The reward-proxy function is instantiated as an averaging operator, and communication is implemented via the identity map. We further evaluate bridged-3PPO-comm, a variant in which the communication function ϕ is learned using the autoencoder training method from [Paolo et al., 2025]. Training curves are reported in Figure 5. Results are averaged over five random seeds.

In the MPE Simple Spread environment, agents must coordinate in a 2D plane to cover all landmarks while avoiding collisions. This task requires cooperative spatial reasoning and benefits strongly from structured information flow.

IPPO serves as a non-hierarchical baseline: although each agent observes the relative positions of others, the absence of a coordinating structure yields unstable and low-reward behavior. All hierarchical variants substantially outperform IPPO, demonstrating the importance of multi-level reasoning even with simple directive and communication mechanisms.

Reinforcement Networks: novel framework for collaborative Multi-Agent Reinforcement Learning tasks A PREPRINT Across hierarchical methods, bridged-3PPO exhibits a brief warm-up dip and slightly increased variance relative to 3PPO. This effect is expected: adding bridges increases the state and directive dimensionality at higher levels, temporarily complicating credit assignment early in training. However, bridged-3PPO consistently converges faster, reaching its performance plateau in fewer episodes than 3PPO. This suggests that skip-connections provide useful shortcuts for information flow, allowing high-level agents to react more quickly to bottom-level dynamics.

Introducing learned communication in bridged-3PPO-comm further reduces the warm-up phase by compressing observations into a learned latent representation. While the additional learning burden slightly slows the final convergence rate, the improved early-training stability indicates that learned feature extraction can mitigate part of the state explosion introduced by bridging.

In the VMAS Balance task, a team of agents must cooperatively stabilize and lift a spherical package supported by a line under gravity, delivering it to a goal region. All agents receive an identical reward based on changes in package -goal distance, and incur a large penalty if the package or the line touches the floor. The task is highly coordinationsensitive, as even a single misaligned action can destabilize the system. Surprisingly, IPPO reaches higher final returns than 3PPO, indicating that excessive hierarchical abstraction may restrict the fine-grained corrective behavior needed for this domain. However, bridged-3PPO closes the performance gap and reaches the same reward plateau using only about two-thirds of the interaction steps required by IPPO, highlighting a sample-efficiency benefit of bridging even when pure hierarchical decomposition underperforms.

The communication-enabled variant bridged-3PPO-comm further compresses each agent’s observation space, which reduces the training burden on higher-level agents without degrading final performance.

The most robust effect across both variants is a marked reduction in reward variance (see Fig. 5). While 3PPO shows oscillatory learning, both bridged architectures exhibit substantially smoother reward trajectories. 1 Lower variance indicates that links across non-adjacent layers help stabilize training.

Overall, even when bridges do not improve peak performance, they consistently improve stability and sample efficiency, two properties critical for scaling cooperative multi-agent systems. These findings support our hypothesis that structured skip-connections across hierarchical levels are a helpful architectural feature for hierarchical MARL.

The research was carried out using the MSU-270 supercomputer of Lomonosov Moscow State University.

In this work, we propose a novel, flexible, and scalable approach to constructing solutions for collaborative MARL tasks. Our main contribution is a unified framework for hierarchical MARL that generalizes a wide range of existing models and substantially simplifies the use of DAG-structured hierarchies compared to the TAG formalism. Reinforcement Networks introduce additional degrees of freedom in inter-agent while preserving well-defined intra-agent MDPs. We argue that connections across vertices at different depths can improve both training stability, and sample efficiency. This framework also opens multiple avenues for advancing the theory and practical design of Reinforcement Networks.

Below, we outline several promising avenues for future work.

Optimal topology construction. Developing algorithms and methods to identify optimal DAG topologies tailored to specific MARL tasks is a key challenge. Progress in this area would support both researchers and practitioners in designing more effective systems.

Proxy-reward and communication function training. Advancing the theory and practice of training proxy-reward and communication functions is, in our view, one of the most promising directions for enhancing both the robustness and performance of our approach.

LLM-based agents. Another important direction is to investigate how our methods can be applied to the tuning of LLM-based agents. This raises a number of open questions, ranging from the design of communication and proxyreward functions to optimizing training and inference efficiency.

Reward curves are reproduced from the training logs.

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Start searching

No results found