내재된 안전 정렬 지능을 위한 다중 에이전트 강화학습 임베딩 프레임워크

February 23, 2026

Reading time: 5 minute

...

📝 Abstract

We introduce Embedded Safety-Aligned Intelligence (ESAI), a theoretical framework for multi-agent reinforcement learning that embeds alignment constraints directly into agents’ internal representations via differentiable internal alignment embeddings (IAE). Unlike external reward shaping or post-hoc safety constraints, IAE are learned latent variables that predict externalized harm through counterfactual reasoning and modulate policy gradients toward harm reduction via attention gating and graph diffusion. We formalize the ESAI framework through four integrated mechanisms: (1) differentiable counterfactual alignment penalties computed via softmin reference distributions, (2) IAE-weighted attention biasing perceptual salience toward alignmentrelevant features, (3) Hebbian affect-memory coupling supporting temporal credit assignment, and (4) similarity-weighted graph diffusion with bias-mitigation controls. We derive conditions for bounded internal embeddings under Lipschitz constraints and spectral radius bounds, analyze computational complexity as O(N kd) for N agents with k-dimensional embeddings, and discuss theoretical properties including contraction dynamics and fairness-performance tradeoffs. This work positions ESAI as a conceptual contribution to differentiable alignment mechanisms in multi-agent systems. We identify open theoretical questions regarding convergence guarantees, optimal embedding dimensionality, and extension to high-dimensional state spaces. Empirical validation remains future work.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Contemporary multi-agent reinforcement learning (MARL) optimizes explicit task objectives but typically lacks internal differentiable regulators that encourage prosocial behavior and stable coordination under distribution shift. Standard approaches to alignment-such as reward shaping (Ng et al., 1999), constrained optimization (Achiam et al., 2017), or inverse RL (Hadfield-Menell et al., 2016)-rely on external supervision signals that are either hand-designed, require human preference data, or operate as non-differentiable constraints decoupled from policy learning dynamics.

We investigate whether agents can instead learn an internal alignment embedding (IAE): a differentiable latent variable that tracks predicted externalized harm and shapes policy gradients toward harm reduction through three key properties:

Predictive: IAE forecasts alignment-relevant outcomes via counterfactual reasoning over candidate actions.
Regulatory: IAE modulates perception and action selection through attention gating and gradient coupling.
Distributed: IAE propagates across agent neighborhoods via graph diffusion with controllable similarity weighting.

improved performance through learned relational representations. Wang et al. (2021) propose QPLEX with duplex dueling and attention mechanisms for value factorization, demonstrating improved stability. Liu et al. (2023) develop NA2Q using neural attention additive models for interpretable multi-agent Q-learning, showing that attention can provide both performance and interpretability.

However, existing work does not integrate graph mechanisms with internal alignment states that track predicted harm.

ESAI integrates graph diffusion with IAE dynamics, enabling similarity-weighted propagation of alignment-relevant information. The diffusion operator propagates IAE across neighborhoods, while similarity weighting modulates edge strengths based on learned agent identities. The biasmitigation regularizer provides interpretable fairness-performance tradeoffs not present in standard graph communication architectures.

Differentiable memory systems such as Neural Turing Machines (Graves et al., 2014) enable learned read/write operations for temporal credit assignment, showing that memory mechanisms can be trained end-to-end via gradient descent. Miconi et al. (2018) demonstrate that backpropagation can train differentiable Hebbian plasticity rules that adapt network weights during deployment, showing benefits for continual learning and adaptation. This work proves that associative learning rules can be optimized jointly with network parameters.

However, no existing work couples Hebbian learning with alignment-specific internal states or uses Hebbian traces to support counterfactual forecasting of harm. Standard memory architectures (attention over episodic buffers, recurrent states) do not encode associative traces between internal alignment embeddings and perceptual features.

ESAI couples Hebbian traces to IAE dynamics via differentiable read operations, enabling alignmentaware memory updates that support counterfactual forecasting. The Hebbian matrix encodes coactivation history of IAE and percepts, providing context for harm prediction while preserving end-to-end trainability.

Recent work explores using vision-language models to provide semantic guidance for reinforcement learning. Wu et al. (2025) propose using VLMs as action advisors for online RL, demonstrating that pre-trained models can provide interpretable action suggestions. However, this approach requires external model queries at each timestep rather than learning intrinsic alignment representations embedded in the agent’s parameters.

ESAI differs by learning alignment representations end-to-end without requiring external model queries, pre-trained components, or oracle access to semantic models during deployment.

Potential-based reward shaping (Ng et al., 1999) adds Φ(s t+1 ) -Φ(s t ) to rewards, preserving optimal policies under certain conditions and providing theoretical guarantees.

ESAI differs in three fundamental ways:

Learned dynamics: IAE evolves via learned function g ϕ and graph diffusion operator L, not hand-designed potential functions. This enables adaptation to environment-specific harm structures without manual redesign.
Counterfactual supervision: IAE is trained to predict harm outcomes via counterfactual forecasting, not approximate task value. This separates alignment objectives from task rewards.
Perceptual gating: IAE modulates perception via attention, enabling non-Markovian salience modulation unavailable to potential-based methods that only augment rewards.

ESAI trades the theoretical guarantees of potential-based shaping (policy invariance) for adaptive capacity and richer internal dynamics, though this tradeoff requires empirical investigation.

While individual components exist in isolation-counterfactual reasoning for credit assignment (Foerster et al., 2018;Zhou

View Original ArXiv

This content is AI-processed based on ArXiv data.

내재된 안전 정렬 지능을 위한 다중 에이전트 강화학습 임베딩 프레임워크

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found