Dynamic Alignment for Collective Agency: Toward a Scalable Self-Improving Framework for Open-Ended LLM Alignment

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: Dynamic Alignment for Collective Agency: Toward a Scalable Self-Improving Framework for Open-Ended LLM Alignment
ArXiv ID: 2512.05464
Date: 2025-12-05
Authors: Researchers from original ArXiv paper

📝 Abstract

Large Language Models (LLMs) are typically aligned with human values using preference data or predefined principles such as helpfulness, honesty, and harmlessness. However, as AI systems progress toward Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI), such value systems may become insufficient. In addition, human feedback-based alignment remains resource-intensive and difficult to scale. While AI-feedback-based self-improving alignment methods have been explored as a scalable alternative, they have largely remained constrained to conventional alignment values. In this work, we explore both a more holistic alignment objective and a scalable, self-improving alignment approach. Aiming to transcend conventional alignment norms, we introduce Collective Agency (CA)-a unified and open-ended alignment value that encourages integrated agentic capabilities. We also propose Dynamic Alignment-an alignment framework that enables an LLM to iteratively align itself. Dynamic Alignment comprises two key components: (1) automated training dataset generation with LLMs, and (2) a self-rewarding mechanism, where the policy model evaluates its own output candidates and assigns rewards for GRPO-based learning. Experimental results demonstrate that our approach successfully aligns the model to CA while preserving general NLP capabilities.

💡 Deep Analysis

Deep Dive into Dynamic Alignment for Collective Agency: Toward a Scalable Self-Improving Framework for Open-Ended LLM Alignment.

📄 Full Content

Dynamic Alignment for Collective Agency: Toward a Scalable Self-Improving Framework for Open-Ended LLM Alignment Panatchakorn Anantaprayoon1, Nataliia Babina1,2*, Jad Tarifi1, Nima Asgharbeygi1 1Integral AI 2The University of Tokyo {panatchakorn, nataliia, jad, nima}@integral.ai Abstract Large Language Models (LLMs) are typically aligned with human values using preference data or predefined principles such as helpfulness, honesty, and harmlessness. However, as AI systems progress toward Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI), such value sys- tems may become insufficient. In addition, human feedback- based alignment remains resource-intensive and difficult to scale. While AI-feedback-based self-improving alignment methods have been explored as a scalable alternative, they have largely remained constrained to conventional alignment values. In this work, we explore both a more holistic align- ment objective and a scalable, self-improving alignment ap- proach. Aiming to transcend conventional alignment norms, we introduce Collective Agency (CA)—a unified and open- ended alignment value that encourages integrated agentic ca- pabilities. We also propose Dynamic Alignment—an align- ment framework that enables an LLM to iteratively align it- self. Dynamic Alignment comprises two key components: (1) automated training dataset generation with LLMs, and (2) a self-rewarding mechanism, where the policy model evaluates its own output candidates and assigns rewards for GRPO- based learning. Experimental results demonstrate that our ap- proach successfully aligns the model to CA while preserving general NLP capabilities. Code and datasets — https://github.com/integral- ai/dynamic-alignment-for-collective-agency 1 Introduction As Large Language Models (LLMs) become increasingly capable, aligning their behavior has emerged as a cen- tral challenge in building safe and trustworthy AI sys- tems. Most existing alignment efforts focus on human- centric values such as helpfulness, honesty, and harmless- ness (HHH) (Askell et al. 2021; Bai et al. 2022a; Gan- guli et al. 2022). While aligning models to these values has proven effective for current LLMs, such objectives remain vulnerable to reward hacking. For example, a model may learn to produce persuasive and seemingly correct responses that convince human evaluators of their validity, even when the content is factually incorrect or misleading (Wen et al. 2025). As models grow in scale and sophistication, these *Work done during internship at Integral AI. Figure 1: Interrelationship among the four pillars of Collec- tive Agency, our proposed open-ended alignment value. forms of behavioral hacking become harder to detect and control, suggesting that current alignment paradigms may be insufficient for governing the behavior of more advanced systems approaching Artificial General Intelligence (AGI) or Artificial Superintelligence (ASI). Moreover, traditional approaches to AI alignment often attempt to compress di- verse human values into a single optimizable objective. Al- though well-intentioned, this risks epistemic capture (Hall- gren 2025), a state in which one perspective or value system dominates, marginalizing others. Such optimization could inadvertently steer society toward a monoculture of thought and behavior, undermining the pluralism essential to human progress. A more robust alignment framework should there- fore aim not to fixate on static values but to preserve and enhance the capacity of diverse agents to realize their own. As an alignment approach, Reinforcement Learning from Human Feedback (RLHF; Christiano et al. (2017)) has demonstrated strong empirical performance in aligning models to human preferences (Ouyang et al. 2022; Bai et al. 2022a; Rafailov et al. 2023). However, it remains labor-intensive, slow to iterate, and increasingly difficult to scale with increasing model size and generality. To ad- dress the issue, Recent work has explored more scalable ap- proaches based on AI-generated feedback (Bai et al. 2022b; Lee et al. 2024). In particular, self-rewarding mechanisms, where models evaluate their own outputs, have shown early promise in helpfulness alignment (Yuan et al. 2024). Still, their effectiveness in aligning models to more abstract val- ues remains underexplored. In this work, we propose Collective Agency (CA), a new arXiv:2512.05464v1 [cs.CL] 5 Dec 2025 Figure 2: Dynamic alignment framework. Prompts used in each step are in Appendix A. open-ended alignment value designed to scale with model autonomy and capability. Rather than aligning to static be- haviors or fixed outcomes, CA encourages continual growth across core agentic capacities. We provide a detailed formu- lation and justification in Section 2.1. Then, we propose Dynamic Alignment framework, a self-improving alignment method that enables the LLM to iteratively align itself to CA without relying on human- labeled data. Dynamic Alignment consists

…(Full text truncated)…

📄 Read Full PDF on ArXiv