Research Superalignment Should Advance Now with Alternating Competence and Conformity Optimization
The recent leap in AI capabilities, driven by big generative models, has sparked the possibility of achieving Artificial General Intelligence (AGI) and further triggered discussions on Artificial Superintelligence (ASI)-a system surpassing all humans across measured domains. This gives rise to the critical research question of: As we approach ASI, how do we align it with human values, ensuring it benefits rather than harms human society, a.k.a., the Superalignment problem. Despite ASI being regarded by many as a hypothetical concept, in this position paper, we argue that superalignment is achievable and research on it should advance immediately, through simultaneous and alternating optimization of task competence and value conformity. We posit that superalignment is not merely a safeguard for ASI but also necessary for its responsible realization. To support this position, we first provide a formal definition of superalignment rooted in the gap between capability and capacity, delve into its perceived infeasibility by analyzing the limitations of existing paradigms, and then illustrate a conceptual path of superalignment to support its achievability, centered on two fundamental principles. This work frames a potential initiative for developing value-aligned next-generation AI in the future, which will garner greater benefits and reduce potential harm to humanity.
💡 Research Summary
The paper argues that research on “superalignment” – the alignment of future artificial superintelligence (ASI) with human values – should begin immediately rather than waiting for ASI to materialize. The authors first formalize AI models (A) and humans (H) with a utility function U that captures both task performance (competence) and adherence to human values (conformity). They distinguish three levels of intelligence: Artificial Narrow Intelligence (ANI), Artificial General Intelligence (AGI), and Artificial Superintelligence (ASI), defining each in terms of the expected utility U(A(x)) relative to U(H(x)). Traditional alignment is defined as minimizing the absolute difference |U(A) – U(H)|, which implicitly assumes that the model’s capacity C(A) is at least comparable to human capacity C(H).
Superalignment, however, is defined differently: when a model’s capacity far exceeds human capacity (C(A) ≫ C(H)), the goal is to minimize the gap between the model’s capability (its achieved utility) and its superhuman capacity. Formally, superalignment seeks to minimize |D_U
Comments & Academic Discussion
Loading comments...
Leave a Comment