Position: Capability Control Should be a Separate Goal From Alignment
Foundation models are trained on broad data distributions, yielding generalist capabilities that enable many downstream applications but also expand the space of potential misuse and failures. This position paper argues that capability control – imposing restrictions on permissible model behavior – should be treated as a distinct goal from alignment. While alignment is often context and preference-driven, capability control aims to impose hard operational limits on permissible behaviors, including under adversarial elicitation. We organize capability control mechanisms across the model lifecycle into three layers: (i) data-based control of the training distribution, (ii) learning-based control via weight- or representation-level interventions, and (iii) system-based control via post-deployment guardrails over inputs, outputs, and actions. Because each layer has characteristic failure modes when used in isolation, we advocate for a defense-in-depth approach that composes complementary controls across the full stack. We further outline key open challenges in achieving such control, including the dual-use nature of knowledge and compositional generalization.
💡 Research Summary
The paper argues that “capability control” – the deliberate restriction of a foundation model’s functional abilities – should be treated as a distinct, hard‑goal objective separate from the more commonly discussed alignment objective. While alignment seeks to make models follow human intent and values, it is inherently context‑dependent and “soft,” allowing a model to comply with a request in one scenario but refuse it in another. In contrast, capability control imposes absolute prohibitions on certain behaviors (e.g., providing instructions for creating a more lethal pathogen) regardless of context, and must remain effective even under adversarial elicitation.
To operationalize capability control, the authors propose a three‑layer taxonomy that spans the entire model lifecycle:
-
Data‑based control – shaping the training distribution before or during pre‑training. This includes (a) filtering out data that could give rise to undesirable capabilities, (b) curating a high‑quality dataset that only contains desired capabilities, and (c) generating synthetic data conditioned to reinforce or suppress specific abilities. The main challenges here are (i) achieving near‑perfect recall when identifying harmful data, (ii) dealing with the dual‑use nature of knowledge (the same data can be both beneficial and dangerous), and (iii) the computational cost of retraining after data modifications.
-
Learning‑based control – intervening directly on model weights, representations, or training objectives after the model has already acquired both useful and harmful capabilities. The paper categorizes learning‑based methods by supervision type (behavior demonstrations, human‑preference data, or explicit unlearning) and by intervention type (weight updates, model editing, representation engineering). Demonstration‑based “refusal training” teaches the model to say “no” to dangerous prompts, but can lead to over‑refusal. Reinforcement Learning from Human Feedback (RLHF) and its variants (RLAIF, RL from verifiable rewards) shape a reward model to penalize unsafe outputs, yet scaling human labeling and avoiding reward‑model bias remain open problems. Unlearning techniques aim for permanent knowledge removal, but recent work shows latent traces can re‑emerge, questioning the completeness of deletion.
-
System‑based control – applying guardrails at inference time. This includes input/output filters, chain‑of‑thought (CoT) monitors, tool‑access policies, and information‑flow constraints. System‑level controls can block harmful behavior in real time, but are vulnerable to adversarial prompt engineering, can cause false positives (benign over‑refusal), and cannot guarantee that the underlying model does not retain the prohibited capability.
The authors emphasize that each layer, when used in isolation, exhibits characteristic failure modes: data filtering alone cannot prevent a model from reconstructing dangerous knowledge from benign sources; learning‑based suppression may not generalize to novel contexts; system‑level guardrails can be bypassed or cause unnecessary refusals. Consequently, they advocate a defense‑in‑depth strategy that composes complementary interventions across all three layers, thereby mitigating the weaknesses of any single approach.
The paper also outlines four major research challenges that must be addressed to make capability control practical:
- Attribution of capabilities to data – developing methods to reliably map a specific harmful ability to the subset of training data that induced it.
- Balancing dual‑use trade‑offs – removing dangerous knowledge while preserving useful functionality, without incurring prohibitive performance loss.
- Verification of complete unlearning – establishing rigorous metrics and benchmarks to detect residual knowledge after unlearning or model editing.
- Robustness of system‑level guardrails – designing detection and mitigation techniques that resist adversarial prompt manipulation and calibrate refusal thresholds to avoid over‑refusal.
By treating capability control as an independent objective and by integrating data‑, learning‑, and system‑level mechanisms, the authors provide a roadmap for building safer AI systems that can be reliably deployed in high‑stakes settings. Their analysis highlights the need for interdisciplinary research, standardized evaluation frameworks, and policy guidance to operationalize capability control alongside, but distinct from, alignment.
Comments & Academic Discussion
Loading comments...
Leave a Comment