Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences
Recent empirical results have demonstrated that training large language models (LLMs) with negative-only feedback can match or exceed standard reinforcement learning from human feedback (RLHF). Negative Sample Reinforcement achieves parity with PPO o…
Authors: Quan Cheng
Via Negativ a for AI Alignmen t: Wh y Negativ e Constrain ts Are Structurally Sup erior to P ositiv e Preferences Quan Cheng T singh ua Univ ersity c hengq25@mails.tsinghua.edu.cn Abstract Recen t empirical results ha ve demonstrated that training large language models (LLMs) with negativ e-only feedbac k can matc h or exceed standard reinforcement learning from h uman feedback (RLHF). Negative Sample Reinforcement achiev es parit y with PPO on mathematical reasoning; Distributional Dispreference Opti- mization trains effectively using only dispreferred samples; and Constitutional AI outp erforms pure RLHF on harmlessness b enc hmarks. Y et no unified theoretical accoun t explains why negativ e signals are so effective. This paper proposes such an account: p ositiv e preferences and negativ e constrain ts are structurally asymmetric. Positiv e preferences (“whic h is b etter”) encode con tinuously coupled, con text-dep enden t human v alues that cannot b e exhaustiv ely sp ecified—leading mo dels to learn surface correlates such as agreement with the user (sycophancy). Negativ e constrain ts (“what is wrong”) enco de discrete, finite, indep endently v erifi- able prohibitions that can con verge to a stable b oundary . This asymmetry—ro oted in P opp er’s falsification logic and the epistemology of negativ e kno wledge—explains b oth the sycophancy failure of preference-based RLHF and the surprising effectiv e- ness of negativ e-signal methods. W e argue that alignment researc h should shift its center of gra vity from “learning what h umans prefer” to “learning what humans reject,” and offer testable predictions for this framew ork. 1 In tro duction A puzzling pattern has emerged in LLM alignmen t research. Metho d after metho d demon- strates that negative feedbac k signals—p enalizing what is wrong rather than reinforcing what is preferred—p erform surprisingly w ell, sometimes matc hing or exceeding methods that use b oth p ositive and negative signals. Negativ e Sample Reinforcemen t (NSR) trains mo dels by p enalizing incorrect reason- ing traces without reinforcing correct ones, yet matc hes PPO and GRPO on MA TH and AIME benchmarks [1]. Distributional Dispreference Optimization (D2O) learns from dispreferred samples only , without requiring noisy p ositive examples [2]. Negativ e Pref- erence Optimization (NPO) treats forget data exclusiv ely as negativ e responses, ac hieving effectiv e unlearning without catastrophic collapse [3]. Kahneman-T v ersky Optimization (KTO) aligns mo dels using unpaired binary signals weigh ted b y loss a v ersion, matc hing DPO at scale with far less data [4]. 1 Mean while, a parallel line of research has established that standard preference-based RLHF systematically amplifies sycophancy . Sharma et al. [5] demonstrated that h uman annotators prefer sycophantic responses o ver correct ones at non-trivial rates, corrupting the preference signal at its source. Shapira et al. [6] pro vided a formal mathematical mec hanism: sycophancy amplification is driv en b y a co v ariance b et ween “endorsing the user’s b elief ” and “receiving high rew ard” under the base policy . These t w o phenomena—the effectiv eness of negativ e-only training and the sycophancy failure of p ositiv e-preference training—ha ve b een studied indep enden tly . This pap er ar- gues they are tw o manifestations of a single structural asymmetry: p ositiv e prefer- ences are contin uously coupled and inexhaustible, while negativ e constrain ts are discrete, finite, and con vergen t. This is a p osition paper. W e do not presen t new exp eriments but offer a theoreti- cal framework that unifies and explains existing empirical results, and prop ose testable predictions. 2 The Structural Asymmetry 2.1 P ositiv e Preferences Are Contin uously Coupled When a h uman annotator is asked “which resp onse is b etter?”, they are implicitly ev alu- ating against a preference function that is: • Context-dependent : What coun ts as “b etter” dep ends on who is asking, wh y they are asking, what they already kno w, and what they in tend to do with the answer. The same resp onse ma y b e preferred in one con text and dispreferred in another. • Multi-dimensional : “Better” sim ultaneously enco des accuracy , helpfulness, tone, conciseness, creativity , safety , and dozens of other criteria whose relative w eights v ary b y situation. • Contin uously coupled : These dimensions are not indep enden t. The optimal lev el of detail dep ends on the user’s exp ertise, whic h affects what coun ts as helpful, which in teracts with what counts as concise. Eac h dimension’s ideal v alue is a function of all other dimensions—a contin uously coupled system [7]. This structure is formally analogous to what Smolensky [8] identified in connectionist represen tations: a massively parallel con tinuous constrain t satisfaction system in which eac h v ariable’s v alue is a function of all other v ariables, admitting no complete sym b olic- lev el description. The preference function that “whic h is b etter?” attempts to elicit is precisely such a system. The consequence is that p ositive preferences cannot b e exhaustively sp ecified by an y finite set of rules or examples. Eac h preference annotation is a lossy pro jection of an infinite-dimensional preference manifold on to a binary signal. The information loss is not a practical limitation that more data could o v ercome—it is a structural prop ert y of the preference function itself. 2.2 Negativ e Constrain ts Are Discrete and Finite Consider instead the question “what is wrong with this resp onse?” The space of identifi- able errors is structurally differen t: 2 • F actual errors are discrete and indep enden tly verifiable: “P aris is not the capital of German y .” • Safety violations are en umerable: a finite list of prohibited b ehaviors (generating malw are, pro viding instructions for violence, revealing priv ate information). • Logical con tradictions are binary: the resp onse either con tradicts itself or do es not. • F ormat violations are chec k able: the resp onse either follo ws the requested format or do es not. Eac h negativ e constrain t is: (1) discr ete —it either applies or do es not; (2) indep en- dent —one constraint’s v alidity do es not depend on other constrain ts; (3) verifiable —it can be c heck ed without reference to the full preference function; (4) stable —“factually wrong” does not b ecome “factually righ t” dep ending on con text. This means the space of negative constrain ts can, in principle, b e exhaustively en u- merated. As constrain ts accum ulate, the feasible resp onse space narro ws monotonically . Bey ond a sufficient n umber of constraints, the remaining feasible space is narrow enough that any resp onse within it is approximately acceptable—not because the mo del has learned what is best, but b ecause it has learned to a void everything that is clearly wrong. 2.3 The Asymmetry Is Structural, Not Quantitativ e The distinction we are drawing is not that negativ e feedback is “easier to collect” or “less noisy”—though b oth ma y be true empirically . The claim is stronger: p ositiv e preferences and negativ e constrain ts o ccup y different p ositions in the episte- mological hierarc h y . This asymmetry has deep ro ots. Karl P opp er’s philosoph y of science [9] rests on precisely this structure: a single coun terexample decisively refutes a universal claim (fal- sification), but no finite n umber of confirming instances can decisively verify one. F alsi- fication is logically asymmetric with resp ect to verification. Negativ e kno wledge (“this is wrong”) is epistemologically privileged o ver p ositive knowledge (“this is righ t”). Nassim T aleb [10] extended this insight under the term via ne gativa : in domains of high uncertaint y , removing what is harmful is more robust than adding what seems b en- eficial. “The chess grandmaster usually wins by not losing.” The grandmaster’s exp ertise is primarily negative—a v ast rep ertoire of p ositions and mo ves to a void—rather than a p ositiv e sp ecification of the optimal mov e in eac h p osition. Gartmeier et al. [11] formalized this in the context of professional exp ertise as “nega- tiv e knowledge”: kno wledge ab out what is wrong and what is to b e a voided, which func- tions through inhibition rather than prescription. Their k ey observ ation—that negativ e kno wledge is exp erientially acquired and exp ert-level—aligns precisely with the pattern w e observ e in LLM training. The contribution of this paper is to connect these epistemological traditions to the empirical landscape of LLM alignmen t, pro viding a unified theoretical accoun t of wh y negativ e-signal methods w ork. 3 3 Explaining Existing Results 3.1 Wh y RLHF Pro duces Sycophancy The structural asymmetry framew ork offers a clean explanation for sycophancy in RLHF. Standard RLHF asks annotators: “whic h resp onse is b etter?” This question forces the annotator to pro ject their contin uously coupled preference function onto a binary comparison. The pro jection is necessarily lossy . Among the d imensions lost, one has a particularly p ernicious surface correlate: agreemen t with the user’s stated p osition correlates with p erceiv ed qualit y . Sharma et al. [5] confirmed this empirically: annotators prefer sycophantic resp onses o ver correct ones at significan t rates. Shapira et al. [6] formalized the mechanism: when the base p olicy already correlates “endorsing the user’s view” with “high reward,” RLHF amplifies this correlation b ecause the rew ard mo del learns the correlation as a feature rather than a confound. Our framework explains why this is not a fixable bug but a structural feature of p ositiv e-preference training. The annotator’s true preference function—which w ould dis- tinguish “genuinely b etter” from “merely more agreeable”—is contin uously coupled and cannot b e fully enco ded in pairwise comparisons. The sycophancy correlate is a low- dimensional surface feature that survives the lossy pro jection. No amoun t of preference data can eliminate this problem, because the problem lies in the structure of the signal, not its quantit y . 3.2 Wh y Constitutional AI Is More Robust An thropic’s Constitutional AI [12] replaces human preference annotation with a set of principles—a “constitution”—that an AI assistan t uses to critique and revise its o wn outputs. The constitution is primarily negative: it sp ecifies what the mo del should not do (b e harmful, b e deceptive, b e inv asive of priv acy). F rom our framew ork, this works precisely b ecause the constitution enco des discrete negativ e constrain ts rather than con tinuous positive preferences. Eac h principle is in- dep enden tly v erifiable: “Do es this resp onse contain instructions for making w eap ons? Y es/No.” The model does not need to learn the full human preference function—it only needs to learn to a void a finite set of clearly defined violations. This also explains an observ ation that has b een noted but not theoretically accounted for: Claude (trained primarily with Constitutional AI) exhibits less sycophancy than mo dels trained primarily with preference-based RLHF [5]. The structural reason is that Constitutional AI’s negativ e constrain ts do not con tain the sycophancy correlate that p ositiv e preference data does. 3.3 Wh y Negativ e-Only T raining Matc hes F ull RLHF The NSR result [1]—that p enalizing wrong answers without reinforcing correct ones matc hes PPO—is initially counterin tuitiv e. Ho w can a mo del impro v e if it is nev er told what is right? Our framework pro vides the answer: the mo del already p ossesses a prior distribu- tion o ver resp onses from pre-training. Negativ e feedbac k do es not need to sp ecify the correct answer—it only needs to suppress incorrect regions of the resp onse space. As 4 incorrect regions are progressively eliminated, the probabilit y mass redistributes tow ard the remaining feasible space, whic h is increasingly concen trated around correct responses. This is precisely the mec hanism NSR’s authors iden tified empirically: “NSR suppresses incorrect generations and redistributes probabilit y mass to ward plausible alternativ es guided b y the model’s prior beliefs” [1]. Our framew ork explains why this w orks in general: because the space of errors is discrete and enumerable, while the space of correct resp onses is con tinuous and con text-dep enden t, it is structurally more efficient to sp ecify the former than the latter. The same logic explains D2O [2] (learning from dispreferred samples only), NPO [3] (negative-only unlearning), and the finding b y Y ao et al. [13] that LLM unlearning with 2% of the computational budget achiev es RLHF-equiv alent safety—all cases where sp ecifying what to a void pro ves sufficien t. 3.4 Wh y KTO W orks with Unpaired Data KTO [4] aligns models using unpaired binary feedbac k—individual resp onses lab eled as “desirable” or “undesirable”—without requiring pairwise comparisons. Its theoretical foundation is Kahneman and T v ersky’s prosp ect theory: humans are loss-a verse, w eighing losses more heavily than equiv alen t gains. Our framew ork provides a deep er explanation for why loss-av erse w eighting is appro- priate: losses (negativ e feedbac k) carry structurally more information than gains (p osi- tiv e feedbac k). A single “undesirable” lab el decisiv ely excludes a region of response space, while a single “desirable” lab el only weakly indicates one p oin t in an infinite-dimensional preference manifold. The asymmetric weigh ting in KTO implicitly recognizes the struc- tural asymmetry we ha ve made explicit. 4 The Con v ergence Argumen t A critical adv antage of negativ e constrain ts is their con vergence prop erty . W e state this informally: Claim : As the num b er of negativ e constrain ts increases, the feasible response space con tracts monotonically . Bey ond a sufficient n umber of constrain ts, any resp onse within the feasible space is appro ximately acceptable. This is the alignment analogue of what T aleb [10] calls the via ne gativa principle and what the Dreyfus model of exp ertise [14] describ es as the transition from rule-following to in tuitive competence: the expert do es not compute the optimal action but has in ternalized enough prohibitions that the remaining action space con tains mostly adequate options. Consider a concrete example. An unaligned mo del’s resp onse space for a given query is v ast—it could output anything from a helpful answ er to harmful instructions. Eac h negativ e constrain t (“do not generate malw are,” “do not fabricate citations,” “do not rev eal priv ate data,” “do not con tradict established facts”) remov es a region of this space. The constrain ts are cumulativ e and non-conflicting: adding a new prohibition never re-op ens a previously closed region. After sufficien tly man y constraints, the remaining space is narro w enough that the mo del’s pre-trained language comp etence is sufficient to pro duce acceptable outputs within it. P ositive preferences, b y con trast, do not conv erge in this wa y . Adding a new “this is b etter than that” comparison do es not monotonically narrow the resp onse space—it 5 adjusts the relative ranking within an already con tin uous space. T w o preference compar- isons can conflict (resp onse A preferred o ver B in con text 1, B preferred ov er A in context 2), and the resolution dep ends on the contin uously coupled con text function that cannot b e fully specified. 5 A T estable Prediction: Capabilit y as Negativ e Kno wl- edge If the structural asymmetry thesis is correct, it generates a testable prediction ab out mo del capability: Prediction : More capable mo dels p ossess more negativ e kno wledge (what not to say) rather than more p ositive knowledge (what to sa y). This manifests as shorter, denser resp onses with higher information conten t p er tok en. The reasoning is as follows. A more capable mo del has b een trained on more data and has undergone more alignmen t iterations. If negative kno wledge (constrain ts on what to av oid) accum ulates more reliably than p ositiv e kno wledge (specifications of what is optimal), then a more capable mo del’s primary adv an tage is knowing more about what not to include in a resp onse—redundant elab oration, unnecessary hedging, tangential information, formulaic pleasantries. Informal observ ations are consistent with this prediction. Within the same mo del family , more capable v ariants (e.g., Opus vs. Sonnet in An thropic’s Claude family) tend to pro duce shorter resp onses with higher information densit y . A cross mo del families, mo dels trained with more Constitutional AI emphasis (negativ e constrain ts) tend to b e less verbose than models trained with more RLHF emphasis (p ositive preferences). This prediction is empirically testable through a con trolled benchmark: • Metric 1 : Resp onse length (in tok ens) for standardized queries across model capabilit y tiers • Metric 2 : Information densit y (unique substan tive claims p er tok en) • Metric 3 : Sycophancy rate (agreement with demonstrably false user claims) • Prediction : Capabilit y correlates negatively with length, p ositiv ely with information densit y , and negativ ely with sycophancy rate If confirmed, this w ould pro vide evidence that capabilit y growth in LLMs is at least partially driv en b y the accum ulation of negativ e kno wledge—learning what not to say— rather than solely b y the expansion of p ositive kno wledge about what to say . 6 Implications for Alignmen t Researc h 6.1 Reframing the Alignmen t Ob jectiv e Curren t alignmen t research is largely organized around the question: “How do w e learn what humans w ant?” Our framew ork suggests this question is structurally ill-p osed for the same reason that “describ e the optimal c hess mo v e for ev ery p osition” is ill-p osed—the answ er space is contin uously coupled and inexhaustible. 6 A more tractable form ulation is: “Ho w do w e learn what humans reject?” This ques- tion targets the discrete, finite, conv ergent side of the structural asymmetry . It do es not require solving the preference function—it requires en umerating the boundaries. This is not a minor reframing. It c hanges what data to collect (rejection signals rather than preference comparisons), ho w to design annotation interfaces (asking “what is wrong?” rather than “whic h is better?”), and what con vergence guarantees are ac hiev able (monotonic b oundary con traction rather than appro ximate preference matc hing). 6.2 Constitutional AI as a T emplate Constitutional AI [12] already implemen ts this reframing, though it was not explicitly motiv ated b y the structural asymmetry w e describe. Our framew ork suggests that Con- stitutional AI’s success is not inciden tal but reflects a correct alignmen t betw een the metho d’s structure and the structure of the problem. F uture alignmen t metho ds should b e ev aluated not only on b enchmark performance but on whether they lev erage the asymmetry—targeting discrete constraints rather than con tin uous preferences. 6.3 The Limits of Via Negativ a W e do not claim that negativ e constrain ts are sufficien t for all asp ects of alignmen t. Certain alignment desiderata—helpfulness, creativity , appropriate tone—are in heren tly p ositiv e and ma y resist negative sp ecification. Our claim is that the discrete, conv ergen t comp onen t of alignmen t (safety , factual accuracy , logical consistency) should b e addressed through negativ e constraints, reserving p ositive preference learning for the residual con tin- uous component. This separation of concerns could reduce the sycophancy contamination curren tly observ ed when safet y and helpfulness are learned jointly . 7 Related W ork Sycophancy . Perez et al. [15] first do cumented sycophancy in language models. Sharma et al. [5] traced it to preference data. Shapira et al. [6] formalized the amplification mech- anism. W ei et al. [16] prop osed simple prompting-based mitigations. Our con tribution is a structural explanation for why sycophancy is an in trinsic failure mo de of positive- preference metho ds. Negativ e-signal training. D2O [2], NSR [1], NPO [3], KTO [4], and BNF [17] demonstrate the empirical effectiveness of negativ e-only or negativ e-weigh ted training. Our contribution is a theoretical account unifying these results. Via negativ a in philosoph y . Popper [9] established the falsification asymmetry . T aleb [10] applied it to decision-making under uncertain ty . Gartmeier et al. [11] formal- ized negativ e knowledge in exp ertise researc h. P arviainen and Eriksson [18] connected it to organizational kno wledge. Our contribution is to bring this epistemological tradition in to con tact with the AI alignment literature, where it has been absen t. T acit kno wledge and LLMs. Kambhampati [19] iden tified the connection b etw een exp ert systems’ failure and tacit knowledge. Cheng [7] argued that the v aluable capabil- ities of LLMs are precisely the unexplainable ones, via a pro of b y contradiction through exp ert system equiv alence. The presen t pap er extends this argumen t: if p ositive knowl- edge is structurally uncapturable (Cheng’s thesis), then alignment should target negative kno wledge instead. 7 8 Conclusion W e ha v e argued that p ositive preferences and negativ e constraints are structurally asym- metric: the former are contin uously coupled and inexhaustible, while the latter are dis- crete, finite, and con vergen t. This asymmetry—grounded in P opp er’s falsification logic and the epistemology of negativ e knowledge—pro vides a unified theoretical explanation for t wo indep endently observ ed phenomena in LLM alignmen t: the systematic syco- phancy pro duced b y preference-based RLHF, and the surprising effectiv eness of negativ e- only training metho ds. The practical implication is a reframing of the alignmen t ob jective: from “learn what h umans prefer” (a structurally in tractable problem) to “learn what humans reject” (a structurally conv ergen t one). Constitutional AI already implemen ts this reframing; the gro wing family of negative-signal methods (NSR, D2O, NPO, KTO) provides empirical supp ort; and the epistemological tradition of via ne gativa provides theoretical grounding. The c hess grandmaster wins b y not losing. The aligned mo del aligns b y learning what not to do. References [1] Liu, Y., Zeng, Z., et al. (2025). “The Surprising Effectiv eness of Negative Rein- forcemen t in LLM Reasoning.” A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) . [2] Duan, H., Yi, Y., Zhang, Z., Liu, F., et al. (2024). “Negating Negativ es: Alignment with Human Negativ e Samples via Distributional Dispreference Optimization.” Find- ings of EMNLP . [3] Zhang, J., et al. (2024). “Negativ e Preference Optimization: F rom Catastrophic Collapse to Effective Unlearning.” arXiv pr eprint arXiv:2404.05868 . [4] Etha yara jh, K., Xu, W., Muennighoff, N., Jurafsky , D., and Kiela, D. (2024). “KTO: Model Alignmen t as Prospect Theoretic Optimization.” Pr o c e e dings of ICML . [5] Sharma, M., T ong, M., Korbak, T., et al. (2024). “T ow ards Understanding Syco- phancy in Language Models.” Pr o c e e dings of ICLR . [6] Shapira, N., Levy , M., Ala vi, S. H., et al. (2026). “Ho w RLHF Amplifies Sycophancy .” arXiv pr eprint arXiv:2602.01002 . [7] Cheng, Q. (2026). “Wh y the V aluable Capabilities of LLMs Are Precisely the Unex- plainable Ones.” arXiv pr eprint . [8] Smolensky , P . (1988). “On the Prop er T reatment of Connectionism.” Behavior al and Br ain Scienc es , 11(1), 1–23. [9] P opp er, K. R. (1959). The L o gic of Scientific Disc overy . Routledge. (Original: L o gik der F orschung , 1934.) [10] T aleb, N. N. (2012). Antifr agile: Things That Gain fr om Disor der . Random House. Chapter 22: Via Negativ a. 8 [11] Gartmeier, M., Bauer, J., Grub er, H., and Heid, H. (2008). “Negative Kno wledge: Understanding Professional Learning and Exp ertise.” V o c ations and L e arning , 1(2), 87–103. [12] Bai, Y., Kadav ath, S., et al. (2022). “Constitutional AI: Harmlessness from AI F eed- bac k.” arXiv pr eprint arXiv:2212.08073 . [13] Y ao, Y., et al. (2024). “Large Language Model Unlearning.” A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) . [14] Dreyfus, H. L. and Dreyfus, S. E. (1986). Mind over Machine: The Power of Human Intuition and Exp ertise in the Er a of the Computer . F ree Press. [15] P erez, E., Ringer, S., et al. (2023). “Discov ering Language Mo del Behaviors with Mo del-W ritten Ev aluations.” Findings of A CL . [16] W ei, J., et al. (2023). “Simple Synthetic Data Reduces Sycophancy in Large Language Mo dels.” arXiv pr eprint arXiv:2308.03958 . [17] Han, Y., et al. (2024). “BNF: As Simple as Fine-tuning: LLM Alignment via Bidi- rectional Negative F eedbac k Loss.” Op enR eview . [18] P arviainen, J. and Eriksson, M. (2006). “Negative Kno wledge, Expertise and Or- ganisations.” International Journal of Management Conc epts and Philosophy , 2(2), 140–153. [19] Kam bhampati, S. (2021). “P olan yi’s Rev enge and AI’s New Romance with T acit Kno wledge.” Communic ations of the A CM , 64(10), 31–33. 9
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment