CAPE: Capability Achievement via Policy Execution

Reading time: 32 minute
...

📝 Original Info

  • Title: CAPE: Capability Achievement via Policy Execution
  • ArXiv ID: 2512.14761
  • Date: 2025-12-15
  • Authors: David Ball

📝 Abstract

Modern AI systems lack a way to express and enforce requirements. Pre-training produces intelligence, and post-training optimizes preferences, but neither guarantees that models reliably satisfy explicit, context-dependent constraints. This missing abstraction explains why highly intelligent models routinely fail in deployment despite strong benchmark performance. We introduce Capability Engineering, the systematic practice of converting requirements into executable specifications and training models to satisfy them by default. We operationalize this practice through CAPE (Capability Achievement via Policy Execution), a protocol implementing a Specify -> Verify -> Correct -> Train loop. CAPE is grounded in two empirical findings: (1) contextual objectivity, where properties appearing subjective become objective once context is fixed (inter-annotator agreement rises from kappa = 0.42 to kappa = 0.98), and (2) verification-fidelity scaling, where verification accuracy improves with model scale (r = 0.94), unlike preference agreement which plateaus at 30 to 50 percent disagreement regardless of compute. Across 109,500 examples in six domains, CAPE reduces violation rates by 81 percent relative to DPO (standard deviation less than 0.3 percent). By replacing per-example annotation with reusable specifications, CAPE reduces costs by 5 to 20 times and shortens timelines from months to weeks. We release the CAPE protocol, PredicateGraph schema, CPL specification language, and policy packs under Apache 2.0. We also launch CapabilityBench, a public registry of model evaluations against community-contributed policies, shifting evaluation from intelligence benchmarks toward capability measurement.

📄 Full Content

Software engineering has specifications, test suites, and traceable fixes. Language models have benchmarks, preferences, and hope.

In software, requirements are formalized as specifications. When something fails, you know exactly what broke. When you fix it, you prove it’s fixed. For language models, none of this holds.

To translate intelligence into economic value, AI systems must satisfy requirements. Today, we cannot express requirements in a form that models can reliably follow. Benchmarks measure intelligence. Preferences express taste. Neither specifies capability.

We refer to the missing layer as Capability Engineering: the systematic practice of converting requirements into executable specifications, verifying outputs against those specifications, and training models to satisfy them by default. Capability engineering is distinct from prompt or context engineering. Rather than guiding model behavior probabilistically, it constrains behavior through verifiable requirements.

A model can be highly intelligent yet lack specific capabilities. GPT-5 can prove theorems but cannot reliably cite only provided documents. DeepSeek-R1 achieves 79.8% on AIME but may recommend off-formulary medications. Intelligence is necessary but insufficient; capability must be engineered.

RLHF, RLVR, and related methods optimize for intelligence proxies. CAPE optimizes for capability directly.

Historically, specification-based post-training was infeasible because model outputs could not be reliably parsed, evaluated, or corrected at scale. As a result, the field optimized what it could measure-preferences and benchmark scores-rather than what deployment required. Recent advances in structured generation, long-context reasoning, and verification fidelity change this constraint, making executable specifications a practical training primitive for the first time.

Three recent technical advances enable specification-based post-training:

  1. Long context windows. PredicateGraph extraction requires analyzing full outputs. Until mid-2025, context limits caused frequent truncation. Current models enable robust extraction of complete output structure. Kimi k1.5 [Kimi Team et al., 2025] demonstrates that scaling RL context enables continued capability improvements, confirming that context-limited extraction is no longer the binding constraint.

  2. Improved instruction following. Our pilot studies show extraction fidelity improved from 72% (GPT-3.5, 2023) to 96% (GPT-5, Claude Opus 4.5, 2025). Below 85% fidelity, CAPE does not outperform RLHF; above 90%, it consistently does. The crossing point occurred in mid-2025.

  3. Structured generation. JSON mode, grammar constraints, and constrained decoding have matured significantly. OpenAI reports that gpt-4-0613 scored below 40% on complex JSON schema following, while gpt-4o-2024-08-06 with Structured Outputs achieves 100% [OpenAI, 2024]. Grammar-constrained decoding frameworks have similarly improved [Geng et al., 2025, Dong et al., 2024], enabling reliable PredicateGraph production.

This paper argues that capability engineering, the systematic practice of converting requirements into executable specifications, completes the AI development stack. The approach rests on two insights:

  1. Contextual objectivity: Most capability requirements become objective once context is fixed. “Appropriate financial advice” is subjective. “Recommend only approved products, disclose all fees, verify suitability against stated risk tolerance” is objective.

  2. Verification-fidelity scaling: The binding constraint on post-training is verification accuracy, not preference agreement. Unlike human agreement, verification fidelity improves with scale.

CAPE (Capability Achievement via Policy Execution) operationalizes these insights into a closed training loop that sidesteps the algorithmic pathologies of reward-based methods.

  1. Capability Engineering: A systematic practice completing the AI development stack, with explicit distinction between intelligence and capability 2. Empirical Validation of Contextual Objectivity: Inter-annotator agreement study showing context transforms subjective properties (κ = 0.42) into objective specifications (κ = 0.98), with independent validation from cross-domain verification studies 3. Verification-Fidelity Scaling Law: Demonstration that residual error tracks verification accuracy (r > 0.9) which improves with scale, unlike preference disagreement (fixed at 30-50%) or algorithmic biases in reward methods 4. The CAPE Protocol: Specify → Verify → Correct → Train loop with meta-verification, avoiding reward-shaping pathologies 5. Technical Infrastructure: PredicateGraph schema, CPL specification, learned verifier training 6. Comprehensive Evaluation: 81% violation reduction across 109,500 examples in 6 domains with stability analysis (σ < 0.3% for symbolic policies)

  2. CapabilityBench: Public registry where models are evaluated against community-contributed policies, replacing aggregate benchmark scores with traceable capability verdicts 8. Open Release: Protocol, schemas, policy packs, and reference implementation (Apache 2.0)

2 Background and Related Work

RLHF [Christiano et al., 2017, Ziegler et al., 2019, Ouyang et al., 2022] trains reward models from human preference comparisons, then optimizes language models against these rewards using PPO [Schulman et al., 2017]. InstructGPT [Ouyang et al., 2022] demonstrated that 1.3B models trained with RLHF could outperform 175B base models on instruction-following tasks. However, fundamental limitations have emerged. [Bai et al., 2022a] report annotator disagreement rates of 30-50% on nuanced tasks. This reflects genuine variance in human judgment. [Gao et al., 2023] document reward model overoptimization: as training progresses, proxy reward increases while true quality (measured by held-out human evaluation) decreases. The preference ceiling is structural: more compute amplifies noise.

Beyond preference noise, reward-based methods exhibit structural algorithmic biases. [Fatemi et al., 2025] show that PPO inadvertently favors longer responses due to loss normalization. If the model gets a negative reward, longer responses dilute the per-token penalty, encouraging verbosity independent of correctness. This is not a bug but a mathematical property: when the reward is negative, the average per-token loss becomes smaller when the response is longer, so the model is indirectly encouraged to make responses longer even when those extra tokens don’t help solve the problem.

[ Liu et al., 2025] identify two additional biases in GRPO: (1) response-length bias, where dividing advantage by response length makes long incorrect answers receive smaller penalties, and (2) difficulty-level bias, where normalising by reward standard deviation causes easy or hard questions to be overweighted. These require algorithmic patches (like Dr. GRPO) that add complexity and may introduce new failure modes.

CAPE sidesteps these pathologies entirely. Verification produces binary pass/fail verdicts per policy; correction fixes failures at specific spans. There is no advantage normalization, no lengthbased reward shaping, no difficulty weighting. The training signal is simply: “this output satisfies the specification” or “this corrected output satisfies the specification.”

DPO [Rafailov et al., 2023] simplifies RLHF by eliminating the explicit reward model, directly optimizing policy using a classification objective on preference pairs. While computationally simpler, DPO inherits the same preference noise ceiling. Our experiments confirm both methods plateau similarly (Section 7).

Constitutional AI [Bai et al., 2022b] uses natural language principles (“Choose the response that is most helpful, harmless, and honest”) for self-critique and revision, training via RL from AI Feedback (RLAIF). This approach is closer to CAPE than pure preference methods as both methods use explicit principles rather than implicit preferences.

Conceptually, the distinction between Constitutional AI and CAPE is not merely one of implementation, but of abstraction. Constitutional AI relies on natural-language principles interpreted by models. CAPE relies on executable specifications evaluated by systems. The former reduces reliance on human annotators but preserves interpretive variance; the latter eliminates interpretation entirely for objective properties by construction.

Key difference: Natural language principles require interpretation, reintroducing variance. Our experiments (Section 3) show:

• Inter-model agreement on Constitutional AI critiques: 67%

• Inter-model agreement on CAPE policy evaluation: 99.7% (symbolic) CAPE’s symbolic policies eliminate interpretation for structural properties. For semantic properties, CAPE’s learned verifiers are trained on explicit rubrics with meta-verification, providing more reliable evaluation than self-critique.

Concurrent work on reinforcement learning with verifiable rewards (RLVR) shares CAPE’s preference for objective signals over learned rewards. DeepSeek-R1 [Guo et al., 2025] demonstrated that pure RL with rule-based rewards (accuracy for math, compiler feedback for code) can induce reasoning capabilities without preference learning.

[ Su et al., 2025] demonstrate that binary verification judgments exhibit high inter-model agreement across medicine, chemistry, economics, and education when expert-written references exist, validating our contextual objectivity thesis (Section 3). They further show that compact (7B) generative verifiers can provide reliable cross-domain reward signals without domain-specific annotation.

However, RLVR still operates within the reward-shaping paradigm, inheriting algorithmic biases and requiring ground-truth answers for each training example. CAPE’s contribution is the closed correction loop and reusable specifications: you write the formulary policy once, not 10,000 reference answers. Policies are composable, versionable, and auditable artifacts that separate specification from training.

Recent work demonstrates that semantic properties previously considered too complex for automated verification can be reliably assessed by LLMs trained on explicit rubrics.

DeepSeekMath-V2 [Shao et al., 2025] achieved IMO gold medal (5/6 problems) and Putnam 118/120 (exceeding top human score of 90) using learned verifiers for proof validity. Their approach instantiates CAPE for mathematical reasoning:

• Explicit rubrics for proof validity (not learned preferences)

• Issue identification as required output (not just scores)

• Meta-verification to catch hallucinated issues • Verifier-guided generation and refinement

Their key insight: “verification capability leads generation capability.” This validates CAPE’s core thesis that verification fidelity, not preference agreement, determines the achievable ceiling.

AlphaProof [Hubert et al., 2025] uses formal verification in Lean as reward signal, achieving similar IMO-level results. CAPE occupies the middle ground: more flexible than formal verification, more reliable than preference learning.

Recent work questions whether RL induces reasoning or merely amplifies pre-existing capabilities from pre-training. [Liu et al., 2025] find that “Aha moments” and self-correction behaviors appear in base models without RL, suggesting these capabilities may be inherited from pre-training on chain-of-thought data. [Shah et al., 2025] show that self-reflection and self-correction behaviors emerge progressively throughout pre-training across various domains and model sizes.

This debate is orthogonal to CAPE’s contribution. We do not claim to induce emergent capabilities, we enforce adherence to explicit specifications. Whether reasoning “emerges” from RL or is “pre-existing” from pre-training, deployed models must still satisfy specific requirements: recommend only formulary drugs, cite only provided documents, produce valid arithmetic. CAPE guarantees satisfaction of these requirements independent of how the underlying capability arose.

Systems like NeMo Guardrails [Rebedea et al., 2023] enforce constraints at inference time through input/output filtering, topic control, and jailbreak detection. These are necessary but insufficient because they filter without teaching. A model repeatedly blocked by guardrails learns nothing about satisfying the underlying requirements. CAPE generates training signal from violations, enabling models to satisfy constraints by default rather than requiring runtime enforcement. In our experiments, CAPE-trained models achieve 96.2% compliance without any inference-time guardrails.

Work on code generation with formal specifications [Chen et al., 2021, Austin et al., 2021]

A common objection to specification-based approaches is: “You can only write policies for simple, objective properties. Real capabilities are too complex and subjective for rules.”

This objection conflates two orthogonal distinctions:

  1. Structural vs. Semantic: Can the property be checked by pattern-matching over output structure, or does it require understanding meaning?

  2. Objective vs. Subjective: Is there a fact of the matter about whether the property holds, or does it depend on individual preference?

The key insight: semantic does not imply subjective. “The proof step is logically valid” is semantic (requires understanding) but objective (there’s a fact of the matter). “This explanation is elegant” is subjective. The former is verifiable; the latter requires preferences.

Consider “good customer service.” Asked universally, this is genuinely subjective. But for a specific enterprise, it decomposes into testable requirements:

• Acknowledge the customer’s issue before offering solutions

To validate this empirically, we conducted an annotation study with 500 model outputs across three requirement types:

Setup: 5 annotators, 3 conditions per output:

  1. Abstract: “Is this good medical advice?” 2. Contextualized: “Does this recommend only formulary drugs and flag contraindications?”

  2. Explicit Policy: Run CPL policy and report violation

Results (Table 1): Context transforms properties from subjective (κ = 0.42, “moderate agreement”) to objective (κ = 0.73, “substantial agreement”). Executable policies achieve near-perfect agreement (κ = 0.98). This confirms: most capability requirements are contextually objective-subjective in general, objective once context is fixed.

Independent validation comes from [Su et al., 2025], who demonstrate this principle at scale: binary verification judgments on medical, chemistry, and economics tasks exhibit high inter-model agreement when expert-written reference answers are provided. This confirms that context (the reference) transforms subjective-seeming properties into objective verification targets.

Their finding that 7B models can serve as reliable cross-domain verifiers without domain-specific annotation further supports our learned verifier approach. Notably, their data analysis reveals that only 60.3% of mathematical problems possess single-term numerical answers verifiable by rule-based methods, with the ratio dropping to 45.4% for complex multi-domain queries, demonstrating the need for soft verification mechanisms that CAPE’s learned verifiers provide.

Capability engineering encompasses both structural and semantic properties:

Structural properties can be verified by pattern-matching. Examples: “tool argument equals computed value,” “code contains no eval() calls.” These are verified by symbolic policies: deterministic code evaluating predicates.

Semantic properties require understanding meaning. Examples: “reasoning step is logically valid,” “proof is complete.” These are verified by learned verifiers: models trained on explicit rubrics.

The boundary is not fixed. Many semantic properties, once context is sufficiently specified, reduce to structural verification. “Well-supported claim” is semantic; “every factual statement has a citation to a provided document” is structural. The key question is not “Is this property simple enough for a rule?” but “Is there a fact of the matter?” If yes, it’s verifiable. The only question is whether verification is symbolic or learned.

Some properties resist specification even with fixed context:

• Aesthetic quality: “Is this prose beautiful?”

• Creative insight: “Is this idea novel?”

• Humor: “Is this funny?” These require preference learning. But they’re a small fraction of production failures. Most failures are specification failures and capability engineering handles these. Section 7 shows that 89% of production issues in our case studies were objectively verifiable.

We propose completing the AI development stack with a third layer: capability engineering, defined as the systematic practice of defining, verifying, and training models against executable specifications.

Capability engineering completes the AI development stack by making requirements first-class, verifiable artifacts rather than implicit expectations. For structural properties, verification is deterministic: the same policy produces the same verdict on the same output, every time. For semantic properties, verification is probabilistic but calibrated: a learned verifier’s score reflects explicit rubric criteria, not implicit preferences. Both are orders of magnitude more reliable than “preferred by annotators who disagree 30-50% of the time.”

The distinction between intelligence and capability clarifies the stack’s purpose: Intelligence training asks: “Can the model solve this?” Capability engineering asks: “Does the model satisfy this specification?” The former is open-ended; the latter is verifiable.

This distinction also clarifies why RLVR is insufficient. A model can get the right answer while violating format, safety, or domain constraints. RLVR optimizes for task correctness (intelligence); CAPE optimizes for specification satisfaction (capability).

Capability engineering doesn’t replace prompt and context engineering. Instead, it completes them.

Consider a legal research assistant. Context engineering retrieves relevant case law and statutes. Prompt engineering structures the analysis format. Capability engineering guarantees outputs follow the firm’s citation style, jurisdiction constraints, and confidentiality rules, requirements no generalpurpose model can satisfy out of the box.

Or consider a customer service agent. Context engineering retrieves account history and product details. Prompt engineering structures the response flow. Capability engineering guarantees the agent follows the company’s escalation protocol, refund policy, and brand voice-turning a generic model into one that operates as a trusted employee.

The binding constraint on post-training is not preference agreement but verification fidelity.

Preference-based methods face a structural ceiling:

Annotator disagreement. On subtle tasks, annotators disagree 30-50% of the time [Bai et al., 2022a]. Disagreement rates remain constant even with expert annotators and detailed guidelines reflecting the challenge of generalizing training across contexts.

Reward model over-optimization. [Gao et al., 2023] show proxy scores increase while true quality degrades. The problem: reward models learn patterns in annotator preferences, including their biases and inconsistencies.

No path forward. More compute amplifies noise. The ceiling is structural: human judgment variance cannot be reduced by better models or more data.

The ceiling for preference methods is not merely disagreement, it includes structural algorithmic biases. [Fatemi et al., 2025] demonstrate that PPO exhibits length bias: when the model receives a negative reward, longer responses dilute the per-token penalty. The model “learns” that longer responses reduce punishment, even when those extra tokens don’t help correctness. This explains the observation that RL-trained models produce increasingly verbose outputs.

[ Liu et al., 2025] identify two additional biases in GRPO:

  1. Response-length bias: Dividing advantage by response length makes long incorrect answers receive smaller penalties, so the model learns to generate longer bad answers.

  2. Difficulty-level bias: Normalizing by reward standard deviation causes easy or hard questions (with low reward variance) to be over-weighted.

These require algorithmic patches (such as Dr. GRPO; [Liu et al., 2025]) that add complexity and may introduce new failure modes.

CAPE sidesteps these pathologies. Verification produces binary pass/fail verdicts per policy; correction fixes failures at specific spans. There is no advantage normalization, no length-based reward shaping, no difficulty weighting. The training signal is: “this output satisfies the specification” or “this corrected output satisfies the specification.”

CAPE’s ceiling is verification fidelity, which manifests differently for structural and semantic properties. Both improve with scale.

The bottleneck is extraction fidelity: how accurately can we parse model outputs into Predicate-Graphs? Our experiments (Section 7.2) demonstrate that each percentage point reduction in extraction error corresponds to approximately 0.8 percentage points reduction in violation rate (r = 0.94). The ceiling is technical and falls as extraction improves.

The bottleneck is verifier capability. Our experiments (Section 7.5) show that learned verifier accuracy correlates strongly with downstream model quality (r = 0.84-0.87 within domains, r = 0.93 across domains; p < 0.001). Better verifiers directly translate to better models. The ceiling rises as verifier capability improves.

Preference disagreement reflects genuine human variance that doesn’t decrease with better models. Twelve months ago, extraction error rates were 30%+. CAPE would not have worked. Three developments changed this: improved instruction following, better structured generation, and longer context windows. As these continue to improve, CAPE becomes not just competitive with preference learning but the only method whose upper bound moves with model scale.

A scaling law describes a predictable relationship between resources and outcomes. Preference-based methods lack this property; CAPE has it.

Preference ceiling: fixed. Annotator disagreement reflects genuine variance in human judgment, not measurement error. When two annotators disagree about which response is “more helpful,” they are often both right according to their own values. More annotators, better guidelines, or larger reward models do not resolve disagreement, they measure it more precisely. The ceiling is epistemic, not technical.

Algorithmic bias ceiling: patched, not solved. Length bias in PPO and difficulty bias in GRPO are mathematical properties of the training objectives. Fixes like Dr. GRPO add corrections, but each patch may introduce new failure modes. There is no systematic relationship between compute and bias reduction.

Verification ceiling: falls with capability. Extraction error and verifier error are technical limitations. Extraction improves with better instruction following, longer context, and structured generation all improve with scale. Learned verifiers improve with model capability, following standard scaling laws. The ceiling is technical, and technical ceilings fall.

CAPE benefits doubly from continued progress: better base models to train, and better extractors and verifiers to train them with. [Su et al., 2025] provide independent validation of verification-fidelity scaling. They demonstrate that:

  1. Binary verification judgments achieve high inter-model agreement (comparable to our κ = 0.98 for explicit policies) when expert references exist 2. Compact (7B) generative verifiers can provide reliable cross-domain rewards 3. Soft-scoring methods outperform binary rewards in free-form, unstructured scenarios Kimi k1.5 [Kimi Team et al., 2025] further validates that verification-based approaches scale: they achieve state-of-the-art reasoning without Monte Carlo tree search, value functions, or process reward models, demonstrating that when verification is reliable, simpler training pipelines suffice.

The convergent finding across independent work: invest in verification quality, not reward complexity.

6 The CAPE Protocol CAPE operationalizes capability engineering through a closed loop: Specify → Verify → Correct → Train. M ← FineTune(M, T ) 18: end for 19: return M

The core loop (Algorithm 1):

The loop is identical for symbolic and learned verification, only the verification mechanism differs. Critically, this loop avoids the reward-shaping pathologies of PPO and GRPO: there is no advantage normalization, no length weighting, no difficulty adjustment. The signal is binary per policy: pass or fail.

The PredicateGraph is a structured intermediate representation of model output, exposing elements that policies reference.

Example: Model output for “What’s 15% of $47.30?" { "schema_version": "1.0.0", "operations": [ {"op_type": "MULTIPLY", "inputs": [47.30, 0.15], "output": 7.095} ], "tool_calls": [ {"name": "calc", "arguments": {"value": 7.1}} ], "claims": [ {"text": "Fifteen percent of $47.30 is 7.095”, “modality”: “factual”} ] } Policies reference specific elements: “the tool call argument,” “the operation output.” The structure makes capability requirements expressible. CPL is an executable specification language for structural properties:

{ “id”: “policy.tool.calc_matches”, “tier”: “T1”, “scope”: {“kind”: “tool_call”, “filter”: {“name”: “calc”}}, “where”: [{“expr”: “count(operations) > 0”}], “assert”: [{ “expr”: “tool_call.arguments.value == \ last(operations).output” }], “on_violation”: { “action”: “CORRECT”, “correction_hint”: “Update to exact value” } } Expression language (constrained for determinism and termination):

Category Operators Arithmetic +, -, *, /, % Comparison ==, !=, <, >, <=, >= Boolean and, or, not Quantifiers any(), all() Aggregation count(), sum(), min(), max() Collection first(), last(), filter()

The language is not Turing-complete. All expressions terminate. All evaluations are deterministic.

Policy tiers. Not all requirements carry equal weight. A model that produces the wrong arithmetic result has failed absolutely; a model that uses passive voice instead of active voice has merely violated a preference. Tiers encode this distinction:

• T1: Objective Correctness. Verifiable right/wrong: arithmetic, citations, factual accuracy.

T1 violations are failures. These policies cannot be overridden.

• T2: Safety and Governance. Organizational and regulatory requirements: formulary adherence, jurisdiction constraints, PII protection. T2 violations are compliance failures. These policies cannot override T1.

• T3: Structural Preferences. Scaffolding that correlates with quality: reasoning before conclusions, explicit uncertainty quantification. T3 violations are quality degradations, not failures. These policies can be overridden by T1 or T2.

Conflicts arise when satisfying one policy requires violating another. For example, a T3 policy requiring verbose explanations might conflict with a T2 policy limiting response length for regulatory reasons. Resolution is fully deterministic: tier first (T1 > T2 > T3), then the explicit priority field within a tier, then policy ID alphabetically. No ambiguity, no judgment calls.

Evaluation produces structured verdicts:

{ “output_id”: “example_847”, “policies_evaluated”: 12, “policies_passed”: 11, “violations”: [{ “policy_id”: “policy.tool.calc_matches”, “message”: “7.1 != 7.095”, “expected”: 7.095, “actual”: 7.1 }] }

When Used Success Deterministic Expected known 99.7% Template Element missing 97.3% Rewrite Semantic change 94.6% Deterministic patching handles cases like “argument 7.1 should be 7.095.” Template insertion handles missing elements. Constrained rewrite uses an LLM under minimality constraints when semantic changes are required, with re-verification before acceptance.

Symbolic policies handle structural verification: pattern-matching over PredicateGraph elements. But many capability requirements are semantic: “the reasoning is logically valid,” “the proof is complete,” “the plan is feasible.” These require understanding meaning, not matching patterns.

The naive approach-train a classifier on human judgments-reintroduces the preference ceiling. Annotators disagree on what counts as “valid reasoning,” and the classifier learns their disagreement.

CAPE takes a different approach: train verifiers on explicit rubrics, not implicit preferences. Rubric structure. A rubric decomposes a semantic property into discrete, describable levels:

{ “id”: “verifier.reasoning.validity”, “rubric”: { “1.0”: “All steps follow logically; no gaps or unsupported claims”, “0.5”: “Core argument sound but minor gaps in justification”, “0.0”: “Contains logical errors or non-sequiturs that invalidate conclusion” }, “output_schema”: { “issues”: [{“location”: “…”, “description”: “…”}], “score”: “float” } }

The rubric is not a preference, it is a specification. “Contains logical errors that invalidate conclusion” has a fact of the matter. The verifier’s job is to detect that fact, not to predict annotator preferences.

Training procedure. Learned verifiers are trained in three stages:

  1. Rubric calibration. Expert annotators label 500-1,000 examples using the rubric. Interannotator agreement (κ > 0.7) validates that the rubric is sufficiently precise. If agreement is low, the rubric is refined.

  2. Supervised fine-tuning. A base model is fine-tuned to produce (score, issues) pairs given an output and rubric. The training objective rewards both score accuracy and issue identification. Crucially, the model must cite specific locations and describe specific problems.

A held-out meta-verifier checks whether identified issues actually exist in the output. Training examples where the verifier hallucinated issues are downweighted or removed.

Why this works. The rubric converts a semantic judgment into a structured task. Instead of asking “is this good reasoning?” (subjective), we ask “does this contain logical errors that invalidate the conclusion?” (objective, given sufficient context). The verifier learns to detect specific failure modes described in the rubric, not to mimic annotator gestalt. [Su et al., 2025] validate this approach across medicine, chemistry, psychology, economics, and education, demonstrating that compact (7B) generative verifiers achieve high inter-model consistency without domain-specific annotation. Their soft-scoring method of computing rewards from token probabilities of assessment judgments complements our rubric-based approach. The key shared insight: explicit criteria yield reliable verification; implicit preferences do not.

Integration with symbolic policies. Both verification types feed the same training loop. A single output might be evaluated by symbolic policies (citation format, arithmetic correctness) and learned verifiers (reasoning validity, plan feasibility). All violations, whether they are structural or semantic, produce correction targets. The model learns to satisfy both.

Limitations. Learned verifiers are less reliable than symbolic policies. Our best verifiers achieve r = 0.87 correlation with expert judgment, compared to effectively r = 1.0 for deterministic policies. Meta-verification reduces but does not eliminate hallucinated issues. For high-stakes semantic properties, we recommend combining learned verification with human review rather than relying on verifiers alone.

Both verification types can fail. Meta-verification checks whether primary verification is accurate:

• For symbolic policies: Check that PredicateGraph faithfully represents the output

where R meta is the meta-verifier’s assessment of whether identified issues are genuine.

Impact: Without meta-verification, verifiers hallucinate issues 19% of the time. With it, hallucination drops to 4%.

We validate CAPE across 109,500 examples spanning structural verification (72,000 examples in 3 domains) and semantic verification (37,500 examples in 3 domains).

Arithmetic Tool Use: Calculator-equipped math problems from GSM8K and MATH datasets. Models must compute intermediate results and invoke calc() tool with exact values.

Code Safety: Python code generation from MBPP and HumanEval, augmented with synthetic examples containing unsafe patterns (eval(), exec(), SQL injection).

Citation Grounding: Question-answering with RAG context from Wikipedia. All factual claims must cite provided documents.

Argument Soundness: Multi-step reasoning requiring 2-5 steps. Expert annotation on 3,000 examples using explicit rubrics (Fleiss’ κ = 0.73).

Proof Validity: Mathematical proofs from MATH and miniF2F datasets requiring multi-step derivations. Rubric evaluates completeness (all steps present), logical validity (each step follows), and gap-free reasoning (no unjustified leaps). Expert annotation on 2,500 examples (κ = 0.71). Code Correctness: Functional correctness of generated code from MBPP and HumanEval, evaluated beyond binary pass/fail. Rubric assesses logic soundness (algorithm is correct), edge case handling (boundary conditions addressed), and implementation quality (no subtle bugs). Expert annotation on 2,500 examples (κ = 0.68).

• Base model: Llama-4-8B-Instruct • Batch size: 32

• Learning rate: 1 × 10 -5 (cosine schedule)

• Epochs: 3

• Hardware: 8× H100 (80GB)

• Training time: ∼36 hours per method All methods use identical compute budgets for fair comparison.

Preference-based:

• DPO [Rafailov et al., 2023]: 5 annotators per comparison, majority vote, β = 0.1

• Outcome RL: PPO [Schulman et al., 2017] with binary ground-truth rewards, clip ratio = 0.2

• Iterative DPO: Policy pass/fail converted to synthetic preferences, then standard DPO Principle-based:

• CAI (general) [Bai et al., 2022b]: 16 principles from Anthropic, self-critique and revision

• CAI (structured): Domain-specific principles matching CAPE policies (e.g., “verify calculator argument matches computed value”)

Verification-based:

• Policy RL (dense): PPO with policy verdicts as reward (one component per policy)

• Best-of-N + Filter: Generate N = 8 candidates, select first passing all policies (inference only, no training)

• CAPE (open-weight): Full protocol with Llama-4-70B extractor

• CAPE (frontier): Full protocol with Claude Opus 4.5 extractor All trained methods use identical compute budgets.

Table 6 presents violation rates across all methods.

Figure 2 shows DPO plateaus at ∼10% violation rate by step 2,000. CAPE continues improving throughout training, following power-law decay:

where V (t) is violation rate at step t, V 0 is initial violation rate, α is decay rate, and ϵ is the fitted asymptotic floor.

We evaluate CAPE on three semantic domains requiring learned verifiers: argument soundness, proof validity, and code correctness. The small violation increase (2.5% → 2.9%) reflects DPO occasionally trading structure for fluency, acceptable given the 7.6 percentage point preference gain. Policies establish correctness floor; preferences maximize quality within constraints.

Section 3 argued that most capability requirements become objective once context is fixed. We validate this claim on four domain-specific requirements that appear subjective in the abstract but become verifiable given organizational context. These requirements appear subjective without context (“good medical advice”) but become objectively verifiable when context is fixed (“recommend only formulary drugs”). Prompting alone achieves 40-60% reduction in violations. CAPE achieves 87-99% reduction, confirming that contextual objectivity enables specification-based training.

Recent work demonstrates that RL-based improvements on reasoning benchmarks can be unstable. Changing random seeds shifts AIME24 scores by several percentage points [Wang et al., 2025]. We verify CAPE’s improvements are robust.

We train CAPE with 5 different random seeds and report variance: CAPE exhibits substantially lower variance than reward-based methods (σ = 0.2% vs σ = 1.6-2.1%). For symbolic policies, verification is deterministic:

• Same policy + same output = same verdict, always

• Across 10,000 test outputs, verdict variance = 0

• Variance arises only from training dynamics, not evaluation noise

For PredicateGraph extraction with greedy decoding (temperature=0):

• Extraction agreement across 5 runs: 99.6%

• With temperature=0.3: agreement drops to 97.4%

• All reported results use greedy decoding

CAPE improvements hold when training on random subsets: Training on random 25% subsets yields violation rates within 1.0% of full-data training, confirming that gains reflect systematic policy satisfaction rather than benchmark-specific artifacts. CAPE’s advantage over DPO is preserved across all data scales.

CAPE reduces capability-specific post-training costs by 5-20× by replacing per-example annotation with reusable specifications.

Annotation cost dominates preference methods. At $5-15 per preference comparison and 5 comparisons per example for quality, 10,000 training examples require $50,000-150,000 in Policy authoring is a one-time cost. A domain expert and engineer can author a policy pack in 2-5 days ($2,000-4,000 fully loaded). The policy then generates unlimited training signal. Our arithmetic policy pack, authored in 3 days, has generated training signal for 50,000+ examples across multiple training runs.

Verification infrastructure has negligible marginal cost. PredicateGraph extraction costs $0.02-0.04 per example with frontier models (structured output mode). Policy evaluation is deterministic code execution. For 10,000 examples: $200-400 total.

Compute costs are equivalent. Both methods fine-tune the same base model with similar dataset sizes. The compute cost ($8,000-12,000 for 8×H100 training) is identical.

Iteration cycles differ dramatically. Preference methods require multiple annotation-trainingevaluation cycles to diagnose failures. When a model produces bad outputs, you know that something failed but not what, requiring new annotation to isolate the problem. Each cycle takes 2-4 weeks. CAPE’s explicit verdicts identify exactly which policies fail on which outputs: “policy.citation.factual_claims_cited failed on 847 of 10,000 examples.” Engineers fix the policy or the correction strategy, not the annotation guidelines. Our experiments required 1-2 iteration cycles versus 3-5 for DPO baselines.

Amortization compounds. Policy packs are reusable across model versions, organizations, and capability combinations. Our Arithmetic pack has been reused across multiple model versions and organizations, reducing effective per-deployment cost by an order of magnitude. The Citation pack works for any RAG application. As CapabilityBench grows, marginal policy cost falls toward the cost of downloading a JSON file.

This aligns with Dang & Ngo [2025], who achieved substantial RL improvements with 7,000 examples and $42 compute. Their finding reinforces ours: training compute is cheap; signal generation is the binding cost. Annotation scales O(n) with examples. Policies scale O(1): write once, apply everywhere.

We isolate CAPE’s contributions through four ablations testing alternative uses of verification signal. Table 17 summarizes results. We release the policy pack specification and initial packs under Apache 2.0; evaluation results will be published at https://capabilitybench.com after review.

• Opaque: A score of 78% tells you nothing about which capabilities passed or failed, or whether the failures matter for your use case

We release four initial policy packs spanning common deployment contexts. These are intended as starting points: organizations can adopt them directly, extend them with additional policies, or use them as templates for custom packs.

Each pack encodes requirements drawn from real deployment constraints: the Tool-Use pack validates argument types and schema conformance; the Code-Safety pack detects dangerous patterns like eval() and hardcoded secrets; the Citation pack enforces citation requirements for factual claims. Full policy definitions and test cases are available in the repository.

We invite the community to contribute policy packs for additional domains. As the registry grows, CapabilityBench becomes a shared resource for understanding which models are capable of what, against the specific requirements that determine whether deployment is possible. • Structural verification with deterministic policies (arithmetic, format, citations, safety patterns)

• Semantic verification with explicit rubrics (reasoning validity, proof completeness, plan feasibility)

• Domain-specific requirements (formulary adherence, jurisdiction constraints, escalation protocols)

Our analysis shows 89% of deployment requirements are objectively verifiable once context is fixed. Hybrid training (Section 7.7) shows these approaches compose: CAPE establishes correctness floor, preferences optimize quality within constraints. Specification drift. Policies encode requirements at a point in time. As workflows, regulations, or product rules evolve, policies must be updated. Poor policy maintenance can create mismatches between organizational rules and model behavior. Mitigation: Versioned policy packs with deprecation warnings. Automated policy testing against evolving requirements.

Policy gaming. Models might learn to satisfy policy letter while violating policy spirit, producing technically compliant outputs that fail on unmeasured dimensions. Mitigation: Comprehensive policy coverage. Hybrid training with preferences to optimize unmeasured quality dimensions.

Our experiments cover six domains: three structural (arithmetic, code safety, citations) and three semantic (argument soundness, proof validity, code correctness). We have not validated on openended dialogue, creative writing, or complex multi-turn reasoning. The protocol should transfer (the loop is domain-agnostic), but empirical validation beyond reasoning tasks is incomplete.

Open question: For which semantic properties can rubrics achieve sufficient inter-annotator agreement (κ > 0.7) to outperform preference learning? Our evidence suggests reasoning tasks broadly qualify: argument soundness (κ = 0.73) and proof validity (κ = 0.71) meet this threshold, while code correctness (κ = 0.68) approaches it. [Su et al., 2025] provide further evidence that medical, chemistry, and economics domains achieve high verification agreement. The boundaries for creative and aesthetic tasks remain unclear.

Our ablations (Section 7.11) address most alternative uses of verification signal. Two approaches remain untested:

Process Reward Models. PRMs assign rewards to intermediate reasoning steps, similar to CAPE’s span-level violation detection [Lightman et al., 2023]

The rapid progress in reasoning benchmarks, from 13.4% to 79.8% on AIME in 18 months, demonstrates scaling intelligence. Yet production failures persist. This is not a bug in intelligence training; it reflects different optimization targets. Intelligence training maximizes performance on open-ended challenges. Capability engineering ensures satisfaction of closed specifications. Both matter; neither subsumes the other.

CAPE’s contribution is operationalizing capability as a distinct, trainable property. The verificationfidelity scaling law (Section 5) applies to capability, not intelligence. Better verifiers yield more capable models: models that more reliably satisfy requirements independent of whether those models become more intelligent in any general sense.

As discussed in Section 2.5, recent work questions whether RL induces reasoning or amplifies preexisting capabilities. [Liu et al., 2025] find “Aha moments” in base models without RL; [Shah et al., 2025] show self-correction emerges during pre-training.

CAPE is agnostic to this debate. We do not claim to induce reasoning, we claim to enforce specifications. If reasoning emerges from pre-training, CAPE ensures it’s applied correctly. If RL induces reasoning, CAPE ensures the induced capabilities satisfy requirements. The verificationfidelity scaling law holds regardless of capability origin.

This agnosticism is a feature, not a limitation. As the field continues debating how capabilities arise, practitioners need methods that work regardless. CAPE provides that: specification satisfaction is measurable, trainable, and improvable independent of the underlying capability mechanism.

This work reframes post-training as an optimization problem over requirements, not preferences. By introducing executable specifications as the unit of training, CAPE separates intelligence acquisition from capability enforcement and makes reliable deployment a tractable engineering problem rather than an empirical hope.

Explicit verification beats implicit preferences. Whether verification is symbolic (deterministic policies for structural properties) or learned (rubric-trained verifiers for semantic properties), the same loop applies: specify requirements → verify outputs → correct violations → train on corrections. This loop compounds because verification and training share the same specification. Better verification yields cleaner training signal; cleaner signal yields better models; better models improve verification; improved verification enables stricter specifications.

Critically, this loop avoids the pathologies of reward-based methods: no length bias from loss normalization, no difficulty weighting from advantage computation, no reward hacking from proxy optimization. The signal is direct: does this output satisfy this specification?

Preference-based methods plateau at human disagreement (30-50%), a structural ceiling that doesn’t fall with scale. Algorithmic biases in PPO and GRPO require manual patches that add complexity without guaranteeing improvement. CAPE’s ceiling is verification fidelity:

• For symbolic policies: extraction accuracy (r = 0.94 correlation with violation rate)

• For learned verifiers: rubric-following accuracy (r = 0.93 correlation with downstream quality) Both improve with model scale. This means post-training compute investment yields guaranteed returns. Improve your verifier (symbolic or learned) and your generator improves. The feedback loop compounds rather than saturates.

CAPE operationalizes a crucial distinction:

• Intelligence: Can the model solve complex problems? (Open-ended, benchmark-measured)

• Capability: Does the model satisfy specific requirements? (Closed, specification-measured) A model can be highly intelligent yet lack specific capabilities. RLHF and RLVR optimize for intelligence; CAPE optimizes for capability. Both matter. Hybrid training achieves both: CAPE establishes the correctness floor, preferences maximize quality within constraints.

CAPE is not “policies for simple tasks, preferences for complex tasks.” It’s verification for objective properties, preferences for subjective properties:

CAPE makes model improvement resemble traditional engineering: explicit requirements, verifiable correctness, traceable failures, validated fixes. The specification is the artifact that unifies development, evaluation, and deployment. For too long, AI capability has been an optimization target without a definition. CAPE provides the definition.

The ceiling is now technical, not human. The path forward is verification.

We provide:

CAPE’s errors are technical limitations that fall predictably with capability.

2.Training, not filtering: Policies generate training signal. Rules typically reject at inference; CAPE teaches models to satisfy requirements.3. Compositional:

2.Training, not filtering: Policies generate training signal. Rules typically reject at inference; CAPE teaches models to satisfy requirements.

📸 Image Gallery

page_1.png page_2.png page_3.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut