Democratic Preference Alignment via Sortition-Weighted RLHF

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Whose values should AI systems learn? Preference based alignment methods like RLHF derive their training signal from human raters, yet these rater pools are typically convenience samples that systematically over represent some demographics and under represent others. We introduce Democratic Preference Optimization, or DemPO, a framework that applies algorithmic sortition, the same mechanism used to construct citizen assemblies, to preference based fine tuning. DemPO offers two training schemes. Hard Panel trains exclusively on preferences from a quota satisfying mini public sampled via sortition. Soft Panel retains all data but reweights each rater by their inclusion probability under the sortition lottery. We prove that Soft Panel weighting recovers the expected Hard Panel objective in closed form. Using a public preference dataset that pairs human judgments with rater demographics and a seventy five clause constitution independently elicited from a representative United States panel, we evaluate Llama models from one billion to eight billion parameters fine tuned under each scheme. Across six aggregation methods, the Hard Panel consistently ranks first and the Soft Panel consistently outperforms the unweighted baseline, with effect sizes growing as model capacity increases. These results demonstrate that enforcing demographic representativeness at the preference collection stage, rather than post hoc correction, yields models whose behavior better reflects values elicited from representative publics.

💡 Research Summary

The paper tackles a fundamental governance question in preference‑based AI alignment: whose values should be learned when human feedback is used to fine‑tune large language models? Existing RLHF pipelines typically rely on convenience samples of annotators, which are demographically skewed and therefore embed systematic biases into the learned reward model. To address this, the authors introduce Democratic Preference Optimization (DemPO), a framework that imports the political science tool of algorithmic sortition—random selection subject to quota constraints—into the RLHF workflow.

DemPO defines two concrete training regimes. The “Hard Panel” approach draws a single quota‑feasible mini‑public (a panel) from the pool of raters using the LEXIMIN sortition algorithm, then restricts training to the preference data contributed by raters in that panel. Each panel member’s contributions are normalized by the number of comparisons they supplied, ensuring a “one person, one voice” principle. The “Soft Panel” approach retains the full dataset but re‑weights each rater’s loss by their inclusion probability under the same sortition lottery (π_i). The authors prove that, when the weight function is the identity (w_i = π_i), the expected Soft‑Panel objective exactly matches the expectation of the Hard‑Panel objective, providing a clean theoretical link between the two schemes.

The empirical evaluation uses the PRISM alignment dataset, which contains multi‑turn conversational prompts, model completions, human preference judgments, and detailed demographic attributes for each rater. In addition, a 75‑clause constitution elicited from a representative U.S. panel serves as an external benchmark of democratic values. The authors fine‑tune LLaMA models of four sizes (1 B, 2 B, 4 B, 8 B parameters) under four conditions: (1) Full – standard unweighted RLHF on the entire PRISM pool, (2) US‑Rep – training only on the subset of raters who belong to a pre‑selected representative panel, (3) Hard Panel – training on a single sortition‑drawn panel, and (4) Soft Panel – training on the full pool with inclusion‑probability weighting.

Performance is measured by how well the fine‑tuned models align with the constitutional criteria, using six distinct rank‑aggregation methods (Bradley‑Terry, Plackett‑Luce, Borda, Copeland, Kemeny‑Young, and Mallows). Across all model sizes and aggregation families, Hard Panel consistently achieves the highest alignment scores, and the advantage grows with model capacity. Soft Panel uniformly outperforms the Full baseline, confirming that demographic weighting improves alignment without discarding data. US‑Rep performs comparably to Soft Panel but falls short of Hard Panel, indicating that the random‑draw nature of sortition adds robustness beyond simply selecting a pre‑identified representative subset.

The paper also details practical engineering choices that stabilize training: per‑rater loss normalization by the number of comparisons, self‑normalized weighted means within each minibatch, and distributed aggregation of weight denominators. Diagnostics of inclusion probabilities (π_i) show that the LEXIMIN procedure provides near‑max‑min protection for scarce demographic groups and symmetry across indistinguishable raters.

Key insights include: (1) Enforcing representativeness at the data‑collection stage yields stronger and more consistent alignment than post‑hoc re‑weighting; (2) Soft Panel offers a data‑efficient alternative that retains all annotations while achieving the same expected objective as Hard Panel; (3) The benefits of democratic weighting amplify with larger model capacities, suggesting that as models become more expressive they are better able to internalize the nuanced preferences of a demographically balanced electorate.

Limitations are acknowledged: the quota constraints are applied only to marginal distributions (one attribute at a time), ignoring joint demographic intersections; only a single panel is sampled for Hard Panel, which may introduce variance; and the study focuses on U.S. demographic categories and a specific constitutional benchmark, leaving open the question of cross‑cultural generalization.

Future work is proposed along several axes: extending sortition to enforce joint demographic constraints, developing dynamic or sequential panel selection to capture evolving public values, testing the framework in high‑stakes domains such as healthcare or law, and integrating sortition‑based feedback collection into formal AI governance policies.

In sum, DemPO demonstrates that algorithmic sortition can be operationalized within RLHF to produce language models whose behavior more faithfully reflects the values of a representative public, offering both a theoretically grounded and empirically validated path toward more democratic AI alignment.

Democratic Preference Alignment via Sortition-Weighted RLHF

💡 Research Summary

Comments & Academic Discussion

Leave a Comment