Aligning Large Language Model Behavior with Human Citation Preferences

Aligning Large Language Model Behavior with Human Citation Preferences
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Most services built on powerful large-scale language models (LLMs) add citations to their output to enhance credibility. Recent research has paid increasing attention to the question of what reference documents to link to outputs. However, how LLMs recognize cite-worthiness and how this process should be controlled remains underexplored. In this study, we focus on what kinds of content LLMs currently tend to cite and how well that behavior aligns with human preferences. We construct a dataset to characterize the relationship between human citation preferences and LLM behavior. Web-derived texts are categorized into eight citation-motivation types, and pairwise citation preferences are exhaustively evaluated across all type combinations to capture fine-grained contrasts. Our results show that humans most frequently seek citations for medical text, and stronger models display a similar tendency. We also find that current models are as much as $27%$ more likely than humans to add citations to text that is explicitly marked as needing citations on sources such as Wikipedia, and this overemphasis reduces alignment accuracy. Conversely, models systematically underselect numeric sentences (by $-22.6%$ relative to humans) and sentences containing personal names (by $-20.1%$), categories for which humans typically demand citations. Furthermore, experiments with Direct Preference Optimization demonstrate that model behavior can be calibrated to better match human citation preferences. We expect this study to provide a foundation for more fine-grained investigations into LLM citation preferences.


💡 Research Summary

The paper investigates whether large language models (LLMs) generate citations in a way that matches human preferences for “cite‑worthiness.” While many recent works focus on which external documents should be linked, this study asks which pieces of generated text deserve a citation at all, and how controllable that behavior is.

To answer these questions the authors first built a human‑annotated dataset. They harvested 6,000 sentences from Wikipedia that carry inline‑template tags (e.g., “Citation needed”, “Sic”, “Doubt”). After reviewing 19 raw tags, they regrouped them into eight high‑level citation‑motivation categories: Missing Information, Sic, Doubt, Vague, POV, Medical Content, Jargon, and Unclear. Each category contributed 750 sentences, and all possible unordered pairs of distinct categories (28 combinations) were balanced, yielding 2,596 valid sentence pairs after quality control. 402 annotators performed pairwise preference judgments, indicating which sentence in each pair should receive a citation. The resulting human preference matrix shows that medical content is most frequently preferred for citation (e.g., 75.9 % over Vague), while Unclear and Jargon also attract citations because they serve as “anchors of meaning.”

With this benchmark the authors evaluated 11 LLMs spanning open‑source (Mistral Small/Large, Llama‑1B/3B/70B, DeepSeek Chat) and closed‑source (GPT‑5, Claude Sonnet 4, Gemini 2.5 Flash, Qwen Max, CommandR +) families. Models were prompted to choose the more citation‑worthy sentence in each pair, and agreement with human choices was measured. Overall agreement hovered around 60 %, indicating a modest alignment. Larger models performed better (e.g., DeepSeek Chat 62.7 %, Llama‑70B 61.6 %) while the smallest Llama‑1B was at random baseline (~50 %).

Two systematic biases emerged. First, sentences explicitly marked “Citation needed” on Wikipedia were over‑cited by all models except the smallest one: selection rates were 10–27 % higher than human rates, suggesting that the prevalence of this tag in pre‑training data leads models to over‑estimate cite‑worthiness. Second, models under‑cite sentences containing numeric expressions (‑9.8 % to ‑22.6 % relative to humans) and those containing personal names (‑3.9 % to ‑20.1 %). These categories are precisely where humans demand citations for precision or verification, indicating a gap in the models’ risk assessment.

The authors then applied Direct Preference Optimization (DPO), a reinforcement‑learning‑style fine‑tuning method that directly optimizes the model to match the human pairwise preferences. After DPO training, overall agreement improved by 11.8 % absolute points, and the over‑citation of “Citation needed” sentences dropped by 5–7 % relative to humans. This demonstrates that LLM citation behavior can be calibrated post‑hoc to better reflect user expectations, mitigating biases inherited from the pre‑training corpus.

Key contributions are: (1) the first dataset explicitly targeting human citation‑worthiness judgments across diverse content types; (2) a systematic comparison of many state‑of‑the‑art LLMs against human preferences, revealing both scaling benefits and persistent biases; (3) empirical evidence that training‑data artifacts (e.g., Wikipedia “Citation needed” tags) skew model behavior; (4) a proof‑of‑concept that DPO can align LLMs with human citation preferences.

Limitations include reliance on Wikipedia‑derived sentences, which may not generalize to other domains (legal, scientific literature) or to real‑world user interactions where citation needs are context‑dependent. Future work should expand the taxonomy to more domains, incorporate live user feedback loops, and explore optimal trade‑offs between citation density and user satisfaction. The study lays groundwork for fine‑grained control of LLM citation policies, an essential step toward trustworthy, user‑aligned generative AI.


Comments & Academic Discussion

Loading comments...

Leave a Comment