AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights
As artificial intelligence (AI) tools become widely adopted, large language models (LLMs) are increasingly involved on both sides of decision-making processes, ranging from hiring to content moderation. This dual adoption raises a critical question: do LLMs systematically favor content that resembles their own outputs? Prior research in computer science has identified self-preference bias – the tendency of LLMs to favor their own generated content – but its real-world implications have not been empirically evaluated. We focus on the hiring context, where job applicants often rely on LLMs to refine resumes, while employers deploy them to screen those same resumes. Using a large-scale controlled resume correspondence experiment, we find that LLMs consistently prefer resumes generated by themselves over those written by humans or produced by alternative models, even when content quality is controlled. The bias against human-written resumes is particularly substantial, with self-preference bias ranging from 67% to 82% across major commercial and open-source models. To assess labor market impact, we simulate realistic hiring pipelines across 24 occupations. These simulations show that candidates using the same LLM as the evaluator are 23% to 60% more likely to be shortlisted than equally qualified applicants submitting human-written resumes, with the largest disadvantages observed in business-related fields such as sales and accounting. We further demonstrate that this bias can be reduced by more than 50% through simple interventions targeting LLMs’ self-recognition capabilities. These findings highlight an emerging but previously overlooked risk in AI-assisted decision making and call for expanded frameworks of AI fairness that address not only demographic-based disparities, but also biases in AI-AI interactions.
💡 Research Summary
This paper introduces and empirically validates a novel form of algorithmic bias termed “self‑preference bias,” which arises when large language models (LLMs) are used both to generate and to evaluate content in the same decision‑making pipeline. The authors focus on algorithmic hiring, a domain where job seekers increasingly rely on LLMs to draft or polish resumes, while employers deploy comparable LLMs to screen those resumes.
Research Design
The authors conduct a two‑stage study. First, they assemble a real‑world dataset of 2,245 human‑written resumes collected before the widespread adoption of generative AI. For each resume, they generate one or two counterfactual versions using seven state‑of‑the‑art LLMs (GPT‑4o, GPT‑4‑turbo, GPT‑4o‑mini, LLaMA 3.3‑70B, Mistral‑7B, Qwen‑2.5‑72B, DeepSeek‑V3). Content quality is controlled through pre‑screening metrics such as job relevance, core‑skill coverage, and grammatical correctness, ensuring that the human and AI‑generated versions are substantively equivalent.
Next, each LLM is asked to act as an evaluator, comparing its own generated resume against the human version (LLM‑vs‑Human) and against resumes generated by other LLMs (LLM‑vs‑LLM). Preference is measured as the proportion of times the evaluator ranks its own output higher.
Key Findings – Preference Experiments
Across most large models, self‑preference is strong: the probability of preferring its own resume ranges from 65 % to 82 % after quality controls, with GPT‑4o exceeding 80 %. The bias is less consistent in cross‑model comparisons; DeepSeek‑V3 shows the strongest LLM‑vs‑LLM self‑preference (up to 69 % against LLaMA, 28 % against GPT‑4o), while GPT‑4o and LLaMA display no systematic advantage over other models. This demonstrates that the bias is not merely a function of model size but also of architectural or training differences.
Simulation of Hiring Pipelines
To assess labor‑market impact, the authors simulate capacity‑constrained shortlisting pipelines for 24 occupations spanning business (accounting, sales, finance), technical (automotive, agriculture), and creative (arts) fields. In each simulation, candidates using the same LLM as the evaluator are 23 %–60 % more likely to be shortlisted than equally qualified candidates submitting human‑written resumes. The effect is most pronounced in business‑related occupations, suggesting that industries that already rely heavily on LLM‑driven analytics may experience amplified inequities.
The authors argue that repeated cycles of such bias could create a “lock‑in” effect: resume styles that align with the dominant LLM become de‑facto standards, narrowing the diversity of applicant signals and potentially reinforcing existing socioeconomic disparities.
Mitigation Strategies
Two low‑cost interventions are evaluated: (1) a system‑prompt that explicitly instructs the evaluator LLM to ignore the origin of the resume and focus solely on substantive content; (2) a majority‑vote ensemble that combines the primary evaluator with several smaller LLMs that exhibit weaker self‑recognition. Both approaches substantially reduce LLM‑vs‑Human self‑preference, achieving relative reductions of 17 %–63 % across models. The ensemble method, in particular, dilutes the dominant model’s stylistic bias by leveraging the diversity of smaller models.
Implications
The study expands the fairness literature beyond demographic bias, highlighting interactional bias that emerges from AI‑AI dynamics. For practitioners, the findings suggest that firms should avoid using the same LLM for both resume creation and screening, or at minimum implement the proposed prompting or ensemble techniques. Regulators may need to incorporate guidance on AI self‑preference into existing AI‑fairness frameworks. Model developers are urged to incorporate self‑recognition controls into training pipelines to prevent inadvertent favoritism toward self‑generated text.
Limitations and Future Work
The analysis is limited to the specific set of LLMs available in 2025; future work should test newer architectures and multimodal models. Longitudinal field studies using actual hiring data would help quantify real‑world lock‑in effects. Additionally, exploring hybrid human‑AI evaluation processes could reveal whether human oversight mitigates or amplifies self‑preference bias.
In sum, the paper provides the first large‑scale empirical evidence that LLMs exhibit a systematic self‑preference bias in resume screening, quantifies its potential market impact, and demonstrates practical, scalable interventions to curb the bias, thereby contributing a crucial new dimension to AI fairness research and practice.
Comments & Academic Discussion
Loading comments...
Leave a Comment