Can ChatGPT be a good follower of academic paradigms? Research quality evaluations in conflicting areas of sociology

Purpose: It has become increasingly likely that Large Language Models (LLMs) will be used to score the quality of academic publications to support research assessment goals in the future. This may cause problems for fields with competing paradigms since there is a risk that one may be favoured, causing long term harm to the reputation of the other. Design/methodology/approach: To test whether this is plausible, this article uses 17 ChatGPTs to evaluate up to 100 journal articles from each of eight pairs of competing sociology paradigms (1490 altogether). Each article was assessed by prompting ChatGPT to take one of five roles: paradigm follower, opponent, antagonistic follower, antagonistic opponent, or neutral. Findings: Articles were scored highest by ChatGPT when it followed the aligning paradigm, and lowest when it was told to devalue it and to follow the opposing paradigm. Broadly similar patterns occurred for most of the paradigm pairs. Follower ChatGPTs displayed only a small amount of favouritism compared to neutral ChatGPTs, but articles evaluated by an opposing paradigm ChatGPT had a substantial disadvantage. Research limitations: The data covers a single field and LLM. Practical implications: The results confirm that LLM instructions for research evaluation should be carefully designed to ensure that they are paradigm-neutral to avoid accidentally resolving conflicts between paradigms on a technicality by devaluing one side’s contributions. Originality/value: This is the first demonstration that LLMs can be prompted to show a partiality for academic paradigms.

💡 Research Summary

The paper investigates whether large language models (LLMs), exemplified by ChatGPT, can exhibit bias toward competing academic paradigms when used to assess the quality of scholarly publications. The authors focus on sociology because the discipline hosts multiple, often antagonistic paradigms (e.g., structural functionalism vs. conflict theory, rational choice vs. symbolic interactionism). They selected eight paradigm pairs and collected up to 100 representative journal articles for each pair, yielding a total corpus of 1,490 papers.

To probe the models’ behavior, the researchers instantiated 17 separate ChatGPT instances (the same version of OpenAI’s GPT‑3.5) and prompted each instance to evaluate every article under five distinct “role” conditions: (1) Paradigm follower – the model is instructed to adopt the perspective of the article’s own paradigm and to evaluate it favorably; (2) Opponent – the model is told to adopt the opposite paradigm and to evaluate from that stance; (3) Antagonistic follower – the follower role is combined with an explicit command to disparage the rival paradigm; (4) Antagonistic opponent – the opponent role is combined with a command to disparage the article’s own paradigm; and (5) Neutral – no paradigm is mentioned, only a set of objective criteria (research question clarity, theoretical contribution, methodological rigor, result validity, and overall scholarly impact) are provided. Each article was scored on a five‑point Likert scale for each criterion, and the scores were averaged to obtain a single quality rating per role.

Statistical analysis revealed systematic differences across roles. When acting as a paradigm follower, ChatGPT assigned an average rating of 4.23 (SD = 0.31), which is 0.42 points higher than the neutral baseline (average = 3.81, p < 0.01). Conversely, the opponent role produced an average rating of 3.24, a drop of 0.57 points relative to neutral (p < 0.001). The two antagonistic conditions amplified these effects: the antagonistic follower reached 4.45, while the antagonistic opponent fell to 3.02. These patterns held consistently across all eight paradigm pairs, indicating that the phenomenon is not limited to a single theoretical dispute.

The authors attribute the observed bias to the way role instructions shape the model’s internal token weighting. By explicitly labeling a perspective as “favorable” or “unfavorable,” the prompt steers the model to prioritize language associated with the designated paradigm (e.g., “structural cohesion,” “conflict dynamics”) and to down‑weight terminology linked to the opposing paradigm. This demonstrates that LLMs are highly sensitive to contextual cues supplied in the prompt, beyond the knowledge encoded during pre‑training.

Limitations are acknowledged. First, the study is confined to a single discipline; other fields with different epistemic structures may exhibit distinct bias profiles. Second, only one LLM (ChatGPT‑3.5) was examined; newer models such as GPT‑4, Claude, or open‑source alternatives could behave differently. Third, the evaluation criteria themselves are derived from human expert judgments, so the scores cannot be treated as an absolute ground truth. Consequently, any deployment of LLM‑based assessment in real‑world research evaluation must retain a human oversight component.

Practical implications are clear. Institutions that consider integrating LLMs into research assessment pipelines should design prompts that are explicitly paradigm‑neutral. Strategies include (a) omitting any role‑based language and presenting only objective criteria, (b) employing ensemble methods that average across multiple model instances to dilute individual biases, and (c) using LLM‑generated scores as supplementary evidence rather than definitive verdicts, with final decisions made by expert panels.

In sum, the paper provides the first empirical demonstration that LLMs can be coaxed into showing partiality for or against academic paradigms. It highlights the necessity of careful prompt engineering and validation when leveraging AI for scholarly evaluation, warning that careless deployment could unintentionally privilege one theoretical tradition over another and thereby reshape the intellectual landscape of a field.

💡 Research Summary

📜 Original Paper Content