On the Concept of Violence: A Comparative Study of Human and AI Judgments
Background: What counts as violence is neither self-evident nor universally agreed upon. While physical aggression is prototypical, contemporary societies increasingly debate whether exclusion, humiliation, online harassment or symbolic acts should be classified within the same moral category. At the same time, Large Language Models (LLMs) are being consulted in everyday contexts to interpret and label complex social behaviors. Whether these systems reproduce, reshape or simplify human conceptions of violence remains an open question. Methods: Here we present a systematic comparison between human judgements and LLM classifications across 22 scenarios carefully designed to be morally dividing, spanning from physical and verbally aggressive behavior, relational dynamics, marginalization, symbolic actions and verbal expressions. Human responses were compared with outputs from multiple instruction-tuned models of varying sizes and architectures. We conducted global, sentence-level and thematic-domain analyses, and examined variability across models to assess patterns of convergence and divergence. Findings: This study treats violence as a strategically chosen proxy through which broader belief formation dynamics can be observed. Violence is not the focus of the study, but it serves as a tool to investigate broader analysis. It enables a structured investigation of how LLMs operationalize ambiguous moral constructs, negotiate conceptual boundaries, and transform plural human interpretations into singular outputs. More broadly, the findings contribute to ongoing debates about the epistemic role of conversational AI in shaping everyday interpretations of harm, responsibility and social norms, highlighting the importance of transparency and critical engagement as these systems increasingly mediate public reasoning.
💡 Research Summary
The paper investigates how contemporary large language models (LLMs) classify statements that may or may not be considered violent, comparing their judgments to those of a large sample of human respondents. The authors designed a set of 22 deliberately provocative sentences covering four thematic domains: verbal and linguistic expressions, symbolic acts, interpersonal/relational dynamics, and omission/exclusion/indifference. Each sentence was presented to the public via a radio broadcast and social‑media channels, and participants were asked to label it as “violence,” “non‑violence,” or “depend‑on” (i.e., context‑sensitive). Over 3,000 respondents provided categorical answers for each sentence, yielding a total of 73,335 human judgments.
In parallel, the authors selected 18 publicly available LLMs that differ in architecture (LLaMA‑derived, Mistral, Qwen, Phi‑3, Nous, etc.), parameter count (≈1 B to >10 B), training data, instruction‑tuning, reinforcement learning from human feedback, and safety alignment mechanisms. Using a uniform prompt template that forced a single JSON output containing only the “category” field, each model was asked to classify the same 22 sentences and to provide a confidence score. Two models (phi3:mini and gemma3:4b) failed to produce any label and were excluded from the statistical analysis.
The authors performed global and sentence‑level comparisons using chi‑square tests of independence on 2 × 3 contingency tables, applying Benjamini–Hochberg false discovery rate correction for multiple testing. Overall, human responses labeled 72.3 % of the cases as violence, 13.8 % as non‑violence, and 13.9 % as depend‑on. The LLM ensemble labeled 71.9 % as violence, 18.8 % as non‑violence, and 9.4 % as depend‑on. The overall distribution differed significantly (χ²(2)=11.35, p=0.0034), driven mainly by a shift from “depend‑on” in humans toward “non‑violence” in the models.
At the sentence level, nine of the 22 items showed significant human‑model distribution differences after FDR correction. The most pronounced divergences occurred in verbal expression items: private and public online insults (sentences 10, 11) and coordinated mass‑insult campaigns (sentence 13) where models classified only about half the cases as violence, whereas humans classified >90 % as violent. Conversely, for a scenario where a TV host interrupts a speaker advocating physical elimination (sentence 20), models labeled 81 % as violence compared with only 27 % of humans. In relational dynamics, persistent staring on a bus (sentence 7) was more often deemed non‑violent by AI (50 % vs 17 % human), while non‑consensual touching (sentence 8) was labeled violence by both groups but with a modest model under‑estimation (87.5 % vs 97.9 %).
Inter‑model agreement was quantified with Fleiss’ κ; sentences with significant human‑model disagreement also exhibited lower AI consensus (mean κ ≈ 0.64 vs 0.80 for non‑significant items). Correlations between model size and alignment with human majority labels were weakly positive, but Kruskal‑Wallis tests revealed significant differences across model families, indicating that architecture and alignment strategy matter more than sheer parameter count.
The authors interpret these findings as evidence that LLMs tend to simplify or re‑shape the pluralistic moral landscape humans exhibit. Safety‑alignment filters appear to push models toward “non‑violence” labels in many ambiguous cases, while in other contexts (e.g., censored speech) the same filters may cause over‑classification as violent. The variability among models underscores that AI systems are not neutral arbiters of social norms; rather, they inherit biases from training data, instruction‑tuning, and policy layers.
In the broader discussion, the paper argues that as conversational AI increasingly mediates ethical judgments in everyday life—moderating content, advising users, or even acting as a “conscience”—understanding how these systems operationalize ambiguous constructs like violence is crucial. Transparency, explainability, and continuous human oversight are recommended to prevent the inadvertent reinforcement of narrow or distorted normative positions. The study contributes empirical evidence to ongoing debates about the epistemic role of AI in shaping public reasoning about harm, responsibility, and social norms.
Comments & Academic Discussion
Loading comments...
Leave a Comment