Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment

Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) are increasingly used as evaluators of reasoning quality, yet their reliability and bias in payments-risk settings remain poorly understood. We introduce a structured multi-evaluator framework for assessing LLM reasoning in Merchant Category Code (MCC)-based merchant risk assessment, combining a five-criterion rubric with Monte-Carlo scoring to evaluate rationale quality and evaluator stability. Five frontier LLMs generate and cross-evaluate MCC risk rationales under attributed and anonymized conditions. To establish a judge-independent reference, we introduce a consensus-deviation metric that eliminates circularity by comparing each judge’s score to the mean of all other judges, yielding a theoretically grounded measure of self-evaluation and cross-model deviation. Results reveal substantial heterogeneity: GPT-5.1 and Claude 4.5 Sonnet show negative self-evaluation bias (-0.33, -0.31), while Gemini-2.5 Pro and Grok 4 display positive bias (+0.77, +0.71), with bias attenuating by 25.8 percent under anonymization. Evaluation by 26 payment-industry experts shows LLM judges assign scores averaging +0.46 points above human consensus, and that the negative bias of GPT-5.1 and Claude 4.5 Sonnet reflects closer alignment with human judgment. Ground-truth validation using payment-network data shows four models exhibit statistically significant alignment (Spearman rho = 0.56 to 0.77), confirming that the framework captures genuine quality. Overall, the framework provides a replicable basis for evaluating LLM-as-a-judge systems in payment-risk workflows and highlights the need for bias-aware protocols in operational financial settings.


💡 Research Summary

The paper tackles the emerging “LLM‑as‑a‑judge” paradigm in a high‑stakes, domain‑specific setting: merchant risk assessment based on Merchant Category Codes (MCCs). The authors design a structured multi‑evaluator framework that simultaneously treats large language models (LLMs) as both generators of risk rationales and as evaluators of those rationales. Five frontier LLMs—GPT‑5.1, Gemini‑2.5 Pro, Grok 4, Claude 4.5 Sonnet, and Perplexity Sonar—are used.

Prompt and Rationale Generation
A carefully crafted prompt asks each model to produce a five‑level risk hierarchy (Very Low to Very High) and, for each level, to discuss five mandatory dimensions: Business Model Stability, Regulatory Exposure, Fraud Exposure, Return/Refund Patterns, and Chargeback Activity. The prompt also forces the selection of three representative MCCs per level and prohibits numerical metrics or industry jargon. The output must be a JSON array containing a risk‑level definition and a narrative rationale. All models receive only a public MCC‑to‑Name mapping, ensuring that generated rationales rely on pre‑training knowledge rather than proprietary transaction data.

Evaluation Framework
Each LLM is then cast as a “Global Payments‑Risk Domain Expert” and asked to score every other model’s rationales—including its own—using a five‑criterion rubric (Accuracy, Rationale Quality, Consistency, Completeness, Practical Applicability) on a 0‑10 scale. To capture stochastic variability, the authors employ a Monte Carlo protocol: for each judge‑target pair, ten independent scoring runs are performed at temperature 0.7, yielding a mean (µ) and standard deviation (σ) per judge. This quantifies both scoring tendency and stability.

Consensus‑Deviation Metric
A central methodological contribution is the consensus‑deviation metric. For any judge, its score on a given rationale is compared to the mean of all other judges’ scores on the same rationale. The difference isolates the judge’s self‑evaluation bias while avoiding circularity (the judge’s own score never influences its reference). Positive values indicate self‑promotion; negative values indicate self‑critique.

Key Findings

  • Self‑Evaluation Bias: GPT‑5.1 and Claude 4.5 Sonnet exhibit negative self‑evaluation bias (‑0.33 and ‑0.31 points respectively), meaning they score their own outputs lower than the peer consensus. Gemini‑2.5 Pro and Grok 4 show strong positive bias (+0.77 and +0.71). Perplexity Sonar has modest positive bias (+0.21).
  • Effect of Anonymization: When the source model is concealed, bias magnitude drops by an average of 25.8 % but the direction (positive or negative) remains unchanged, suggesting intrinsic model tendencies rather than mere author‑recognition effects.
  • Human Expert Benchmark: Twenty‑six payment‑industry experts independently scored the same rationales using the identical rubric. LLM judges on average gave scores +0.46 points higher than the human consensus. Notably, the models with negative self‑bias (GPT‑5.1, Claude 4.5 Sonnet) aligned most closely with human judgments.
  • Ground‑Truth Validation: Using four years of payment‑network transaction data, the authors compute Spearman correlations between model‑assigned risk scores and empirical merchant‑risk patterns. Four models (Claude 4.5 Sonnet, Gemini 2.5 Pro, Grok 4, GPT‑5.1) achieve statistically significant correlations ranging from 0.56 to 0.77, confirming that the evaluation framework captures genuine quality rather than shared artefacts.

Contributions

  1. First domain‑aligned, replicable evaluation of LLM reasoning in MCC‑based merchant risk assessment.
  2. Monte Carlo protocol that quantifies evaluator stability (µ ± σ).
  3. Consensus‑deviation metric that rigorously measures both positive and negative self‑evaluation bias, eliminating circularity.
  4. Empirical evidence of negative self‑evaluation bias in frontier LLMs, challenging prior assumptions of universal self‑preference.
  5. Demonstration that bias direction persists under anonymization, with magnitude reduction but not reversal.
  6. Triangulated validation through human experts and real‑world transaction data, showing that conservative (negative‑bias) scoring aligns better with human and empirical risk.

Implications and Limitations
The study highlights that deploying LLMs as judges in financial risk pipelines requires bias‑aware protocols: model‑specific self‑bias must be measured, and anonymization can mitigate but not eliminate it. Stability metrics (σ) reveal that some models produce more consistent scores than others, an operational consideration for automated risk pipelines. Limitations include reliance on publicly available MCC mappings (no proprietary transaction features), a fixed Monte Carlo sample size (10 runs), and evaluation confined to a single domain (merchant risk). Future work could extend the framework to other high‑risk domains (credit scoring, AML), explore larger sample sizes, and investigate mitigation strategies such as prompt engineering, ensemble judging, or post‑hoc calibration.

Conclusion
By integrating a structured rubric, Monte Carlo stability analysis, and a novel consensus‑deviation metric, the paper provides a robust, reproducible methodology for assessing LLM‑as‑a‑judge systems in payment‑risk workflows. The findings underscore substantial heterogeneity in evaluator behavior, the existence of both positive and negative self‑evaluation bias, and the importance of bias‑aware deployment strategies in operational finance.


Comments & Academic Discussion

Loading comments...

Leave a Comment