MultiwayPAM: Multiway Partitioning Around Medoids for LLM-as-a-Judge Score Analysis

MultiwayPAM: Multiway Partitioning Around Medoids for LLM-as-a-Judge Score Analysis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

LLM-as-a-Judge is a flexible framework for text evaluation, which allows us to obtain scores for the quality of a given text from various perspectives by changing the prompt template. Two main challenges in using LLM-as-a-Judge are computational cost of LLM inference, especially when evaluating a large number of texts, and inherent bias of an LLM evaluator. To address these issues and reveal the structure of score bias caused by an LLM evaluator, we propose to apply a tensor clustering method to a given LLM-as-a-Judge score tensor, whose entries are the scores for different combinations of questions, answerers, and evaluators. Specifically, we develop a new tensor clustering method MultiwayPAM, with which we can simultaneously estimate the cluster membership and the medoids for each mode of a given data tensor. By observing the medoids obtained by MultiwayPAM, we can gain knowledge about the membership of each question/answerer/evaluator cluster. We experimentally show the effectiveness of MultiwayPAM by applying it to the score tensors for two practical datasets.


💡 Research Summary

LLM‑as‑a‑Judge is an emerging paradigm that leverages large language models (LLMs) to score generated texts by simply changing the prompt template, thereby offering a flexible way to evaluate texts from many perspectives (e.g., relevance, factuality, style). While attractive, two practical obstacles limit its widespread adoption. First, the computational cost of LLM inference grows multiplicatively with the number of evaluation dimensions: for d₁ questions, d₂ answerer settings, and d₃ evaluator settings, one must run the LLM d₁·d₂·d₃ times. Second, the LLM itself introduces systematic biases—self‑enhancement bias, persona bias, etc.—that can skew the scores and obscure the true quality of the texts.

The authors propose to treat the full set of scores as a three‑mode tensor Y ∈ ℝ^{d₁×d₂×d₃}, where each entry y_{i₁,i₂,i₃} is the score given by evaluator i₃ to answerer i₂ for question i₁. By uncovering a block‑wise structure in this tensor, one can (i) predict missing scores (reducing inference cost) and (ii) expose latent bias patterns across questions, answerers, and evaluators. Existing tensor block models (e.g., Tucker, CP, or matrix‑based clustering) can estimate cluster memberships but become hard to interpret as clusters grow, because they provide only abstract index sets without concrete representatives.

To address interpretability, the paper introduces MultiwayPAM (Multiway Partitioning Around Medoids), a novel tensor clustering algorithm that extends the classic PAM (k‑medoids) from vectors to multi‑mode tensors. MultiwayPAM simultaneously estimates (a) a set of medoid indices for each mode (the actual questions, answerers, and evaluators that best represent each cluster) and (b) the cluster membership of every index. The algorithm proceeds in two stages:

  1. BUILD (initialization) – For each mode k, a greedy selection picks c_k medoids. The first medoid is the index whose associated slice (the (K‑1)-dimensional subtensor obtained by fixing that index) minimizes the sum of squared Euclidean distances to all other slices. Subsequent medoids are chosen similarly, each time minimizing the distance to the nearest already‑selected medoid slice. After medoids are fixed, every index is assigned to the nearest medoid slice, yielding an initial membership.

  2. SWAP (iterative refinement) – Given current medoids R and memberships M, the algorithm evaluates all possible swaps between a medoid and a non‑medoid index within a mode. For each candidate swap (i, j), it constructs a tentative medoid list \tilde R by replacing the i‑th medoid with j, recomputes the induced memberships \tilde M (ensuring that each medoid truly belongs to its own cluster), builds the corresponding “medoid tensor” \hat Y, and computes the total loss \tilde D = ‖Y – \hat Y‖². The swap that yields the greatest loss reduction is performed. This process repeats across all modes until no swap can improve the loss, guaranteeing convergence to a local optimum.

Because the loss is defined on the full tensor, MultiwayPAM captures interactions across modes, unlike separate k‑means on flattened data. Moreover, the medoids are actual data points, so after clustering one can directly inspect the representative question, answerer persona, and evaluator persona for each block, making the bias structure immediately understandable.

Experiments were conducted on two real LLM‑as‑a‑Judge datasets:

  • Truthy‑DPO‑v0.1 (Truthful) – 50 randomly selected “prompt” texts as questions, and 50 persona descriptions each for answerers and evaluators drawn from the Persona Hub. Scores were generated using GPT‑4o mini, yielding a 50×50×50 tensor.
  • Emerton‑DPO‑Pairs‑Judge – a similarly constructed tensor with different source texts.

For both datasets the authors fixed the cluster size vector to (5, 5, 5), i.e., five clusters per mode. MultiwayPAM was run, and the resulting medoids were examined. Visualizations (Figures 2‑5) show the original tensors, the estimated block structures, and both centroid (mean) and medoid scores for each block. The authors sorted clusters by mean score for readability.

Key observations:

  • Question clusters dominate score variation. In both datasets, moving from one question cluster to another caused the largest shifts in scores, indicating that the intrinsic difficulty or subject matter of the question heavily influences evaluator behavior.
  • Evaluator clusters encode systematic bias. For Truthy, evaluator medoid E14 (a persona “a nurse concerned about military career dangers”) consistently gave low scores to question cluster Q6 (“ability to navigate a physical environment”), whereas evaluator medoid E22 (a “long‑time fan of Trident F.C.”) gave high scores to question cluster Q11 (“need to drink eight glasses of water daily”). This demonstrates that evaluator background can skew judgments independent of answer quality.
  • Answerer clusters have subtler effects. While answerer medoids did affect scores, the impact was less pronounced than that of questions or evaluators, suggesting that the LLM evaluator’s bias is more sensitive to the question context and evaluator persona than to the answerer’s persona.
  • Quantitative performance. Compared with baseline tensor block models and k‑means on flattened data, MultiwayPAM achieved 8–12 % lower reconstruction error (squared Frobenius norm) on the same cluster configuration, confirming that the medoid‑based approach does not sacrifice accuracy while gaining interpretability.

Limitations identified by the authors include: (i) dependence on greedy initialization, which may lead to sub‑optimal local minima; (ii) the need to pre‑specify the number of clusters per mode; (iii) use of only Euclidean distance, which may not capture semantic similarity in high‑dimensional embedding spaces. Future work is suggested to incorporate automatic model selection (e.g., Bayesian information criteria), alternative distance measures (cosine, learned metric), and to explore using the medoid‑based block structure for score prediction, thereby further reducing inference cost.

Conclusion: MultiwayPAM provides a practical and interpretable solution for analyzing LLM‑as‑a‑Judge score tensors. By delivering concrete medoid representatives for each block, it enables researchers and practitioners to diagnose and mitigate evaluator bias, and to infer missing scores, thus addressing both computational and fairness challenges inherent in large‑scale LLM‑based evaluation.


Comments & Academic Discussion

Loading comments...

Leave a Comment