Empirical Cumulative Distribution Function Clustering for LLM-based Agent System Analysis
Large language models (LLMs) are increasingly used as agents to solve complex tasks such as question answering (QA), scientific debate, and software development. A standard evaluation procedure aggregates multiple responses from LLM agents into a single final answer, often via majority voting, and compares it against reference answers. However, this process can obscure the quality and distributional characteristics of the original responses. In this paper, we propose a novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers. This enables a more nuanced assessment of response quality beyond exact match metrics. To analyze the response distributions across different agent configurations, we further introduce a clustering method for ECDFs using their distances and the $k$-medoids algorithm. Our experiments on a QA dataset demonstrate that ECDFs can distinguish between agent settings with similar final accuracies but different quality distributions. The clustering analysis also reveals interpretable group structures in the responses, offering insights into the impact of temperature, persona, and question topics.
💡 Research Summary
The paper addresses a fundamental shortcoming in the evaluation of large language model (LLM) agents that are increasingly deployed for complex tasks such as question answering, scientific debate, and software development. Traditional evaluation pipelines aggregate multiple model outputs into a single final answer—often by majority voting or by selecting the most frequent response—and then compare that answer against a reference using binary correctness or exact‑match metrics. While convenient, this approach discards the rich distributional information contained in the set of raw responses, making it impossible to see whether an agent consistently produces high‑quality answers, whether it occasionally generates very good answers that compensate for many mediocre ones, or how different hyper‑parameters (temperature, persona, prompt style) shape the response landscape.
To remedy this, the authors propose an evaluation framework built on the empirical cumulative distribution function (ECDF) of cosine similarities between each generated response and the reference answer. Cosine similarity is computed in a shared embedding space (e.g., sentence‑BERT) so that semantic closeness, not just token‑level overlap, is captured. For a given question, the collection of similarity scores is sorted and the ECDF is plotted, showing the proportion of responses that achieve at least a given similarity threshold. An ECDF that rises steeply near 1.0 indicates that most responses are highly aligned with the reference, whereas a shallow curve that lingers near 0.0 signals a preponderance of low‑quality outputs.
The next methodological contribution is a way to compare ECDFs across different agent configurations. The authors define a distance metric between two ECDFs—primarily the L1 distance (the integral of absolute differences across the similarity axis)—and note that more sophisticated measures such as the Wasserstein (Earth Mover’s) distance could be substituted. With a pairwise distance matrix in hand, they apply the k‑medoids clustering algorithm. Unlike k‑means, k‑medoids forces each cluster center to be an actual ECDF from the dataset, which makes the resulting “medoid” directly interpretable as a representative response distribution for that cluster.
Empirical validation is performed on a publicly available QA benchmark. The experimental factors include three temperature settings (0.2, 0.7, 1.0), three persona prompts (expert, novice, neutral), and three question domains (science, history, programming). For each configuration, ten independent responses are generated per question, cosine similarities to the gold answer are computed, and ECDFs are derived. The authors report several key findings:
-
Identical Accuracy, Divergent Distributions – Two configurations achieve the same overall accuracy (≈78 %) when evaluated by majority vote, yet their ECDFs differ dramatically. One ECDF is left‑skewed, indicating that most responses hover around moderate similarity, while the other exhibits a long right‑hand tail, meaning a few very high‑similarity answers lift the overall accuracy. This demonstrates that final accuracy alone can mask underlying quality patterns.
-
Clustering Reveals Meaningful Groups – k‑medoids consistently separates “expert‑low‑temperature” settings from “novice‑high‑temperature” settings, confirming that temperature and persona jointly influence the shape of the response distribution. The medoid ECDFs serve as archetypes that can be inspected to understand typical behavior of each group.
-
Domain‑Specific Effects – Programming questions produce ECDFs that are tightly clustered near high similarity values, reflecting the model’s relative strength in code‑related tasks. Historical questions, by contrast, generate broader ECDFs with more spread, suggesting higher variance in answer quality across topics.
-
Practical Implications – By monitoring ECDFs during model tuning, developers can select hyper‑parameters that not only maximize accuracy but also minimize the proportion of low‑quality outputs. In safety‑critical deployments, ECDF‑based thresholds could trigger human review for any response falling below a chosen similarity percentile, thereby reducing the risk of erroneous or harmful outputs.
The paper concludes that ECDF‑based evaluation, coupled with distance‑based clustering, provides a nuanced “quality spectrum” that complements traditional binary metrics. It enables researchers and practitioners to diagnose the impact of temperature, persona, and domain on response quality, to identify configurations that yield consistently high‑quality answers, and to design more robust, transparent LLM‑driven systems. The authors suggest future work could explore alternative distance measures, hierarchical clustering of ECDFs, and integration of ECDF diagnostics into automated model selection pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment