Large language models (LLMs) are increasingly used as agents to solve complex tasks such as question answering (QA), scientific debate, and software development. A standard evaluation procedure aggregates multiple responses from LLM agents into a single final answer, often via majority voting, and compares it against reference answers. However, this process can obscure the quality and distributional characteristics of the original responses. In this paper, we propose a novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers. This enables a more nuanced assessment of response quality beyond exact match metrics. To analyze the response distributions across different agent configurations, we further introduce a clustering method for ECDFs using their distances and the $k$-medoids algorithm. Our experiments on a QA dataset demonstrate that ECDFs can distinguish between agent settings with similar final accuracies but different quality distributions. The clustering analysis also reveals interpretable group structures in the responses, offering insights into the impact of temperature, persona, and question topics.
Following the success of large language models (LLMs), various agent-based approaches [1] have been proposed to tackle tasks such as scientific debate [2,3] and software development [4][5][6]. In particular, this paper focuses on the task of question answering (QA), where each question generally has multiple correct answers. Such QA datasets range from multiple-choice questions to test knowledge of models (e.g., CommonsenseQA [7] and SWAG [8] datasets) to questions requiring complex reasoning to answer (e.g., StrategyQA [9] and GSM8K [10] datasets), and are commonly used to measure the task performance of LLM-based agent systems.
A typical evaluation pipeline for LLM-based agent systems involves generating multiple responses per question under a given configuration, then selecting a final answer through decision protocols such as majority voting [11][12][13]. In such cases, a basic method for comparing multiple settings is to evaluate the consistency between the final answers obtained under each setting and the correct ones. However, such an evaluation criterion alone cannot reveal the tendencies of individual original responses generated by the LLM-based agent. Even if the final answers are the same in a pair of settings, the quality of the original responses may differ. For example, when applying majority voting to 2n 1 responses, we cannot distinguish between the case where all 2n 1 responses are correct and the case where only n `1 responses are correct. Furthermore, the “goodness” of incorrect responses may differ among the original responses. For instance, for the question “What is the highest mountain in Japan?”, the answer “Yari-ga-take” can be considered closer to the correct answer than the answer “computer,” even though both are incorrect answers. However, simply measuring the percentage of exact matches with the correct answer does not allow us to distinguish between the quality of these two answers.
To solve these problems and obtain more detailed information about the quality of responses given by LLM-based agents to each question, we propose to evaluate a given set of responses based on the empirical cumulative distribution function (ECDF) of their cosine similarities to the correct answers, as shown in Figure 1. Using ECDFs to evaluate the responses has two advantages. First, unlike histograms, for which the bin width needs to be determined, there is no need to set any hyperparameters to construct an ECDF. Moreover, by representing the sets of responses as ECDFs, we can make direct comparison across response sets of varying lengths.
One concern with using ECDFs is that there are various types of configurations in LLM-based agent systems, and it is difficult to grasp the overview of the large number of ECDFs corresponding to different settings. Therefore, we propose a method to apply clustering to ECDFs, which allows us to estimate the group structure of ECDFs that are similar to each other. Since ECDFs are different from the typical vector format samples (although a set of cosine similarities for defining an ECDF can be represented as a vector), we propose a new clustering method for ECDFs, each of which corresponds to a set of LLM-based agents’ responses in a given setting. Applying clustering to ECDFs has also been proposed in the existing study [14] aimed at network traffic anomaly detection, however, their method differs from ours in that the ECDFs are first discretized and converted into vector format samples and then clustered using k-means algorithm.
The subsequent part of this paper is organized as follows. In Section 2, we review the existing studies on LLM-based agent systems. Next, in Section 3, we propose an ECDF clustering method for analyzing multiple sets of LLM-based agents’ responses. In Section 4, we experimentally demonstrate the effectiveness of the proposed ECDF clustering by applying it to two practical QA datasets, changing two types of agent settings (i.e., persona and temperature). Finally, we conclude this paper in Section 5.
A typical approach to solving tasks, including QA, with LLM-based agents is to first have each agent generate an answer to a given question, and then combine all the answers to generate a final answer. In this section, we review the existing methods based on such an approach from the following three perspectives: how to generate multiple answers, how to convert multiple answers into a final answer, and how to evaluate the final answer.
First, the generation of multiple answers using LLM-based agents itself involves various settings, including base models of agents, prompt templates, and inter-agent communication styles. Several studies have reported that discussion among LLM-based agents improves performance in reasoning tasks [2,[15][16][17][18], while others have explored prompting methods to elicit the reasoning ability of models by specifying personas [19,20] or by describing the reasoning procedure [21]. The inter-agent communicat
This content is AI-processed based on open access ArXiv data.