MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness
Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive “act-as-a-user” prompting often yields verbose, unrealistic utterances, motivating principled evaluation of user proxy agents. We present MirrorBench, a reproducible and extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational regimes, explicitly decoupled from downstream task success. MirrorBench combines three lexical-diversity metrics (MATTR, Yule’s~$K$, and HD-D) with three LLM-judge-based metrics (GTEval, Pairwise Indistinguishability, and Rubric-and-Reason), and contextualizes judge scores using Human-Human and Proxy-Proxy calibration controls. Across four public datasets, MirrorBench yields variance-aware comparisons and reveals systematic gaps between user proxies and real human users. The framework is open source and includes a command-line interface for running and managing user-proxy benchmarking experiments.
💡 Research Summary
MirrorBench introduces a reproducible and extensible benchmark specifically designed to assess the human‑likeness of conversational user‑proxy agents, i.e., large language models (LLMs) prompted to act as users in dialogue systems. The authors argue that existing “act‑as‑a‑user” approaches often produce overly verbose or unrealistically cooperative utterances, which can bias downstream evaluation of assistants and cause distribution shift when such synthetic data are used for fine‑tuning. To address this, MirrorBench isolates the user side of the interaction and evaluates it on two complementary dimensions: lexical diversity and behavioral realism.
Lexical diversity is captured by three well‑established statistics: Moving‑Average Type‑Token Ratio (MA‑TTR), Yule’s K, and Hypergeometric Distribution Diversity (HD‑D). Each metric is computed on the tokenized user turns generated by a proxy and then z‑scored against the empirical distribution of real human user turns from the same dataset, ensuring that scores are interpretable and comparable across domains.
Behavioral realism is measured using a suite of LLM‑judge metrics. GTEval asks a strong LLM to rate utterances according to predefined criteria such as naturalness, tone, and appropriateness. Pairwise Indistinguishability (PI) presents a human‑proxy pair to the judge and asks which one is more likely to be human, yielding a probabilistic indistinguishability score. Rubric‑and‑Reason (RNR) supplies a rubric and requires the judge to provide a rationale for each rating, increasing transparency. Crucially, the authors introduce Human‑Human (HH) and Proxy‑Proxy (PP) calibration controls: HH scores are obtained by running the same judges on real human‑human dialogues, while PP scores are derived from dialogues where both sides are generated by the same proxy. These controls allow the absolute judge scores to be contextualized, revealing how far a proxy deviates from the human baseline and how much variance is intrinsic to the judge itself.
The benchmark incorporates four publicly available conversational corpora—QULAC (clarification‑heavy), ClariQ (information‑seeking), OASST1 (open‑domain chat), and ChatbotArena (AI‑bot style chat). For each dataset the authors perform stratified sampling to obtain up to 200 balanced dialogues, generate a concise goal description for each using an auxiliary LLM, and then conduct goal‑conditioned rollouts where a user‑proxy LLM interacts with a fixed assistant LLM. Only the proxy’s user turns are evaluated; assistant responses are deliberately excluded to keep the focus on user realism.
Empirical results across five user‑proxy models (including GPT‑4, LLaMA‑2 variants, and other strong LLMs) reveal systematic gaps between synthetic users and real humans. While some proxies achieve high GTEval scores, they often lag behind on MA‑TTR, Yule’s K, or HD‑D, especially in clarification‑centric datasets where human utterances exhibit richer lexical variation. Conversely, proxies that match human lexical statistics sometimes receive lower behavioral realism scores, highlighting a realism‑diversity tension. The authors also demonstrate judge sensitivity: swapping judges or omitting HH/PP calibration can shift both absolute scores and relative rankings, underscoring the importance of multi‑judge reporting and variance‑aware analysis. Robustness checks—changing the assistant model, varying random seeds, and repeating experiments—show that the observed gaps are stable and not artifacts of a particular rollout configuration.
MirrorBench is released as open‑source software with a command‑line interface that automates dataset preprocessing, goal synthesis, rollout generation, metric computation, and result aggregation. The framework allows researchers to plug in new datasets, alternative lexical metrics, or custom LLM judges, making it a flexible platform for ongoing evaluation of user‑proxy agents.
In summary, the paper makes four key contributions: (1) a clear benchmark protocol that isolates user‑proxy quality from downstream task success; (2) a dual‑metric suite combining lexical‑diversity statistics and calibrated LLM‑judge realism scores; (3) a unified preprocessing pipeline for four diverse conversational datasets; and (4) an empirical study that uncovers realism‑diversity trade‑offs, judge sensitivity, and consistent performance gaps between current LLM‑based user proxies and real human users. The work positions human‑likeness assessment as a necessary step for reliable synthetic data generation and automated dialogue system testing.
Comments & Academic Discussion
Loading comments...
Leave a Comment