HumanStudy-Bench: Towards AI Agent Design for Participant Simulation
Large language models (LLMs) are increasingly used as simulated participants in social science experiments, but their behavior is often unstable and highly sensitive to design choices. Prior evaluations frequently conflate base-model capabilities with experimental instantiation, obscuring whether outcomes reflect the model itself or the agent setup. We instead frame participant simulation as an agent-design problem over full experimental protocols, where an agent is defined by a base model and a specification (e.g., participant attributes) that encodes behavioral assumptions. We introduce HUMANSTUDY-BENCH, a benchmark and execution engine that orchestrates LLM-based agents to reconstruct published human-subject experiments via a Filter–Extract–Execute–Evaluate pipeline, replaying trial sequences and running the original analysis pipeline in a shared runtime that preserves the original statistical procedures end to end. To evaluate fidelity at the level of scientific inference, we propose new metrics to quantify how much human and agent behaviors agree. We instantiate 12 foundational studies as an initial suite in this dynamic benchmark, spanning individual cognition, strategic interaction, and social psychology, and covering more than 6,000 trials with human samples ranging from tens to over 2,100 participants.
💡 Research Summary
The paper addresses the growing practice of using large language models (LLMs) as simulated participants in social‑science experiments, highlighting that the behavior of such models is often unstable and highly sensitive to design choices. Existing evaluations typically conflate the raw capabilities of the underlying model with the specifics of how the model is instantiated for a given experiment, making it difficult to discern whether observed discrepancies stem from model limitations or from the way the model is prompted, conditioned, or otherwise configured. To resolve this, the authors reframe participant simulation as an agent‑design problem: an agent consists of a base LLM together with a specification that encodes behavioral assumptions (e.g., role prompts, demographic attributes, memory, tools).
They introduce HumanStudy‑Bench, a benchmark and execution engine that automates the full lifecycle of reproducing published human‑subject experiments with LLM‑based agents. The system follows a four‑stage pipeline—Filter, Extract, Execute, Evaluate—that transforms a research article into a machine‑executable simulation environment.
-
Filter: An LLM‑driven filter agent screens candidate studies against three criteria: (i) complete experimental details are publicly available, (ii) outcomes are quantifiable with clearly specified statistical tests, and (iii) the study is feasible to simulate without specialized equipment. Human reviewers verify the filter’s checklist.
-
Extract: An extraction agent parses the selected papers to produce a structured schema containing participant profiles (sample size, demographics, group assignments), experimental design (conditions, stimuli, trial order), and the original statistical hypothesis tests together with human ground‑truth results (test statistics, p‑values, effect sizes). Human verification ensures high fidelity.
-
Execute: For each study, a configuration agent synthesizes three components—trial generator, prompt constructor, and response aggregator—based on the extracted schema. The execution engine then runs any specified agent (model + specification) through the reconstructed protocol, delivering trial‑level responses while preserving the exact sequence of instructions, stimuli, and condition assignments used in the original human experiment. This design allows researchers to swap in different agent specifications without rewriting experiment code, enabling controlled, side‑by‑side comparisons.
-
Evaluate: An evaluator agent re‑runs the original statistical analyses on the simulated data. The authors propose two novel, inference‑level metrics:
-
Probability Alignment Score (PAS) – a Bayesian‑style probability that the agent’s outcome (accept/reject H₀) matches the human population’s latent truth, explicitly accounting for sampling variability in the human data.
-
Effect Consistency Score (ECS) – a normalized distance between the agent’s observed effect size (e.g., Cohen’s d, Pearson r) and the human effect size, measuring data‑level concordance.
-
The benchmark initially incorporates 12 foundational studies spanning individual cognition, strategic interaction, and social psychology, together comprising more than 6,000 trials with human sample sizes ranging from a few dozen to over 2,100 participants. The authors evaluate 10 contemporary LLMs (including GPT‑4, Claude‑2, Gemini) across four common agent specifications: a blank baseline, role‑play prompting, demographic conditioning, and a rich backstory.
Key findings:
- Current LLM‑based agents achieve limited and inconsistent inferential alignment with humans. Their response distributions are often bimodal, unlike the unimodal patterns observed in human data.
- Agent design matters more than model size; larger models do not systematically improve PAS or ECS, and simple multi‑model ensembles provide no reliable gains.
- The effect of a specification is non‑monotonic and domain‑dependent—for example, demographic conditioning helps in some decision‑making tasks but harms performance in certain social‑psychology paradigms.
- PAS and ECS, by incorporating human sampling uncertainty, reveal shortcomings that traditional aggregate metrics (means, accuracy, distributional distances) would miss.
The authors argue that successful participant simulation requires explicit, theory‑driven agent specifications and rigorous end‑to‑end replication of experimental protocols. HumanStudy‑Bench offers a reusable, extensible platform for the community to develop, test, and compare such agents. Future directions include automated specification optimization, support for longitudinal or multi‑session experiments, and extending the framework to multimodal models and richer behavioral measurements.
Comments & Academic Discussion
Loading comments...
Leave a Comment