Accelerating Social Science Research via Agentic Hypothesization and Experimentation

Accelerating Social Science Research via Agentic Hypothesization and Experimentation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Data-driven social science research is inherently slow, relying on iterative cycles of observation, hypothesis generation, and experimental validation. While recent data-driven methods promise to accelerate parts of this process, they largely fail to support end-to-end scientific discovery. To address this gap, we introduce EXPERIGEN, an agentic framework that operationalizes end-to-end discovery through a Bayesian optimization inspired two-phase search, in which a Generator proposes candidate hypotheses and an Experimenter evaluates them empirically. Across multiple domains, EXPERIGEN consistently discovers 2-4x more statistically significant hypotheses that are 7-17 percent more predictive than prior approaches, and naturally extends to complex data regimes including multimodal and relational datasets. Beyond statistical performance, hypotheses must be novel, empirically grounded, and actionable to drive real scientific progress. To evaluate these qualities, we conduct an expert review of machine-generated hypotheses, collecting feedback from senior faculty. Among 25 reviewed hypotheses, 88 percent were rated moderately or strongly novel, 70 percent were deemed impactful and worth pursuing, and most demonstrated rigor comparable to senior graduate-level research. Finally, recognizing that ultimate validation requires real-world evidence, we conduct the first A/B test of LLM-generated hypotheses, observing statistically significant results with p less than 1e-6 and a large effect size of 344 percent.


💡 Research Summary

The paper tackles a fundamental bottleneck in data‑driven social science: the slow, iterative cycle of observation, hypothesis generation, and experimental validation. While recent large‑language‑model (LLM) approaches can suggest hypotheses or test pre‑specified ones, none integrate both steps into a closed‑loop system that works directly on unstructured data. To fill this gap, the authors introduce EXPERIGEN, an agentic framework that unifies hypothesis generation and empirical testing through a two‑phase, Bayesian‑optimization‑inspired search.

EXPERIGEN consists of two specialized LLM agents. The Generator receives a concise summary of the target dataset (schema, feature distributions, sample rows) and a description of the Experimenter’s capabilities, then samples plausible, testable natural‑language hypotheses. This sampling is treated as drawing from the LLM’s implicit prior over scientific statements, ensuring that generated hypotheses are both conceptually reasonable and operationalizable. The Experimenter is implemented as a ReAct‑style agent that, given a hypothesis, plans the necessary feature engineering, selects appropriate covariates, and chooses a statistical test (e.g., chi‑square, t‑test, regression). It executes these steps using two tools: a sandboxed code interpreter for deterministic feature extraction and statistical computation, and an LLM‑based extractor for higher‑level semantic features (e.g., “presence of citations”). The Experimenter returns a structured evidence packet containing p‑value, effect size, assumptions, and robustness checks, applying multiple‑testing corrections (Bonferroni or FDR) when evaluating families of related hypotheses.

The two‑phase search proceeds as follows. In the outer loop, the Generator proposes a seed hypothesis that maximizes novelty relative to the current hypothesis bank. In the inner loop, the Experimenter evaluates this seed; the resulting evidence is stored in a short‑term memory that the Generator can query in subsequent refinement steps. The Generator may reformulate the hypothesis, add contextual qualifiers, or combine multiple variables based on the feedback. This iterative refinement continues until the hypothesis reaches statistical significance (after correction) or is rejected. The process mirrors Bayesian optimization: exploration is driven by novelty‑biased seed generation, exploitation by repeated, statistically‑controlled refinement of promising candidates.

Empirically, EXPERIGEN is evaluated on ten heterogeneous tasks spanning text (online persuasion), vision (image memorability), and relational data (Reddit thread dynamics). Compared to prior LLM‑based hypothesis generators and automated testing pipelines, EXPERIGEN achieves 7–17 percentage‑point gains in out‑of‑distribution predictive accuracy and discovers 2–4 × more statistically significant hypotheses. The false discovery rate drops below 5 % versus 20–25 % for baselines. A manual review of 25 generated hypotheses by senior professors (with 5–20 years of experience) found 88 % to be moderately or strongly novel, 70 % to be impactful and worth pursuing, and the methodological rigor comparable to that of senior graduate students. Finally, the authors partner with a Fortune 500 consumer brand to run a real‑world A/B test on a hypothesis about factors influencing sign‑up conversion. The LLM‑generated intervention yields a 344 % lift in conversion with p < 10⁻⁶, demonstrating that EXPERIGEN’s outputs can translate into actionable, high‑impact interventions beyond offline benchmarks.

The paper’s contributions are fourfold: (1) the first end‑to‑end, agentic framework that unifies hypothesis generation and validation; (2) a scalable two‑phase search algorithm that balances novelty‑driven exploration with statistically‑controlled exploitation; (3) extensive empirical evidence of superior predictive performance, discovery rate, and low false‑positive rate across multimodal domains; (4) the first expert evaluation and real‑world A/B test of machine‑generated scientific hypotheses. Limitations include reliance on the underlying LLM’s domain knowledge, restriction to conventional statistical tests (limiting causal inference), potential brittleness of the code interpreter and feature extractor pipeline, and the need for broader real‑world validation across more industries and scientific fields. Future work is suggested on integrating causal inference modules, domain‑specific prompting, open‑source release for reproducibility, and scaling up real‑world experimental deployments.

In sum, EXPERIGEN demonstrates that a tightly coupled pair of LLM agents, guided by Bayesian‑style search, can automate the full scientific discovery loop for unstructured social‑science data, dramatically accelerating hypothesis discovery while maintaining statistical rigor and practical relevance.


Comments & Academic Discussion

Loading comments...

Leave a Comment