BRIDGE: Predicting Human Task Completion Time From Model Performance

BRIDGE: Predicting Human Task Completion Time From Model Performance
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE, a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time. Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR’s exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months.


💡 Research Summary

The paper introduces BRIDGE, a psychometric framework that translates AI model performance on benchmark tasks into human‑interpretable estimates of task completion time. The authors observe that directly collecting human time annotations for every benchmark is costly, noisy, and does not scale with the rapid proliferation of new tasks. To overcome this, they adopt a two‑parameter logistic Item Response Theory (2PL‑IRT) model, which jointly estimates a latent difficulty parameter (b) and discrimination parameter (a) for each task, as well as a latent ability (θ) for each model, using binary success/failure outcomes derived from model runs.

First, they fit the 2PL‑IRT model on the METR dataset, which contains 170 tasks with human‑annotated completion times. By regressing log (human time) against the inferred difficulty b, they discover an approximately linear relationship: log h = slope × b + intercept. This calibration step anchors the otherwise scale‑invariant IRT latent space to a concrete human‑time axis.

Armed with this mapping, the authors predict human completion times for four out‑of‑distribution benchmarks that lack extensive time annotations: SWE‑bench Verified, MLE‑bench, GDPval, and Cybench. For each benchmark they collect binary success logs from multiple state‑of‑the‑art models (including tool‑using agents), fit a fresh 2PL‑IRT model, and translate the resulting b values into estimated human times using the METR‑derived linear map. The predicted times align well with the limited human labels that exist (e.g., coarse time buckets in SWE‑bench) and with expert intuition about task difficulty, demonstrating that the latent difficulty scale transfers across domains.

The framework also enables forecasting of future model capabilities. Assuming model ability θ grows exponentially over calendar time—a pattern observed in prior scaling studies—the authors model θ as a linear function of time. For each 2‑month release window they identify the best‑performing model, set b = θ (the difficulty at which the model would achieve 50 % success), and convert this b to a human time horizon h. The resulting “50 % solvable task length horizon” doubles roughly every six months, reproducing the exponential trend reported by METR (which found a ~7‑month doubling) but suggesting a slightly faster pace of progress.

Key contributions include: (1) a scalable method to infer human‑centric difficulty without new human studies, (2) empirical validation that IRT‑derived difficulty correlates log‑linearly with actual human time, and (3) independent confirmation of exponential growth in AI capability expressed in human‑time units.

Limitations are acknowledged. The binary success formulation discards nuanced performance signals (partial credit, quality scores). The calibration relies on the quality and diversity of the initial human‑time dataset; if that dataset is biased, the mapping may misestimate times for novel domains. Moreover, human completion time itself is variable across expertise levels, tool availability, and contextual factors, so the “average human” assumption may not hold in all settings.

Overall, BRIDGE offers a principled bridge between model‑centric benchmark scores and human‑interpretable task durations, paving the way for more intuitive AI progress reporting and for planning real‑world deployments based on estimated human effort. Future work could extend the framework to multi‑class or continuous performance metrics, incorporate partial credit, and explore automated collection of human‑time anchors.


Comments & Academic Discussion

Loading comments...

Leave a Comment