TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As LLM-based agents are deployed in increasingly complex real-world settings, existing benchmarks underrepresent key challenges such as enforcing global constraints, coordinating multi-tool reasoning, and adapting to evolving user behavior over long, multi-turn interactions. To bridge this gap, we introduce \textbf{TRIP-Bench}, a long-horizon benchmark grounded in realistic travel-planning scenarios. TRIP-Bench leverages real-world data, offers 18 curated tools and 40+ travel requirements, and supports automated evaluation. It includes splits of varying difficulty; the hard split emphasizes long and ambiguous interactions, style shifts, feasibility changes, and iterative version revision. Dialogues span up to 15 user turns, can involve 150+ tool calls, and may exceed 200k tokens of context. Experiments show that even advanced models achieve at most 50% success on the easy split, with performance dropping below 10% on hard subsets. We further propose \textbf{GTPO}, an online multi-turn reinforcement learning method with specialized reward normalization and reward differencing. Applied to Qwen2.5-32B-Instruct, GTPO improves constraint satisfaction and interaction robustness, outperforming Gemini-3-Pro in our evaluation. We expect TRIP-Bench to advance practical long-horizon interactive agents, and GTPO to provide an effective online RL recipe for robust long-horizon training.

💡 Research Summary

TRIP‑Bench is a newly introduced benchmark designed to evaluate large‑language‑model (LLM) agents in realistic, long‑horizon, multi‑turn scenarios. The authors focus on travel planning, a domain that naturally involves complex constraints (budget, timing, preferences), a rich set of tools (flights, hotels, restaurants, attractions), and evolving user behavior. Building on the TripTailor dataset, they clean and expand the data to cover 40 cities, more than 6 000 itineraries, and hundreds of thousands of points of interest. Eighteen tool APIs are provided, each supporting fine‑grained filtering, sorting, and pagination, enabling agents to compose sophisticated action sequences.

The benchmark creates tasks by sampling two‑ or three‑city trips, generating a “rubric‑to‑constraint” mapping for over 40 requirement categories, and then synthesizing modification chains that mimic users adding, changing, or removing constraints over up to three steps. Tasks are stratified into Easy, Mid, and Hard difficulty levels based on trip length, number of constraints, and the complexity of simulated user behaviors. The Hard split is further divided into four sub‑sets: LIT (Long Interaction Task), FIT (Feasible‑Infeasible Transition), AIS (Ambiguous Intent Shift), and PMR (Plan Merge/Redirect). Dialogues can reach 15 turns, involve more than 150 tool calls, and exceed 200 k tokens of context, making the benchmark a stress test for both reasoning depth and memory handling.

Evaluation is multi‑dimensional: (1) global constraint satisfaction, (2) correctness and efficiency of tool usage, (3) turn‑level consistency, and (4) alignment with user preferences. An automated rule‑based verifier and turn‑level scoring pipeline provide reliable metrics even for the longest contexts.

Experiments with state‑of‑the‑art models (GPT‑5.2, Gemini‑3‑Pro, Qwen2.5‑Instruct) reveal a stark performance gap. On the Easy split, the best model reaches only about 45 % success; on Hard subsets, success drops below 10 %. This demonstrates that existing LLMs, even when fine‑tuned for tool use, struggle with sustained planning, constraint management, and dynamic user interaction.

To address these shortcomings, the authors propose GTPO (Group‑Relative Turn‑level Policy Optimization), an online multi‑turn reinforcement learning framework. GTPO introduces three key mechanisms: (i) turn‑level reward normalization to mitigate variance across steps, (ii) global‑instruction normalization that aligns per‑turn rewards with the overall task objective, and (iii) reward differencing that stabilizes policy updates by focusing on incremental improvements. When applied to Qwen2.5‑32B‑Instruct, GTPO yields a 10 percentage‑point gain in “loose” evaluation and a 5 percentage‑point gain in the stricter setting, surpassing Gemini‑3‑Pro.

The paper’s contributions are threefold: (1) a large‑scale, tool‑augmented benchmark that captures long‑horizon planning, complex rule adherence, and diverse user behaviors; (2) a comprehensive empirical analysis exposing the limitations of current LLM agents in such settings; and (3) the GTPO algorithm, which demonstrates that online RL can substantially improve constraint satisfaction and interaction robustness for long‑horizon tasks.

Limitations include the travel‑specific focus, which may limit generalization to other domains, reliance on a rule‑based user simulator that may not fully capture human unpredictability, and the computational cost of online RL training. Future work is suggested to broaden the benchmark to other real‑world domains, incorporate human‑in‑the‑loop evaluations, and develop more memory‑efficient RL methods.

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios

💡 Research Summary

Comments & Academic Discussion

Leave a Comment