WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints

WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Real-world autonomous planning requires coordinating tightly coupled constraints where a single decision dictates the feasibility of all subsequent actions. However, existing benchmarks predominantly feature loosely coupled constraints solvable through local greedy decisions and rely on idealized data, failing to capture the complexity of extracting parameters from dynamic web environments. We introduce \textbf{WorldTravel}, a benchmark comprising 150 real-world travel scenarios across 5 cities that demand navigating an average of 15+ interdependent temporal and logical constraints. To evaluate agents in realistic deployments, we develop \textbf{WorldTravel-Webscape}, a multi-modal environment featuring over 2,000 rendered webpages where agents must perceive constraint parameters directly from visual layouts to inform their planning. Our evaluation of 10 frontier models reveals a significant performance collapse: even the state-of-the-art GPT-5.2 achieves only 32.67% feasibility in text-only settings, which plummets to 19.33% in multi-modal environments. We identify a critical Perception-Action Gap and a Planning Horizon threshold at approximately 10 constraints where model reasoning consistently fails, suggesting that perception and reasoning remain independent bottlenecks. These findings underscore the need for next-generation agents that unify high-fidelity visual perception with long-horizon reasoning to handle brittle real-world logistics.


💡 Research Summary

The paper introduces WorldTravel, a novel benchmark designed to evaluate autonomous agents on realistic, tightly‑coupled travel‑planning tasks, and WorldTravel‑Webscape, a multimodal execution environment that forces agents to extract constraint parameters directly from rendered web pages. Existing travel‑planning benchmarks typically present loosely coupled constraints and provide clean, structured inputs (e.g., JSON), which allow greedy local decisions and do not reflect the difficulty of gathering information from dynamic web interfaces. In contrast, WorldTravel comprises 150 real‑world itineraries across five major European cities (Berlin, Vienna, Rome, Florence, Barcelona). Each scenario contains on average more than 15 interdependent temporal and logical constraints, ranging from hard constraints (fixed entry slots, operating windows, minimum dwell times, inter‑activity buffers) to soft constraints (cost consistency, preference matching).

Data collection involved harvesting factual data (official operating hours, tiered pricing, reservation slots) and user‑generated content (recommended dwell times, crowd‑avoidance tips) from official venue sites, travel platforms, and social media. Using GPT‑5.2, the authors generated over 2,000 static HTML pages that mimic real booking platforms, then manually verified UI diversity, visual cues (e.g., “Sold Out” badges, color coding), and layout complexity. The environment exposes eight APIs (attraction booking, restaurant reservation, hotel search, route planning, etc.) that return screenshot images rather than structured JSON, compelling agents to perform visual perception, information extraction, and cross‑page reasoning.

A formal constrained‑scheduling formulation underpins the benchmark: given a user query Q, an agent must output an ordered itinerary y = (I, {s_i, d_i}, v) where I is the set of activities, s_i and d_i are start times and durations, and v encodes discrete decisions such as ticket type or hotel choice. Feasibility is governed by a parameter set θ (extracted from the webpages) and must satisfy all hard constraints; soft constraints are evaluated only within feasible solutions. The authors provide mathematically precise definitions for each constraint type and deterministic verification functions, enabling automatic, reproducible scoring.

The experimental evaluation covers ten state‑of‑the‑art language‑model‑based agents, including GPT‑5.2, GPT‑4.1, Gemini‑2.5‑Pro, and others. Two settings are compared: (1) a text‑only condition where all constraint parameters are pre‑extracted and supplied in structured form, and (2) the full multimodal condition where agents must perceive constraints from the rendered webpages. In the text‑only setting, the best model (GPT‑5.2) achieves only 32.67 % feasibility, highlighting that even with perfect perception the reasoning component struggles with tightly coupled constraints. In the multimodal setting, performance drops sharply to 19.33 %, revealing a substantial “Perception‑Action Gap”: current agents are far from human‑level visual understanding of complex UI layouts.

A deeper analysis uncovers two independent bottlenecks. First, the perception‑action gap shows that visual extraction of time slots, availability indicators, and pricing incurs the largest error margin. Second, a “Planning Horizon” effect emerges: as the number of constraints in a scenario rises, success rates decline, with a clear inflection point around ten constraints. This suggests that existing LLM reasoning mechanisms cannot reliably propagate decisions through long chains of interdependent constraints. Even when the perception problem is removed (pre‑extracted parameters), the top model reaches only 36.67 % feasibility, confirming that reasoning alone remains a limiting factor.

The authors acknowledge limitations: the webpages are static and lack dynamic behaviors such as infinite scroll, AJAX loading, or CAPTCHAs, which are common in real booking sites. Moreover, the evaluation focuses on large language models; specialized vision‑language models, multimodal planners, or reinforcement‑learning agents are not examined, leaving open the question of how alternative architectures would fare.

In conclusion, WorldTravel and WorldTravel‑Webscape constitute the first comprehensive benchmark that jointly tests high‑fidelity visual perception, long‑horizon constraint reasoning, and global plan feasibility in a realistic travel‑planning context. The identified perception‑action gap and planning‑horizon threshold provide concrete diagnostic targets for future research, encouraging the development of agents that can seamlessly integrate multimodal perception with sophisticated, globally consistent reasoning to handle brittle real‑world logistics.


Comments & Academic Discussion

Loading comments...

Leave a Comment