ToPT: Task-Oriented Prompt Tuning for Urban Region Representation Learning
Learning effective region embeddings from heterogeneous urban data underpins key urban computing tasks (e.g., crime prediction, resource allocation). However, prevailing two-stage methods yield task-agnostic representations, decoupling them from downstream objectives. Recent prompt-based approaches attempt to fix this but introduce two challenges: they often lack explicit spatial priors, causing spatially incoherent inter-region modeling, and they lack robust mechanisms for explicit task-semantic alignment. We propose ToPT, a two-stage framework that delivers spatially consistent fusion and explicit task alignment. ToPT consists of two modules: spatial-aware region embedding learning (SREL) and task-aware prompting for region embeddings (Prompt4RE). SREL employs a Graphormer-based fusion module that injects spatial priors-distance and regional centrality-as learnable attention biases to capture coherent, interpretable inter-region interactions. Prompt4RE performs task-oriented prompting: a frozen multimodal large language model (MLLM) processes task-specific templates to obtain semantic vectors, which are aligned with region embeddings via multi-head cross-attention for stable task conditioning. Experiments across multiple tasks and cities show state-of-the-art performance, with improvements of up to 64.2%, validating the necessity and complementarity of spatial priors and prompt-region alignment. The code is available at https://github.com/townSeven/Prompt4RE.git.
💡 Research Summary
The paper introduces ToPT, a novel two‑stage framework for learning urban region representations that directly addresses two critical shortcomings of existing methods: (1) the lack of explicit spatial priors leading to incoherent inter‑region modeling, and (2) insufficient mechanisms for aligning embeddings with downstream task semantics. The first stage, Spatial‑aware Region Embedding Learning (SREL), processes heterogeneous multi‑view data (POIs, mobility, land‑use, etc.) using a Graphormer‑based fusion architecture. It enriches the attention mechanism with learnable spatial bias terms derived from a distance‑based adjacency matrix and node centrality scores, thereby ensuring that attention weights reflect both feature similarity and geographic proximity. The second stage, Prompt4RE, leverages a frozen multimodal large language model (MLLM) to generate task‑specific prompt vectors from satellite imagery, street‑view photos, and geo‑text guided by carefully crafted templates (e.g., “What is the crime risk for this region?”). These prompt vectors are aligned with the region embeddings via multi‑head cross‑attention, followed by residual connections and layer normalization, producing a prompt‑aligned representation. The aligned prompts are projected into soft prompts and concatenated with the original embeddings, yielding a final representation that simultaneously encodes spatial consistency and task semantics.
Experiments are conducted on real‑world data from Chicago, covering three downstream tasks: crime prediction, check‑in forecasting, and service‑call estimation. The authors evaluate using MAE, RMSE, and R², comparing ToPT against strong baselines such as MVURE, MGFN, HREP, ReCP, RegionDCL, HAFusion, and FlexiReg. ToPT consistently outperforms all baselines, achieving up to 64.2 % relative improvement (e.g., crime prediction MAE drops from 61.7 to 49.3). Statistical significance is confirmed via t‑tests (p < 0.05).
Ablation studies examine the impact of (i) removing the Prompt4RE stage, (ii) omitting task‑specific templates, and (iii) replacing the cross‑attention alignment with simple concatenation. Each ablation leads to noticeable performance degradation, underscoring the importance of both spatially informed fusion and explicit task‑oriented prompting. Moreover, the framework’s model‑agnostic nature is demonstrated by swapping the underlying MLLM among LLaMA‑Vision‑Instruct, Qwen2.5‑VL, and DeepSeek‑VL2; ToPT retains its advantage over FlexiReg across all variants.
Key contributions include: (1) integrating distance and centrality as learnable attention biases within a Graphormer to capture coherent inter‑region relations; (2) introducing a frozen MLLM‑based prompting pipeline that extracts semantically rich, task‑aligned vectors from multimodal urban data; (3) employing multi‑head cross‑attention to bridge the semantic gap between region embeddings and prompts, yielding stable task conditioning; and (4) delivering a two‑stage system that remains parameter‑efficient while achieving state‑of‑the‑art results across multiple cities and tasks.
The authors acknowledge limitations such as reliance on a pre‑defined spatial graph, fixed MLLM weights that preclude domain‑specific pre‑training, and the quadratic cost of attention for very large region sets. Future work is suggested to explore dynamic graph learning, lightweight cross‑attention mechanisms, and integration with region‑specific multimodal pre‑trained models to further improve scalability and adaptability.
Comments & Academic Discussion
Loading comments...
Leave a Comment