KOINEU

February 10, 2026

Reading time: 21 minute

...

📝 Original Info

Title:
ArXiv ID: 2512.22605
Date:
Authors: Unknown

📝 Abstract

The precise prediction of human mobility has produced significant socioeconomic impacts, such as location recommendations and evacuation suggestions. However, existing methods suffer from limited generalization capability: unimodal approaches are constrained by data sparsity and inherent biases, while multi-modal methods struggle to effectively capture mobility dynamics caused by the semantic gap between static multi-modal representation and spatial-temporal dynamics. Therefore, we leverage multi-modal spatial-temporal knowledge to characterize mobility dynamics for the location recommendation task, dubbed as Multi-Modal Mobility (M 3 ob). First, we construct a unified spatial-temporal relational graph (STRG) for multi-modal representation, by leveraging the functional semantics and spatial-temporal knowledge captured by the large language models (LLMs)-enhanced spatial-temporal knowledge graph (STKG). Second, we design a gating mechanism to fuse spatial-temporal graph representations of different modalities, and propose an STKG-guided cross-modal alignment to inject spatial-temporal dynamic knowledge into the static image modality. Extensive experiments on six public datasets show that our proposed method not only achieves consistent improvements in normal scenarios but also exhibits significant generalization ability in abnormal scenarios. Our code will be made publicly available at https://anonymous.4open.science/r/M3ob-62EF. CCS Concepts • Information systems → Recommender systems.

📄 Full Content

The widely adopted mobile devices have made location-based services highly accessible on the Web, allowing applications such as location search [14,42], mobile navigation [37], and location recommendation [21,45]. However, while the related check-in sequence data greatly facilitates the integration of web services into daily life, two issues lead to data sparsity: the sheer number of locations, and users' tendency to visit only a limited number of popular locations of interest. Furthermore, abnormal mobility behaviors are difficult to capture due to their scarce occurrence, such as visits to long-tail locations or during extreme weather events.

The remarkable success of deep learning has transformed next location recommendation into a representation learning paradigm, which can be categorized into sequence-based methods and graphbased methods. Sequence-based methods design spatial-temporal modules to extract mobility representations within historical checkin sequences, including markov chain-based model [8], RNNs-based model [28,41] and attention-based model [6,7,23,30]. Graph-based methods [3, 9, 15, 24-27, 34, 40, 43] enhance location representations by sharing common patterns of locations based on spatialtemporal graphs. However, the sparsity of check-in sequences and the inherent mobility bias lead to suboptimal performance in representation learning for these unimodal approaches.

Recently, language and vision models have demonstrated advanced representations based on sufficient self-supervised learning, which inspires the progress of multi-modal learning across various domains. In e-commerce recommendation, MKGAT [29] and MMGCN [36] leverage multi-modal attributes of items together with abundant user-item interactions to construct a multi-modal graph for learning. Alignrec [22] narrows the representation distance through contrastive learning between items and their multimodal content. In location recommendation, to our knowledge, MMPOI [39] was the first to introduce multi-modality by constructing a multi-modal graph via cosine similarity. However, existing multi-modal approaches neglect the semantic gap between mobility dynamics and static multi-modal representations, thus inevitably leading to performance degradation when generalizing.

Despite the effectiveness of multi-modal pre-training representations, bridging the semantic heterogeneity between static multimodal data and dynamic human mobility remains a key challenge.

(1) How to model the dynamic relationships of locations and users in a multi-modal view? Current multi-modal methods construct static multi-modal relationships to obtain multi-modal knowledge without integrated mobility dynamics, leading to limited generalization capability in dynamic scenarios. (2) How to align multi-modal representations of locations and users to capture mobility dynamics? The inherent semantic gap between different modalities necessitates spatial-temporal-aware alignment techniques to eliminate the modality heterogeneity.

To this end, we design a Multi-Modal Mobility (M 3 ob) framework that overcomes the generalization gap by building a shared graph across multiple modalities.

To address the first challenge of modeling multi-modal dynamics, we build a spatial-temporal relational graph (STRG) to enable the sharing of spatial-temporal knowledge across modalities, which leverages the functional semantics and spatial-temporal knowledge from a large language models (LLMs)-enhanced spatialtemporal knowledge graph (STKG). The hierarchical knowledge from the STKG further enables the modeling of multi-level user preferences. For the second challenge of aligning multi-modal representations, we employ a gating mechanism to dynamically fuse spatial-temporal graph representations from different modalities, thereby mitigating modal interference. Additionally, STKG-guided cross-modal alignment is applied to reduce inter-modal discrepancies.

Our main contributions are summarized as follows: ✗ ✗ Single-level MMPOI [39] Cosine Similarity ✗ Single-level TSPN-RA [12] Quad Tree ✗ Single-level

or attention mechanisms [23,30]. For instance, LSTPM [28] employs a non-local network and a geographically dilated RNN to model long-and short-term preferences. CLSPRec [6] leverages contrastive learning on raw sequential data to effectively distinguish between long-and short-term user preferences. MCLP [30] incorporates a multi-head attention mechanism to generate arrival time embeddings as contextual information for location recommendation. However, these methods insufficiently capture the spatial-temporal transition patterns of different locations in trajectories. Graph-based models primarily utilize spatial-temporal relationships between locations and apply GNNs to enhance location representations. A line of works enhance location embeddings by constructing spatial-temporal graphs based on geographical and transitional relationships between locations [9,15,25,43]. GETNEXT [43] constructs a global trajectory flow graph to better capture transition patterns between locations. Another line of works builds spatial-temporal mobility graphs using pre-trained spatial-temporal knowledge graphs [26,38,44]. AGCL [27] employs graph-enhanced location representation and negative-sample contrast to enhance the discriminability and consistency of POI representations. LoTNext [38] introduces long-tailed adjustment strategies for both graph and loss to tackle the imbalance between head and tail locations. However, unimodal check-in sequences inevitably limit the representation quality of these methods, yielding suboptimal generalization capability in scarce mobility scenarios.

Recent studies have attempted to alleviate this sparsity issue by enriching the semantic features of locations with multi-modal information. MMPOI [39] constructs static multi-modal graphs using intra-modal cosine similarity to leverage independent information from each modality. TSPN-RA [12] builds a spatial graph based on a static quadtree structure derived from remote sensing imagery to incorporate real-world environmental semantics. However, static relationships of locations extracted from multi-modal mobility data overlook the spatial-temporal dynamics, thereby leaving significant semantic gaps between multi-modal embedding and mobility learning. Table 1 illustrates that we inject mobility dynamics into multi-modal representation beyond limitations of existing methods.

Multi-modal content, which enriches item information and captures users’ fine-grained preferences, has been widely introduced into recommendation systems in recent studies [19,47]. Existing research has effectively integrated multi-modal information through the integration of attention mechanisms [4,16,18,20]. Another line of work constructs multi-modal graphs based on user-item interactions and item multi-modal attributes to enrich item semantic representations [31,32,35,36]. Further advancing this direction, several studies leverage multi-modal knowledge graphs to model complex interactions between users and multi-modal information [17,29,33]. Another approach [22,48] seeks alignment in the representation space, diminishing the distance between representations through the adoption of contrastive learning between items and their multi-modal content. Existing multi-modal recommendation primarily focuses on static modal fusion and fails to model the inherent spatiotemporal dynamics of mobile data, making it difficult to be directly applied to mobility representation tasks.

Figure 1 demonstrates that the overall workflow of our proposed M 3 ob framework can be divided into three parts. (a) Spatial-Temporal Knowledge Graph (STKG) enriches the spatial-temporal semantics by LLM-enhanced textual semantics of spatial-temporal entities. (b) Leveraging the unified embedding space of the Spatial-Temporal Knowledge Graph (STKG), the Multi-modal Mobility Representation first constructs a novel Spatial-Temporal Relational Graph (STRG). This graph is built using spatial-temporal dynamics embedded in STKG-fused mobility dynamics, and it supports multi-modal location fusion. Additionally, the Multi-modal Mobility Representation builds multi-level user preferences by leveraging textual semantics that range from coarse to fine granularity. (c) Multi-modal Trajectory Optimization is regularized by hierarchical semantic prediction and STKG-guided cross-modal alignment to incorporate multi-modal spatial-temporal knowledge.

High-order textual semantics alleviates long-tail location data sparsity and enrich the entity association dimensions of SpatioTemporal Knowledge Graph (STKG). Distinct from the existing works [26,38], we are the first to inject textual semantics derived from Large Language Models (LLMs) into STKG, thereby obtaining hierarchical spatiotemporal dynamic relationships. We first applied DeepSeek-R11 to cluster location-based categorical texts, thereby obtaining 12 general human activity types. This process resulted in a three-level semantic hierarchy, ranging from coarse to fine: activity, category, and location. The prompt for activity generation is provided in Appendix A.1. This LLM-based approach not only enhances mapping accuracy but also addresses the challenges arising from divergent category definitions across different urban contexts.

Formally, a Hierarchical STKG is defined as a multi-relation graph G = (E, R, F ), where E is the set of entities, R is the set of relations and F = {(𝑒 head , 𝑟, 𝑒 tail )}, where head, tail ∈ E, 𝑟 ∈ R. Each triplet (𝑒 head , 𝑟, 𝑒 tail ) denotes an edge in KG from head to tail with relation 𝑟 . We consider users, locations, and their associated categories and activities as spatial-temporal entities to enhance knowledge capacity and characterize relations of entities as follows. We pretrain STKG using TransE [1] and the resulting embeddings are then utilized in subsequent modules to construct the spatialtemporal relation graph, which is further adopted to fuse the multimodal representations of locations.

4.2.1 Spatial-Temporal Relational Graph. We construct a data-driven Spatial-Temporal Relational Graph by leveraging entity similarity in the STKG, which establishes unified dynamic relationships for multi-modal data and effectively mitigates the issue of data sparsity. The similarity function of multi-level entities is computed by

where ent 𝑘 𝑖 , ent 𝑘 𝑗 , 𝑟 𝑡 are obtained from the pretrained STKG. Using the hierarchical transition edge weights computed via Equation 1, we establish three mobility transition matrices 𝑀 𝑝 ∈ 𝑁 𝑝 × 𝑁 𝑝 , 𝑀 𝑐 ∈ 𝑁 𝑐 × 𝑁 𝑐 , 𝑀 𝑎 ∈ 𝑁 𝑎 × 𝑁 𝑎 . 𝑁 𝑐 and 𝑁 𝑎 are the number of categories and activities, respectively. We further reduce useless graph edges by selecting from 𝑘 nearest neighbors (𝑘-NN), written as:

where N 𝑘 (𝑒 𝑖 ) is the 𝑘 nearest neighbors of different entities. The STRG is further normalized by dividing a diagonal matrix 𝐷 𝑒 of row-wise maximum values. The corresponding representations of different entities are obtained by GCN, written as:

where Z 𝑒 denotes the original one-hot embeddings of a type of entity, 𝐺 𝑒 is the corresponding STRG, 𝑊 is a weight matrix, and 𝜎 (•) is the activation function. Furthermore, we update the representations by integrating the spatial-temporal graph representations of both categories and activities with their original features through residual connections.

4.2.2 Multi-scale Image Representation. In the real world, a direct correspondence exists between a remote sensing image and its geographic location, as such images capture the surrounding environmental semantics. Remote sensing images exhibit hierarchical semantics: lower zoom levels capture broader regional context, while higher levels provide finer-grained details. To effectively leverage multi-scale representations, we first extract features from localized images at three zoom levels using pre-trained CLIP vision encoder (ViT). These features are then projected into the ID modality’s semantic space via modality-specific MLP layers (Proj ℎ ):

To prevent catastrophic forgetting and reduce GPU memory requirements, we freeze the parameters of all pretrained image encoders during the following training tasks. Image Relational Graph. We introduce mobility dynamic relationships into the remote sensing imagery to overcome the limitation of static graphs based on cosine similarity between images of previous methods [39]. To achieve this, we share the Spatial-Temporal-Relational Graph (STRG) structure of a location with its corresponding image to build an image relational graph 𝐺 img . The corresponding representations of image modality are obtained by GCN, written as:

where 𝐷 img also denotes the row-wise max diagonal matrix of 𝐺 img .

Cross-modal fusion of ID and image representations is essential for multi-modal interaction, but direct application of static weighting may induce modality conflicts.

To address this issue, we employ a gated fusion mechanism that dynamically adjusts the contribution of each modality. Specifically, we first concatenate Z fusion 𝑝 ∈ R 𝑁𝑝 ×𝑑 and Z fusion img ∈ R 𝑁𝑝 ×𝑑 , then apply a linear transformation followed by a Sigmoid activation to generate gate values 𝑔 ∈ R 𝑁𝑝 ×𝑑 constrained within [0, 1]; finally, these gate values weight the two modalities:

where W ∈ R 𝑑 ×2𝑑 is a weight matrix, b ∈ R 𝑑 is a bias vector, ∥ denotes concatenation, and ⊙ is element-wise multiplication. We then perform residual weighting between the multi-modal graph representations Z 𝑚 and the original image/ID modality representations respectively, written as:

where Z 𝑝 denotes the original one-hot embeddings of POI, 𝛼 denotes a hyperparameter that controls the proportion of multi-modal graph fused information. This residual connection can effectively help avoid the degradation of cross-modal representations caused by the intra-modal noise.

To capture users’ preferences for different levels of locations, we construct a multi-level user preference representation following the hierarchical knowledge of textual semantics, written as:

where the preference P 𝑘 is computed by P 𝑘 = ent 𝑢 + 𝑟 𝑡 + ent 𝑘 , 𝑘 ∈ {𝑝, 𝑐, 𝑎}, by summing the embeddings of the visit triplet (ent 𝑢 , 𝑟 𝑡 , ent 𝑘 ) from STKG. During the training process, we freeze all entity and relation representations to avoid preference forgetting while finetuning down-stream prediction tasks.

4.3.1 Hierarchical Regularization. The mobility sequence representation starts from a visit record of (𝑢, 𝑝, 𝑡) and its associated category and activity following the hierarchical knowledge of textual semantics, written as:

where Z 𝑢 denotes the one-hot encoded representation of user IDs and 𝑓 (•) = MLP(concat(•)). A day is discretized into 48 time slots, each spanning 30 minutes, and temporal representations Z 𝑡 are derived using Time2Vector [13] to capture temporal periodicity. The corresponding representation 𝑆 𝑢 𝑖 of the mobility sequence of user 𝑢 𝑖 is computed by

The obtained representation matrix of the mobility sequence is then fed into the Transformer decoder to obtain the final representation, written as:

where the Transformer computes the attention scores of the historical locations iteratively as softmax( 𝑄𝐾 𝑇 √ 𝑑

)𝑉 . By incorporating the next category, activity, and time as auxiliary tasks, prediction performance for the next location can be enhanced. Therefore, we construct four MLP prediction heads to simultaneously perform four prediction tasks, written as:

4.3.2 STKG-Guided Cross-Modal Alignment. To bridge the semantic gap between static multi-modal data and dynamic mobility behaviors, we enrich the static image representations with the spatial-temporal dynamics from the Hierarchical Spatial-Temporal Knowledge Graph (STKG). Specifically, we first input the location entity from the pretrained STKG and the image into two independent projection layers and align them in the same space.

where Proj KG is a projection layer composed of a single layer MLP.

The alignment loss between image-modal and STKG entity are computed in a global way, which treats the corresponding img 𝑖 and 𝑝 𝑖 as the positive sample and img 𝑖 and 𝑝 𝑗 ( 𝑗 ≠ 𝑖) as negative samples, written as:

where sim(•) denotes the inner product. Bidirectional contrastive learning can prevent modal collapse. Cross-modal alignment enhances the spatial-temporal dynamics of the image modality.

Objective. Finally, we adopt a multi-task learning approach to collaboratively optimize four prediction losses and cross-modal alignment losses to enhance the next location recommendation, our training objective integrates all previous loss functions, written as:

where L 𝑝 , L 𝑐 and L 𝑎 denote the cross-entropy losses for next location, next category and next activity prediction, respectively, and L 𝑡 represents the mean squared error loss for next time prediction. 𝜆 𝑡 , 𝜆 𝑐𝑜𝑛 is the weights of L 𝑡 , L con , respectively.

We conduct extensive experiments on real trajectory dataset of six cities to verify the performance of M 3 ob in next-location recommendation task. We aim to answer the following research questions:

• RQ1: How does M 3 ob perform compared with other state-ofthe-art methods for the next-location recommendation task? • RQ2: How is the generalization capability of M 3 ob under abnormal scenarios like adverse weather and long-tail locations? • RQ3: How does M 3 ob compare with baselines in efficiency? • RQ4: How do different modules affect the prediction performance of M 3 ob across various mobility datasets? • RQ5: How do hyper-parameter influence M 3 ob’s performance? sourced from Foursquare2 . We perform a general preprocessing step [28], filtering locations with fewer than 10 visits; defining all check-ins within a calendar day as a single trajectory; removing trajectories with fewer than 3 check-ins; and excluding inactive users with fewer than 5 trajectories. Each user’s trajectory is split chronologically: 80% training, 10% validation, and 10% test. Table 2 summarizes the statistics of the experimental datasets. Image representations are derived from 256×256-pixel satellite imagery acquired via ArcGIS3 at three granularities: coarse (zoom level 15), medium (zoom level 16), and fine (zoom level 17). Weather data are obtained from ERA5-Land4 . Figure 2 compares the time intervals and geographical distances between consecutive locations in trajectories under rainy and sunny conditions across six cities. Figure 3 Acc@5 Acc@10 Acc@20 0.30 compares the proportion of human activities on MOW under colder periods versus normal weather conditions.

To validate M 3 ob’s effectiveness, we benchmark it against 11 state-of-the-art baselines. These comparative baselines are categorized into sequence-based methods [7,8,10,28,30,41] and graph-based methods [25,26,38,39,43]. More details are provided in Appendix B.1.

We evaluate predictions using Acc@𝑘, the standard location recommendation metric. This metric assesses whether the true next location appears in the top-𝑘 recommendations, effectively measuring system accuracy.

Table 3 shows the performance of different baselines on 6 datasets. Our method consistently outperforms state-of-the-art baselines across all datasets, achieving average improvements of 3.71% at Acc@5, 3.70% at Acc@10, and 3.31% at Acc@20. Among sequencebased methods, LSTPM excels by explicitly modeling long-and short-term preferences, while MCLP struggles with sparse trajectory data and topic modeling fails to capture diverse user prefer- Acc@5 (%) Acc@10 (%) Acc@20 (%) Acc@5 (%) Acc@10 (%) Acc@20 (%) Acc@5 (%) Acc@10 (%) Acc@20 (%) Acc@5 (%) Acc@10 (%) Acc@20 (%) Acc@5 (%) Acc@10 (%) Acc@20 (%) Acc@5 (%) Acc@10 (%) Acc@20 (%) shown in Figure 2, analysis of trajectory data from six cities shows that during rain, average time intervals and spatial distances decreased by 28.2% and 9.7% respectively compared to sunny conditions. Movement speed increased, indicating a preference for visiting nearer locations and greater use of vehicles. NYC and MOW, situated at higher latitudes, are frequently affected by cold spells. As shown in Figure 3, in cold weather, people favor indoor activities, reducing trips for dining and shopping. Such specific weather conditions challenge the generalization of mobility models in sparse scenarios. As shown in Figure 4 and Figure 5, our method significantly outperforms Graph-Flashback and MMPOI in location recommendation under both rainy and cold weather. This consistent advantage across datasets confirms our approach’s generalization in adverse weather. This is because the remote sensing image module in our method enriches the environmental semantics of locations, while the text semantics of LLM captures high-level differences in human activities. Additionally, the unified STRG and the STKG-guided cross-modal alignment enhance the spatiotemporal knowledge across modalities, enabling the capture of dynamic shifts in human behavior under adverse weather conditions.

Figure 6: Prediction accuracy on long-tail locations. The performance is measured using Acc@10. data. As shown in Figure 6, our method consistently outperforms LoTNext (a long-tail adjustment framework) in predicting both head and tail locations. The performance of LoTNext is limited by the sparse trajectory data and inherent data biases, whereas our approach enriches information for long-tail locations by leveraging real-world textual and visual modalities. Moreover, through a shared spatial-temporal relational graph, our method endows different modalities with dynamic spatial-temporal representations, enabling effective semantic transfer from head locations to tail ones.

As shown in Table 4, our method demonstrates superior training and inference efficiency over competitive baselines on NYC and SGP datasets: Among sequence-based methods, LSTPM incurs substantially higher training/validation time by serially processing each historical trajectory; for graph-based methods, LoTNext requires additional time overhead per iteration due to graph denoising operations. Furthermore, despite utilizing multi-modal data, our lightweight model construction introduces little additional memory overhead compared to unimodal methods.

To evaluate the effectiveness of different modules, we constructed six variants of M 3 ob in our ablation study. As shown in Table 5, the is excluded, the IRG is also removed simultaneously-this is because the IRG is derived from the STRG. The STRG is capable of modeling dynamic relationships between multiple modalities and mitigating the sparsity issue; consequently, the model’s performance decreases (-1.31%) after this module is removed. • w/o MUP: Omitting multi-level user preferences (MUP), which capture time-aware user interests across location semantics, results in observable performance degradation. • w/o CMA: Eliminating the cross-modal alignment (CMA), which infuses images with spatial-temporal dynamics from the knowledge graph, also leads to a performance drop.

Acc@5 Acc@10 Acc@20 consistently yields near-optimal results across all datasets. Hyperparameter sensitivity experiments for K-NN on STRG and multitask loss weights can be found in Appendix B.4.

In this study, we introduce M 3 ob, a multi-modal fusion framework that leverages spatial-temporal knowledge to integrate human mobility dynamics with multi-modal representations, build a united STRG for multi-modal and implements STKG-guided cross-modal contrastive alignment. Our method exhibits strong generalization ability, as demonstrated by extensive comparative experiments conducted across both normal and abnormal scenarios-especially under adverse weather and long-tail locations. Benefiting from its lightweight design, the method also boasts high inference efficiency. Furthermore, ablation studies validate the contributions of different modules within our framework. In the future, we will explore dynamic scenarios for multi-modal mobility prediction to enhance the emergency management capabilities of decision-makers. Another promising direction is to investigate the interpretable human mobility paradigm by guiding the mobility of LLMs agents with their perceptual multi-modal knowledge.

A Method Details. A.1 LLM Prompt for Location Hierarchy.

The prompt used for generating Activity labels from Category data is shown as Figure 8. Based on the categorical text descriptions of locations, we employ a large language model to infer 12 generic types of human activities. • MCLP [30]: It employs thematic modeling to extract users’ historical locational preferences and generates arrival time embeddings as the context for location recommendation through a multi-head attention mechanism.

• GETNEXT [43]: It employs trajectory flow graph to capture the general movement patterns of users. • Graph-Flashback [26]: Based on the spatial-temporal knowledge graph, enhance location representation by utilizing the temporal and spatial relationships between locations. • Diff-POI [25]: It harnesses two graphs to extract spatial-temporal representations, and introduces a diffusion-based sampling strategy to investigate users’ spatial preferences.

• MMPOI [39]: It leverages intra-modal cosine similarity to construct a multi-modal graph, thereby enriching location representation.

• LoTNext [38]: It proposes long-tailed graph adjustment and a long-tailed loss adjustment module to enhance long-tailed location prediction.

All experiments are performed on a single NVIDIA RTX 4090. We utilize Adam as the optimizer, with an initial learning rate set to 1e-4 and the L2 regularization penalty set to 1e-3. For the image encoder, we adopt CLIP’s pre-trained image encoder (ViT-B-32).

For Transformer, we stack two layers of encoders with a dropout of 0.3 and set the number of attention heads to 4 to run 75 epochs with batch size 128. We search for ID dimension from {128, 256, 512}.

It is worth noting that we set the time loss and alignment loss weights to 10 and 1 respectively to match the scale of location and category losses. Additionally, we set the weights for the graph fusion representation to 0.8. For a fair comparison, we optimized each baseline’s parameters as per their papers and recorded the mean and standard deviation across five random seeds.

Given their status as the latest sequence-based model (MCLP) and graph-based model (LoTNext) among our baseline comparisons, we conducted comprehensive grid hyperparameter search experiments for both methods. The results of these searches for selected parameters on the NYC and SGP datasets are presented in Table 6. Despite extensive hyperparameter tuning, we observe that both methods continue to underperform prior work (e.g., LSTPM and Graph-Flashback). This performance gap occurs because MCLP was evaluated on datasets with lower sparsity than ours; their approach performs best with denser trajectory data. Furthermore, we identified a data leakage issue in the publicly released LoTNext code, where the Transformer module lacked temporal masking during inference. We confirmed this flaw via email correspondence with the authors and subsequently added masking to LoTNext in our experiments to ensure fair comparison.

B.4 More Parameter Analysis. capture macro functional zone features and achieve the best performance. For SGP, a city with mixed functions and compact spaces, medium/fine-grained remote sensing images are more effective as they capture microscopic neighborhood features. Meanwhile, the multi-scale remote sensing image representation outperforms those with a single granularity, as it integrates information from remote sensing images of various granularities.

Deep learning approaches have dominated the advanced progress of next location recommendation, which predicts users’ future visits and can be categorized into sequence-based and graph-based models. Sequence-based models capture spatial-temporal patterns through Markov chains[8], RNN architectures[7,28,41,46],

5.3 RQ2: Generalization Analysis of M 3 ob 5.3.1 Performance Analysis on Adverse Weather. Human travel preferences shift significantly in rainy and cold weather[2,5,11]. As

5.3 RQ2: Generalization Analysis of M 3 ob 5.3.1 Performance Analysis on Adverse Weather. Human travel preferences shift significantly in rainy and cold weather[2,5,11]

5.3 RQ2: Generalization Analysis of M 3 ob 5.3.1 Performance Analysis on Adverse Weather. Human travel preferences shift significantly in rainy and cold weather

https://www.deepseek.com/

https://sites.google.com/site/yangdingqi/home/foursquare-dataset

https://server.arcgisonline.com/

https://cds.climate.copernicus.eu/datasets

📄 Read Full PDF on ArXiv