Multimodal Generative Retrieval Model with Staged Pretraining for Food Delivery on Meituan

Multimodal Generative Retrieval Model with Staged Pretraining for Food Delivery on Meituan
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multimodal retrieval models are becoming increasingly important in scenarios such as food delivery, where rich multimodal features can meet diverse user needs and enable precise retrieval. Mainstream approaches typically employ a dual-tower architecture between queries and items, and perform joint optimization of intra-tower and inter-tower tasks. However, we observe that joint optimization often leads to certain modalities dominating the training process, while other modalities are neglected. In addition, inconsistent training speeds across modalities can easily result in the one-epoch problem. To address these challenges, we propose a staged pretraining strategy, which guides the model to focus on specialized tasks at each stage, enabling it to effectively attend to and utilize multimodal features, and allowing flexible control over the training process at each stage to avoid the one-epoch problem. Furthermore, to better utilize the semantic IDs that compress high-dimensional multimodal embeddings, we design both generative and discriminative tasks to help the model understand the associations between SIDs, queries, and item features, thereby improving overall performance. Extensive experiments on large-scale real-world Meituan data demonstrate that our method achieves improvements of 3.80%, 2.64%, and 2.17% on R@5, R@10, and R@20, and 5.10%, 4.22%, and 2.09% on N@5, N@10, and N@20 compared to mainstream baselines. Online A/B testing on the Meituan platform shows that our approach achieves a 1.12% increase in revenue and a 1.02% increase in click-through rate, validating the effectiveness and superiority of our method in practical applications.


💡 Research Summary

The paper tackles the practical challenges of multimodal retrieval in a large‑scale food‑delivery platform (Meituan), where queries are textual while items contain both text and images. Conventional dual‑tower architectures jointly optimize several contrastive objectives (intra‑tower and inter‑tower). The authors observe two critical problems in this joint training regime: (1) modality dominance, where the loss for the easier modality (text) rapidly converges while the loss for the harder modality (image) lags far behind, effectively causing the model to ignore image information; and (2) the “one‑epoch problem”, where the faster‑learning modality over‑fits early, preventing the slower modality from ever receiving sufficient training. Empirical evidence is provided by replacing image embeddings with random vectors, which yields almost unchanged retrieval performance, confirming that images are under‑utilized.

To remedy these issues, the authors propose a staged pre‑training strategy. Training is divided into distinct phases, each focusing on a single objective:

  • Stage 1 learns image‑to‑text alignment inside the item tower (image2text contrastive loss), thereby establishing a strong multimodal representation.
  • Stage 2 introduces three query‑item contrastive tasks—query2item, query2text, and query2image—each with its own InfoNCE loss. By training these objectives sequentially rather than jointly, the model can allocate appropriate capacity to each modality and avoid the one‑epoch phenomenon. Hyper‑parameters (learning rate, batch composition, temperature) are tuned per stage, giving fine‑grained control over training dynamics.

Beyond representation learning, the paper addresses the storage and latency constraints of deploying high‑dimensional embeddings at scale. It adopts Residual Quantized Variational Auto‑Encoder (RQ‑VAE) to compress 1024‑dimensional vectors into compact Semantic IDs (SIDs) via multi‑layer codebooks. While prior work treats SIDs as static discrete tokens fine‑tuned only with discriminative losses, the authors argue that SIDs lack semantic grounding and therefore propose a dual‑task fine‑tuning:

  • A generative task that reconstructs the original dense embedding from the SID (reconstruction loss), encouraging SIDs to retain rich semantic information.
  • A discriminative task that aligns SIDs with query embeddings through contrastive learning, ensuring that SIDs are directly useful for retrieval.

This combination turns SIDs from mere indices into meaningful “semantic tokens” that can be matched against LLM‑encoded queries.

The experimental evaluation uses billions of real‑world search logs from Meituan. Baselines include standard dual‑tower models, recent multimodal contrastive models, and a generative retrieval baseline that uses SIDs without staged training. Results show consistent gains: Recall@5/10/20 improve by 3.80 %, 2.64 %, and 2.17 % respectively; NDCG@5/10/20 improve by 5.10 %, 4.22 %, and 2.09 %. An ablation where image features are randomized confirms that the staged approach truly leverages visual information.

Online A/B testing further validates the method: the staged‑pretrained, SID‑enhanced model yields a 1.12 % increase in revenue and a 1.02 % lift in click‑through rate compared to the production baseline. These gains demonstrate that the proposed techniques translate into tangible business impact at massive scale.

In summary, the paper makes three major contributions:

  1. Identification and systematic mitigation of modality dominance and one‑epoch problems via staged pre‑training.
  2. Introduction of generative + discriminative objectives for SID learning, giving discrete tokens semantic depth and retrieval utility.
  3. Comprehensive offline and online validation on a real‑world, billion‑scale platform, showing both metric improvements and revenue uplift.

Future directions suggested include extending the staged framework to additional modalities (e.g., video, audio), incorporating richer user behavior signals for multi‑task personalization, and building ultra‑low‑latency SID‑based retrieval pipelines (e.g., GPU/FPGA acceleration). The work sets a solid foundation for scalable, high‑performance multimodal search systems in e‑commerce and beyond.


Comments & Academic Discussion

Loading comments...

Leave a Comment