Benchmarking Multimodal Large Language Models for Missing Modality Completion in Product Catalogues

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Missing-modality information on e-commerce platforms, such as absent product images or textual descriptions, often arises from annotation errors or incomplete metadata, impairing both product presentation and downstream applications such as recommendation systems. Motivated by the multimodal generative capabilities of recent Multimodal Large Language Models (MLLMs), this work investigates a fundamental yet underexplored question: can MLLMs generate missing modalities for products in e-commerce scenarios? We propose the Missing Modality Product Completion Benchmark (MMPCBench), which consists of two sub-benchmarks: a Content Quality Completion Benchmark and a Recommendation Benchmark. We further evaluate six state-of-the-art MLLMs from the Qwen2.5-VL and Gemma-3 model families across nine real-world e-commerce categories, focusing on image-to-text and text-to-image completion tasks. Experimental results show that while MLLMs can capture high-level semantics, they struggle with fine-grained word-level and pixel- or patch-level alignment. In addition, performance varies substantially across product categories and model scales, and we observe no trivial correlation between model size and performance, in contrast to trends commonly reported in mainstream benchmarks. We also explore Group Relative Policy Optimization (GRPO) to better align MLLMs with this task. GRPO improves image-to-text completion but does not yield gains for text-to-image completion. Overall, these findings expose the limitations of current MLLMs in real-world cross-modal generation and represent an early step toward more effective missing-modality product completion.

💡 Research Summary

This paper addresses the pervasive problem of missing product modalities—specifically images or textual descriptions—in large‑scale e‑commerce catalogs. The authors investigate whether recent Multimodal Large Language Models (MLLMs) can automatically generate the absent modality and how useful the generated content is for downstream tasks such as recommendation. To this end, they introduce the Missing Modality Product Completion Benchmark (MMPCBench), which consists of two complementary sub‑benchmarks:

Content Quality Completion Benchmark (CQBench) – evaluates the intrinsic fidelity of the generated modality using a suite of automatic metrics. For text, cosine similarity on TF‑IDF vectors, token overlap, and BERTScore are employed. For images, low‑level measures (PSNR, SSIM, MSE) are combined with high‑level perceptual metrics (LPIPS, CLIP similarity) to capture both pixel‑wise reconstruction quality and semantic alignment.
Recommendation Benchmark (RecBench) – assesses the extrinsic utility of the generated modality by substituting it for the missing input in three well‑known multimodal recommender models (VBPR, MMGCN, LightGCN‑Multimodal). Standard ranking metrics (NDCG@k, Recall@k) are reported to quantify the impact on personalized ranking.

The benchmark is built on the latest (March 2024) Amazon Review Dataset, covering nine popular product categories (All Beauty, Arts‑Crafts‑Sewing, Electronics, Home‑Kitchen, Industrial‑Scientific, Musical Instruments, Office Products, Toys‑Games, Video Games). For each category, 1,000 fully annotated items are sampled; one modality is masked to create the Image‑to‑Text (I→T) and Text‑to‑Image (T→I) tasks. Text‑to‑Image generation proceeds by prompting a diffusion model (e.g., Stable Diffusion) with the image caption produced by the MLLM.

Six state‑of‑the‑art MLLMs are evaluated: three sizes each from the Qwen2.5‑VL and Gemma‑3 families (≈2 B, ≈7 B, ≈13 B parameters). The experimental findings are nuanced:

Semantic competence – All models capture high‑level product semantics, achieving text similarity scores in the 0.78–0.84 range and CLIP image similarity of 0.62–0.71. However, fine‑grained word‑level alignment and pixel‑level fidelity remain weak, as reflected by modest token overlap, BERTScore, and relatively high LPIPS values.
Scale vs. performance – Contrary to trends on generic vision‑language benchmarks, larger models do not consistently outperform smaller ones. In I→T, larger models yield slightly better text metrics, but in T→I the medium‑sized models often produce higher CLIP similarity, suggesting that the bottleneck lies in the prompt‑to‑diffusion pipeline rather than the language backbone.
Category dependence – Visual‑rich categories (Beauty, Home & Kitchen) suffer the most in image quality, while specification‑heavy categories (Electronics, Video Games) see better text reconstruction. This highlights the uneven difficulty across domains.
Policy optimization – The authors introduce Group Relative Policy Optimization (GRPO) to better align MLLM outputs with the downstream tasks. GRPO improves I→T text metrics by 3–5 % on average but yields no measurable gain for T→I, indicating that policy‑level tweaks help language generation but not the subsequent diffusion step.
Recommendation impact – When the generated modality replaces the original in recommender models, ranking performance drops by 2–4 % relative to the oracle, confirming that current generation quality is insufficient for a drop‑in replacement in production‑grade systems.

The paper’s contributions are fourfold: (1) the creation of MMPCBench, the first benchmark that jointly evaluates content fidelity and recommendation utility for missing‑modality product completion; (2) a systematic, cross‑category comparison of six leading MLLMs, revealing non‑monotonic scaling behavior; (3) the exploration of GRPO as a task‑specific alignment technique; and (4) an empirical demonstration of the gap between semantic competence and practical applicability in e‑commerce settings.

In conclusion, while modern MLLMs are capable of generating plausible high‑level descriptions and images, they fall short on the fine‑grained alignment required for realistic product presentation and effective downstream recommendation. Future work should focus on domain‑specific fine‑tuning, improved prompt engineering for diffusion models, and integrated training that jointly optimizes language and visual generation components. Moreover, real‑world user studies and online A/B testing will be essential to validate whether the modest gains observed in offline metrics translate into measurable business value.

Benchmarking Multimodal Large Language Models for Missing Modality Completion in Product Catalogues

💡 Research Summary

Comments & Academic Discussion

Leave a Comment