UniRec: Unified Multimodal Encoding for LLM-Based Recommendations
Large language models have recently shown promise for multimodal recommendation, particularly with text and image inputs. Yet real-world recommendation signals extend far beyond these modalities. To reflect this, we formalize recommendation features into four modalities: text, images, categorical features, and numerical attributes, and highlight the unique challenges this heterogeneity poses for LLMs in understanding multimodal information. In particular, these challenges arise not only across modalities but also within them, as attributes such as price, rating, and time may all be numeric yet carry distinct semantic meanings. Beyond this intra-modality ambiguity, another major challenge is the nested structure of recommendation signals, where user histories are sequences of items, each associated with multiple attributes. To address these challenges, we propose UniRec, a unified multimodal encoder for LLM-based recommendation. UniRec first employs modality-specific encoders to produce consistent embeddings across heterogeneous signals. It then adopts a triplet representation, comprising attribute name, type, and value, to separate schema from raw inputs and preserve semantic distinctions. Finally, a hierarchical Q-Former models the nested structure of user interactions while maintaining their layered organization. Across multiple real-world benchmarks, UniRec outperforms state-of-the-art multimodal and LLM-based recommenders by up to 15%, and extensive ablation studies further validate the contributions of each component.
💡 Research Summary
The paper “UniRec: Unified Multimodal Encoding for LLM‑Based Recommendations” addresses a critical gap in current large‑language‑model (LLM) driven recommender systems: most existing approaches only handle text and, at best, image modalities, while real‑world recommendation data routinely includes categorical labels, numerical attributes, timestamps, and geographic coordinates. To bridge this gap, the authors formalize recommendation features into four modalities—text, images, categorical, and numeric—and propose a unified encoder, UniRec, that can ingest and align all of them for downstream LLM reasoning.
UniRec’s architecture consists of three main components. First, modality‑specific encoders map raw inputs into a common 1024‑dimensional space. Textual fields and categorical labels are embedded using Qwen‑3‑0.6B, images are processed by CLIP ViT‑L/14 followed by a projection layer, and numeric attributes are encoded with a Fourier‑based Math‑Aware Number Encoder that captures magnitude, sign, and periodicity (e.g., hour‑of‑day). Second, each attribute is represented as a triplet (name, type, value). Separate embeddings for the name, its modality type, and the raw value are summed, yielding a schema‑aware attribute vector that preserves semantic distinctions even when different attributes share the same data type (e.g., price vs. rating). Third, a hierarchical Q‑Former aggregates these attribute vectors in two stages. An Item‑Q‑Former consumes the variable‑length set of attribute embeddings for a single item and, via learnable query tokens, produces a fixed‑size item token zₜ. A User‑Q‑Former then takes the sequence of item tokens, concatenated with review‑text embeddings and timestamp embeddings, and again uses learnable queries to generate a unified user representation U. This two‑level design explicitly models the nested structure of recommendation data (users → sequences of items → sets of attributes).
Training proceeds in two phases. In the pre‑training phase, the modality encoders and both Q‑Formers are trained while the LLM remains frozen. The loss combines a reconstruction term (forcing the Q‑Former output to reconstruct original attribute embeddings) and an InfoNCE contrastive term (treating adjacent items in a user’s history as positive pairs). This encourages the encoder to learn semantically meaningful, modality‑aligned embeddings. In the fine‑tuning phase, the LLM is adapted using LoRA while the core modality encoders stay fixed. The user token U is projected into the LLM’s word‑embedding space as a soft prompt, and the system is trained with an InfoNCE loss to predict the next item. During inference, U is generated from a user’s interaction history, mean‑pooled from the LLM’s final hidden states, and then dot‑product similarity with pre‑computed item tokens yields a ranked candidate list.
The authors evaluate UniRec on three heterogeneous benchmarks: Amazon Beauty, Amazon Baby, and Yelp. After 5‑core filtering, each training instance consists of 20 consecutive interactions with the 21st as the target. Metrics include MRR, Hit@10, and NDCG@10. UniRec consistently outperforms three families of baselines—multimodal sequential models (GRU4Rec, SASRec, etc.), multimodal recommendation models (VBPR, MMGCN, etc.), and recent LLM‑multimodal hybrids—achieving up to a 15 % relative gain. Ablation studies demonstrate that removing the triplet schema representation or the hierarchical Q‑Former leads to substantial performance drops, confirming their central role.
In summary, UniRec introduces a principled, schema‑aware, and hierarchy‑preserving framework that enables LLMs to reason over richly heterogeneous recommendation signals. By unifying modality‑specific encoders, explicit attribute triplets, and a two‑stage Q‑Former, the system delivers state‑of‑the‑art results across diverse real‑world datasets, setting a new benchmark for multimodal recommendation with large language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment