One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation

One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Advertising image generation has increasingly focused on online metrics like Click-Through Rate (CTR), yet existing approaches adopt a ``one-size-fits-all" strategy that optimizes for overall CTR while neglecting preference diversity among user groups. This leads to suboptimal performance for specific groups, limiting targeted marketing effectiveness. To bridge this gap, we present \textit{One Size, Many Fits} (OSMF), a unified framework that aligns diverse group-wise click preferences in large-scale advertising image generation. OSMF begins with product-aware adaptive grouping, which dynamically organizes users based on their attributes and product characteristics, representing each group with rich collective preference features. Building on these groups, preference-conditioned image generation employs a Group-aware Multimodal Large Language Model (G-MLLM) to generate tailored images for each group. The G-MLLM is pre-trained to simultaneously comprehend group features and generate advertising images. Subsequently, we fine-tune the G-MLLM using our proposed Group-DPO for group-wise preference alignment, which effectively enhances each group’s CTR on the generated images. To further advance this field, we introduce the Grouped Advertising Image Preference Dataset (GAIP), the first large-scale public dataset of group-wise image preferences, including around 600K groups built from 40M users. Extensive experiments demonstrate that our framework achieves the state-of-the-art performance in both offline and online settings. Our code and datasets will be released at https://github.com/JD-GenX/OSMF.


💡 Research Summary

The paper tackles a fundamental limitation of current advertising image generation systems: they treat Click‑Through Rate (CTR) as a single, aggregate metric and ignore the fact that user preferences vary dramatically across demographic and behavioral groups. To address this, the authors propose One Size, Many Fits (OSMF), a unified framework that aligns diverse group‑wise click preferences at industrial scale.

1. Product‑Aware Adaptive Grouping (PAAG).
PAAG first encodes three modalities: (i) user attributes (gender, age, location, etc.) via embedding layers and an MLP, (ii) product titles via a text encoder, and (iii) advertising images via a vision encoder. Cross‑attention layers condition the user embedding on the product text and then on the product image, yielding a product‑aware user representation e_u|c that captures how a specific user would react to a particular product. For each product, all such representations are clustered with K‑Means; the optimal number of clusters K* is selected by maximizing the silhouette coefficient. Rather than representing each cluster solely by its centroid, PAAG samples additional points at various percentiles around the centroid, forming a richer group embedding G_s,k that preserves intra‑cluster diversity.

2. Preference‑Conditioned Image Generation (PCIG).
The group embedding e_G_s,k is prepended as a special token to the input of a Multimodal Large Language Model (MLLM), producing an augmented sequence **


Comments & Academic Discussion

Loading comments...

Leave a Comment