CADET: Context-Conditioned Ads CTR Prediction With a Decoder-Only Transformer

CADET: Context-Conditioned Ads CTR Prediction With a Decoder-Only Transformer
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Click-through rate (CTR) prediction is fundamental to online advertising systems. While Deep Learning Recommendation Models (DLRMs) with explicit feature interactions have long dominated this domain, recent advances in generative recommenders have shown promising results in content recommendation. However, adapting these transformer-based architectures to ads CTR prediction still presents unique challenges, including handling post-scoring contextual signals, maintaining offline-online consistency, and scaling to industrial workloads. We present CADET (Context-Conditioned Ads Decoder-Only Transformer), an end-to-end decoder-only transformer for ads CTR prediction deployed at LinkedIn. Our approach introduces several key innovations: (1) a context-conditioned decoding architecture with multi-tower prediction heads that explicitly model post-scoring signals such as ad position, resolving the chicken-and-egg problem between predicted CTR and ranking; (2) a self-gated attention mechanism that stabilizes training by adaptively regulating information flow at both representation and interaction levels; (3) a timestamp-based variant of Rotary Position Embedding (RoPE) that captures temporal relationships across timescales from seconds to months; (4) session masking strategies that prevent the model from learning dependencies on unavailable in-session events, addressing train-serve skew; and (5) production engineering techniques including tensor packing, sequence chunking, and custom Flash Attention kernels that enable efficient training and serving at scale. In online A/B testing, CADET achieves a 11.04% CTR lift compared to the production LiRank baseline model, a hybrid ensemble of DCNv2 and sequential encoders. The system has been successfully deployed on LinkedIn’s advertising platform, serving the main traffic for homefeed sponsored updates.


💡 Research Summary

The paper introduces CADET (Context‑Conditioned Ads Decoder‑Only Transformer), a production‑grade model for click‑through‑rate (CTR) prediction on LinkedIn’s home‑feed sponsored updates. Traditional deep learning recommendation models (DLRMs) dominate ad CTR tasks but struggle with post‑scoring contextual signals such as ad position, which are unavailable at the moment of scoring and create a chicken‑and‑egg dependency between predicted CTR and final ranking. Recent generative recommenders based on decoder‑only transformers have shown promise in content recommendation, yet adapting them to the ad domain raises three major challenges: (1) handling contextual features that appear only after scoring, (2) ensuring strict online‑offline consistency despite tracking delays and session‑level scoring, and (3) scaling training and inference to billions of daily impressions.

CADET addresses these challenges through five key innovations. First, a context‑conditioned decoding block partitions the space of post‑scoring contexts (e.g., ad position buckets) into K discrete buckets and attaches a dedicated MLP head to each. During training, only the head corresponding to the observed context contributes to the loss, while at inference time all K heads produce predictions in parallel; the serving system then selects the head matching the actual rendering context. This eliminates iterative rescoring and resolves the ranking dependency loop in a single forward pass.

Second, the model incorporates self‑gated attention. A lightweight gating module applies a sigmoid‑based gate to the raw token representations before the Q/K projections (representation‑level gating) and again to the projected Q and K vectors (interaction‑level gating). The gates suppress noisy dimensions and limit dot‑product magnitudes, preventing any single token from monopolizing attention. Empirically, this stabilizes training, accelerates convergence, and reduces loss divergence in deep, large‑batch settings.

Third, CADET replaces the standard positional encoding with a timestamp‑based Rotary Positional Embedding (RoPE). Instead of using sequence indices, the model rotates each 2‑D sub‑vector by an angle proportional to the Unix timestamp of the event. Hyper‑parameters (Δt_max, φ_min, base) allow the rotation frequency to span from seconds to months, enabling the transformer to capture fine‑grained short‑term dynamics as well as long‑term temporal trends that are crucial for ad performance.

Fourth, the authors devise session‑aware masking strategies to guarantee online‑offline consistency. During training, a mask M(i, j) = –∞ is applied whenever token j occurs within a configurable delay Δdelay after token i, preventing the model from attending to events that would not be observable at serving time. For inference, a special mask isolates candidate ads appended at the end of the sequence, allowing hundreds of candidates to be scored simultaneously while preserving causal independence. This dramatically reduces latency compared with per‑candidate scoring.

Fifth, extensive production engineering optimizations make CADET feasible at LinkedIn scale. Tensor packing and sequence chunking reduce memory footprint, while custom Flash‑Attention kernels accelerate the attention computation, especially for the multi‑item scoring scenario. The system processes billions of impressions daily with acceptable GPU utilization and sub‑millisecond inference latency.

Experimental results show that CADET outperforms the production baseline LiRank (a hybrid of DCNv2 and sequential encoders) on offline metrics (AUC improvement of 0.006, 2.3 % reduction in log‑loss) and delivers an 11.04 % lift in CTR in online A/B testing. Ablation studies confirm that each component—self‑gated attention, context‑conditioned heads, timestamp RoPE, and session masking—contributes meaningfully to the overall gain.

The paper also discusses limitations: the need to pre‑define context buckets may require re‑training when UI changes, and the timestamp RoPE hyper‑parameters demand careful tuning per domain. Future work could extend the framework to multi‑objective optimization (e.g., conversion, revenue) and explore dynamic context bucket generation.

In summary, CADET demonstrates that a decoder‑only transformer, when equipped with context conditioning, adaptive gating, temporal embeddings, and service‑aware masking, can replace traditional DLRM pipelines for ad CTR prediction, achieving both higher predictive performance and production‑ready efficiency at massive scale.


Comments & Academic Discussion

Loading comments...

Leave a Comment