Disentangled Interest Network for Out-of-Distribution CTR Prediction
Click-through rate (CTR) prediction, which estimates the probability of a user clicking on a given item, is a critical task for online information services. Existing approaches often make strong assumptions that training and test data come from the same distribution. However, the data distribution varies since user interests are constantly evolving, resulting in the out-of-distribution (OOD) issue. In addition, users tend to have multiple interests, some of which evolve faster than others. Towards this end, we propose Disentangled Click-Through Rate prediction (DiseCTR), which introduces a causal perspective of recommendation and disentangles multiple aspects of user interests to alleviate the OOD issue in recommendation. We conduct a causal factorization of CTR prediction involving user interest, exposure model, and click model, based on which we develop a deep learning implementation for these three causal mechanisms. Specifically, we first design an interest encoder with sparse attention which maps raw features to user interests, and then introduce a weakly supervised interest disentangler to learn independent interest embeddings, which are further integrated by an attentive interest aggregator for prediction. Experimental results on three real-world datasets show that DiseCTR achieves the best accuracy and robustness in OOD recommendation against state-of-the-art approaches, significantly improving AUC and GAUC by over 0.02 and reducing logloss by over 13.7%. Further analyses demonstrate that DiseCTR successfully disentangles user interests, which is the key to OOD generalization for CTR prediction. We have released the code and data at https://github.com/DavyMorgan/DiseCTR/.
💡 Research Summary
The paper tackles the pervasive out‑of‑distribution (OOD) problem in click‑through rate (CTR) prediction, where the assumption that training and test data share the same distribution is routinely violated in real‑world recommender systems. Users’ interests evolve over time, and often only a subset of these latent interests changes at any given moment—a phenomenon the authors term “partial‑distribution‑variation.” Existing CTR models (e.g., FM, DeepFM, AutoInt) directly learn the conditional probability P(Y|X) and therefore suffer dramatic performance drops when the underlying interest distribution shifts.
To address this, the authors adopt a causal perspective and factorize the joint distribution of features X, latent interests Z, and click label Y into three mechanisms: (1) an interest model P(Z), (2) an exposure model P(X|Z), and (3) a click model P(Y|X,Z). By learning P(Z|X) and P(Y|Z) separately, the model can isolate the effect of interest shifts, because only the affected interests Z_i change while the majority remain stable. This leads to a more robust predictor under OOD conditions.
The proposed architecture, called DiseCTR, consists of three components:
-
Interest Encoder – Raw high‑dimensional features are first embedded and then processed by a sparse‑attention mechanism. Sparse attention forces each interest embedding Z_i to attend to only a small subset of input features, ensuring that each latent interest is associated with a distinct, limited feature set and reducing redundancy among interests.
-
Interest Disentangler – Since interests are unobserved, the disentangler employs weak supervision to encourage independence among the learned interest embeddings. It combines a KL‑divergence based mutual‑information penalty with a clustering‑based self‑supervised loss, preventing different Z_i from relying on the same features and promoting semantic separation.
-
Interest Aggregator – An attentive aggregation layer combines the set of disentangled interest embeddings to compute P(Y|Z). The attention weights are dynamically conditioned on the current context X, allowing the model to give higher importance to those interests that are currently relevant or have shifted.
The authors evaluate DiseCTR on three large‑scale, real‑world datasets covering video recommendation, e‑commerce, and social media. For each dataset they construct IID test splits and OOD test splits where user interests have been deliberately altered (e.g., by simulating coupon campaigns that affect economic interest). Baselines include state‑of‑the‑art CTR models and recent OOD‑focused recommendation methods. DiseCTR consistently outperforms all baselines, achieving average improvements of 0.022 in AUC/GAUC and a 13.7 % reduction in log‑loss. The gains are especially pronounced for users with large ΔP (>0.1), where traditional models suffer up to 0.1 AUC loss while DiseCTR’s degradation remains under 0.02. Ablation studies confirm that both the sparse‑attention encoder and the weakly supervised disentangler are essential; removing either component leads to a steep performance decline.
Further analysis shows that the learned interest embeddings align with interpretable business factors such as price sensitivity, brand affinity, social influence, and aesthetic preference. When only a subset of interests changes, updating the corresponding embeddings suffices to adapt the model, dramatically reducing transfer cost compared with retraining the whole network.
The paper’s contributions are threefold: (1) formalizing the OOD challenge in CTR prediction and introducing the partial‑distribution‑variation property; (2) proposing a causal‑inspired, disentangled interest network that leverages sparse attention and weak supervision; (3) delivering extensive empirical evidence of superior accuracy, robustness, and transfer efficiency.
Limitations include the need to pre‑specify the number of interests M, increased computational overhead due to sparse attention, and reliance on weak supervisory signals that may be insufficient in extremely low‑label regimes. Future work aims to automatically discover M, integrate graph‑based feature‑interest relationships, and explore multi‑domain transfer learning to further enhance generalization.
Comments & Academic Discussion
Loading comments...
Leave a Comment