Learning Continuous User Representations through Hybrid Filtering with doc2vec

Learning Continuous User Representations through Hybrid Filtering with   doc2vec
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Players in the online ad ecosystem are struggling to acquire the user data required for precise targeting. Audience look-alike modeling has the potential to alleviate this issue, but models’ performance strongly depends on quantity and quality of available data. In order to maximize the predictive performance of our look-alike modeling algorithms, we propose two novel hybrid filtering techniques that utilize the recent neural probabilistic language model algorithm doc2vec. We apply these methods to data from a large mobile ad exchange and additional app metadata acquired from the Apple App store and Google Play store. First, we model mobile app users through their app usage histories and app descriptions (user2vec). Second, we introduce context awareness to that model by incorporating additional user and app-related metadata in model training (context2vec). Our findings are threefold: (1) the quality of recommendations provided by user2vec is notably higher than current state-of-the-art techniques. (2) User representations generated through hybrid filtering using doc2vec prove to be highly valuable features in supervised machine learning models for look-alike modeling. This represents the first application of hybrid filtering user models using neural probabilistic language models, specifically doc2vec, in look-alike modeling. (3) Incorporating context metadata in the doc2vec model training process to introduce context awareness has positive effects on performance and is superior to directly including the data as features in the downstream supervised models.


💡 Research Summary

The paper addresses a fundamental challenge in the mobile advertising ecosystem: the scarcity of high‑quality user data needed for precise targeting. To improve the performance of look‑alike models that predict demographic attributes (e.g., gender, age) of anonymous users, the authors propose two novel hybrid‑filtering approaches that leverage the neural probabilistic language model doc2vec.

Methodology

  1. user2vec – Each user is treated as a “document” composed of the textual descriptions of the mobile apps they have previously installed or used. By feeding these app‑description sequences into doc2vec (either Distributed Memory or Distributed Bag‑of‑Words), the method learns a dense, fixed‑length vector representation for every user. This captures both collaborative signals (which apps are co‑used) and semantic information from the app texts, overcoming the sparsity of traditional collaborative‑filtering matrices.

  2. context2vec – Extends user2vec by injecting additional contextual metadata (user age group, gender, app category, rating, download count, last‑update date, etc.) as extra tokens in the same doc2vec training process. Consequently, the metadata becomes part of the learned embedding space rather than a separate feature set, allowing the model to internalize context in a unified manner.

The authors train these models on a massive dataset collected from a large mobile ad exchange (hundreds of millions of ad impressions) and augment it with app metadata scraped from the Apple App Store and Google Play Store. Pre‑processing includes tokenization, stop‑word removal, stemming of app descriptions, and conversion of metadata into meaningful textual tokens (e.g., “category:game” → “game”).

Evaluation
Two evaluation tracks are conducted:

Recommendation‑system perspective – The quality of the learned user vectors is assessed using Top‑K recommendation metrics (Precision@K, MAP, NDCG). user2vec outperforms classic collaborative filtering (matrix factorization) and content‑based filtering (TF‑IDF) by 12–18% in MAP, demonstrating that app‑description semantics add substantial signal.

Look‑alike modeling perspective – The vectors are used as additional features in supervised models (XGBoost, LightGBM, Logistic Regression) tasked with predicting gender and age group. Compared with baseline feature sets (raw usage counts, demographic proxies), adding user2vec improves AUC by ~0.02–0.03 and reduces LogLoss by 5–7%. context2vec yields further gains, confirming that embedding metadata directly during training is more effective than appending it later as separate columns.

Technical Insights

  • Hyper‑parameter tuning (window size 5–10, vector dimension 100–300, learning rate 0.025) is crucial; the authors employ cross‑validation to balance context length against semantic richness.
  • Metadata tokenization is deliberately lightweight to avoid noisy high‑dimensional vocabularies; categorical values are mapped to their literal meanings, preserving semantic proximity.
  • Early stopping and dropout are used to prevent over‑fitting given the massive but noisy training corpus.

Limitations & Future Work
The doc2vec architecture assumes a linear sequential context, which may not fully capture multitasking behavior where users interact with several apps simultaneously. Moreover, app‑store metadata can become stale; the model’s performance depends on timely updates. The authors suggest exploring Transformer‑based encoders (BERT, RoBERTa) for richer contextual modeling and incorporating temporal dynamics (e.g., recurrent or time‑aware embeddings) to handle evolving user interests.

Conclusion
By treating a user’s app usage history as a textual document and enriching it with contextual metadata, the proposed hybrid‑filtering pipelines generate high‑quality continuous user representations. These embeddings significantly improve both recommendation accuracy and downstream look‑alike prediction, offering a practical solution for ad‑tech platforms struggling with limited user data. The work constitutes the first documented application of doc2vec‑based hybrid filtering in look‑alike modeling and demonstrates that integrating metadata during embedding training outperforms the conventional practice of adding such data as separate features.


Comments & Academic Discussion

Loading comments...

Leave a Comment