High Fidelity Textual User Representation over Heterogeneous Sources via Reinforcement Learning

High Fidelity Textual User Representation over Heterogeneous Sources via Reinforcement Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Effective personalization on large-scale job platforms requires modeling members based on heterogeneous textual sources, including profiles, professional data, and search activity logs. As recommender systems increasingly adopt Large Language Models (LLMs), creating unified, interpretable, and concise representations from heterogeneous sources becomes critical, especially for latency-sensitive online environments. In this work, we propose a novel Reinforcement Learning (RL) framework to synthesize a unified textual representation for each member. Our approach leverages implicit user engagement signals (e.g., clicks, applies) as the primary reward to distill salient information. Additionally, the framework is complemented by rule-based rewards that enforce formatting and length constraints. Extensive offline experiments across multiple LinkedIn products, one of the world’s largest job platforms, demonstrate significant improvements in key downstream business metrics. This work provides a practical, labeling-free, and scalable solution for constructing interpretable user representations that are directly compatible with LLM-based systems.


💡 Research Summary

The paper addresses the challenge of constructing a unified, high‑fidelity user representation for large‑scale job platforms such as LinkedIn, where each member’s profile, professional history, and search activity are stored in heterogeneous textual sources. Traditional approaches fall into two camps: dense learned embeddings, which are opaque, require costly re‑embedding and downstream model retraining whenever the encoder changes, and hand‑engineered sparse features, which are interpretable but brittle and maintenance‑heavy. With the rise of Large Language Model (LLM)‑centric retrieval and ranking pipelines, a textual representation that lives directly in the LLM token space becomes highly desirable because it is human‑readable, editable, and incurs no additional projection layers or alignment steps.

The authors propose a reinforcement‑learning (RL) framework that treats the generation of a compact textual synopsis as a sequence‑generation problem. Given a context q that aggregates all available textual signals for a member, a policy πθ (the “actor”) samples a synopsis o from a pre‑trained 1.7 B parameter LLM fine‑tuned via policy optimization. The synopsis must satisfy two desiderata: (i) salience – it must retain all information predictive of future job‑related actions, and (ii) conciseness – it must respect a strict token budget (≈150 tokens) to meet latency constraints.

Supervision is entirely label‑free. The primary reward is derived from implicit engagement signals (clicks, applications, skips) using both pointwise and listwise prediction tasks. For example, the model is asked to predict whether the user will apply to a particular job given only the generated synopsis; the binary cross‑entropy loss against the logged outcome is transformed into a scalar reward. A secondary, rule‑based reward penalizes violations of formatting requirements and token‑length overshoot, ensuring the output remains well‑structured and within the budget.

Policy optimization leverages Group‑Relative Policy Optimization (GRPO), which eliminates the need for a learned critic by using the group‑average reward as a baseline. The objective is a PPO‑style clipped surrogate with an additional KL‑regularization term to prevent policy drift. To further stabilize training on large vocabularies, the authors incorporate Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) for asymmetric clipping, and Dr‑GRPO to correct systematic length bias. This combination yields stable, sample‑efficient updates suitable for production‑scale training.

The reward model (the “oracle”) is implemented as a larger 30 B parameter LLM accessed via prompting. It receives a (state, synopsis) pair and returns a quality score reflecting how well the synopsis predicts the user’s future actions. By using a promptable LLM rather than a supervised critic, the system avoids the need for per‑task labeled data while still providing semantically rich feedback. The authors validate the oracle against GPT‑o1 with chain‑of‑thought reasoning and find high correlation, confirming its suitability.

Extensive offline experiments across multiple LinkedIn products (Job Search, Recommendations, Alerts) demonstrate that the RL‑generated textual representations improve key downstream metrics: click‑through rate (CTR), application conversion, and session duration increase by 3–7 % relative to strong baselines (dense embeddings and hand‑engineered features). The generated synopses average 140–160 tokens, keeping inference latency within 5–10 ms, well within production constraints. Qualitative evaluation using an LLM‑as‑judge framework shows high information fidelity and readability, with human judges agreeing with the model’s output 85 % of the time.

In summary, the work presents a practical, scalable, and labeling‑free method to synthesize concise, interpretable textual user representations from heterogeneous sources. By aligning the generation process with real engagement signals through reinforcement learning, the approach produces representations that are directly consumable by LLM‑based recommendation and retrieval systems, reducing operational overhead and enabling rapid iteration. Future directions include extending the framework to multimodal signals, incorporating online RL feedback loops, and developing safety‑aware monitoring for policy drift.


Comments & Academic Discussion

Loading comments...

Leave a Comment