RELATE: A Reinforcement Learning-Enhanced LLM Framework for Advertising Text Generation
In online advertising, advertising text plays a critical role in attracting user engagement and driving advertiser value. Existing industrial systems typically follow a two-stage paradigm, where candidate texts are first generated and subsequently aligned with online performance metrics such as click-through rate(CTR). This separation often leads to misaligned optimization objectives and low funnel efficiency, limiting global optimality. To address these limitations, we propose RELATE, a reinforcement learning-based end-to-end framework that unifies generation and objective alignment within a single model. Instead of decoupling text generation from downstream metric alignment, RELATE integrates performance and compliance objectives directly into the generation process via policy learning. To better capture ultimate advertiser value beyond click-level signals, We incorporate conversion-oriented metrics into the objective and jointly model them with compliance constraints as multi-dimensional rewards, enabling the model to generate high-quality ad texts that improve conversion performance under policy constraints. Extensive experiments on large-scale industrial datasets demonstrate that RELATE consistently outperforms baselines. Furthermore, online deployment on a production advertising platform yields statistically significant improvements in click-through conversion rate(CTCVR) under strict policy constraints, validating the robustness and real-world effectiveness of the proposed framework.
💡 Research Summary
The paper addresses a fundamental mismatch in modern advertising systems between the generation of ad copy and the business metrics that ultimately matter, such as click‑through conversion rate (CTCVR). Traditional pipelines operate in two stages: a large language model (or template system) first produces a set of candidate ad texts, and a separate ranking or filtering component later aligns these candidates with online performance signals (CTR, conversion). This separation creates objective inconsistency, low funnel efficiency, and makes it difficult to enforce policy constraints while still improving conversion.
RELATE (Reinforcement Learning‑Enhanced LLM Framework for Advertising Text Generation) proposes an end‑to‑end solution that treats ad copy generation as a reinforcement‑learning (RL) problem. The core idea is to view the LLM as a stochastic policy πθ that, given an input x (user query, bid keywords, landing‑page information), generates a text y. The policy is trained to maximize a scalar reward rₑₓₚₐᵣᵈ(x, y) that aggregates three dimensions:
- Conversion reward – an estimate of CTCVR derived from click‑conversion logs, optionally smoothed with Bayesian techniques.
- Quality reward – a composite of compliance checks (keyword presence, prohibited‑word detection, grammatical correctness) and semantic relevance, evaluated by a pretrained classifier.
- Diversity reward – measures to mitigate “ad fatigue”, including n‑gram duplication penalties and token‑level entropy incentives.
These components are combined via a weighted function f(·) with tunable coefficients (α, β, γ), allowing the system to balance business goals against creative constraints.
A major technical contribution is the group credit‑assignment mechanism that addresses the classic RL problem of sparse, delayed rewards. Instead of assigning the same reward to every token in a generated sequence (as in vanilla REINFORCE), RELATE samples K candidate texts for the same input, computes each candidate’s total reward, and then derives token‑level advantages by comparing each candidate’s reward to the batch mean baseline. This yields per‑token advantage estimates Aₜ that are back‑propagated through the policy network, enabling fine‑grained credit assignment and more stable learning.
Training uses a PPO‑style clipped objective to keep policy updates within a safe region, while normalizing rewards and employing a value‑function baseline to reduce variance. The authors also discuss how to dynamically adjust the reward weights to explore the trade‑off between diversity (reducing fatigue) and conversion (maximizing revenue).
Experiments are conducted on Baidu’s production advertising platform using over 100 million logged impressions. Offline metrics show that RELATE reduces policy‑violation rates to below 0.2 % while improving CTCVR by 8.5 %–9.2 % relative to strong baselines (a conventional two‑stage pipeline, a CTR‑focused RL model, and a DPO‑based diversity model). An online A/B test lasting two weeks confirms these gains: CTCVR rises by 9.19 %, overall revenue increases by 6.7 %, and policy violations drop by 0.15 %.
The paper also discusses limitations. Excessive emphasis on diversity can slightly hurt conversion, indicating the need for careful weight tuning. Scaling to very large LLMs requires balancing the number of sampled candidates against GPU memory constraints. Moreover, while the reward design is specific to advertising, the overall framework—multi‑dimensional reward aggregation plus token‑level credit assignment—should be transferable to other generation tasks such as news headlines or product descriptions.
In conclusion, RELATE demonstrates that integrating reinforcement learning directly into large language model generation can close the gap between creative output and real‑world business objectives. By jointly optimizing conversion, quality, and diversity within a single policy, the system achieves higher funnel efficiency, respects compliance constraints, and delivers measurable revenue uplift in a large‑scale production environment. Future work may explore real‑time reward feedback loops, multimodal inputs, and automated weight‑optimization to further enhance the framework’s applicability.
Comments & Academic Discussion
Loading comments...
Leave a Comment