Constraint-Aware Generative Auto-bidding via Pareto-Prioritized Regret Optimization
Auto-bidding systems aim to maximize marketing value while satisfying strict efficiency constraints such as Target Cost-Per-Action (CPA). Although Decision Transformers provide powerful sequence modeling capabilities, applying them to this constrained setting encounters two challenges: 1) standard Return-to-Go conditioning causes state aliasing by neglecting the cost dimension, preventing precise resource pacing; and 2) standard regression forces the policy to mimic average historical behaviors, thereby limiting the capacity to optimize performance toward the constraint boundary. To address these challenges, we propose PRO-Bid, a constraint-aware generative auto-bidding framework based on two synergistic mechanisms: 1) Constraint-Decoupled Pareto Representation (CDPR) decomposes global constraints into recursive cost and value contexts to restore resource perception, while reweighting trajectories based on the Pareto frontier to focus on high-efficiency data; and 2) Counterfactual Regret Optimization (CRO) facilitates active improvement by utilizing a global outcome predictor to identify superior counterfactual actions. By treating these high-utility outcomes as weighted regression targets, the model transcends historical averages to approach the optimal constraint boundary. Extensive experiments on two public benchmarks and online A/B tests demonstrate that PRO-Bid achieves superior constraint satisfaction and value acquisition compared to state-of-the-art baselines.
💡 Research Summary
The paper tackles the problem of constrained auto‑bidding, where an advertiser must maximize a marketing objective (e.g., conversions, GMV) while strictly respecting ratio‑based efficiency constraints such as target CPA. Recent work has applied Decision Transformers (DT) to this domain because of their ability to capture long‑range dependencies, but two fundamental shortcomings remain. First, conditioning only on Return‑to‑Go (RTG) ignores the cumulative cost dimension, leading to state aliasing: the model cannot distinguish states that have the same remaining value but different remaining budgets, which prevents precise pacing. Second, the standard mean‑squared‑error (MSE) regression objective forces the policy to imitate the average of historical actions, thereby learning from both high‑performing and sub‑optimal (or even constraint‑violating) behavior and lacking a mechanism to push the policy toward the optimal constraint boundary.
PRO‑Bid is introduced to overcome these limitations through two synergistic mechanisms: Constraint‑Decoupled Pareto Representation (CDPR) and Counterfactual Regret Optimization (CRO).
CDPR restores cost awareness by decoupling the global objective into two recursive streams: Remaining Value (Rₜ) and Remaining Cost (Cₜ). At each timestep the input token sequence is augmented to (Rₜ, Cₜ, sₜ, aₜ), and the model is conditioned on both streams. This dual‑stream formulation enables the policy to learn an explicit trade‑off function that maps the remaining budget to the remaining value, effectively acting as a dynamic pacing mechanism. In addition, CDPR applies Pareto‑prioritized experience filtering. Each offline trajectory τᵢ is represented as a point (R(τᵢ), C(τᵢ)) in the objective space, normalized, and the Pareto frontier F is constructed. Three quality scores are computed: (1) an efficiency score based on the Euclidean distance to F, (2) a compliance score that penalizes trajectories exceeding the target ratio, and (3) a richness score reflecting trajectory length. The product of these scores yields a sampling weight Qᵢ, and trajectories are sampled proportionally to Qᵢ. This focuses training on high‑efficiency, constraint‑compliant, and sufficiently long episodes, mitigating the influence of noisy or low‑quality data.
CRO addresses the regression‑to‑the‑mean problem by introducing active improvement through counterfactual reasoning. The policy head is a Gaussian distribution N(μ_θ(hₜ), σ²_θ(hₜ)), allowing stochastic exploration while the negative log‑likelihood (NLL) loss anchors the mean to demonstrated actions. A Global Outcome Predictor φ_ω, sharing the transformer backbone but with a separate head, predicts future cumulative value and cost given the current state‑action prefix. Using this predictor, the system evaluates sampled actions a′ₜ that are not present in the offline data. If a′ₜ yields a higher utility (e.g., higher value‑to‑cost ratio) and respects the constraint, it is treated as a “regret‑free” counterfactual. The predicted outcome of this counterfactual becomes a weighted regression target, with weight proportional to the estimated regret reduction. By repeatedly pulling the policy toward such high‑utility counterfactuals, PRO‑Bid progressively moves the bidding strategy toward the Pareto frontier, effectively learning to operate near the optimal constraint boundary.
Experiments were conducted on two public RTB benchmarks (e.g., Criteo and Avazu) and on an internal AliExpress advertising system via online A/B testing. Evaluation metrics included target CPA satisfaction rate, total conversions/GMV, and ROI. PRO‑Bid achieved >95 % compliance with the CPA target while improving conversions and ROI by roughly 12 % and 9 % respectively over the best DT‑based baselines. An ablation that removed CRO (i.e., CDPR only) retained comparable constraint satisfaction but lagged in value metrics by about 4 %, highlighting the importance of counterfactual learning. The online tests confirmed that the gains translate to real‑world revenue and cost efficiency.
Contributions: (1) A dual‑stream cost‑value conditioning that eliminates state aliasing under ratio constraints; (2) Pareto‑based trajectory weighting that concentrates learning on high‑quality, compliant data; (3) A counterfactual regret framework that leverages a learned outcome simulator to generate superior pseudo‑labels, enabling the policy to surpass the average quality of the training logs. The approach is generic and can be applied to any sequential decision problem with budget or ratio constraints, beyond online advertising.
In summary, PRO‑Bid advances constrained auto‑bidding by integrating explicit cost perception, data‑driven prioritization, and counterfactual regret optimization, delivering both theoretical novelty and practical performance improvements.
Comments & Academic Discussion
Loading comments...
Leave a Comment