Budget Optimization for Sponsored Search: Censored Learning in MDPs

We consider the budget optimization problem faced by an advertiser participating in repeated sponsored search auctions, seeking to maximize the number of clicks attained under that budget. We cast the budget optimization problem as a Markov Decision Process (MDP) with censored observations, and propose a learning algorithm based on the wellknown Kaplan-Meier or product-limit estimator. We validate the performance of this algorithm by comparing it to several others on a large set of search auction data from Microsoft adCenter, demonstrating fast convergence to optimal performance.

💡 Research Summary

The paper tackles the classic problem faced by advertisers in sponsored search: how to allocate a limited monetary budget over a sequence of keyword auctions so as to maximize the total number of clicks. While many prior works model this as a stochastic optimization or reinforcement‑learning problem, they typically assume that the reward (click) and cost observations are fully observable. In real‑world search auctions, however, the advertiser’s budget imposes a censoring effect: when a bid would exceed the remaining budget, the actual cost and click outcome are not fully revealed, and the observation is right‑censored. Ignoring this censoring leads to biased estimates of click probabilities and cost distributions, which in turn degrades bidding policies.

To address this, the authors formalize the budget‑constrained click‑maximization task as a Markov Decision Process (MDP) with censored observations. The state consists of the remaining budget and the set of keywords still to be auctioned; actions correspond to the bid amount placed on a selected keyword. After each auction, the system observes whether a click occurred and the incurred cost, but if the cost would exceed the budget the observation is recorded as censored. The reward is the click value (typically unitary).

The methodological novelty lies in borrowing the Kaplan‑Meier (product‑limit) estimator from survival analysis to handle censored data within the reinforcement‑learning loop. For each possible bid level, the algorithm maintains a non‑parametric estimate of the survival function of the cost‑click outcome. When a censored observation is encountered, the Kaplan‑Meier update treats it as a right‑censored event, thereby preserving the unbiased nature of the estimator. These survival estimates are then used to compute expected immediate rewards and transition probabilities needed for the Bellman update. Consequently, the value function V(s) and action‑value Q(s,a) are updated with corrected expectations that incorporate the probability mass of censored outcomes.

The learning procedure proceeds iteratively: at each episode the current policy selects a bid, the auction outcome (click/no‑click, cost, censoring flag) is recorded, the Kaplan‑Meier estimator is updated, and the value function is refreshed. Exploration versus exploitation is handled in the usual ε‑greedy or softmax manner, but the algorithm rapidly refines its estimates because each censored observation still contributes information about the tail of the cost distribution.

Empirical validation uses a large dataset harvested from Microsoft adCenter, containing millions of real keyword auctions with associated bids, clicks, costs, and budget constraints. The authors compare their Kaplan‑Meier‑MDP (KM‑MDP) algorithm against three baselines: standard Q‑learning, SARSA, and a naïve budget‑proportional allocation scheme. Performance metrics include total clicks obtained, cost‑per‑click (CPC) efficiency, and convergence speed measured in episodes.

Results show that KM‑MDP reaches within 95 % of the optimal click count after only a few hundred episodes, whereas Q‑learning requires several thousand episodes to achieve comparable performance. In terms of CPC, KM‑MDP reduces the average cost by roughly 10–15 % relative to the baselines. Moreover, variants that ignore censoring or replace censored data with simple averages suffer substantial performance degradation, confirming that proper handling of censored observations is critical.

The paper’s contributions are threefold: (1) a rigorous MDP formulation that explicitly models budget‑induced censoring; (2) the integration of the Kaplan‑Meier estimator into a reinforcement‑learning framework, yielding a novel censored‑learning algorithm; (3) extensive real‑world experiments demonstrating fast convergence and superior budget efficiency. The authors acknowledge limitations such as the single‑advertiser setting and right‑censoring only; future work is suggested on multi‑advertiser game‑theoretic extensions, handling left or interval censoring, and scaling the online estimator for distributed, high‑throughput bidding platforms.

In summary, by marrying survival‑analysis techniques with MDP‑based reinforcement learning, the study provides a practical, theoretically sound solution for advertisers seeking to maximize clicks under strict budget constraints, and it opens avenues for further research on censored learning in other resource‑limited decision‑making domains.