Adaptive Policies for Sequential Sampling under Incomplete Information and a Cost Constraint

Adaptive Policies for Sequential Sampling under Incomplete Information   and a Cost Constraint
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We consider the problem of sequential sampling from a finite number of independent statistical populations to maximize the expected infinite horizon average outcome per period, under a constraint that the expected average sampling cost does not exceed an upper bound. The outcome distributions are not known. We construct a class of consistent adaptive policies, under which the average outcome converges with probability 1 to the true value under complete information for all distributions with finite means. We also compare the rate of convergence for various policies in this class using simulation.


💡 Research Summary

The paper addresses a sequential sampling problem in which a decision maker repeatedly selects from a finite set of independent statistical populations (or “arms”) in order to maximize the long‑run average outcome per period while respecting an upper bound on the average sampling cost. Unlike classic multi‑armed bandit (MAB) formulations that focus solely on reward maximization, this work explicitly incorporates a cost constraint, reflecting realistic settings such as budget‑limited experiments, resource‑constrained manufacturing, or cloud‑computing allocation. The authors assume that the reward and cost distributions of each arm are unknown a priori, but that they possess finite means. Their goal is to design adaptive policies that are consistent: with probability one the average reward converges to the optimal value that would be obtained under complete information, and the average cost never exceeds the prescribed bound in the limit.

Model and Objective

Formally, let there be K arms. Arm i generates i.i.d. reward (X_i) with mean (\mu_i) and cost (C_i) with mean (c_i). At each discrete time t a policy chooses an arm (i_t). The performance metrics are the empirical averages
\


Comments & Academic Discussion

Loading comments...

Leave a Comment