Information Acquisition and Exploitation in Multichannel Wireless Networks
A wireless system with multiple channels is considered, where each channel has several transmission states. A user learns about the instantaneous state of an available channel by transmitting a control packet in it. Since probing all channels consume…
Authors: Sudipto Guha, Kamesh Munagala, Saswati Sarkar
Information Acquisition and Exploitation in Multichannel W ireless Networks Sudipto Guha ∗ Kamesh Munagala † Saswati Sarkar ‡ July 26, 2021 Abstract A wireless system with multiple channels is considered, where each channel has sev eral transmission states. A user learns about the ins tantaneous state of an av ailable channel by transmitting a control packet in it. Since probing all channels consumes significant ener gy and time, a user needs to determine what and ho w much information it needs to acquire about the instantaneous states of the a vailable channels so that it can maximize its transmission rate. This moti v ates the study of the trade-off between the cost of information acquisition and its value to wards improving the transmission rate. A simple model is presented for studying this information acquisition and exploitation trade-off when the channels are multi-state, with different distributions and information acquisition costs. The objectiv e is to maximize a utility function which depends on both the cost and v alue of information. Solution techniques are presented for computing near -optimal policies with succinct representation in polynomial time. These policies provably achie ve at least a fix ed constant f actor of the optimal utility on any problem instance, and in addition, hav e natural characterizations. The techniques are based on exploiting the structure of the optimal policy , and use of Lagrangean relaxations which simplify the space of approximately optimal solutions. 1 Intr oduction Future wireless netw orks will pro vide each terminal access to a large number of channels. A channel can for e xample be a frequency in a frequency division multiple access (FDMA) network, or a code in a code di vision multiple access (CDMA) network, or an antenna or a polarization state (vertical or horizontal) of an antenna in a device with multiple antennas (MIMO). Se veral existing wireless technologies, e.g., IEEE 802.11a [1], IEEE802.11b [15], IEEE802.11h [2] propose to use multiple frequencies. For example, IEEE 802.11a protocol has 8 channels for indoor use and 4 channels for outdoor use in the 5GHz band, while the IEEE 802.11b protocol has 3 channels in the 2 . 4 GHz band. The potential deregulation of the wireless spectrum is likely to enable the use of a significantly larger number of frequencies. Due to significant adv ances in de vice technology , laptops with multiple antennas (antenna arrays) incorporated in the front lid, and devices with smart antennas ha ve already been dev eloped, and the number of such antennas are likely to significantly increase in near future. ∗ Department of Computer and Information Science, Univ ersity of Pennsylvania, Philadelphia P A 19104. Email: sudipto@cis.upenn.edu . This research w as supported in part by an Alfred P . Sloan Fellowship and NSF award CCF 04 - 30376 . † Department of Computer Science, Duke University . Durham NC 27708-0129. Email: kamesh@cs.duke.edu . This research was supported in part by NSF aw ard CNS 05 - 40347 . ‡ Department of Electrical and Systems Engineering, Uni versity of Pennsylv ania, Philadelphia P A 19104. Email: swati@seas.upenn.edu . This research was supported by NSF a wards NCR 02 - 38340 , CNS 04 - 35306 , CNS 04 - 35141 , ECCS 06 - 21782 and CNS 07 - 21308 . 1 The increase in the number of channels is e xpected to significantly enhance network capacity and enable se veral ne w bandwidth-intensiv e applications as multiple transmissions can now proceed simultaneously in a vicinity using dif ferent channels. Furthermore, the av ailability of multiple channels substantially enhances the probability (at any gi ven time) of e xistence of at least one channel with acceptable transmission quality , since the transmission quality of the indi vidual channels stochastically v ary with time and location of the users. These benefits can ho wev er be realized only if the users can select the channels efficiently . Most of the e xisting channel selection strategies assume complete knowledge of instantaneous transmis- sion qualities of all channels. W e refer to this approach as ”complete information based optimal control”. Note that a user can only learn the instantaneous state of a channel by transmitting a control packet in it and subsequently the receiv er informs the sender about the quality of the channel in a response packet (e.g., the R TS and CTS packet e xchange in IEEE 802.11). The exchange of control pack ets in this probing process consumes additional energy , and prev ents other neighboring users from simultaneously utilizing the channel. Probing a channel is therefore associated with a cost. When the number of a vailable channels is large, the cost incurred in learning the instantaneous transmission qualities of all channels may become prohibitiv e. Owing to this cost, some recent papers ha ve in vestigated selection strategies that assume no kno wledge of instantaneous transmission qualities of any channel [27]. This approach, which we refer to as ”minimal information based optimal control”, may ho wev er attain significantly lower transmission rates o wing to sub- optimal selection of channels. W e seek to design a framew ork that attains an y desired trade-of f between the abov e extremes using only simple control mechanisms. Specifically , we de velop a framework for partial information based stochastic contr ol which, in accordance with the costs and the benefits of probing dif- ferent channels, determines both (a) the amount of information a user must obtain about the instantaneous transmission qualities of the channels at its disposal and also (b) how to select the channels based on the acquired information. W e consider a single sender with access to n channels. Every time the sender probes a channel it learns about the signal to noise ratio and thereby the probability of success in the channel, but also incurs a certain cost which may again be different for dif ferent channels. Before each transmission, the sender needs to determine how many and which channels it will probe and also the sequence in which these channels will be probed ( pr obing policy ). In this paper, we consider the scenario where a sender can transmit in only one channel in a time slot and transmits at most one packet in each slot. Based on the outcomes of the probes, a sender decides whether to transmit or defer transmission until transmission qualities impro ve ( transmission policy ). If the sender decides to transmit, it must select one of the a v ailable channels ( c hannel selection policy ), which need not be those that it has probed. W e seek to design a jointly optimal probing and channel selection policy that maximizes a system utility which is the difference between the probability of successful transmission and a suitably scaled expected probing cost before each transmission. Loosely , this utility function represents the gain or the profit of the sender if the sender recei ves credit from the receiv er for each packet it delivers successfully and needs to additionally compensate the wireless provider for each probe packet it transmits. T echnical Hurdles and Contributions: The optimal probing policy needs to probe adaptiv ely , i.e., the result of a probe determines the channels to be probed subsequently . For e xample, consider channels with 3 possible states ( 0 , 1 , 2 ), each of which is associated with a dif ferent transmission quality . Clearly , the probing terminates if a probed channel is in the highest state. No w , let a probed channel be in the intermediate state (state 1 ). Then the subsequent probes should be limited to channels that hav e high probabilities of being in the highest state. Howe ver , if all channels that ha ve been probed in a slot are in the lowest state, then the channels that ha ve high probabilities of being in the intermediate state may also be subsequently probed. Even the decision regarding which channel should be probed first depends on the high order statistics of all the channels in a complex fashion. This is because the optimum policy may not probe a channel if 2 its quality has a low variance as probing it does not provide significant information but incurs additional cost. Example 5.1 in Section 5 illustrates these points. Also, the channel selection decision depends on the outcomes of the probes and the expectation and uncertainty of the transmission quality of the channels that hav e not been probed. The optimal policy is therefore a decision tree over n variables (Figure 1) and can be computed by solving a dynamic program with Ω( K 2 n ) states – nai ve computations will require both exponential time and exponential storage space. The above observ ations rule out greedy strategies for computing the optimal solution. Our main contrib ution is in showing that despite these hurdles, there is a nice (albeit non-tri vial) combinatorial structure in the optimal decision trees. W e then use this structure to design simple, natural, and polynomial time greedy algorithms which prov ably achiev e at least 4 / 5 of the optimal utility on any problem instance. A nice feature of our algorithmic frame work is that it easily extends to handle other constraints on the problem, as we elucidate ne xt. For example, when the sender is not saturated, that is it does not always have packets to transmit, it need not transmit packets in ev ery slot, and therefore needs to jointly optimize the transmission, probing and channel selection polic y so as to attain the maximum utility . This presents addi- tional technical challenges since the transmission policy needs to take into account two conflicting criteria: the transmissions should only happen when some channel is observed to be in a very high quality state, but on the other hand, the y should happen frequently enough to maintain stability . This leads to a rate constraint in the corresponding optimization. W e show a nov el technique based on linear programming duality to handle the rate constraint, and present simple polynomial time computable greedy policies which prov ably approximate the optimal policies. In addition, these greedy policies are the most natural threshold policies, where the decision is to transmit only when the reward from transmission exceeds a certain threshold. This threshold depends on the arri val rate, and is computed by a simple parametric search. Summary of Contributions: In summary , our main contribution is to obtain succinct polynomial time computable joint probing, selection and transmission policies that provably attain utility v alues which are within constant factors of the optimal utilities. More specifically: 1. W e first consider the case that a sender is satur ated and seek to determine the policy that has the max- imum utility . W e prove that when each channel has two states an optimal policy can be computed in O ( n log n ) time (Section 4). When each channel has K states, we obtain a policy that prov ably attains 4 / 5 of the maximum utility and can be computed in O ( n 2 K ) time (Section 5). These performance guarantees hold e ven when dif ferent channels hav e different distributions for the state processes and dif ferent probing costs. In the special case in which all channels have equal probing costs, but po- tentially dif ferent distributions for the state processes, we present a parametrized probing and channel selection policy whose parameters can be appropriately selected so as to attain any desired trade-of f between performance guarantee and computation time (Section 6). 2. W e next consider the unsatur ated sender scenario where packets arriv e with a giv en rate λ , so that the sender will not have packets to transmit in ev ery slot (Section 7). The goal no w is to determine the policy that attains the maximum utility among all stable policies. W e prov e that when each channel has two states such a polic y can be computed in O ( n 3 ) time and in the case where each channel has K states, we sho w that a stable policy that pro vably attains 2 / 3 of the maximum utility among all stable policies can be computed in O ( n 2 K ( n + K )) time. All policies can be readily implemented in resource constrained devices as once computed the y can be ex ecuted in O ( n ) time and stored in O ( n + K ) space. Our results are some what surprising gi ven that optimal solutions for most partial information based control problems are possibly computationally intractable, and standard approximation techniques either do not pro vide guaranteed approximation ratios or require exponential computation times [5]. Our proofs 3 therefore rely on exploitation of specific system characteristics and emplo y techniques that are not standard in context of stochastic control. The techniques we dev elop are very natural and general; they are expected to hav e wider applicability in designing simple and intuitive heuristics for a larger class of problems in the broad area of partial information based control problems, and in particular the joint optimization of the re ward obtained from informed selections and the cost incurred in acquiring the required information. W e will explore this in future w ork. 2 Related Literatur e W e first discuss the relation of our problem with some classical problems like the stopping time and multi- armed bandit problems. The most well-researched version of the stopping time problem is a stochastic control problem that optimally selects between two possible actions at any gi ven time: to continue or to stop [10]. Recently , the results for this problem ha ve been used to solve partial information based control problems for statistically identical channels with equal probing costs [19, 24]. Empirical in vestigations indicate that dif ferent channels av ailable to a sender oftentimes hav e dif ferent statistics [16]. When channels hav e different statistics and/or different probing costs, which is the case we consider in the paper , the optimal action needs to be selected from multiple options at any giv en time - the options being (a) whether to continue probing (b) which channel to probe next if the decision is to probe and (c) which channel to transmit if the decision is to stop probing. Thus, the results from the above v ersion of stopping time problem do not apply in our context. The optimal stopping time problem has also been considered in a more general setting where the number of available actions may be more than two; our problem is in fact a special case of this general version (Chapter IV , [5]). In this general case, the process terminates in certain states, which constitute the termination set, and selects the optimal action in other states. But, so far , only certain broad characterizations of the termination set are known in this general case, and the optimal actions when the decision is not to stop are also not known in closed form [5]. Thus, these general results do not lead to the optimal policies we are seeking to characterize. The stochastic multi-armed bandit problem considers a bandit with n arms [14]. The system can try one arm in each slot, and when it tries an arm, it recei ves a random re ward which depends on the state of the arm. The state of an arm changes only when the system tries it. The reward of a system in T slots is the sum of the re wards in each slot. The goal is to maximize the expected rew ard in T slots. Our problem dif fers from the above in that (a) the state of a channel can change ev en when it is not probed or used for transmission and (b) a node can learn the states of multiple channels in an epoch while incurring additional probing costs for learning the state of each additional channel and it can choose each such channel adaptiv ely after observing the states of the channels chosen for probing before in the same slot. The adversarial multi- armed bandit problem [3] and the restless bandit problem [6, 28] remov e one of the abo ve differences in that they allow the state of an arm to change e ven when the system does not try it. But, the adversarial multi-armed bandit problem [3] seeks to optimize the selection under the assumption that the sender uses the same arm in all slots. Note that we allo w a sender to probe, and also transmit, in dif ferent channels in dif ferent slots. In another version of the adversarial multi-armed bandit problem, the goal is to select the arms so as to minimize the “regret” or the difference in expected re ward with the best policy in a collection of a certain number of giv en policies [3]. Our problem differs from this version of adversarial multi-armed bandit problem and also from the restless bandit problem [6, 28] in that we allo w a node to adaptively probe different channels in the same slot by paying additional costs (difference (b) abov e). Thus the results av ailable in this context do not apply in our problem, and we use dif ferent solution approach and obtain dif ferent performance guarantees. Optimizing the order of e valuation of random variables so as to minimize the cost of e valuation (“pipelined filters”) has been in vestigated in se veral different contexts like diagnostic tests in f ault detection and medical 4 diagnosis, optimizing conjunctive query and joint ordering in data-stream systems, web services [4, 25, 11, 13, 17, 20, 21, 22, 23]. Ho we ver our w ork is different from all the abov e in that, we (a) consider multi-state channel models whereas pipeline filters consider two state models and (b) allow a node to transmit in a chan- nel even if the channel has not been probed. Note that usually two state models can not capture the statistical v ariations of wireless channels [16]. As we demonstrate later , both the above generalizations significantly alter the decision issues and the optimal solutions (Section 5). Finally , opportunistic selection of channels with complete kno wledge of channel states has been com- prehensi vely in vestig ated over the last decade (e.g., [26]). But, in general, the area of partial information based control problems, and in particular the joint optimization of the re ward obtained from informed se- lections and the cost incurred in acquiring the required information, remains largely unexplored in wireless networks. Policies with prov able performance guarantees are known only in special cases like statistically identical channels [8, 19, 24], and ev en under these simplifying assumptions only the saturated sender case had been in vestigated. W e consider both the saturated and unsaturated sender case, and in both cases obtain prov able performance guarantees e ven when channels ha ve dif ferent statistics and/or probing costs. The re- sults in this paper therefore enhance the state of knowledge in an emer ging area which has hitherto recei ved only limited attention. 3 System Model and Pr oblem Definition A sender U has access to n channels which are denoted as channels 1 , 2 , . . . , n , each of which has K possible states, 0 , . . . , K − 1 . W e assume that time is slotted. In any slot channel j is in state i with probability p ij independent of its state in other time slots and the states of other channels in any slot. W ithout loss of generality , we assume that p K − 1 j < 1 for each j , as otherwise the optimum policy is simply to transmit in j without probing any channel. In every slot, U transmits at most one data pack et in a selected channel. If the channel selected for transmission is in state i, the transmission is successful with probability r i . W ithout loss of generality we assume 0 ≤ r 0 < r 1 < · · · < r K − 1 . For simplicity , we also assume that r 0 = 0 ; all analytical results can howe ver be generalized to the scenario where r 0 > 0 . Whenever U probes a channel j, it pays a cost of c j ≥ 0 . Probing different channels may incur dif ferent costs as the probing process for dif ferent channels may interfere with the channel access of dif ferent number of users (based on geometry and allocation of channels). W e now formally define the policies and the performance metrics. Definition 3.1. A probing policy is a rule that, given the set of channels the sender has alr eady pr obed in a slot (whic h would be empty at the beginning of the slot) and the states of the channels pr obed in the slot, determines (a) whether the sender should pr obe additional channels and (b) if the sender pr obes additional channels whic h c hannel it should pr obe next. The sender knows the state of a channel in a slot if and only if it pr obes the channel in the slot. Definition 3.2. A selection policy is a rule that selects a channel for the transmission of a data pack et in a slot on the basis of the states of the pr obed channels, after the completion of the pr obing pr ocess in the slot. The selection policy can select a channel even if it has not been pr obed in the slot, and in that case, the channel is r eferr ed to as a backup channel. Definition 3.3. The pr obing cost is the sum of the costs of all channels pr obed in the slot. The pr obing cost is clearly a r andom variable that depends on the pr obing policy and the outcomes of the pr obes (as the sender may pr obe subsequent channels depending on the outcomes of the pr evious pr obes). The expected probing cost is the expectation of this random variable and depends on both the pr obing policy and the channel statistics. Definition 3.4. In any slot, the transmission reward is 1 if there is a successful transmission and 0 oth- erwise. Ther efor e, the expected tr ansmission r ewar d is r i in a slot t if U transmits in a channel in state i 5 during t. The expected transmission r ewar d of a policy is therefor e P i q i r i wher e q i is the pr obability that the selection policy decides to use a channel which is in state i ; q i depends on the channel statistics as well as the policy . Definition 3.5. The expected utility of the sender , denoted simply as gain , is the differ ence between the expected transmission r ewar d, and the pr obing cost scaled by a factor κ . W e denote the gain of a policy π as G π . The gain depends on the probing and selection policies, the channel statistics and the scaling parameter κ – choosing the scaling parameter κ to be 0 makes the policy acquire complete information, while setting it to ∞ makes the policy acquire no information. Since κ can be included in the probing costs themselv es, we drop this parameter in the remaining discussion without loss of generality . In Sections 4, 5 and 6, we assume that U ’ s queue is ne ver empty (saturated sender assumption); we relax this assumption in Section 7. The two v ersions of the problem are defined as follows: Saturated Sender Pr oblem: Under the saturated sender assumption, at least one policy that maximizes the utility transmits a packet in ev ery slot. W e therefore assume that U transmits exactly one data packet in e very slot. The problem formulation for the saturated sender case follo ws. Problem 1. Given { c j } , { r i } and { p ij } , find a pr obing and selection policy so as to maximize the expected gain. Let O P T denote the optimal policy and G Opt its gain. Since channels are temporally independent, the optimal policy in a slot need not depend on the decisions and the observ ations in other slots. Also, the optimal policy remains the same in all slots, though the specific choices it mak es may be dif ferent in dif ferent slots depending on the outcome of the probes. Note that the optimal probing policy does not probe any further in a slot if a probed channel is in state K − 1 . Using these observ ations, the optimal policy can be computed using a bottom-up dynamic program whose states correspond to the tuple consisting of (a) the maximum value of the best state encountered so far and (b) the set of channels that ha ve not been probed yet. Thus, the dynamic program has K 2 n states, and hence, nai ve computations will require Ω( K 2 n ) time and space. Policies and Decision T rees: Every joint policy can be represented by a unique decision tree (Figure 1); we therefore use policies and decision trees interchangeably . Unsaturated Sender Problem: W e now relax the assumption that a sender always has packets to trans- mit. A sender generates packets as per an arriv al process which constitutes a positi ve recurrent, aperiodic, irreducible Markov chain. Under the steady state distribution of the Markov chain, the expected number of arri vals in any slot is λ , where λ ∈ [0 , 1) . Packets are stored in an infinite b uf fer . If in a slot the sender has a packet in its queue, the slot is referred to as a b usy slot. The sender transmits only in b usy slots, b ut may not transmit in ev ery busy slot; it may improve its gain by deferring transmission until at least one channel has good quality . The transmission policy is a rule that determines which slots a sender transmits. The decisions may depend on the outcomes of the probes, the queue lengths, channel and arri val statistics. The sender must ho wev er ensure that it transmits at least at the rate at which it generates packets, otherwise its delay becomes unbounded. In addition to gain, system stability is therefore of interest. Definition 3.6. The system is stable if the sender’ s e xpected queue length is finite. A policy that attains finite expected queue length is a stable policy . Problem 2. Given { c i } , { r i } and { p ij } find a pr obing, selection and transmission policy that stabilizes the system and maximizes the gain among all stable policies. Let O P T U N S A T denote the optimal policy and G U its gain. 6 4 The T wo-State, Saturated Sender Pr oblem W e now consider the case that the sender always has packets to transmit and seek to solve Problem 1 formulated in Section 3 when K = 2 . W e first consider a specific class of policies, E X H AU S T , and prove that O P T belongs in this class. Subsequently we show how to find the optimal policy T W O S TA T E O P T in this class in O ( n log n ) time. Thus, when K = 2 , T W O S TA T E O P T is O P T and can be computed in O ( n log n ) time. Also, T W O S TA T E O P T can be executed and stored in O ( n ) time and space. Definition 4.1. Given S ⊂ { 1 , . . . , n } , i 6∈ S , let E X H AU S T ( S, i ) denote the class of policies which pr obe all channels in S in a deterministic or der until a pr obed channel is in state 1 or all channels in S have been pr obed. It selects the last pr obed channel if it is in state 1 , and selects i otherwise. Channel i is denoted as the backup channel. In what follows, we prove that there is an optimal policy which is of the form E X H AU S T ( S, i ) . W e note that results prov ed later in the paper (for the case K > 2 ) will subsume the proof of this fact – but a straightforward application of these later results will yield an algorithm which requires O ( n 2 ) time. Lemma 4.1. Ther e exists an E X H AU S T ( S, i ) policy whic h is optimal. Pr oof. W e prove the lemma by induction on the number of channels n . For the base case, n = 1 , the expected gain is r 1 p 11 − c 1 if the optimal polic y probes the channel, and r 1 p 11 otherwise. Since c 1 ≥ 0 , the policy that selects the channel without probing is optimal ov er all possible con vex combinations, i.e., randomization, of the abov e two policies. Thus, E X H AU S T (Φ , 1) is an optimal polic y in this case. Assuming the lemma holds for n = s , consider a set J of s + 1 channels. O P T can either (a) select a channel without probing or (b) probe a channel. Conditioned on case (a), G Opt = G E X H AU S T (Φ ,j ) , where j = arg max i p 1 i . In case (b), O P T chooses to probe a channel i with some probability . Subsequently , if i is in state 1 , O P T selects i . Now , if i is in state 0 , then O P T takes the same decisions as that in a system with the s remaining channels, and by the induction hypothesis, uses E X H AU S T ( Q, j ) policy for some Q ⊂ J \ { i } , j ∈ ( J \ Q ) \ { i } . Thus, in this case, O P T is an E X H A U S T ( { i } ◦ Q, j ) policy where the ◦ denotes the ordering. Therefore, conditioned on case (b), G Opt is a con ve x combination of the gains of E X H AU S T policies. Therefore, overall, G Opt is a conv ex combination of the gains of E X H AU S T policies. Thus, there exists an optimum polic y which is E X H AU S T ( S, i ) . W e next prove that O P T satisfies additional properties, which allows a polynomial time computation of O P T . Lemma 4.2. Let S i = { j : (1 − p 1 i ) p 1 j r 1 > c j } . If E X H A U S T ( S, i ) is an optimum policy , then the following conditions hold. 1. channels j in S ar e pr obed in decreasing or der of p 1 j /c j 2. E X H AU S T ( S i , i ) policy is an optimum policy . Pr oof. Let E X H AU S T ( S, i ) policy be an optimum policy . Wlog. S = { k 1 , . . . k | S | } , where channel k l is probed before k l +1 . Then the gain of E X H A U S T ( S, i ) polic y is A = | S | X l =1 ( p 1 k l r 1 − c k l ) l − 1 Y m =1 (1 − p 1 k m ) + p 1 i r 1 | S | Y m =1 (1 − p 1 k m ) . 7 W e first prove (1). Recall that p 1 j < 1 for all channels j. Let p 1 k s /c k s < p 1 k s +1 /c k s +1 . Consider a new policy which probes k s +1 before k s but is other-wise similar to the E X H AU S T ( S, i ) policy . Let the gain of this new polic y be B . Then, A − B = Q s − 1 m =1 (1 − p 1 k m )( p 1 s c s +1 − p 1 s +1 c s ) . Thus, clearly , B > A. Thus, E X H AU S T ( S, i ) is not the optimum polic y . The result follows. W e now prov e (2). If S = S i the result follo ws. Let S 6 = S i . Thus, either S i \ S 6 = φ or S \ S i 6 = φ. Let S \ S i 6 = φ. Consider some j ∈ S \ S i . From (1), p 1 k s /c k s ≥ p 1 k s +1 /c k s +1 . Thus, (1 − p 1 i ) p 1 k | S | r 1 /c k | S | = min 1 ≤ l ≤| S | (1 − p 1 i ) p 1 l r 1 /c l ≤ (1 − p 1 i ) p 1 j r 1 /c j ≤ 1 . Thus, k | S | ∈ S \ S i . Let Q = S \ { k | S | } . The gain of E X H AU S T ( Q, i ) policy with probing sequence k 1 , . . . , k | S |− 1 is D = P | S |− 1 l =1 ( p 1 k l r 1 − c k l )Π l − 1 m =1 (1 − p 1 k m ) + p 1 i r 1 Π | S |− 1 m =1 (1 − p 1 k m ) . No w , D − A = ( c k | S | − (1 − p 1 i ) p 1 k | S | r 1 )Π | S |− 1 m =1 (1 − p 1 k m ) . Since (1 − p 1 i ) p 1 k | S | r 1 ≤ c k | S | , D ≥ A. Thus, E X H A U S T ( Q, i ) is an optimum policy , where Q ⊆ S and | Q \ S i | < | S \ S i | . Continuing this argument, clearly there exists a T such that T ⊆ S and T \ S i = φ and E X H AU S T ( T , i ) polic y is optimal. No w let S i \ S 6 = φ. If S \ S i 6 = φ , let T be as constructed in the above paragraph; otherwise let T = S. In both cases, E X H AU S T ( T , i ) policy is optimal. W e now show that S i \ T = φ. If not, consider a j ∈ S i \ T . Let Q = T ∪ { j } . The gain of E X H A U S T ( Q, i ) policy with probing sequence k 1 , . . . k | T | , k j is C = P | T | l =1 ( p 1 k l r 1 − c k l )Π l − 1 m =1 (1 − p 1 k m ) + ( p 1 j r 1 − c j )Π | T | l =1 (1 − p 1 k l ) + p 1 i r 1 (1 − p 1 j )Π | T | l =1 (1 − p 1 k l ) . No w , C − A = ((1 − p 1 i ) p 1 j r 1 − c j ) Π | T | l =1 (1 − p 1 k l ) . Since p 1 s < 1 for all s and (1 − p 1 i ) p 1 j r 1 > c j , C > A. This contradicts the optimality of the E X H AU S T ( T , i ) . Thus, S i \ T = φ. Thus, S i = T . Hence, E X H AU S T ( S i , i ) policy is optimal. Lemmas 4.1 and 4.2 prove that there exists an E X H AU S T ( S i , i ) policy that is optimal, and this polic y probes the channels in S i in decreasing order of p 1 j /c j . The routine D E T E R M I N E B E S T B A C K U P described belo w determines the B E S T B K U P channel i ∗ such that E X H AU S T ( S i ∗ , i ∗ ) attains the maximum gain among all such E X H AU S T ( S i , i ) policies. Note that i ∗ can be computed in O ( n 2 ) time using a naiv e implementa- tion, but the follo wing computation requires only O ( n log n ) time. D E T E R M I N E B E S T B AC K U P 1. Sort the channels so as to arrange them in decreasing order of p 1 i /c i . Re-number the channels in accor- dance with the sorted order, i.e., if i < j , p 1 i /c i > p 1 j /c j . Let S i = { j : (1 − p 1 i ) p 1 j r 1 > c j , j 6 = i } . Let Gain ( i ) denote the gain of E X H AU S T ( S i , i ) . 2. Let D 0 = 1 . For j ≥ 0 , compute D j +1 = D j (1 − p 1 j ) . /* D j +1 = Probability that first j channels return state “0” */ 3. Let F 0 = 0 . For j ≥ 1 , compute F j +1 = F j + ( p 1 j r 1 − c j ) D j . /* F j +1 = Gain of the first j channels if probed. */ 4. For each channel i , if i > | S i | , Gain ( i ) = F | S i | +1 + p 1 i r 1 D | S i | +1 ; else Gain ( i ) = F i + F | S i | +2 − F i − ( p 1 i r 1 − c i ) D i 1 − p 1 i + p 1 i r 1 1 − p 1 i D | S i | +2 . 5. Let B E S T B K U P = arg max i =1 ,...,n Gain ( i ) . W e now e xplain the computations of the gains of E X H AU S T policies in the D E T E R M I N E B E S T B AC K U P routine. The channels are numbered in decreasing order of p 1 j /c j . Now , F | S i | +1 is the gain of sequentially probing the first | S i | channels, and D | S i | +1 is the probability that the first | S i | channels are in state 0 . If i > | S i | , then the first | S i | channels constitute S i . Thus, the gain of E X H A U S T ( S i , i ) is F | S i | +1 + p 1 i D | S i | +1 . If i ≤ | S i | , the first | S i | + 1 channels constitute S i ∪ { i } . Thus, F i + F | S i | +2 − F i − ( p 1 i r 1 − c i ) D i 1 − p 1 i is the gain obtained by probing channels in S i in decreasing order of p 1 j /c j , and D | S i | +2 / (1 − p 1 i ) is the probability 8 that all channels in S i are in state 0 . Thus, the gain of E X H AU S T ( S i , i ) is F i + F | S i | +2 − F i − ( p 1 i r 1 − c i ) D i 1 − p 1 i + p 1 i r 1 1 − p 1 i D | S i | +2 . The computation time for the D E T E R M I N E B E S T B A C K U P routine is dominated by the time required to sort the channels, which is O ( n log n ) . The follo wing T W O S TA T E O P T policy is E X H A U S T ( S i ∗ , i ∗ ) where i ∗ is the B E S T - B K U P channel re- turned by the routine D E T E R M I N E B E S T B AC K U P . T W O S TA T E O P T 1. Probe channels j ∈ S B E S T -B K U P until a probed channel is in state 1 or all channels in S Best-Bkup hav e been probed. 2. If the last probed channel j is in state 1 , transmit the packet in j , else transmit the packet in the B E S T - B K U P channel. Theorem 4.3. T W O S TA T E O P T attains the maximum gain when K = 2 , and can be computed in O ( n log n ) time. Since T W O S TA T E O P T is the E X H AU S T ( S i ∗ , i ∗ ) policy that attains the maximum gain among all E X H A U S T ( S i , i ) policies that probe the channels in S i in decreasing order of p 1 j /c j , its optimality follows from Lemmas 4.1 and 4.2. The computation time for T W O S TA T E O P T is the same as that for the routine D E T E R M I N E B E S T B A C K U P which can be computed in O ( n log n ) time. Thus, Theorem 4.3 follo ws. Clearly , T W O S TA T E O P T can be ex ecuted and stored in O ( n ) time and space. 5 The Multi-State Saturated Sender Pr oblem W e still assume that the sender always has packets to transmit but now focus on the case that each channel has K states where K > 2 . W e first demonstrate that some natural generalizations of the T W O S TA T E O P T policy are suboptimal when K > 2 . Example 5.1. Recall that T W O S TA T E O P T pr obes channels in incr easing or der of c j /p 1 j . Thus, for K > 2 , the natural generalizations of T W O S TA T E O P T ar e to probe c hannels in decr easing or der of the ratio be- tween their (a) pr obabilities of being in the highest state and costs (i.e., p K − 1 j /c j ) or (b) the expected r ewar ds and costs (i.e., P K − 1 k =0 p kj r j /c j ). F igur e 1 pr esents one scenario where both these pr obing se- quences ar e sub-optimal. Note that i has the least, j intermediate and k maximum e xpected r ewar ds, and p 2 k /c k > p 2 j /c j > p 2 i /c i . But, O P T pr obes i befor e pr obing j . The main challenge for K > 2 is that the optimal probing sequence needs to be adaptiv ely determined depending on the outcomes of the pre vious probes in a slot (Figure 1). For example, when K = 3 , and when a probed channel is in the intermediate state (state 1 ), then the subsequent probes should be limited to channels that ha ve high probabilities of being in state 2 . Howe ver , if all channels that ha ve been probed in a slot are in state 0 , then the channels that hav e high probabilities of being in state 1 may also be subsequently probed. W e show that in O ( n 2 K ) time, we can compute a policy which attains 4 / 5 of the maximum gain, for arbitrary distributions for the state processes and costs. Howe ver , more importantly , we develop techniques and ideas, such as the Structur e Theor em belo w , which are useful beyond the context of this specific problem. In fact, we will use the structure theorem for all the problems considered in the rest of this paper . 9 i 0/1 Transmit on i 1 Transmit on k 0/1 Transmit on i 2 k 0 = Transmit Node = Probe node 2 Transmit on j 2 Transmit on j Transmit on i Transmit on k 0/1 Use k as backup j 2 k j 1 0 2 Figure 1: Consider a node U that has access to 3 channels i, j, k each of which has 3 states. Let r 2 = 1 , r 1 = 0 . 1 . The probabilities associated with dif ferent states of i, j, k are (0 . 49 , 0 . 02 , 0 . 49) , (0 . 5 , 0 . 01 , 0 . 49) , (0 . 5 , 0 . 5 − δ, δ ) respecti vely . Also, c i = 0 . 05885 δ, c j = 0 . 06 δ, c k = 0 . 05 δ. Let δ < 0 . 15 . The figure sho ws the decision tree for O P T in this case. A channel is probed at each probe node, and the letter inside it indicates which channel is probed at the node. The numbers next to the branches indicate the outcome of the probe. The number r /s next to a branch indicates that both states r and s of the previously probed channel lead to the same action. For example, the sender first probes channel i . If i is in state 2 , it transmits in i . If i is in state 1 and 0 , it probes k and j respectively . 5.1 The Roadmap and the Main Results The main component of the algorithm is the follo wing theorem which is proved in Section 5.2. Theorem 5.1 (Structure Theorem) . Ther e exists an optimum policy which uses a unique bac kup channel. The theorem states that there exists an optimum policy O P T and a channel ` such that whenev er O P T uses a backup, it uses ` as a backup. Note that the abov e is true for the case K = 2 tri vially , because if at any point the sender observes a channel to be in state 1 , there is no further benefit of probing. Thus the strategy corresponds to a path where we observe every probed channel to be in state 0 . Note that another interesting property of the case K = 2 is that the backup channel is nev er probed. This motiv ates the follo wing definitions. Definition 5.1. Let P ( ` ) denote the class of policies, each of which (a) ne ver pr obes ` and (b) never uses any c hannel other than ` as a backup. Let P (0) correspond to the class of policies that never use backup channels (note that the channels ar e numbered 1 , 2 , . . . , n ). Let the policy that attains the maximum gain among all policies in P ( ` ) be denoted as R E S E RV E B K U P ( ` ) , and the policy that attains the maximum gain among all policies in ∪ n l =0 P ( ` ) be denoted as B E S T R E S E RV E B K U P . Let G B estReser v eB k up be the gain of B E S T R E S E RV E B K U P . The follo wing theorems indicate why B E S T R E S E RV E B K U P is of interest. Theorem 5.2. F or ` = 0 , 1 , . . . , n , we can compute a policy R E S E RV E B K U P ( ` ) that attains the maximum gain among all policies in P ( ` ) in time O ( nK log n ) . R E S E RV E B K U P ( ` ) has been presented in Section 5.3, and the abov e theorem has been pro ved in Sec- tion 5.4. Therefore, clearly , we can compute B E S T R E S E RV E B K U P in time O ( n 2 K log n ) . In Section 5.3, we argue that we can in f act compute B E S T R E S E RV E B K U P in time O ( n 2 K ) . 10 Theorem 5.3. G BestReserveBkup ≥ (4 / 5) G Opt . Pr oof. By the structure Theorem (Theorem 5.1), there exists an optimum policy O P T that uses a unique backup, say B . Let α denote the probability with which O P T uses ` as a backup. Construct a new polic y A that is similar to O P T except that it probes B whenev er O P T uses B as a backup. Clearly , A attains a gain of at least G Opt − αc B . Also, since A ∈ P (0) , its gain is at most G BestReserveBkup . Thus, G BestReserveBkup ≥ G Opt − αc B . (1) W e now sho w that there exists a policy which ne ver probes B , but does not perform significantly worse than O P T . In this discussion, the gain G ( T ) of a sub-tree T rooted at t is defined as the expected rew ard o wing to transmissions at the lea ves of T conditioned on reaching t , minus the expected probing cost of nodes in T . Suppose that O P T probes B at nodes m 1 , . . . , m J in its decision tree. Let β 1 , . . . , β J be the respectiv e probabilities that O P T traverse these nodes and G 1 , . . . , G J be the respecti ve gains of O P T gi ven that it trav erses these nodes. No w consider the gains of the subtrees G 0 1 , . . . , G 0 J produced by modifying the decision tree so that B is not probed. This produces a decision tree τ which is the same as that for O P T except for the trees rooted at m 1 , . . . , m J . τ reaches these nodes with the same probabilities as O P T . W e no w make a claim which allows us to complete the proof, and subsequently we prov e why the claim is true. This claim would also be used for other results. Claim 5.4. Let T be any decision (sub-)tr ee r ooted at node t wher e we pr obe channel u , and let its gain be G ( T ) . Suppose at the point of arriving at decision node t , the best pr obed channel has state at least j ≥ 0 (equivalently , re war d at least r j ). Then ther e exists a corr esponding (sub-)tr ee T 0 , wher e u is not pr obed at t or anywher e else in T 0 , and whose gain G ( T 0 ) satisfies G ( T 0 ) ≥ G ( T ) + c u − P K − 1 i = j +1 p iu r i . W e first complete the proof of the theorem using the abov e Claim 5.4. Setting j = 0 it follows that G k − G 0 k ≤ P K − 1 i =0 p iB r i − c ` . Then the difference between the o verall gains of O P T and τ is P J k =1 β k ( G k − G 0 k ) . Since P J t =1 β t ≤ 1 − α , the policy τ which ne ver probes B , attains a gain of at least G Opt − (1 − α )( P K − 1 i =0 p iB r i − c B ) . Since τ never probes B , therefore τ ∈ P ( B ) . Thus we hav e, G BestReserveBkup ≥ G Opt − (1 − α )( K − 1 X i =0 p iB r i − c B ) . (2) From multiplying Equations (1) and (2) with (1 − α ) and α respectiv ely , and adding the results, we hav e G BestReserveBkup ≥ G Opt − α (1 − α ) P K − 1 i =0 p iB r i . Now , the policy that uses B as a backup without probing any channel attains a gain of P K − 1 i =0 p iB r i . Since this policy is in P ( B ) , P K − 1 i =0 p iB r i ≤ G BestReserveBkup . Thus, G BestReserveBkup ≥ G Opt / (1 + α (1 − α )) . The result follo ws since the maximum v alue of the denomi- nator is 1 . 25 . This prov es the theorem; we now focus on Claim 5.4. (Pr oof of Claim 5.4). Let F iu denote the set of leaf nodes in T where the decision is to transmit on probed channel u in state i (note that this happens only if i > j ). Let W be the ev ent that t is reached, and ˆ F iu be the e vent that F iu is reached. First note that Pr[ ˆ F iu | W ] ≤ p iu for all states i . Now construct a corresponding decision tree T 00 in which the decision at F iu is to transmit on the (a) best probed channel excluding the probed channel u if channels other than u hav e been probed and (b) a backup channel otherwise. Clearly , G ( T 00 ) ≥ G ( T ) − K − 1 X i = j +1 Pr[ ˆ F iu | W ] r i ≥ G ( T ) − K − 1 X i = j +1 p iu r i 11 In the decision tree T 00 , the transmission is nev er on probed channel u . Therefore, at the root node t in T 00 where u is probed, suppose we do not actually probe u (saving cost c u ), b ut simply choose the branch corresponding to u being in state i with probability p iu , this ne w decision tree T ∗ has gain: G ( T ∗ ) ≥ G ( T 00 ) + c u ≥ G ( T ) − K − 1 X i = j +1 p iu r i + c u This new decision tree neither probes nor uses u . Denote the sub-tree of T 00 belo w t corresponding to channel u being in state i to be T i . If i ∗ = argmax K i =1 G ( T i ) , we hav e: G ( T ∗ ) = K X i =1 p iu G ( T i ) ≤ K max i =1 G ( T i ) = G ( T i ∗ ) Modify T ∗ so that on reaching node t , branch T i ∗ is chosen. Denote this T 0 . Then, G ( T 0 ) = G ( T i ∗ ) ≥ G ( T ∗ ) ≥ G ( T ) − K − 1 X i = j +1 p iu r i + c u The result follo ws. Note that there are cases where B E S T R E S E RV E B K U P is strictly suboptimal, (e.g., in Figure 1, where O P T probes the backup channel k on some paths). But, in practice, the gain of B E S T R E S E RV E B K U P sub- stantially exceeds the lower bound in Theorem 5.3. For example, ev en in Figure 1 (which is one of the few cases where we observed the suboptimality of B E S T R E S E RV E B K U P ) the gain of R E S E RV E B K U P ( k ) , and hence that of B E S T R E S E RV E B K U P , is only 0 . 1% less than that of O P T . W e now point out an important property of R E S E RV E B K U P (0) . Recall that P (0) consists of all policies that transmit only in probed channels and nev er uses backup channels. Thus, R E S E RV E B K U P (0) , which will henceforth be denoted as O P T N O B K U P , attains the maximum gain among all such policies. From Theorem 5.2, O P T N O B K U P can be determined in O ( nK + n log n ) time. Thus, the optimum policy is polynomial time computable when backups are not allo wed. Finally , when K = 2 , e very policy is in the class ∪ n u =0 P ( u ) , and hence Theorem 5.2 also proves that the optimal policy can be computed in O ( n 2 ) time for K = 2 . Howe ver for K = 2 we ha ve already obtained an optimal policy which has a lo wer running time. 5.2 Proof of the Structur e Theor em Definition 5.2. A ≤ i tr ee is a decision tree which takes the same decisions irr espective of the states of the pr obed channels pr ovided the states ar e less than or equal to i . The decisions corr esponding to the states which ar e less than or equal to i therefor e constitute a path in such a tr ee which we r efer to as a ≤ i path. W e prov e the Structure Theorem 5.1 (in fact a slightly stronger version which we would require later) using the follo wing lemma. Lemma 5.5. Suppose O P T pr obes a channel j at a node m in its decision tr ee and if j is in state i it tak es a bac kup downstr eam. Then ther e e xists another optimum policy O P T 1 which has the same decision tr ee as O P T e xcept possibly for the tr ee r ooted at m . In O P T 1 , the tr ee r ooted at m, 1. is a ≤ i tree 2. takes a backup, say ` , at the end of its ≤ i path and 12 3. takes ` as a backup wher ever it takes a bac kup. Pr oof. W e prove the lemma using induction on the states. The lemma holds by v acuity for all channels j , node m if i = K − 1 . This is because O P T ne ver takes a backup after observing a channel in state K − 1 . Suppose the lemma holds for all channels j , nodes m in the decision tree of O P T , states i + 1 , . . . , K − 1 . W e prove the lemma for all channels j, nodes m and state i. Now , let O P T probe a channel j at a node m in its decision tree and take a backup somewhere downstream if j is in state i . Let m 1 be the first node after j is probed at m and observed to be in state i. Clearly , the decision tree rooted at m 1 is a ≤ i tree , and takes at least one backup some where downstream. W e will first sho w that there is one optimum policy O P T 2 which is similar to O P T except possibly for the tree rooted at m 1 . The tree rooted at m 1 in O P T 2 is still a ≤ i tr ee but (p1) takes a backup, say ` , at the end of its ≤ i path and (p2) takes ` as a backup wherev er it takes a backup. Suppose the tree T rooted at m 1 in O P T does not satisfy the above conditions. Then there is a path originating from its ≤ i path which ends in a backup. Let m 2 be the highest node (i.e., node closest to m 1 ) from where such a path originates on the ≤ i path of T . Clearly this path corresponds to a channel being observed in a state higher than i , say q , at m 2 . From the induction hypothesis, there exists one optimum O P T 2 which is similar to O P T except possibly for the decision tree rooted at m 2 . In O P T 2 this decision tree • is a ≤ q tr ee • takes a backup, say ` , at the end of its ≤ q path and • takes ` as a backup wherev er it takes a backup. <= i Tf (ii) OPT1 (i) OPT2 Backup m, j probed i m1 m2 <= q path <= i m, j probed Tf, Gain = Gf <= i tree <= i tree f < i Figure 2: The transformation of O P T 2 to O P T 1 in lemma 5.5. Note that O P T 2 satisfies conditions (p1) and (p2) for the tree rooted at m 1 , see Figure (2). Let α be the probability that O P T 2 nev er visits m , G 0 be the expected gain of O P T 2 if it ne ver visits m , G h be the expected gain of O P T 2 gi ven that j is observed in state h at node m. Thus, the expected gain of O P T 2 is αG 0 + (1 − α ) P h p hj G h . Let T h be the decision tree in O P T 2 , and hence in O P T , after j is observed to be in a state h < i after being probed at m . Consider a new policy which is similar to O P T 2 except that it replaces the decision tree rooted at m 1 with T f for some f < i. Note that the gain of this ne w polic y gi ven that j is observed in state i at node m is at least G f since the ov erall gain after observing a state i and a subsequent sequence of actions can not be less as compared to that after observing f and the same subsequent sequence of actions. Thus, the expected gain of this new policy is at least αG 0 + (1 − α ) P h,h 6 = i p hj G h + (1 − α ) p ij G f . Since the e xpected gain of this policy can not exceed that of the optimum, G f ≤ G i . 13 No w , consider another policy O P T 1 which is similar to O P T 2 except that it replaces the decision trees T 0 , . . . , T i − 1 (i.e., those rooted at nodes immediately downstream of m and corresponding to j being in states lo wer than i ), with the decision tree rooted at m 1 (i.e., the one corresponding to j being in state i ), refer Figure (2). Since the decision tree rooted at m 1 is a ≤ i tree and the ≤ i path ends in a backup, the gain of O P T 1 is α G 0 + (1 − α ) G i P h ≤ i p hj + P h>i p hj G h . Thus, since G f ≤ G i , for f < i, the expected gain of O P T 1 is at least as high as that for O P T 2 . Thus, O P T 1 is also optimum. Note that O P T 1 is similar to O P T e xcept possibly for the decision tree rooted at m , which is a ≤ i tr ee and satisfies conditions (p1) and (p2) . The result follo ws. W e now state and prov e a theorem which implies the Structure Theorem. Theorem 5.6 (Implies the Structure Theorem) . Ther e e xists an optimum policy that uses a unique backup channel. If such an optimum policy pr obes at least one channel, it uses the bac kup channel at the end of a ≤ i path originating fr om the r oot of its decision tree . Pr oof. Consider an optimum policy that does not use a backup channel at all. Consider the path in its decision tree which corresponds to all probed channels being in state 0 . Modify the policy so as to use the last channel probed in this path as a backup. Note that the gain does not decrease. Thus, the modified policy is also optimum. Thus, there always exists an optimum policy that uses a backup channel at the end of some path in its decision tree. If one such optimum policy O P T 3 does not probe any channel, then the theorem follo ws. Let O P T 3 probe a channel. Clearly , then, O P T 3 probes a channel j at the root node of its decision tree, say m. Let i be the highest state of j for which O P T 3 transmits in a backup channel some where downstream of m. Then, by lemma 5.5, there exists another optimum policy O P T 4 for which the decision tree rooted at m , and hence the overall decision tree, is (a) a ≤ i tree (b) uses a channel, say ` , as a backup at the end of the ≤ i path in the tree and (c) uses ` as a backup whenev er it uses a backup. The theorem follo ws. 5.3 The Policy R E S E RV E B K U P ( ` ) : Algorithm and Intuition W e require the subsequent definitions to specify the policy R E S E RV E B K U P ( ` ) . Definition 5.3. F or i = 1 , . . . , n, , let ˜ r i [ u ] = P v = K − 1 v = u p vi r vi P v = K − 1 v = u p vi and ˜ p i [ u ] = P v = K − 1 v = u p v i . Let ˜ r 0 [0] = − 1 and r − 1 = − 1 . Let w ` = min u { u : r u > ˜ r ` [0] } 1 . Note that ˜ r ` [0] is the probability of success if the sender uses ` as a backup. Definition 5.4. Let H u,` = φ for all u ≥ K . F or each ` , starting fr om u = K − 1 , down to u = w ` , r ecursively , define H u,` = n i | i 6∈ S v : v >u H v ,` , and ˜ r i [ u ] − c i ˜ p i [ u ] > max( ˜ r ` [0] , r u − 1 ) o \{ ` } . Assume that c i / ˜ p i [ u ] = ∞ when ˜ p i [ u ] = 0 . 1 Note that w ` is well-defined for each ` as (a) ˜ r ` [0] ≥ r 0 = 0 and (b) ˜ r ` [0] < r K − 1 which follo ws since p K − 1 ` < 1 for each `. 14 R E S E RV E B K U P ( ` ) ( Probing Pr ocess: ) u = K − 1 While u ≥ w ` and the highest state of a probed channel is lower than u, probe channels in H u,` in non-increasing order of ˜ r j [ u ] − c j / ˜ p j [ u ] . u → u − 1 ( Selection Process: ) Consider the channel j in the highest state y among all probed channels. (If no channel is probed, j = − 1 . ) If r y ≥ ˜ r ` [0] , then transmit in y , else transmit in ` . 0/1/2 (i) ReserveBkup(k) (ii) ReserveBkup(0) Transmit on highest reward probed channel Use k as backup 2 2 Transmit on i j i Transmit on j 0/1 0/1 0/1 k j i 2 2 Transmit on k Transmit on j 0/1 Figure 3: The illustration of R E S E RV E B K U P ( ` ) in Figure 1. Refer to Figure (3) for examples elucidating R E S E RV E B K U P ( ` ) . W e no w explain why R E S E RV E B K U P ( ` ) attains the same gain as O P T ( ` ) for ` = 0 , . . . , n . First, observe that R E S E RV E B K U P ( ` ) ∈ P ( ` ) . This is because by definition ` 6∈ H u,` for any u. Thus, R E S E RV E B K U P ( ` ) nev er probes ` (refer to the probing process). Also, note that R E S E RV E B K U P ( ` ) does not use any channel other than ` as backup (refer to the selection process). Next, since ˜ r ` [0] is the probability of success when the sender transmits in the backup ` , the channel selection for R E S E RV E B K U P ( ` ) is clearly optimal among policies that ha ve the same probing sequence. W e now explain the intuition behind the design of the probing process for R E S E RV E B K U P ( ` ) . First let ` = 0 . Once a sender observes that a probed channel is in state u it can not increase its gain any further by discovering another probed channel in state u or in a lower state. Thus, subsequently , it probes only the channels j for which the additional rew ard ( ˜ r j [ u + 1] ˜ p j [ u + 1] − r u ˜ p j [ u + 1] ) exceeds the cost c j , i.e., it probes the channels in H v ,` , v > u . The incremental gain for probing a channel j then is ˜ r j [ u + 1] ˜ p j [ u + 1] − r u ˜ p j [ u + 1] − c j . The probing sequence in each H u,` follo ws an increasing order of the ratio between this incremental gain and the probability that the channel is in a state that is higher than the highest observ ed state u ( ˜ p j [ u + 1] ). W e no w comment on the major dif ferences between the probing processes of R E S E RV E B K U P ( ` ) for ` > 0 and ` = 0 . Note that ˜ r ` [0] is the gain if the sender transmits in ` without probing any channel j. If, the sender observ es a channel to be in state u for which r u ≤ ˜ r ` [0] , the observ ation does not increase the gain as compared to ˜ r ` [0] . Hence, the sender considers the incremental gain as ˜ r j [ u + 1] ˜ p j [ u + 1] − max( r u , ˜ r ` [0]) ˜ p j [ u + 1] instead of ˜ r j [ u + 1] ˜ p j [ u + 1] − r u ˜ p j [ u + 1] , and, as before, probes only channels for which the incremental gain exceeds the probing cost. Finally , we determine the computation times of R E S E RV E B K U P ( ` ) and B E S T R E S E RV E B K U P . All the H u,` s can be ev aluated in O ( nK ) time and the sorting required to determine the probing sequence within each H u,` needs O ( n log n ) time. Thus, the entire probing sequence and hence for any giv en ` R E S E RV E B K U P ( ` ) , can be computed in O ( nK + nK log n ) time or rather in O ( nK log n ) time. No w , note that the computations for H u, 0 s and the sorting order in each H u, 0 can be reused to determine H u,` s for all ` s 15 in additional O ( n ) time. Thus, all R E S E RV E B K U P ( ` ) s can be computed in O ( nK log n ) time. The gain for each R E S E RV E B K U P ( ` ) can be computed in O ( nK ) time. Thus, B E S T R E S E RV E B K U P can be computed in O ( n 2 K ) time. 5.4 Proof f or Theorem 5.2 W e now show that the policy R E S E RV E B K U P ( ` ) described in Section 5.3 is optimal in the class of policies P ( ` ) . This completes the proof for Theorem 5.2 since we hav e already pro ved in Section 5.3 that for any gi ven ` R E S E RV E B K U P ( ` ) , can be computed in O ( nK log n ) time. First observe that the optimal in the class of policies P ( ` ) need not be unique. W e consider O P T ( ` ) to be one optimal policy in P ( ` ) that satisfies the following property . Suppose channel j is the last channel probed in a path in the decision tree of O P T ( ` ) ; let m be the node at which j is probed. Then, O P T ( ` ) would attain a lesser gain if it were not to probe j at m. Clearly , such optimal policies exist, and can be obtained by progressi vely removing the lo west node at which a channel is probed and which can be remo ved without reducing the gain. W e first observe the follo wing about O P T ( ` ) . Let the highest state of a probed channel be u when O P T ( ` ) terminates its probing process. Then O P T ( ` ) transmits in the probed channel if r u ≥ ˜ r ` [0] and transmits in ` otherwise. Thus, the channel selection for R E S E RV E B K U P ( ` ) is optimal among all policies that have the same probing sequence. W e no w state and prove three lemmas which will establish that the probing process for R E S E RV E B K U P ( ` ) is optimal in P ( ` ) , and thereby prove Theorem 5.2. Lemma 5.7. 1. If all channels in S v > max( u,w ` − 1) H v ,` have already been pr obed, and the best state seen so far is u , then O pt ( ` ) does not pr obe any further . 2. O pt ( ` ) can not terminate the pr obing pr ocess when ther e an is un-pr obed channel in S v > max( u,w ` − 1) H v ,` , and the best state seen so far is u. Pr oof. The first part of the lemma clearly holds for u = K − 1 since the optimal polic y does not probe any further after observing a channel in state K − 1 . Let the first part not hold for some u < K − 1 . Thus, although all channels in S v > max( u,w ` − 1) H v ,` hav e already been probed, O P T ( ` ) probes further channels. Let j be the last channel probed by O P T ( ` ) in one such path. Let j be probed at node m of the decision tree. The highest state of a channel probed before j is probed at m be u. Note that after probing j O P T ( ` ) transmits in (a) backup ` if the maximum of u and j ’ s state is w ` − 1 or lo wer and (b) the channel that has the highest state among all the probed channels otherwise. No w , consider another policy C ∈ P ( ` ) which is similar to O P T ( ` ) except that it does not probe j at node m , and instead transmits in (a) backup ` if u < w ` and (b) the probed channel that is in state u otherwise. Let O P T ( ` ) and hence C reach node m with probability α . Clearly , α > 0 , else node m can be removed from the decision tree of O P T ( ` ) without reducing its gain. Let ∆ be the difference between the gains of O P T ( ` ) and C . W e will arri ve at a contradiction by sho wing that ∆ ≤ 0 . Hence, the first part of the lemma holds. ∆ /α = K − 1 X k =0 p kj max( r k , r u , ˜ r ` [0]) − c j − max( r u , ˜ r ` [0]) = K − 1 X k =max( u,w ` − 1)+1 p kj (max( r k , r u , ˜ r ` [0]) − max( ˜ r ` [0] , r u )) − c j = K − 1 X k =max( u,w ` − 1)+1 p kj ( r k − max( ˜ r ` [0] , r u )) − c j . 16 First, let u ≥ w ` − 1 . Thus, j 6∈ ∪ v >u H v ,` . Thus, ∆ < 0 . Now , let u < w ` − 1 . Then, r u ≤ r w ` − 1 ≤ ˜ r ` [0] . Thus, ∆ /α = P K − 1 k = w ` p kj ( r k − max( ˜ r ` [0] , r w ` − 1 )) − c j . Also, j 6∈ S v ≥ w ` H v ,` . Thus, ∆ < 0 . The second part of the lemma holds by vacuity when u = K − 1 since H K,` = φ. Let u < K − 1 . Let the probability that O pt ( ` ) terminates the probing process ev en when there an is un-probed channel in H v ,` for some v > max( u, w ` − 1) , and the best state seen so f ar is u be α 1 > 0 . Then, whene ver the abov e e vent happens, O P T ( ` ) transmits in (a) ` if u < w ` and (b) a probed channel that is in state u otherwise. Consider another polic y D ∈ P ( ` ) which is similar to O P T ( ` ) e xcept that whenever the abo ve ev ent happens it probes an additional channel j ∈ H v ,` , and transmits in (a) ` if the maximum of u and j ’ s state is w ` − 1 or lower and (b) the probed channel that has the highest state otherwise. Let ∆ 1 be the difference between the gains of O P T ( ` ) and D . W e will show that ∆ 1 < 0 which is a contradiction. Hence, the second part of the lemma holds. ∆ 1 /α 1 = max( r u , ˜ r ` [0]) − K − 1 X k =0 p kj max( r k , r u , ˜ r ` [0]) + c j = c j − K − 1 X k =max( u,w ` − 1)+1 p kj ( r k − max( ˜ r ` [0] , r u )) = c j − v − 1 X k =max( u,w ` − 1)+1 p kj ( r k − max( ˜ r ` [0] , r u )) − K − 1 X k = v p kj ( r k − max( ˜ r ` [0] , r u )) ≤ c j − K − 1 X k = v p kj ( r k − max( ˜ r ` [0] , r v − 1 )) . The last inequality follows since r k ≥ max( ˜ r ` [0] , r u ) for k > max( u, w ` − 1) and r v − 1 ≥ r u since v ≥ max( u, w ` − 1) + 1 . The result follows since j ∈ H v ,` . Lemma 5.8. Let w ` ≤ u < K. O P T ( ` ) pr obes all channels in H u,` befor e pr obing any c hannel that is not in ∪ v ≥ u H v ,` . Also, O P T ( ` ) pr obes all channels of H u,` unless one of the pr obed channels is in state u or a higher state, and pr obes these channels in non-incr easing order of ˜ r j [ u ] − c j ˜ p j [ u ] . Pr oof. Suppose the lemma does not hold. Then there exists a node in the decision tree of O P T ( ` ) , which O P T ( ` ) visits with positiv e probability 2 , such that the decisions at it violate the lemma. Let m be such a node which is also the farthest from the root node among those that violate this lemma. Then there exists a state q ≥ w ` such that there e xists a channel in H q ,` that has not been probed by O P T ( ` ) before it visits m and the best state seen so far is q − 1 or worse. Let u be the highest state that satisfies both the abov e criteria and let j be the channel with the largest ˜ r j [ u ] − c j ˜ p j [ u ] v alue among the un-probed channels of H u,` . At node m, O P T ( ` ) does not probe j contradicting the lemma. From the second part of lemma 5.7, the probing process of O P T ( ` ) can not terminate at m. Thus, O P T ( ` ) probes some channel i 6 = j at node m. Note that i 6∈ ∪ v >u H v ,` . Since O P T ( ` ) hav e already probed all channels in ∪ v >u H u,` , if channel i is in state u or a higher state, O P T ( ` ) does not probe any further (first part of lemma 5.7), and transmits in i (since i has the highest state, say s , among the probed channels and r s ≥ r u ≥ r w ` > ˜ r ` [0] ). If i is in a state u − 1 or a lower state, since H u,` has un-probed channels, the probing process can not terminate (second part of lemma 5.7). Now , O P T ( ` ) probes j ne xt (otherwise m will not be the node farthest from the root to violate the lemma). Using similar arguments, it follo ws that after probing j, O P T ( ` ) transmits in j if j is in state u or a higher state. 2 If the lemma is violated at a node which O P T ( ` ) visits with 0 probability , we can, without reducing the gain of O P T ( ` ) , alter the decisions at the node so as to satisfy the lemma. Hence, without any loss of generality , we assume that the decisions of O P T ( ` ) satisfy the lemma at all such nodes. 17 The situation resembles the decision tree in Figure (4a). The trees T 1 . . . T u 2 correspond to observing the ordered pair ( i = u 0 , j = u 00 ) where 0 ≤ u 0 , u 00 ≤ u − 1 . The square boxes denote that O P T ( ` ) do not probe anything else. 0 0 1 T T T 1 j 2 3 2 0 1 T T T 7 8 9 2 0 1 T T T 4 5 6 2 3 3 3 3 j j i 1 2 0 0 1 T T T 1 2 0 1 T T T 9 2 0 1 T T 5 2 3 3 3 3 T 4 7 i j i i 2 8 3 6 1 2 (a) (b) Figure 4: The decision trees of O P T for u = 3 Let O P T ( ` ) not trav erse node m with probability α 1 , tra verse node m and stop after probing i or j with probability α 2 , trav erse node m and continue after probing j with probability α 3 . By assumption, α 2 > 0 . Let the conditional expected gains giv en these scenarios be G 1 , G 2 , G 3 respecti vely . Then, the expected gain of O P T ( ` ) is G Opt ( ` ) = P 3 i =1 α i G i . Let the total probing cost en-route to node m be C 1 . Then, G 2 = 1 − α 1 α 2 ( ˜ p i [ u ] ˜ r i [ u ] − c i + (1 − ˜ p i [ u ])[ ˜ p j [ u ] ˜ r j [ u ] − c j ] − C 1 ) . No w consider an alternate polic y A that is similar to O P T ( ` ) except for the tree rooted at node m . Figure (4) shows the tree rooted at node m in policy A for u = 3 . Here, A probes j at node m and subsequently probes i unless j is in state u or a higher state. The tree T 0 corresponding to the ordered pair ( i = u 0 , j = u 00 ) in O P T is assigned on the branch corresponding to the ordered pair ( j = u 00 , i = u 0 ) in A. The contributions to the gain from the trees T 1 , . . . T u 2 remain the same because in both the scenarios the probabilities of probing these trees are the same. Thus, the e xpected gain of this new polic y is G C = α 1 G 1 + α 2 G 0 2 + α 3 G 3 , where G 0 2 = 1 − α 1 α 2 ( ˜ p j [ u ] ˜ r j [ u ] − c j + (1 − ˜ p j [ u ])[ ˜ p i [ u ] ˜ r i [ u ] − c i ] − C 1 ) . No w if i ∈ H u,` then we hav e ˜ r j [ u ] − c j ˜ p j [ u ] > ˜ r i [ u ] − c i ˜ p j [ u ] which is the condition that arises from violating the non-increasing order . If i 6∈ H u,` , then since i 6∈ ∪ v >u H v ,` and u ≥ w ` , ˜ r i [ u ] − c i ˜ p i [ u ] ≤ max( r u − 1 , ˜ r ` [0]) . But ˜ r j [ u ] − c j ˜ p j [ u ] > max( r u − 1 , ˜ r ` [0]) since j ∈ H u,` and u ≥ w ` . Therefore, in both cases we have ˜ r j [ u ] − c j ˜ p j [ u ] > ˜ r i [ u ] − c i ˜ p i [ u ] . But this implies that G C − G Opt ( ` ) 1 − α 1 = ˜ p j [ u ] ˜ r j [ u ] − c j + (1 − ˜ p j [ u ]) { ˜ p i [ u ] ˜ r i [ u ] − c i } − ˜ p i [ u ] ˜ r i [ u ] + c i − (1 − ˜ p i [ u ]) { ˜ p j [ u ] ˜ r j [ u ] − c j } = ˜ p i [ u ] ˜ p j [ u ] ˜ r j [ u ] − c j ˜ p j [ u ] − ˜ r i [ u ] + c i ˜ p i [ u ] > 0 . Thus, since α 1 < 1 , G C > G Opt ( ` ) . Thus, we arri ve at a contradiction. The result follo ws. Lemma 5.9. O P T ( ` ) pr obes only channels in ∪ K − 1 v = w ` H v ,` . 18 Pr oof. From the first part of lemma 5.7, O P T ( ` ) terminates its probing process after probing all channels in ∪ K − 1 v = w ` H v ,` . From lemma 5.8, O P T ( ` ) must probe all channels in ∪ K − 1 v = w ` H v ,` before probing a channel that is not in ∪ K − 1 v = w ` H v ,` . The result follows. The optimality of the probing process for R E S E RV E B K U P ( ` ) in P ( ` ) follo ws from lemmas 5.8 and 5.9. Thus, Theorem 5.2 follo ws. 6 Additiv e A pproximation Schemes f or Equal Probing Costs W e no w consider the case that all channels hav e equal probing costs, i.e., c i = c > 0 , but still allo w for dif ferent distributions for the state processes of different channels. This assumption is motiv ated by the fact that oftentimes the probing cost is determined by the energy consumed in transmitting the probe packets which is usually similar for different channels. W e still assume that the sender is saturated. W e present a policy that gi ven any > 0 attains a g ain of at least G O P T − r K − 1 . The time to compute this polic y depends exponentially on 1 , but is a polynomial in n for any fixed > 0 . Moti vated by the Structure Theorem (Theorem 5.6), we consider the follo wing class of policies. Definition 6.1. Let P ( `, i ) be the class of policies which (a) take the same actions if a pr obed channel is observed in state i 0 or i 00 at any node when i 0 , i 00 ≤ i and (b) take backup ` at the end of the ≤ i path originating from the r oots of their decision tr ees and do not take backups anywhere else. Let O pt ( `, i ) denote the optimum policy in this class with gain G ( `, i ) . W e know from Theorem 5.6 that O P T is in P ( ` ∗ , i ∗ ) for some ` ∗ ∈ { 1 , . . . , n } and some i ∗ ∈ { 0 , . . . , K − 1 } . Therefore it suf fices to provide approximations of the O pt ( `, i ) policies for different `, i . Note that the policy that approximates O pt ( `, i ) need not be in P ( `, i ) . W e now prov e the central lemma in this section, which also presents the approximation algorithm. Lemma 6.1. W e can compute in O ( n h +2 hK ) time a policy whose gain is at least G ( `, i ) − r k − 1 , wher e h = 1 + d− log 1 / (1 − ) e . Pr oof. First, let c ≤ r K − 1 . Now , observe that O P T N O B K U P attains a gain of at least G ( `, i ) − r K − 1 since it attains a gain of at least G ( `, i ) − c. T o see the latter , note that if O pt ( `, i ) is modified to first probe ` and subsequently transmit in ` wherev er it were using ` as a backup, then its gain decreases by at most c . Thus, the modified policy has a gain of at least G ( `, i ) − c. No w , since the modified policy does not use an y backup channel, its gain is at most that of O P T N O B K U P . The result follo ws. No w , let c > r K − 1 . Note that O pt ( `, i ) does not transmit in a probed channel whose state is i or less. No w , using a proof which is the same as that for Claim 5.4, it follows that if O pt ( `, i ) probes a channel u 6 = ` , and if u is not the first channel probed, then P K − 1 j = i +1 p j u r j ≥ c > r K − 1 , and hence P K − 1 j = i +1 p j u > . No w , consider the ≤ i − path of O pt ( `, i ) starting from the root, which ends in the backup. The probability of continuing on this path decreases by a factor of 1 − for each additional node after the first node (escape probability is at least after the first node). Therefore, the probability q of continuing beyond h nodes in this path is less than , since h = 1 + d− log 1 / (1 − ) e . Thus, if O pt ( `, i ) is modified to tak e the backup after h nodes in this path, the gain of the modified policy is at least G ( l , i ) − r K − 1 . Let P ( `, i, h ) be the class of policies that are in P ( `, i ) and ha ve h or fewer nodes in their ≤ i path originating from their roots. Thus, the policy that has the maximum gain among all policies in P ( `, i, h ) has gain at least G ( l, i ) − r K − 1 . Thus, the result follows if we can compute the policy that has the maximum gain in P ( `, i, h ) in O ( n h +2 hK ) time. Note that a channel can be probed at most once in the ≤ i path originating from the root for this policy , and thus the number of possible probing sequences for this path is bounded by O ( n h ) . For each probing sequence in this path, we compute as follows the rest of the actions so as to maximize the 19 gain. Suppose that we are at a node t in this path where we just probed channel x and the set of probed channels (including x ) is Q . Since the probing sequence in the ≤ i path is gi ven, we only need to determine the actions do wnstream if x is in state j > i . In this case, we know from Definition 6.1, that a backup is not used do wnstream. Also, only the channels in Q can be probed downstream. Further , all channels which are in states ≤ j will not be used for transmission, and probing them do not increase the gain. Therefore we can pretend that we ha ve a ne w system o ver Q where r 00 s = r s − r j if s > j and 0 otherwise. W e can use O P T N O B K U P on the channels of Q with rewards { r 00 } to find an optimal subtree. Thus, giv en the probing sequence in the ≤ i path, the rest of the tree can be computed using h appli- cations of O P T N O B K U P , which requires O ( hnK log n )) time. The gain of each tree can be computed in O ( nK h ) time. Thus, O ( hnK log n ) time is needed in this part. Since there are O ( n h ) probing sequences for the ≤ i path, the policy that attains the maximum gain in P ( `, i, h ) can be computed in O ( n h +2 hK ) time. The result follo ws. From lemma 6.1 and since G O P T = G ( `, i ) , for some ` ∈ { 1 , . . . , n } and i ∈ { 0 , . . . , K − 1 } , the policy that has the maximum gain among those that attain gains of G ( `, i ) − r K − 1 for different ` ∈ { 1 , . . . , n } and i ∈ { 0 , . . . , K − 1 } attains a gain of at least G O P T − r K − 1 . W e can compute such a polic y in O ( n h +3 hK 2 ) time. W e therefore obtain the following theorem. Theorem 6.2. In O ( n h +3 hK 2 ) time we can compute a policy whose gain is at least G O P T − r K − 1 wher e h = 1 + d− log 1 / (1 − ) e . Finally , note that since we are focusing on only an additi ve approximation, the computation time can be made independent of K . First, divide [0 , r K − 1 ] in disjoint intervals of size / 2 . Then, consider a ne w system where the probability of success in each state i equals ( / 2) b 2 r i / c . This ne w system effecti vely consists of at most 2 r K − 1 / ≤ 2 / states. In this system, the approximate polic y we just dev eloped, approximates the optimum gain within an additive factor of / 2 . The gain of the optimum policy in this system is at least G Opt − / 2 . Thus, the approximate policy computed in this system attains a gain of at least G Opt − in the actual system. Note that irrespectiv e of the number of states in the original system the time required for computing the approximate polic y in the new system is O ( n h +3 h/ 2 ) where h = 1 + d− log 1 / (1 − / 2) / 2 e . 7 The Unsaturated Sender Pr oblem W e now consider the case that the sender may not always ha ve packets to transmit. W e assume that the sender generates pack ets as per a positiv e recurrent Mark ovian arri val process such that the a verage number of packets arriving in its queue under the steady state distribution of the arri val process is λ. Thus, the sender may choose not to transmit in a giv en slot ev en when her queue is non-empty , e.g., when the transmission conditions of the probed channels are not acceptable for her . Howe ver , the sender needs to transmit at rate λ , else her queue will become unstable. Thus, we need to jointly optimize the probing, channel selection and transmission decisions so as to maximize the gain subject to stabilizing the sender’ s queue. Specifically , we seek to solve Problem 2 formulated in Section 3. As stated in Section 3, we consider n channels with K states and potentially unequal probing costs and different distributions for the state processes. W e note that no pre vious results – ev en exponential time policies, were kno wn for this problem. W e will assume that the optimal policy O P T U N S A T is ergodic, and denote its gain by G U . W e present a stable policy that (a) attains a gain of G U for K = 2 and at least (2 / 3) G U for K > 2 and (b) can be computed in O ( n 2 K ( n + K )) time. 7.1 Roadmap and Main Results W e first introduce the following definitions. 20 Definition 7.1. Let Π denote the set of decision tr ees. Let C σ denote the expected pr obing cost in decision tr ee σ ∈ Π , and M σ be the set of leaf nodes where the decision is to transmit. Let ˆ p mσ denote the pr obability that the leaf node m is r eached in σ , and if m ∈ M σ , ˆ r mσ is the pr obability of success for the transmission at m. Let S σ = P m ∈M σ ˆ p mσ and G σ = P m ∈M σ ˆ r mσ ˆ p mσ − C σ for a σ ∈ Π . Since the number of channels and the number of transmission states of the channels are both finite, Π constitutes a finite set. Note that no w a decision tree may not transmit in all leaf nodes. For e xample, the decision trees in Figure 1(a),(b) are examples of decision trees in Π , and in addition, if the decision trees in Figure 1(b) are modified so as not to transmit at the end of the left-most paths, the modifications will also constitute decision trees in Π . Note that Π also includes the decision tree that neither probes nor transmits in any channel. Thus, if the sender tak es actions as per the decision tree σ in a slot in which she has packets to transmit, she transmits with probability S σ in the slot and attains a gain of G σ in the slot. Step 1: Expressing the optimal policy as the solution of a Linear Program: Throughout this discus- sion, we assume that ∈ (0 , 1 λ − 1) is a suitably chosen small constant. Consider the following linear program L P U N S AT ( ) . L P U N S A T ( ) : Maximize X σ ∈ Π β σ G σ P σ ∈ Π β σ S σ = λ (1 + ) ( stability constraint ) P σ ∈ Π β σ = 1 β σ ≥ 0 ∀ σ ∈ Π Definition 7.2. Let { β ∗ ( ) } be the optimum solution of L P U N S A T ( ) . Let Q ∗ ( ) denote the optimal value of the objective function. W e will prove that an arbitrary close approximation for the optimal policy can be obtained using { β ∗ ( ) } . But, { β ∗ ( ) } does not provide an exact solution because the stability constraint in L P U N S A T ( ) in volv es a positi ve , which is required to ensure stability . Thus, the approximation impro ves with decrease of . W e first prov e the following results. Proposition 7.1. F or any 0 ≤ 1 < 2 < 1 λ − 1 , Q ∗ ( 1 ) ≤ Q ∗ ( 2 ) . Pr oof. Let { β } denote the optimal solution for L P U N S A T ( 1 ) . Thus, P σ β σ S σ = λ (1 + 1 ) < λ (1 + 2 ) .There exists a polic y x such that β x > 0 and S x < 1 , else P σ β σ S σ = 1 > λ (1 + 1 ) . Let x C denote the polic y which transmits at all leaf nodes, but is otherwise similar to x. Thus, S x C = 1 , and G x C ≥ G x . No w increase β x C and decrease β x such that their sum remains the same. This change increases P σ β σ S σ , ensures that the objectiv e value does not decrease and that the P σ β σ remains the same. Continue this process until P σ β σ S σ = λ (1 + 2 ) . This yields a feasible solution to L P U N S A T ( 2 ) whose value is at least Q ∗ ( 1 ) . The next lemma pro vides an upper bound for the gain of the optimal policy . Lemma 7.1. G U ≤ Q ∗ ( ) ∀ ∈ [0 , 1 λ − 1) . Pr oof. Let β U σ denote the probability with which the optimal policy O P T U N S A T chooses policy σ . Then { β U σ } forms a feasible solution to L P U N S A T (0) , and the expected gain of this polic y is simply P σ ∈ Π β U σ G σ . Thus, G U ≤ Q ∗ (0) . By Proposition 7.1, we hav e Q ∗ (0) ≤ Q ∗ ( ) , which completes the proof. 21 W e next show that any feasible solution { β ( ) } to L P U N S A T ( ) yields a stable policy , U N S AT ( β ( )) , whose gain is close to the corresponding objecti ve value of L P U N S A T ( ) . W e describe policy U N S A T ( β ( )) after introducing the following definition. A slot in which the sender’ s queue is non-empty is referred to as a busy slot. Policy U N S AT ( β ( )) In each busy slot, select a σ ∈ Π in accordance with the probability distribution { β ( ) } , and probe channels, decide whether to transmit, and select a channel as per σ. Lemma 7.2. Assume that ∈ (0 , 1 λ − 1) . If { β ( ) } is a feasible solution of L P U N S A T ( ) and has objective value Q ( ) , then U N S A T ( β ( )) is stable and attains a gain of Q ( ) 1+ . Pr oof. In any busy slot, U N S A T ( β ( )) selects a decision tree σ in accordance with the probability distribu- tion β ( ) . Thus, the stability constraint in L P U N S A T ( ) ensures that in any busy slot the sender transmits packets with probability λ (1 + ) which exceeds λ. Since the state of the arriv al process and the queue length under U N S A T ( β ( )) constitutes a Marko v chain, stability follo ws from standard results and analytical tech- niques (Theorem 2.2.3 in [12], [9]). Since the system is stable and U N S AT ( β ( )) transmits a packet with probability λ (1 + ) in each busy slot, using Little’ s law , at least 1 1+ of slots are busy . The gain of U N S A T ( β ( )) in each busy slot is Q ( ) . Thus, U N S A T ( β ( )) attains a gain of at least Q ( ) / (1 + ) . In view of Lemmas 7.1 and 7.2, U N S A T ( β ∗ ( )) is stable and attains a gain of Q ∗ ( ) 1+ ≥ G U / (1 + ) . But, we do not know how to solv e L P U N S A T ( ) in polynomial time. W e therefore seek to obtain constant factor approximations of the optimal solution of L P U N S A T ( ) in polynomial time. This motiv ates the follo wing definitions. Definition 7.3. A c -appr oximation to L P U N S A T ( ) is a feasible solution { β } of L P U N S A T ( ) for which the objective function is at least cQ ∗ ( ) . A c -appr oximation to the unsaturated sender pr oblem constitutes a (potentially randomized) stable policy that attains a gain of at least cG U . It follows from Lemmas 7.1 and 7.2 that for an y ∈ (0 , 1 λ − 1) , a c -approximation to L P U N S A T ( ) easily yields a c/ (1 + ) -approximation to the unsaturated sender problem. W e therefore focus on obtaining a c -approximation to L P U N S AT ( ) in polynomial time. Step 2: Considering the Lagrangean Relaxation. W e consider a Lagrangean Relaxation L P L AG R A N G E ( , L ) for L ≥ 0 . L P L A G R A N G E ( , L ) : Maximize X σ ∈ Π β σ G σ + L λ (1 + ) − X σ ∈ Π β σ S σ ! P σ ∈ Π β σ = 1 β σ ≥ 0 ∀ σ ∈ Π Note that the optimal solution of L P L A G R A N G E ( , L ) uses only (a) the σ s that alw ays transmit when L ≤ 0 and (b) the σ s that ne ver transmit when L > r K − 1 . The hope is that the ideal Lagrange multiplier L ∗ would ensure that P σ ∈ Π β σ S σ = λ (1 + ) for the optimum solution of L P L A G R A N G E ( , L ∗ ) , and we would have a solution for L P U N S AT ( ) . Howe ver , the computation time for finding such a L ∗ is the same as that for the original problem! W e proceed as follo ws to circumvent this difficulty . W e obtain a c − approximate solution for L P U N S A T ( ) in polynomial time using the follo wing observation and the subsequent lemma. 22 Proposition 7.2. F or any L ≥ 0 , ther e exists an optimum solution of L P L A G R A N G E ( , L ) in which β σ = 1 for some σ = σ L , and β σ = 0 if σ ∈ Π \ { σ L } . The abov e proposition motiv ates the following definition. Definition 7.4. F or any L ≥ 0 , a policy σ ∈ Π is said to c -appr oximate L P L A G R A N G E ( , L ) if G σ − LS σ ≥ c ( G σ 0 − LS 0 σ ) for any σ 0 ∈ Π . Lemma 7.3. Assume that ∈ (0 , 1 λ − 1) and 0 ≤ c ≤ 1 . Suppose we have two decision trees σ + , σ − that c -appr oximate L P L A G R A N G E ( , L + ) and L P L A G R A N G E ( , L − ) r espectively . Suppose, further that S σ + ≤ λ (1 + ) < S σ − and that 0 ≤ L + − L − ≤ cQ ∗ ( ) . Consider { β } such that β σ + = α , β σ − = 1 − α , and β σ = 0 for σ ∈ Π \ { σ + , σ − } wher e α = S σ − − λ (1+ ) S σ − −S σ + . Then { β } constitutes a feasible solution for L P U N S A T ( ) and P σ ∈{ σ L + ,σ L − } β σ G σ ≥ c (1 − ) Q ∗ ( ) . W e prove the abo ve lemma in Section 7.2. Lemmas 7.1, 7.2, 7.3 imply the follo wing fact. Proposition 7.3. If { β } satisfies the conditions in lemma 7.3, U N S A T ( β ) is a c (1 − ) / (1 + ) -appr oximation to the unsaturated sender pr oblem. Step 3: Finding the two solutions. W e now address the follo wing important issues: (1) ho w to obtain c -approximations for L P L A G R A N G E ( , L ) for arbitrary L and (2) ho w to obtain two L + , L − such that the respective c -approximations σ + , σ − for L P L A G R A N G E ( , L + ) , L P L AG R A N G E ( , L − ) satisfy S σ + ≤ λ (1 + ) < S σ − . W e first observe that the objectiv e function of L P L A G R A N G E ( , L ) can be e xpressed as X σ ∈ Π β σ G σ + L λ (1 + ) − X σ ∈ Π β σ S σ ! = L ( λ (1 + )) + X σ ∈ Π β σ X m ∈M σ ( ˆ r mσ − L ) ˆ p mσ − C σ ! . Thus, optimizing or approximating the above quantity is similar to optimizing or approximating the saturated sender problem in a system where (a) the reward of transmitting in a channel in state m is r 0 m = r m − L and (b) the sender may choose not to transmit in a slot. The shift in the rewards and the option of not transmitting leads to some important differences with the saturated sender problem we considered earlier (Problem 1). Specifically , the optimal policy in this system will not transmit in a probed channel that is in a state m such that r m < L , b ut may transmit in a backup channel in such a state m. Thus, the re ward of transmitting in a probed channel is non-negati ve, whereas the reward of transmitting in a backup channel may be negati ve. Owing to these dif ferences, the proof of the 4 / 5 -approximation no longer holds in this system. Nev ertheless, we obtain a 2 / 3 -approximation for this system. W e first introduce the following definitions. Definition 7.5. Consider a system wher e the sender (a) is saturated (i.e., always has pac kets to tr ansmit) (b) attains a r ewar d of r m − x if it transmits in a c hannel in state m (c) acquir es a cost of c i when it pr obes channel i and (d) may choose not to transmit in a slot. W e r efer to this system as the S A T U R A T E D A LT E R E D R E W A R D ( x ) system, and let T ( x ) be the pr oblem of maximizing the gain in this system. A policy is said to c -appr oximate the T ( x ) pr oblem if its gain in this system is at least c times the maximum gain in this system. Note that the policy that solves ( c -appr oximates, r espectively) the T ( L ) pr oblem solves ( c -appr oximates, r espectively) the L P L A G R A N G E ( , L ) pr oblem. Lemma 7.4. W e can solve T ( x ) optimally ( c = 1 ) for K = 2 and achie ve a c = 2 / 3 appr oximation for K > 2 in O ( n 2 K ) time . 23 The optimum policy in a class of ‘threshold type” policies R E S E RV E B K U P ( `, x ) provides the abo ve optimal and approximate solutions for dif ferent values of K ; as the name suggests, these threshold-type policies are extensions of R E S E RV E B K U P ( ` ) . W e present these threshold type policies in Section 7.3, and prov e lemma 7.4 using these policies in Section 7.4. W e finally prove in Section 7.5 the last piece, namely: Lemma 7.5. Assume that ∈ (0 , 1 λ − 1) . Using O n 2 K ( n + K ) time, we can compute σ + , σ − , L + , L − which satisfy the pr operties of lemma 7.3 for (a) c = 1 for K = 2 and (b) c = 2 / 3 for K > 2 . W e present a constructive polic y for computing the abov e σ + , σ − , L + , L − as part of our proof. The follo wing Theorem follows from proposition 7.3 and lemma 7.5. Theorem 7.6. Assume that ∈ (0 , 1 λ − 1) . W e can compute a c (1 − ) / (1 + ) -appr oximation for the unsaturated sender pr oblem wher e c = 1 for K = 2 and c = 2 / 3 for K > 2 in O n 2 K ( n + K ) time. Since the computation time does not depend on , by selecting small , we can attain in polynomial time an approximation factor close to 1 for K = 2 and 2 / 3 for K > 2 . W e summarize the policy that attains the abov e performance guarantee in Section 7.6. Finally , the idea of using an approximation algorithm for the Lagrangean relaxation of an optimization problem, and performing a parametric search to satisfy the constraint while preserving the approximation ratio, has a rich history in approximation algorithms. It is the method of choice for netw ork design prob- lems when there is a hard bound on the resource allocation constraint, for instance, k -medians [18] and k -MST [7]. W e extend this technique to deal with the hard constraint on the rate of transmissions, and our re- sults constitute the first application of this technique to policy design. This technique yields threshold-based re ward policies, which suggests that this technique will ha ve interesting connections to the retirement-based index policies [14] for multi-armed bandit problems – these connections will be e xplored in future work. 7.2 Proof of Lemma 7.3 W e first prove that { β } constitutes a feasible solution to for L P U N S AT ( ) . Since S σ + ≤ λ (1 + ) < S σ − , β σ + ∈ [0 , 1) and β σ − ∈ (0 , 1] . Finally , note that P σ ∈ Π β σ = 1 and P σ ∈ Π β σ S σ = λ (1 + ) . The result follo ws. W e now prove that P σ ∈{ σ L + ,σ L − } β σ G σ ≥ c (1 − ) Q ∗ ( ) . Note that since σ + c − approximates L P L A G R A N G E ( , L + ) and 0 ≤ c ≤ 1 and P σ ∈ Π β ∗ ( ) σ = 1 , G σ + + L + λ (1 + ) − S σ + ≥ c " X σ ∈ Π β ∗ ( ) σ G σ + L + λ (1 + ) − X σ ∈ Π β ∗ ( ) σ S σ !# (3) And like wise for L − , G σ − + L − λ (1 + ) − S σ − ≥ c " X σ ∈ Π β ∗ ( ) σ G σ + L − λ (1 + ) − X σ ∈ Π β ∗ ( ) σ S σ !# (4) Since { β ∗ ( ) } is a feasible of L P U N S A T ( ) , P σ ∈ Π β ∗ ( ) σ S σ = λ (1 + ) . Thus the terms λ (1 + ) − P σ ∈ Π β ∗ ( ) σ S σ can be removed from the respective right hand sides of the equations 3 and 4. W e now multiply equation 3 by α and equation 4 by 1 − α and add the resulting equations. The right hand side of the sum e valuates to c P σ ∈ Π β ∗ ( ) σ G σ = cQ ∗ ( ) . Since α S σ + + (1 − α ) S σ − = λ (1 + ) , the left hand side becomes α G σ + + (1 − α ) G σ − − ( L + − L − )(1 − α ) λ (1 + ) − S σ − . Thus, we hav e α G σ + + (1 − α ) G σ − − ( L + − L − )(1 − α ) λ (1 + ) − S σ − ≥ cQ ∗ ( ) 24 No w since 0 ≤ λ (1 + ) ≤ S σ − ≤ 1 , − 1 ≤ (1 − α )( λ (1 + ) − S σ − ) ≤ 0 . Thus, since 0 ≤ ( L + − L − ) ≤ cQ ∗ ( ) we hav e α G σ + + (1 − α ) G σ − ≥ cQ ∗ ( ) − ( L + − L − ) ≥ c (1 − ) Q ∗ ( ) The result follo ws since β σ + = α and β σ − = 1 − α. 7.3 Threshold policies f or c -approximating L P L A G R A N G E ( , L ) W e first generalize the definition for H u,` as follo ws. Definition 7.6. Let H u,`,x = φ for all u ≥ K . F or each ` , starting fr om u = K − 1 , down to u = w ` , r ecursively , define H u,`,x = n i | i 6∈ S v : v >u H v ,`,x , and ˜ r i [ u ] − c i ˜ p i [ u ] > max( ˜ r ` [0] , r u − 1 , x ) o \{ ` } . Assume that c i / ˜ p i [ u ] = ∞ when ˜ p i [ u ] = 0 . Let w `,x = min u { u : u ≥ 0 , r u > max( ˜ r ` [0] , x ) } . If r u ≤ max( ˜ r ` [0] , x ) for all u , w `,x = K . R E S E RV E B K U P ( `, x ) ( Probing Pr ocess: ) u = K − 1 While u ≥ w `,x and the highest state of a probed channel is lower than u, probe channels in H u,`,x in non-increasing order of ˜ r j [ u ] − c j / ˜ p j [ u ] . u → u − 1 ( Selection Process: ) Consider the channel j in the highest state y among all probed channels. (If no channel is probed, j = − 1 . ) If max( r y , ˜ r ` [0]) < x , do not transmit. If max( r y , ˜ r ` [0]) ≥ x , transmit in j if r y ≥ ˜ r ` [0] and transmit in ` otherwise. Note that R E S E RV E B K U P ( `, x ) is similar to R E S E RV E B K U P ( ` ) except that R E S E RV E B K U P ( `, x ) se- lects a transmission threshold, x, apriori, and transmits only if a probed channel is in state j or a higher state such that r j ≥ x or if the probability of success of the backup channel ` is not lower than x. Thus, R E S E RV E B K U P ( `, x ) , probes only those channels for which the expected rew ards conditioned on being in states k and above, where r k > max( ˜ r ` [0] , x ) , exceed the probing cost. Definition 7.7. Let B E S T R E S E RV E B K U P ( x ) be the R E S E RV E B K U P ( `, x ) for that ` for which it attains the maximum gain among all ` ∈ { 0 , . . . , n } , and σ x denote the decision tr ee of B E S T R E S E RV E B K U P ( x ) In the next section, we prove that B E S T R E S E RV E B K U P ( x ) optimally solves T ( x ) for K = 2 and 2 / 3 - approximates T ( x ) for K > 2 . Lemma 7.4 follows since B E S T R E S E RV E B K U P ( x ) can be computed in O ( n 2 K ) time. 7.4 Proof of Lemma 7.4 The proof relies on the following Generalized Structure Theorem which is similar to the Structure Theo- rem 5.1. Theorem 7.7 (Generalized Structure Theorem) . Consider the S A T U R AT E D A L T E R E D R E W A R D ( x ) system described in Definition 7.5. Ther e e xists an optimum policy in this system that uses a unique bac kup channel whenever it tr ansmits in a backup c hannel. The backup c hannel is used at the end of one ≤ i path. 25 Pr oof. If no optimum policy transmits in a backup channel, the theorem clearly holds. Now , suppose there exists an optimum policy O P T 1 that transmits in a backup channel at the end of some path in its decision tree. Observe that lemma 5.5 holds in this system. The proof is the same as that in the original system. If O P T 1 does not probe any channel, then the theorem follows as well. Let O P T 1 probe a channel. Clearly , then, O P T 1 probes a channel j at the root node, say m , of its decision tree. Let i be the highest state of j for which O P T 1 transmits in a backup channel somewhere do wnstream of m. Then, by lemma 5.5, there exists another optimum policy O P T 2 for which the decision tree rooted at m , and hence the overall decision tree, is (a) a ≤ i tree (b) uses a channel, say ` , as a backup at the end of the ≤ i path in the tree and (c) uses ` as a backup whene ver it uses a backup. The theorem follo ws. Pr oof. (of Lemma 7.4) Consider the S A T U R AT E D A LT E R E D R E W A R D ( x ) system described in Definition 7.5. The set of decision trees in this system is Π , irrespective of x . Note that the gain of any σ ∈ Π in this system is G σ − x S σ and depends on x. Let σ ∗ x be a polic y that attains the maximum gain in this system (and hence solves problem T ( x ) ), F = G σ x − x S σ x and B E S T = G σ ∗ x − x S σ ∗ x . W e need to pro ve that F = B E S T for K = 2 and F ≥ (2 / 3) B E S T for K > 2 . Lemma 7.4 follows since B E S T R E S E RV E B K U P ( x ) can be computed in O ( n 2 K ) time. In this system, for each x , multiple σ may maximize the gain. The generalized structure theorem (The- orem 7.7) sho ws that for each x at least one σ ∗ x uses a unique backup channel whene ver a backup channel is used for transmission. W e therefore consider a σ ∗ x that uses a unique backup, say ` . Let σ ∗ x use ` as backup with probability α. Let R = P i : r i ≥ x p i` ( r i − x ) and T = P i : r i 2 ), it can be shown that in this system R E S E RV E B K U P ( j, x ) attains the maximum gain among all policies in P 0 ( j ) . Thus, F is the maximum gain attained in this system by any polic y in ∪ n j =0 P 0 ( j ) . Let K = 2 . W e now prove that σ ∗ x ∈ ∪ n j =0 P 0 ( j ) . It follows that F ≥ B E S T . Let x ≥ r 1 . Now , σ ∗ x does not transmit in any channel. Thus, σ ∗ x ∈ P 0 (0) . The result follows. Now , let x < r 1 . Let σ ∗ x 6∈ ∪ n j =0 P 0 ( j ) . Gi ven Theorem 7.7, the abo ve happens only when σ ∗ x probes a channel j in one path and uses it as a backup in another path. W e now rule this out. Now , if a probed channel is in the highest state, state 1 , σ ∗ x transmits in that channel. Thus, σ ∗ x consists of only one path, say P , and some other links that originate from P. Each of these links correspond to the case that a probed channel is in state 1 and leads to a leaf node at which σ ∗ x transmits in the probed channel. Thus, σ ∗ x can transmit in a backup channel only at the end of P , b ut then it can not hav e probed the channel in P , and hence does not probe the channel in any other path as well. The result follo ws. Next, let K > 2 . No w , construct a policy σ 1 that is similar to σ ∗ x except that whenev er σ ∗ x uses ` as a backup, σ 1 does not transmit. Thus σ 1 attains a gain of a t least B E S T - α ( R − T ) . Also, σ 1 ∈ P 0 (0) . Thus, its gain is upper bounded by F . Thus F ≥ B E S T − α ( R − T ) . Now , consider the polic y that transmits in ` e very slot without probing any channel. This policy is in P 0 ( ` ) and attains a gain of R − T . Thus, F ≥ R − T . It follows that (1 + α ) F ≥ B E S T . (5) Next, construct another policy σ 2 that is similar to σ ∗ x except that σ 2 ne ver probes ` ; instead wherev er σ ∗ x probes ` , σ 2 follo ws the same course of actions as σ ∗ x does after discovering ` in state 0 . The gain of σ 2 is at least B E S T − (1 − α )( R − c ` ) , because if ` was in state i such that r i < x , σ ∗ x will ne ver transmit in `. Also, σ 2 ∈ P 0 ( ` ) . Therefore, F ≥ B E S T − (1 − α )( R − c ` ) . Now consider another polic y σ 3 that probes ` and subsequently transmits only if ` is in a state i such that r i ≥ x ; σ 3 neither probes nor transmits in any other channel. Clearly , σ 3 attains a gain of R − c ` . Also, σ 3 ∈ P 0 (0) . Thus, F ≥ R − c ` . Combining the 26 last two equations, (2 − α ) F ≥ B E S T . (6) Adding (5) and (6), we get F ≥ (2 / 3) B E S T . The result follows. 7.5 Proof of Lemma 7.5 W e now describe ho w the parameters L + , L − and σ + , σ − are selected. Definition 7.8. Let T H R E S H O L D be an array consisting of n + K + 2 elements which ar e − 1 , 2 , r 0 , . . . , r K − 1 and ˜ r 1 [0] , . . . , ˜ r n [0] sorted in an incr easing or der . Note that the decision tree σ x of B E S T R E S E RV E B K U P ( x ) is the same for all x ∈ ( T H R E S H O L D [ i ] , T H R E S H O L D [ i + 1]) . Thus, T H R E S H O L D is the collection of thresholds x corresponding to different v alues of σ x . Definition 7.9. F or 1 ≤ i ≤ n + K + 1 let ˆ σ i be σ x (i.e., B E S T R E S E RV E B K U P ( x ) ) for x = T H R E S H O L D [ i ] . Lemma 7.8. F or ∈ (0 , 1 /λ − 1) , there e xists an i such that S ˆ σ i > λ (1 + ) ≥ S ˆ σ i +1 . Pr oof. Now , S σ x = 1 for x ≤ 0 and S σ x = 0 for x > r K − 1 . Thus, since T H R E S H O L D [1] = − 1 and T H R E S H O L D [ n + K + 2] > r K − 1 , S ˆ σ 1 = 1 > λ (1 + ) and S ˆ σ n + K +1 = 0 ≤ λ (1 + ) . The result follows since S ˆ σ i ≥ S ˆ σ i +1 for each i. Pr oof. (of Lemma 7.5) Let i ∗ be the i found by Lemma 7.8, and ∆ = min( 2 Q ∗ ( ) 3 , ( T H R E S H O L D [ i ∗ + 1] − T H R E S H O L D [ i ∗ ]) / 2) . If S σ x ≤ λ (1 + ) , for x ∈ ( T H R E S H O L D [ i ∗ ] , T H R E S H O L D [ i ∗ + 1]) , L + = T H R E S H O L D [ i ∗ ] + ∆ , and L − = T H R E S H O L D [ i ∗ ] . If S σ x > λ (1 + ) , for x ∈ ( T H R E S H O L D [ i ∗ ] , T H R E S H O L D [ i ∗ + 1]) , L + = T H R E S H O L D [ i ∗ + 1] , and L − = T H R E S H O L D [ i ∗ + 1] − ∆ . Now , σ + = σ L + , σ − = σ L − . Since S ˆ σ i ∗ > λ (1 + ) ≥ S ˆ σ i ∗ +1 , in both cases L + , L − and σ + , σ − satisfy the properties of lemma 7.3. Note that we need to compute σ x for O ( n + K ) v alues of x . Thus, the guarantee on the computation time follo ws since σ x can be computed in O ( n 2 K ) time (lemma 7.4). 7.6 Algorithm Summary W e now summarize the design of the stable policy , U N S A T A P P RO X ( ) that attains a gain of 1 − 1+ G U for K = 2 and (2 / 3) 1 − 1+ G U for K > 2 (Theorem 7.6). Recall that σ x is B E S T R E S E RV E B K U P ( x ) , ˆ σ i is σ x for x = T H R E S H O L D [ i ] (Definition 7.8). U N S A T A P P R OX ( ) 1. Compute L + , L − as in the proof of lemma 7.5, and let σ + , σ − denote B E S T R E S E RV E B K U P ( L + ) and B E S T R E S E RV E B K U P ( L − ) respectiv ely . 2. In each busy slot, use σ + with probability α and σ − with probability 1 − α , where α = S σ − − λ (1+ ) S σ − −S σ + . 27 8 Conclusions W e hav e presented a simple model for studying the information acquisition and exploitation trade-off at a single wireless node, when the multiple a v ailable channels are multi-state, and the channel distrib utions and information acquisition costs could be dif ferent. W e presented a general solution framew ork based on exploiting the structure of the optimal policy , and by using Lagrangean relaxations to simplify the space of approximately optimal solutions. W e believ e these techniques will have wider applicability , in particular when we consider the multiple node scenario. Refer ences [1] IEEE 802.11a W orking Group. W ir eless LAN Medium Acces Contr ol(MAC) and Physical Layer(PHY) Specifications-Amendment 1:High speed Physical Layer in the 5GHz Band , 1999. [2] IEEE 802.11h W orking Group. W ir eless LAN Medium Acces Contr ol(MAC) and Physical Layer(PHY) Specifications-Amendment 5:Spectrum and T ransmit P ower Management Extensions in the 5GHz Band in Eur ope , 2003. [3] P . Auer, N. Cesa-Bianchi, Y . Freund, and R. Schapire. The non-stochastic multi-armed bandit problem. SIAM J ournal for Computing , 32(1):48–77, November 2002. [4] S. Bab u, R. Motwani, K. Munagala, I. Nishizawa, and J. W idom. Adapti ve ordering of pipelined stream filters. In Pr oc. of SIGMOD , pages 407–418, June 2004. [5] D. P . Bertsekas. Dynamic Pr ogramming and Optimal Contr ol . Athena Scientific, 2nd edition, 2001. [6] D. Bertsimas and J. Nino-Mora. Restless bandit, linear programming relaxations and a primal-dual heuristic. Operations Resear ch , 48(1):80–90, 2000. [7] A. Blum, R. Ravi, and S. V empala. A constant-factor approximation algorithm for the k -mst problem (extended abstract). In STOC ’96: Pr oceedings of the twenty-eighth annual ACM symposium on Theory of computing , pages 442–448, 1996. [8] N. Chang and M. Liu. Optimal channel probing and transmission scheduling in a multichannel system. Pr oceedings of the Second W orkshop on Information Theory and Applications , 2007. [9] P . Chaporkar , K. Kar, and S. Sarkar . Achieving queue length stability through maximal scheduling in wireless networks. In Pr oceedings of Information Theory and Applications Inaugural W orkshop , San Diego, CA, April 2006. [10] Y . S. Chow , H. Robbins, and D. Siegmund. Great expectations: The theory of optimal stopping. Houghton Mufflin Company , 1971. [11] E. Cohen, A. Fiat, and H. Kaplan. Efficient sequences of trials. In Pr oc. of SODA , 2003. [12] G. Fayolle, V . A. Malyshev , and M. V . Menshiko v . T opics in the Constructive Theory of Countable Markov Chains . Cambridge Uni versity Press, 1995. [13] U. Feige, L. Lov ´ asz, and P . T etali. Approximating min-sum set cover . Algorithmica , 2004. [14] J. Gittins and D. Jones. A dynamic allocation index for the sequential allocation of e xperiments. Pr ogr ess in Statistics , 1974. edited by J. Gani. 28 [15] IEEE 802.11 W orking Group. W ir eless LAN Medium Acces Contr ol(MA C) and Physical Layer(PHY) Specifications , 1997. [16] R. Guha and S. Sarkar . Characterizing temporal snr v ariation in 802.11 networks. Pr oceedings of IEEE W ir eless Communications and Networking Confer ence (WCNC) , April 3-6 2006. [17] J. Hellerstein. Optimization techniques for queries with expensi ve methods. A CM T ransactions on Database Systems , 23(2):113–157, 1998. [18] K. Jain and V . V . V azirani. Primal-dual approximation algorithms for metric facility location and k -median problems. In FOCS ’99: Pr oceedings of the 40th Annual Symposium on F oundations of Computer Science , 1999. [19] Z. Ji, Y . Y ang, J. Zhou, M. T akai, and R. Bagrodia. Exploiting medium access di versity in rate adaptiv e wireless lans. ACM MOBICOM , 2004. [20] H. Kaplan, E. Kushile vitz, and Y . Mansour . Learning with attribute costs. In STOC ’05: Pr oceedings of the thirty-seventh annual A CM symposium on Theory of computing , pages 356–365, 2005. [21] M. S. K odialam. The throughput of sequential testing. Lectur e Notes in Computer Science , 2081:280– 292, 2001. [22] K. Munagala, S. Babu, R. Motw ani, and J. W idom. The pipelined set cov er problem. Pr oc. Intl. Conf. Database Theory , 2005. [23] K. Munagala, U. Sri v astav a, and J. W idom. Optimization of continuous queries with shared expensi ve filters. Pr oc. A CM Symp. on Principles of Database Systems , 2007. [24] A. Sabharwal, A. Khoshne vis, and E. Knightly . Opportunistic spectral usage: Bounds and a multi-band csma/ca protocol. IEEE/ACM T ransactions on Networking , 2006. Accepted for publication. [25] U. Sri vasta va, K. Munagala, J. Widom, and R. Motwani. Query processing o ver web services. Pr oc. Intl. conf. V ery Lar ge Databases , 2006. [26] A. Stolyar , S. Shakk ottai, and R. Srikant. Pathwise optimality of the e xponential scheduling rule for wireless channels. Advances in Applied Pr obability , 36(4):1021–1045, 2004. [27] E. V ergetis, R. Guerin, and S. Sarkar . Realizing the benefits of user-le vel channel di versity . Computer Communications Revie w , 35(5), October 2005. [28] P . Whittle. Restless bandits: activity allocation in a changing world. Journal of applied pr obability , 25:287–298, 1988. 29
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment