Efficient Learning with Partially Observed Attributes

We describe and analyze efficient algorithms for learning a linear predictor from examples when the learner can only view a few attributes of each training example. This is the case, for instance, in medical research, where each patient participating…

Authors: Nicol`o Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir

Suppose we would like to predict if a person has some disease based on medical tests. Theoretically, we may choose a sample of the population, perform a large number of medical tests on each person in the sample and learn from this information. In many situations this is unrealistic, since patients participating in the experiment are not willing to go through a large number of medical tests. The above example motivates the problem studied in this paper, that is learning when there is a hard constraint on the number of attributes the learner may view for each training example. example. Roughly speaking, we actively pick which attributes to observe in a randomized way so as to construct a "noisy" version of all attributes. Intuitively, we can still learn despite the error of this estimate because instead of receiving the exact value of each individual example in a small set it suffices to get noisy estimations of many examples. Many methods have been proposed for dealing with missing or partial information. Most of the approaches do not come with formal guarantees on the risk of the resulting algorithm, and are not guaranteed to converge in polynomial time. The difficulty stems from the exponential number of ways to complete the missing information. In the framework of generative models, a popular approach is the Expectation-Maximization (EM) procedure (Dempster et al., 1977). The main drawback of the EM approach is that it might find sub-optimal solutions. In contrast, the methods we propose in this paper are provably efficient and come with finite sample guarantees on the risk. Our technique for dealing with missing information borrows ideas from algorithms for the adversarial multi-armed bandit problem (Auer et al., 2003;Cesa-Bianchi and Lugosi, 2006). Our learning algorithms actively choose which attributes to observe for each example. This and similar protocols were studied in the context of active learning (Cohn et al., 1994;Balcan et al., 2006;Hanneke, 2007;2009;Beygelzimer et al., 2009), where the learner can ask for the target associated with specific examples. The specific learning task we consider in the paper was first proposed in (Ben-David and Dichterman, 1998), where it is called "learning with restricted focus of attention". Ben-David and Dichterman (1998) considered the classification setting and showed learnability arXiv: 1004.4421v2 [cs.LG] 28 Apr 2010 of several hypothesis classes in this model, like k-DNF and axis-aligned rectangles. However, to the best of our knowledge, no efficient algorithm for the class of linear predictors has been proposed. 1A related setting, called budgeted learning, was recently studied -see for example (Deng et al., 2007;Kapoor and Greiner, 2005) and the references therein. In budgeted learning, the learner purchases attributes at some fixed cost subject to an overall budget. Besides lacking formal guarantees, this setting is different from the one we consider in this paper, because we impose a budget constraint on the number of attributes that can be obtained for every individual example, as opposed to a global budget. In some applications, such as the medical application discussed previously, our constraint leads to a more realistic data acquisition process -the global budget allows to ask for many attributes of some individual patients while our protocol guarantees a constant number of medical tests to all the patients. Our technique is reminiscent of methods used in the compressed learning framework (Calderbank et al., 2009;Zhou et al., 2009), where data is accessed via a small set of random linear measurements. Unlike compressed learning, where learners are both trained and evaluated in the compressed domain, our techniques are mainly designed for a scenario in which only the access to training data is restricted. The "opposite" setting, in which full information is given at training time and the goal is to train a predictor that depends only on a small number of attributes at test time, was studied in the context of learning sparse predictors -see for example (Tibshirani, 1996) and the wide literature on sparsity properties of 1 regularization. Since our algorithms also enforce low 1 norm, many of those results can be combined with our techniques to yield an algorithm that views only O(1) attributes at training time, and a number of attributes comparable to the achievable sparsity at test time. Since our focus in this work is on constrained information at training time, we do not elaborate on this subject. Furthermore, in some real-world situations, it is reasonable to assume that attributes are very expensive at training time but are more easy to obtain at test time. Returning to the example of medical applications, it is unrealistic to convince patients to participate in a medical experiment in which they need to go through a lot of medical tests, but once the system is trained, at testing time, patients who need the prediction of the system will agree to perform as many medical tests as needed. A variant of the above setting is the one studied by Greiner et al. (2002), where the learner has all the information at training time and at test time he tries to actively choose a small amount of attributes to form a prediction. Note that active learning at training time, as we do here, may give more learning power than active learning at testing time. For example, we formally prove that while it is possible to learn a consistent predictor accessing at most 2 attributes of each example at training time, it is not possible (even with an infinite amount of training examples) to build an active classifier that uses at most 2 attributes of each example at test time, and whose error will be smaller than a constant. In this section we outline the main results. We start with a formal description of the learning problem. In linear regression each example is an instance-target pair, (x, y) ∈ R d × R. We refer to x as a vector of attributes and the goal of the learner is to find a linear predictor x → w, x , where we refer to w ∈ R d as the predictor. The performance of a predictor w on an instance-target pair, (x, y) ∈ R d × R, is measured by a loss function ( w, x , y). For simplicity, we focus on the squared loss function, (a, b) = (a-b) 2 , and briefly discuss other loss functions in Section 5. Following the standard framework of statistical learning (Haussler, 1992;Devroye et al., 1996;Vapnik, 1998), we model the environment as a joint distribution D over the set of instance-target pairs, R d × R. The goal of the learner is to find a predictor with low risk, defined as the expected loss: L D (w) def = E (x,y)∼D [ ( w, x , y)]. Since the distribution D is unknown to the learner he learns by relying on a training set of m examples S = (x 1 , y 1 ), . . . , (x m , y m ), which are assumed to be sampled i.i.d. from D. We denote the training loss by . We now distinguish between two scenarios: • Full information: The learner receives the entire training set. This is the traditional linear regression setting. • Partial information: For each individual example, (x i , y i ), the learner receives the target y i but is only allowed to see k attributes of x i , where k is a parameter of the problem. The learner has the freedom to actively choose which of the attributes will be revealed, as long as at most k of them will be given. While the full information case was extensively studied, the partial information case is more challenging. Our approach for dealing with the problem of partial information is to rely on algorithms for the full information case and to fill in the missing information in a randomized, data and algorithmic dependent, way. As a simple baseline, we begin by describing a straightforward adaptation of Lasso (Tibshirani, 1996), based on a direct nonadaptive estimate of the loss function. We then turn to describe a more effective approach, which combines a stochastic gradient descent algorithm called Pegasos (Shalev-Shwartz et al., 2007) with the active sampling of attributes in order to estimate the gradient of the loss at each step. A popular approach for learning a linear regressor is to minimize the empirical loss on the training set plus a regularization term taking the form of a norm of the predictor. For example, in ridge regression the regularization term is w 2 2 and in Lasso the regularization term is w 1 . Instead of regularization, we can include a constraint of the form w 1 ≤ B or w 2 ≤ B. With an adequate tuning of parameters, the regularization form is equivalent to the constraint form. In the constraint form, the predictor is a solution to the following optimization problem: where S = {(x 1 , y 1 ), . . . , (x m , y m )} is a training set of examples, B is a regularization parameter, and p is 1 for Lasso and 2 for ridge regression. Standard risk bounds for Lasso imply that if ŵ is a minimizer of (1) (with p = 1), then with probability greater than 1 -δ over the choice of a training set of size m we have To adapt Lasso to the partial information case, we first rewrite the squared loss as follows: where w, x are column vectors and w T , x T are their corresponding transpose (i.e., row vectors). Next, we estimate the matrix xx T and the vector x using the partial information we have, and then we solve the optimization problem given in (1) with the estimated values of xx T and x. To estimate the vector x we can pick an index i uniformly at random from [d] = {1, . . . , d} and define the estimation to be a vector v It is easy to verify that v is an unbiased estimate of x, namely, E[v] = x where expectation is with respect to the choice of the index i. When we are allowed to see k > 1 attributes, we simply repeat the above process (without replacement) and set v to be the averaged vector. To estimate the matrix xx T we could pick two indices i, j independently and uniformly at random from [d], and define the estimation to be a matrix with all zeros except d 2 x i x j in the (i, j) entry. However, this yields a non-symmetric matrix which will make our optimization problem with the estimated matrix non-convex. To overcome this obstacle, we symmetrize the matrix by adding its transpose and dividing by 2. The resulting baseline procedure 2 is given in Algorithm 1. 2 We note that an even simpler approach is to arbitrarily assume that the correlation matrix is the identity matrix and then the solution to the loss minimization problem is simply the averaged vector, w = (x,y)∈S y x. In that case, we can simply replace x by its estimated vector as defined in (3). While this naive approach can work on very simple classification tasks, it will perform poorly on realistic data sets, in which the correlation matrix is not likely to be identity. Indeed, in our experiments with the MNIST data set, we found out that this approach performed poorly relatively to the algorithms proposed in this paper. The following theorem shows that similar to Lasso, the Baseline algorithm is competitive with the optimal linear predictor with a bounded L 1 norm. ,k), where |S| = m. Then, with probability of at least 1 -δ over the choice of the training set and the algorithm's own randomization we have The above theorem tells us that for a sufficiently large training set we can find a very good predictor. Put another way, a large number of examples can compensate for the lack of full information on each individual example. In particular, to overcome the extra factor d 2 /k in the bound, which does not appear in the full information bound given in (2), we need to increase m by a factor of d 4 /k 2 . Note that when k = d we do not recover the full information bound. This is because we try to estimate a matrix with d 2 entries using only k = d < d 2 samples. In the next subsection, we describe a better, adaptive procedure for the partial information case. In this section, by avoiding the estimation of the matrix xx T , we significantly decrease the number of additional examples sufficient for learning with k attributes per training example. To do so, we do not try to estimate the loss function but rather estimate the gradient ∇ (w) = 2 ( w, x -y) x, with respect to w, of the squared loss function ( w, x -y) 2 . Each vector w can define a probability distribution over [d] by letting We can estimate the gradient using 2 attributes as follows. First, we randomly pick j from [d] according to the distribution defined by w. Using j we estimate the term w, x by sgn(w j ) w 1 x j . It is easy to verify that the expectation of the estimate equals w, x . Second, we randomly pick i from [d] according to the uniform distribution over [d]. Based on i, we estimate the vector x as in (3). Overall, we obtain the following unbiased estimation of the gradient: where v is as defined in (3). The advantage of the above approach over the loss based approach we took before is that the magnitude of each element of the gradient estimate is order of d w 1 . This is in contrast to what we had for the loss based approach, where the magnitude of each element of the matrix A was order of d 2 . In many situations, the L 1 norm of a good predictor is significantly smaller than d and in these cases the gradient based estimate is better than the loss based estimate. However, while in the previous approach our estimation did not depend on a specific w, now the estimation depends on w. We therefore need an iterative learning method in which at each iteration we use the gradient of the loss function on an individual example. Luckily, the stochastic gradient descent approach conveniently fits our needs. Concretely, below we describe a variant of the Pegasos algorithm (Shalev-Shwartz et al., 2007) for learning linear regressors. Pegasos tries to minimize the regularized risk Of course, the distribution D is unknown, and therefore we cannot hope to solve the above problem exactly. Instead, Pegasos finds a sequence of weight vectors that (on average) converge to the solution of ( 5). We start with the all zeros vector w = 0 ∈ R d . Then, at each iteration Pegasos picks the next example in the training set (which is equivalent to sampling a fresh example according to D) and calculates the gradient of the loss function on this example with respect to the current weight vector w. In our case, the gradient is simply 2( w, x -y)x. We denote this gradient vector by ∇. Finally, Pegasos updates the predictor according to the rule: w = (1 -1 t ) w -1 λ t ∇, where t is the current iteration number. To apply Pegasos in the partial information case we could simply replace the gradient vector ∇ with its estimation given in (4). However, our analysis shows that it is desirable to maintain an estimation vector ∇ with small magnitude. Since the magnitude of ∇ is order of d w 1 , where w is the current weight vector maintained by the algorithm, we would like to ensure that w 1 is always smaller than some threshold B. We achieve this goal by adding an additional projection step at the end of each Pegasos's iteration. Formally, after performing the update we set This projection step can be performed efficiently in time O(d) using the technique described in (Duchi et al., 2008). A pseudo-code of the resulting Attribute Efficient Regression algorithm is given in Algorithm 2. w The following theorem provides convergence guarantees for AER. Theorem 2 Let D be a distribution such that Let w be any vector such that w 1 ≤ B and w 2 ≤ B 2 Then, where |S| = m, w is the output of AER(S, k) run with λ = ((B+1)d/B 2 ) log(m)/(mk), and the expectation is over the choice of S and over the algorithm's own randomization. For simplicity and readability, in the above theorem we only bounded the expected risk. It is possible to obtain similar guarantees with high probability by relying on Azuma's inequality -see for example (Cesa-Bianchi et al., 2004). Note that w 2 ≤ w 1 ≤ B, so Theorem 2 implies that Therefore, the bound for AER is much better3 than the bound for Baseline: instead of d 2 /k we have d/ √ k. It is interesting to compare the bound for AER to the Lasso bound in the full information case given in (2). As it can be seen, to achieve the same level of risk, AER needs a factor of d 2 /k more examples than the full information Lasso.4 Since each AER example uses only k attributes while each Lasso example uses all d attributes, the ratio between the total number of attributes AER needs and the number of attributes Lasso needs to achieve the same error is O(d). Intuitively, when having d times total number of attributes, we can fully compensate for the partial information protocol. However, in some situations even this extra d factor is not needed. Suppose we know that the vector w , which minimizes the risk, is dense. That is, it satisfies w 1 ≈ √ d w 2 . In this case, choosing B 2 = B/ √ d, the bound in Theorem 2 becomes order of B 2 d/k 1/m. Therefore, the number of examples AER needs in order to achieve the same error as Lasso is only a factor d/k more than the number of examples Lasso uses. But, this implies that both AER and Lasso needs the same number of attributes in order to achieve the same level of error! Crucially, the above holds only if w is dense. When w is sparse we have w 1 ≈ w 2 and then AER needs more attributes than Lasso. One might wonder whether a more clever active sampling strategy could attain in the sparse case the performance of Lasso while using the same number of attributes. The next subsection shows that this is not possible in general. We now show (proof in the appendix) that any attribute efficient algorithm needs in general order of d/ examples for learning an -accurate sparse linear predictor. Recall that the upper bound of AER implies that order of d 2 (B + 1) 2 B 2 2 / 2 examples are sufficient for learning a predictor with L D (w) -L D (w ) < . Specializing this sample complexity bound of AER to the w described in Theorem 3 below, yields that O(d 2 / ) examples are sufficient for AER for learning a good predictor in this case. That is, we have a gap of factor d between the lower bound and the upper bound, and it remains open to bridge this gap. Theorem 3 For any ∈ (0, 1/16), k, and d ≥ 4k, there exists a distribution over examples and a weight vector w , with w 0 = 1 and w 2 = w 1 = 2 √ , such that any attribute efficient regression algorithm accessing at most k attributes per training example must see (in expectation) at least Ω d k examples in order to learn a linear predictor w with L D (w) -L D (w ) < . Recall that in our setting, while at training time the learner can only view k attributes of each example, at test time all attributes can be observed. The setting of Greiner et al. (2002), instead, assumes that at test time the learner cannot observe all the attributes. The following theorem shows that if a learner can view at most 2 attributes at test time then it is impossible to give accurate predictions at test time even when the optimal linear predictor is known. Theorem 4 There exists a weight vector w and a distribution D such that L D (w ) = 0 while any algorithm A that gives predictions A(x) while viewing only 2 attributes of each x must have L D (A) ≥ 1/9. The proof is given in the appendix. This negative result highlights an interesting phenomenon. We can learn an arbitrarily accurate predictor w from partially observed examples. However, even if we know the optimal w , we might not be able to accurately predict a new partially observed example. Here we only sketch the proof of Theorem 2. A complete proof of all our theorems is given in the appendix. We start with a general logarithmic regret bound for strongly convex functions (Hazan et al., 2006;Kakade and Shalev-Shwartz, 2008). The regret bound implies the following. Let z 1 , . . . , z m be a sequence of vectors, each of which has norm bounded by G. Let λ > 0 and consider the sequence of functions g 1 , . . . , g m such that g t (w) = λ 2 w 2 + z t , w . Each g t is λ-strongly convex (meaning, it is not too flat), and therefore regret bounds for strongly convex functions tell us that there is a way to construct a sequence of vectors w 1 , . . . , w m such that for any w that satisfies w 1 ≤ B we have . With an appropriate choice of λ, and with the assumption w 2 ≤ B 2 , the above inequality implies that . This holds for any sequence of z 1 , . . . , z m , and in particular, we can set z t = 2(ŷ t -y t )v t . Note that z t is a random vector that depends both on the value of w t and on the random bits chosen on round t. Taking conditional expectation of z t w.r.t. the random bits chosen on round t we obtain that E[z t |w t ] is exactly the gradient of ( w, x t -y t ) 2 at w t , which we denote by ∇ t . From the convexity of the squared loss, we can lower bound ∇ t , w t -w by ( w t , x t -y t ) 2 -( w , x t -y t ) 2 . That is, in expectation we have that Taking expectation w.r.t. the random choice of the examples from D, denoting w = 1 m m t=1 , and using Jensen's inequality we get that E[L D ( w)] ≤ L D (w ) + α. Finally, we need to make sure that α is not too large. The only potential danger is that G, the bound on the norms of z 1 , . . . , z m , will be large. We make sure this cannot happen by restricting each w t to the 1 ball of radius B, which ensures that z t ≤ O((B + 1)d) for all t. We performed some preliminary experiments to test the behavior of our algorithm on the well-known MNIST digit recognition dataset (Cun et al., 1998), which contains 70,000 images (28 × 28 pixels each) of the digits 0 -9. The advantages of this dataset for our purposes is that it is not a small-scale dataset, has a reasonable dimensionality-to-data-size ratio, and the setting is clearly interpretable graphically. While this dataset is designed for classification (e.g. recognizing the digit in the image), we can still apply our algorithms on it by regressing to the label. First, to demonstrate the hardness of our settings, we provide in Figure 1 below some examples of images from the dataset, in the full information setting and the partial information setting. The upper row contains six images from the dataset, as available to a full-information algorithm. A partial-information algorithm, however, will have a much more limited access to these images. In particular, if the algorithm may only choose k = 4 pixels from each image, the same six images as available to it might look like the bottom row of Figure 1. We began by looking at a dataset composed of "3 vs. 5", where all the 3 digits were labeled as -1 and all the 5 digits were labeled as +1. We ran four different algorithms on this dataset: the simple Baseline algorithm, AER, as well as ridge regression and Lasso for comparison (for Lasso, we solved (1) with p = 1). Both ridge regression and Lasso were run in the full information setting: Namely, they enjoyed full access to We randomly split the dataset into a training set and a test set (with the test set being 10% of the original dataset). For each algorithm, parameter tuning was performed using 10-fold cross validation. Then, we ran the algorithm on increasingly long prefixes of the training set, and measured the average regression error ( w, x -y) 2 on the test set. The results (averaged over runs on 10 random train-test splits) are presented in Figure 2. In the upper plot, we see how the test regression error improves with the number of examples. The Baseline algorithm is highly unstable at the beginning, probably due to the ill-conditioning of the estimated covariance matrix, although it eventually stabilizes (to prevent a graphical mess at the left hand side of the figure, we removed the error bars from the corresponding plot). Its performance is worse than AER, completely in line with our earlier theoretical analysis. The bottom plot of Figure 2 is similar, only that now the X-axis represents the accumulative number of attributes seen by each algorithm rather than the number of examples. For the partial-information algorithm, the graph ends at approximately 49,000 attributes, which is the total number of attributes accessed by the algorithm after running over all training examples, seeing k = 4 pixels from each example. However, for the full-information algorithm 49,000 attributes are already seen after just 62 examples. When we compare the algorithms in this way, we see that our AER algorithm achieves excellent performance for a given attribute budget, significantly better than the other L 1 -based algorithms, and even comparable to full-information ridge regression. Finally, we tested the algorithms over 45 datasets generated from MNIST, one for each possible pair of dig- its. For each dataset and each of 10 random train-test splits, we performed parameter tuning for each algorithm separately, and checked the average squared error on the test set. As can be seen, the AER algorithm manages to achieve good performance, not much worse than the fullinformation Lasso algorithm. The Baseline algorithm, however, achieves a substantially worse performance, in line with our theoretical analysis above. We also calculated the test classification error of AER, i.e. sign( w, x ) = y, and found out that AER, which can see only 4 pixels per image, usually perform only a little worse than the full-information algorithms (ridge regression and Lasso), which enjoy full access to all 784 pixels in each image. In particular, the median test classification errors of AER, Lasso, and Ridge are 3.5%, 1.1%, and 1.3% respectively. In this paper, we provided an efficient algorithm for learning when only a few attributes from each training example can be seen. The algorithm comes with formal guarantees, is provably competitive with algorithms which enjoy full access to the data, and seems to perform well in practice. We also presented sample complexity lower bounds, which are only a factor d smaller than the upper bound achieved by our algorithm, and it remains open to bridge this gap. Our approach easily extends to other gradient-based algorithms besides Pegasos. For example, generalized additive algorithms such as p-norm Perceptrons and Winnow -see, e.g., (Cesa-Bianchi and Lugosi, 2006). An obvious direction for future research is how to deal with loss functions other than the squared loss. In upcoming work on a related problem, we develop a technique which allows us to deal with arbitrary analytic loss functions, but in the setting of this paper will lead to sample complexity bounds which are exponential in d. Another interesting extension we are considering is connecting our results to the field of privacy-preserving learning (Dwork, 2008), where the goal is to exploit the attribute efficiency property in order to prevent acquisition of information about individual data instances. To ease our calculations, we first show that sampling k elements without replacements and then averaging the result has the same expectation as sampling just once. In the lemma below, for a set C we denote the uniform distribution over C by U (C). Lemma 1 Let C be a finite set and let f : C → R be an arbitrary function. Let Then, Proof Denote |C| = n. We have: To prove Theorem 1 we first show that the estimation matrix constructed by the Baseline algorithm is likely to be close to the true correlation matrix over the training set. Lemma 2 Let A t be the matrix constructed at iteration t of the Baseline algorithm and note that Then, with probability of at least 1 -δ over the algorithm's own randomness we have that Proof Based on Lemma 1, it is easy to verify that E[A t ] = x T t x t . Additionally, since we sample without replacements, each element of A t is in [-d 2 /k, d 2 /k] (because we assume x t ∞ ≤ 1). Therefore, we can apply Hoeffding's inequality on each element of Ā and obtain that Combining the above with the union bound we obtain that Calling the right-hand-side of the above δ and rearranging terms we conclude our proof. Next, we show that the estimate of the linear part of the objective function is also likely to be accurate. Combining the above with the union bound we obtain that Calling the right-hand-side of the above δ and rearranging terms we conclude our proof. We next show that the estimated training loss found by the Baseline algorithm, LS (w), is close to the true training loss. Lemma 4 With probability greater than 1 -δ over the Baseline Algorithm's own randomization, for all w such that w 1 ≤ B we have that Proof Combining Lemma 2 with the boundedness of w 1 and using Holder's inequality twice we easily get that Similarly, using Lemma 3 and Holder's inequality, Combining the above inequalities with the union bound and the triangle inequality we conclude our proof. We are now ready to prove Theorem 1. First, using standard risk bounds (based on Rademacher complexities5 ) we know that with probability greater than 1 -δ over the choice of a training set of m examples, for all w s.t. w 1 ≤ B, we have that Combining the above with Lemma 4 we obtain that for any w s.t. The proof of Theorem 1 follows since the Baseline algorithm minimizes LS (w). We start with the following lemma. Lemma 5 Let y t , ŷt , v t , w t be the values of y, ŷ, v, w, respectively, at iteration t of the AER algorithm. Then, for any vector w s.t. w 1 ≤ B we have The proof follows directly from logarithmic regret bounds for strongly convex functions (Hazan et al., 2006;Kakade and Shalev-Shwartz, 2008) by noting that according to our construction, max Let B 2 be such that w 2 ≤ B 2 and choose λ = ((B + 1)d/B 2 ) log(m)/(mk). Since λ w t 2 ≥ 0 we obtain from Lemma 5 that For each t, let ∇ t = 2( w t , x t -y t )x t and ∇t = 2(ŷ t -y t )v t . Taking expectation of ( 7) with respect to the algorithm's own randomization, and noting that the conditional expectation of ∇t equals ∇ t , we obtain From the convexity of the squared loss we know that Combining with (8) yields The outline of the proof is as follows. We define a specific distribution such that only one "good" feature is slightly correlated with the label. We then show that if some algorithm learns a linear predictor with an extra risk of at most , then it must know the value of the 'good' feature. Next, we construct a variant of a multi-armed bandit problem out of our distribution and show that a good learner can yield a good prediction strategy. Finally, we adapt a lower bound for the multi-armed bandit problem given in (Auer et al., 2003), to conclude that in our case no learner can be too good. Proof For any two distributions P, Q we let P -Q 1 = r |P [r] -Q[r]| be the total variation distance and let KL(P, Q) = r P [r] log(P [r]/Q[r]) be the KL divergence. Using Holder inequality we know that E j [f (r)] -E u [f (r)] ≤ M P j -P u 1 . Additionally, using Pinsker's inequality we have 1 2 P j -P u 2 1 ≤ KL(P u , P j ). Finally, the chain rule and simple calculations yield, KL(P u , P j ) = (a t,i = j) E u -log 1 + 2 p sgn(x t,j y t ) (since x t,j y t is independent of a t,1 , . . . , a t,K ) = 1 2 (-log 1 + 2 p) + 1 2 -log(1 -2 p) . Combining all the above we conclude our proof. We have shown previously that the expected reward of any algorithm is bounded above by 1 2 T + pE j [N j ]. Applying Lemma 6 above on f (r) = N j ∈ {0, 1, . . . , T } we get that Therefore, the expected reward of any algorithm is at most Since the adversary will choose j to minimize the above and since the minimum over j is smaller then the expectation over choosing j uniformly at random we have that the reward against an adversarial choice of j is at most Note that 1 Combining this with (12) and using Jensen's inequality we obtain the following upper bound on the reward Assuming that ≤ 1/16 we have that 4p 2 = 4 ≤ 1/4 and thus using the inequality -log(1 -q) ≤ 3 2 q, which holds for q ∈ [0, 1/4], we get the upper bound t Ben-David and Dichterman (1998) do describe learnability results for similar classes but only under the restricted family of product distributions. When comparing bounds, we ignore logarithmic terms. Also, in this discussion we assume that B1 and B2 are at least 1. We note that when d = k we still do not recover the full information bound. However, it is possible to improve the analysis and replace the factor d/ √ k with a factor d maxt xt 2/k. To bound the Rademacher complexity, we use the boundedness of w 1, x ∞, |y| to get that the squared loss is O(B) Lipschitz on the domain. Combining this with the contraction principle yields the desired Rademacher bound.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment