Advancing subgroup fairness via sleeping experts

We study methods for improving fairness to subgroups in settings with overlapping populations and sequential predictions. Classical notions of fairness focus on the balance of some property across different populations. However, in many applications …

Authors: Avrim Blum, Thodoris Lykouris

Advancing subgroup fairness via sleeping experts
A dv ancing subgroup fairness via sleeping exp erts A vrim Blum ∗ Tho doris Lyk ouris † Decem b er 2019 Abstract W e study methods for improving fairness to subgroups in settings with ov erlapping p opulations and sequen tial predictions. Classical notions of fairness fo cus on the balance of some property across differen t populations. Ho wev er, in man y applications the goal of the different groups is not to b e predicted equally but rather to b e predicted well. W e demonstrate that the task of satisfying this guarantee for multiple ov erlapping groups is not straightforw ard and sho w that for the simple ob jective of unw eighted av erage of false negative and false p ositiv e rate, satisfying this for o verlapping p opulations can b e statistically imp ossible ev en when we are pro vided predictors that p erform well separately on eac h subgroup. On the positive side, w e show that when individuals are equally imp ortan t to the differen t groups they b elong to, this goal is ac hiev able; to do so, we dra w a connection to the sleeping exp erts literature in online learning. Motiv ated b y the one-sided feedback in natural settings of in terest, we extend our results to suc h a feedback model. W e also provide a game-theoretic in terpretation of our results, examining the incen tives of participan ts to join the system and to pro vide the system full information about predictors they ma y possess. W e end with several interesting op en problems concerning the strength of guarantees that can be achiev ed in a computationally efficient manner. 1 In tro duction Concerns ab out ethical use of data in algorithmic decision-making hav e spawned an imp ortant con versation regarding machine learning techniques that are fair tow ards the affected p opulations. W e fo cus here on binary decision-making: e.g., deciding whether to approv e a loan, admit a student to an honors class, display a particular job adv ertisement, or prescrib e a particular drug. While m ultiple fairness notions hav e b een suggested to inform these decisions, most assume access to lab eled data and that this data is dra wn from i.i.d. distributions. In practice, data patterns are often dynamically evolving and the feedbac k received is biased by the decisions of the algorithms — suc h as only learning whether a studen t should hav e b een admitted to an honors class if the student is actually admitted — whic h induces additional misrepresentation of the actual data patterns. Despite the ric h literature, approaching fairness considerations without these strong assumptions is rather underexplored. Most fairness notions imp ose a requirement of balance across groups. F or example, demographic parit y [ CKP09 ] aims to ensure that the p ercen tage of p ositiv e predictions is the same across different ∗ T oy ota T ec hnological Institute at Chicago, avrim@ttic.edu . The author was supp orted in part by NSF gran ts CCF-1815011 and CCF-1733556. † Microsoft Researc h, thlykour@microsoft.com . Researc h initiated during the author’s visit to TTI-Chicago while he w as a Ph.D. studen t at Cornell Universit y . The author w as supp orted in part under NSF grant CCF-1563714 and a Go ogle Ph.D fellowship. 1 p opulations, and equality of opp ortunit y [ HPS16 ] aims to ensure that the p ercen tage of false negative predictions is the same across them. These notions are useful in iden tifying inequities b et w een subp opulations, esp ecially in settings such as criminal recividism [ KLL + 17 ] where there is a conflict b et w een the incen tives of the participants and the goal of accurate prediction. How ev er, these notions ma y not b e appropriate when the goal of each group is to b e predicted as accurately as p ossible, and explicitly p erforming worse on one group in order to pro duce balance would b e morally ob jectionable or absurd. F or example, in a health application, a balance notion may lead to p enalizing the ma jorit y p opulation by willfully pro viding it worse treatmen t to make amends for the fact that a minority p opulation is not classified correctly due to insufficient data. This would b e clearly inappropriate. More generally , there are many natural scenarios where the goal of each group is just to b e predicted as accurately as p ossible. A student wishes to b e admitted in an honors class only if they are qualified to succeed in it; otherwise their academic record may b e jeopardized. A p erson requesting a microloan can b e often significantly harmed b y receiving it unless they return it (see [ LDR + 18 ] for an interesting discussion). A drug prescription is only beneficial if it helps enhance the health of the patient; otherwise it may cause adv erse effects. In these settings what groups care ab out is not b eing treated e qual ly but rather b eing treated wel l . In this w ork, w e consider a fairness notion for sequential non-i.i.d. settings where the subp opulations and the designer b oth strive for accurate predictions. W e say that a prediction rule f is unfair to a subgroup g with resp ect to a family of rules F if there is some other rule f g ∈ F that p erforms significan tly b etter than f on g , in which case we say that f g witnesses this unfairness. More generally , giv en a collection of rules, some of which may come from the designer, some from third-party entities, and some from the groups themselv es, our goal is to ac hieve p erformance on eac h group comparable to that of the best of these rules for that group, even when groups ov erlap. Moreov er, we aim to ac hieve this goal with as strong b ounds as p ossible in a c hallenging “apple-tasting” feedback mo del where we only receiv e feedback on positive predictions (e.g., when a loan is giv en or a studen t is admitted). Interestingly , while our main results are p ositiv e, we sho w such guaran tees are not p ossible if we replace the p erformance measure of accuracy (or error) with any fixed weigh ted av erage of false-p ositiv e and false-negativ e rates. Our p ositiv e results can also b e thought of as a form of individual r ationality with resp ect to the groups: no group has any incentiv e (up to lo w order terms) to pull out and, say , form its own lending agency just for members of that group. F rom this p erspective, we also consider notions of inc entive c omp atibility (could groups ha ve any incentiv e to hide prediction rules from the system) and present a computationally-inefficien t algorithm along with an op en problem related to achieving this guarantee in a computationally-efficient manner. 1.1 Our con tribution W e consider a decision-maker with some global decision function f o that she would like to use (sa y to decide who gets a loan), and a collection of groups G , where each group g ∈ G prop oses some function f g that it would lik e to b e used instead on mem b ers of g . 1 The guarantee we aim to give is that our ov erall p erformance will b e nearly as go od (or b etter) than f o on the entire p opulation with resp ect to the ob jective function of the decision-mak er, and for eac h group g our p erformance is nearly as go od (or b etter) than the p erformance of f g with resp ect to group g ’s ob jectiv e. W e w ould like to do this even when groups are ov erlapping and even when feedback on whether or not a decision w as correct is only received when the decision made was “yes” (e.g., when the loan was giv en or the studen t was admitted). 1 Both decision-maker and groups may hav e more functions; the guarantees are with resp ond to the b est of them. 2 Surprisingly , we show that when groups are ov erlapping and when the group ob jectives are to minimize the unweighte d a verage of false-p ositive rate (FPR) and false-negative rate (FNR), there exist settings where every global prediction rule f m ust b e unfair to one of the groups. In particular, w e present a simple example with t wo ov erlapping groups having predictors f 1 and f 2 resp ectiv ely , where p erforming nearly as well as f 1 on group 1 and nearly as well as f 2 on group 2 is fundamentally imp ossible when p erformance is measured as (FPR + FNR) / 2 (or as max(FPR,FNR) or as an y fixed non-degenerate w eighting) even when the input does not arrive in an online manner. This shows that just ha ving the fairness notions b e the same across groups and aligned with the goals of the designer (who in this case do es not hav e an y additional goal other than to eliminate unfairness) is not b y itself sufficient to b e able to achiev e our fairness criteria under ob jectiv es based on fixed com binations of FPR and FNR. Informal theorem 1.1 (Theorem 5.1) . If subgroups are ov erlapping and their ob jectives are to minimize the unw eigh ted av erage of false p ositive and false negativ e rates, there exist instances where no global function can simultan eously p erform nearly as w ell on eac h group as the b est function for that group. This holds ev en in the batc h setting. Instead we aim for low absolute error on each subgroup 2 and show a connection of this notion to an adv ersarial online learning setting, that of sle eping exp erts [ Blu97 , FSSW97 ]. In sleeping exp erts, eac h predictor (also referred to as an exp ert) can decide at each round to either make a prediction (fire) or abstain (sleep). Sleeping exp erts algorithms guaran tee that, for any exp ert, the p erformance of the algorithm when the exp ert fires is nearly as go od as the p erformance of this exp ert. Providing the functions f o for the decision-maker and f g for each group into existing sleeping-exp erts algorithms (viewing f g as abstaining on any individual outside of group g ) yields the following. Informal theorem 1.2 (Theorem 3.1) . F or the ob jectiv e of minimizing absolute error, w e can p erform nearly as w ell as f g for all groups g while p erforming nearly as w ell as f o o verall. One particular complication that arises in many fairness-related settings, ho wev er, is that feedback receiv ed is one-sided: we only learn ab out the outcome if the loan is giv en, the student is admitted in the class, the adv ertisement is display ed, or the drug is prescrib ed; we do not learn about what w ould hav e happ ened when the action is not tak en. In online learning, this is known as the apple tasting mo del [ HLL92 ]. W e therefore initiate the study of sleeping exp erts in this apple tasting feedbac k mo del, aiming to ac hiev e as strong regret guarantees as p ossible on a p er-group basis. Com bining apple tasting with sleeping exp erts p oses interesting challenges as the exploration needs to b e carefully co ordinated among different subp opulations. In Section 3.2, we provide three differen t blac k-b o x reductions with differen t adv an tages in their performance guaran tee. Informal theorem 1.3 (Theorems 3.2, 3.3, and 3.4) . Ev en if we only receive one-sided feedback, w e can still p erform nearly as well as f g for all groups g while p erforming nearly as well as f o o verall. Eac h of our guarantees is somewhat sub optimal. Theorem 3.2 is based on a construction that do es not use sleeping exp erts, but has an exp onen tial dep endence on the num b er of groups which makes it computationally inefficient when there are many groups. Theorem 3.3 is a natural adaptation of sleeping exp erts to this setting but has an error b ound with a (sublinear) dep endence on the size of the total p opulation instead of only depending on size of the subgroup, whic h makes the result less meaningful for small groups as this term ma y dominate their regret b ound. Last, Theorem 3.4 has a more in volv ed use of sleeping exp erts and do es not suffer from the tw o previous issues, but has a 2 This is equiv alent to FPR on a group weigh ted by the fraction of negative examples in that group plus FNR on a group is weigh ted by the fraction of p ositive examples in that group. 3 sub optimal dep endence on the size of the subgroup. Combining the adv antages of these approac hes without the resulting shortcomings is an intriguing op en question. The final contribution of our work (Section 4) is to provide a game-theoretic inv estigation of our setting in terms of incen tives of the participating groups. In mec hanism design, tw o imp ortan t prop erties that a mec hanism should satisfy are Individual Rationality (IR) and Incentiv e Compatibility (IC). The former asks that no pla yer should prefer to opt out and seek service outside of the system instead. The latter refers to the mechanism creating no incen tives for play ers to misrep ort their priv ate information. Inspired by the kidney exchange literature [ RSU07 , AR14 , AF AKDP13 ], we consider each group as a play er in our system. The IR prop ert y is satisfied when group g can get no b enefit (asymptotically) from b eing predicted by their individual predictor f g , say via their own loan agency . This is exactly what our ab o v e guaran tees provide and therefore they can be in terpreted as asymptotically IR. This observ ation brings up the question of whether incentiv e compatibilit y is also satisfied by sleeping exp erts algorithms. In this con text, IC means that if a group g has a set of predictors { f g } then they get no b enefit (asymptotically) from hiding some of those predictors from the decision-maker. Unfortunately , we show that current sleeping exp erts algorithms are not (even appro ximately) incen tive compatible. On the other hand, w e pro vide an algorithm that achiev es b oth IR and IC guarante es, as w ell as operating in the apple tasting setting, at the exp ense of b eing computationally inefficient (enumerating o ver all exp onen tially-man y group intersections). This leads to an inter esting op en question of finding a computationally efficient algorithm that satisfies b oth IR and IC prop erties. Informal theorem 1.4 (Theorem 4.1) . Classical sleeping exp erts algorithms such as A daNormal- Hedge do not satisfy the IC prop erty . Informal theorem 1.5 (Theorem 4.2) . Separate m ultiplicative weigh ts algorithms for each inter- section of groups satisfy the IC prop erty at the exp ense of b eing computationally inefficient. Op en question 1 (Section 4.3) . Do es there exist a computationally efficient algorithm satisfying b oth IR and IC prop erties? 1.2 Related w ork There is a growing literature aiming to identify natural fairness notions and understand the limitations they imp ose; see [ DHP + 12 , HPS16 , KMR17 , Cho17 , K CP + 17 , ABD + 18 ] for a non-exhaustiv e list. With resp ect to fairness among differen t demographic groups, notions such as disparate impact [ CKP09 , FFM + 15 ] or equalized o dds [ HPS16 ] aim to achiev e some balance in p erformance across differen t p opulations. These make sense when there is an intrinsic conflict b et w een the desires of the differen t groups and those of the designer and can help iden tify undesirable inequities in the system that require remedies (that are often non-algorithmic and need p olicy c hanges). Unfortunately , aiming to satisfy these notions can also hav e undesired implications such as in tentionally misclassifying some groups to make amends for the inabilit y to classify other groups w ell enough. This has given rise to an imp ortan t debate around alternative notions that do not suffer of these issues (see for example [CDG18]). Our work aims to adv ance this direction. When p opulations are ov erlapping, there are sev eral recent works in the batch setting that tackle considerations similar to the ones we address. Kearns et al. [ KNR W18 ] pro vide a simple example illustrating the issue that one can b e non-discriminatory with resp ect to, say , gender and race in isolation but hea vily discriminate against, say , black female participants; they also discuss how to 4 audit whether a classifier exhibits such unfairness with resp ect to groups. In a similar motiv ation, Heb ert-Johnson et al. [ HJKRR18 ] et al. suggest a fairness notion they term mutli-calibration that relates to accurate prediction of all populations that are c omputational ly identifiable . Both these w orks show that their settings are equiv alen t to agnostic learning, whic h has strong computational hardness results but tends to hav e go o d heuristic guarantees. 3 is an interesting op en direction. Subgroup fairness is also discussed by Kim et al. [ K GZ19 , KRR18 ] for the related notion of multi- accuracy , with the aim of p ost-pro cessing data to ac hieve this notion while maintaining ov erall accuracy [ K GZ19 ] as w ell as metric subgroup fairness notions [ KRR18 ]. Our approac h similarly considers o verlapping demographic groups, but fo cuses on the more complex setting of online decision-making under limited feedbac k and addresses inherent conflicts b et ween the incentiv es of differen t subgroups. In fact, even for the batc h setting, our imp ossibilit y result of Section 5 with resp ect to the criterion of un weigh ted a verage of false p ositiv e and false negative rates is of a purely statistical flav or (has nothing to do with computational issues or arriv als b eing online). It therefore pro vides insigh ts on complications that exist even when all the incentiv es seem w ell aligned. On the other hand, our connection to sleeping exp erts do es not app ear to directly hav e implications to m ulticalibration and m ultiaccuracy for similar reasons as the issues encoun tered regarding Incen tive Compatibilit y; these notions tend to require a similar no ne gative r e gr et prop ert y in order to b e satisfied in an online context (similar to our results in Sections 4.1-4.2). F airness issues in online decision-making ha ve also recently b een considered. Joseph et al. [ JKMR16 ] fo cus on a bandit le arning approac h and imp ose what they call merito cr atic fairness against individuals with some sto c hastic but unknown quality . Celis et al. [ CKSV18 ] discuss how to alleviate bias in settings like online advertising where bandit learning can lead to ov erexp osure or underexpsure to particular actions. Blum et al. [ BGLS18 ] fo cus on adversarial online learning and examine for differen t fairness notions whether non-discriminatory predictors can b e combined efficiently in an online fashion while preserving non-discrimination. A recent line of work also fo cuses on online settings where decisions made to da y affect what is allo wed tomorrow as they need to b e connected via some metric-based notion of individual fairness [ LRD + 17 , GJKR18 , GK19 ]. Related to our work, Ragha v an et al. [ RSWW18 ] study externalities that arise in the exploration of classical sto c hastic bandit algorithms when applied across different subp opulations. Finally , Bec hav o d et al. [ BLR + 19 ] study a similar notion of one-sided feedback in online learning with fairness in mind. Unlik e us, the latter work do es not apply to o verlapping p opulations but instead they tak e a con textual bandit approac h, fo cus on sto c hastic and not adversarial losses, and aim for balance notions of fairness (where, as we discussed, there is conflict b et w een incentiv es of designer and the groups). Our work applies adv ersarial online learning which was initiated by the seminal works of Littlestone and W arm uth [ L W94 ], and F reund and Schapire [ FS97 ]. The classical exp erts guarantee we use in our first reduction can come from any of these or later dev elop ed algorithms. The apple tasting setting we consider to mo del the one-sided feedback was in tro duced b y Helmbold et al. [ HLL92 ]; a related concept is that of lab el-efficient prediction that instead has an upp er b ound on the num b er of times the learner can explore [ CBLS05 ]. Sleeping exp erts ha ve b een introduced by Blum [ Blu97 ] and F reund et al. [ FSSW97 ]. Subsequently they were extended to a more general case of confidence-rated exp erts, and the results were better optimized [ BM07 , GSVE14 , LS15 ]. The sleeping exp ert full feedbac k guarantee we use in our reductions is the one of AdaNormalHedge [ LS15 ]. T o the b est of our knowledge, our work is the first to consider sleeping exp erts in the context of either apple 3 Their connection implies that for exact auditing, a linear dep endence on the num b er of subp opulations is required but can b e ov ercome if additional structure exists. Our work also has a linear dep endence on the num ber of subgroups as it explicitly lists the subgroups. Understanding if this can b e improv ed in our setting when the subgroups and predictors hav e additional structure 5 tasting or lab el efficient prediction. W e do so to enable our algorithms to deal with o verlapping p opulations while only using realistic feedbac k (incorp orating the one-sided nature of feedbac k). 2 Basic mo del Online learning with group con text. W e consider an online learning setting with multiple o verlapping demographic groups. The set of groups G can corresp ond to divisions based on gender, age, ethnicity , or other attributes. P eople (also referred to as examples) arrive sequen tially , and the example at round t = 1 , 2 , . . . , T can b e simultaneously a member of multiple groups (e.g., female and hisp anic ); the subset G t ⊆ G denotes all groups that p erson t b elongs to. A t eac h round t , the system designer (or le arning algorithm or le arner ) denoted by A aims to classify incoming examples by predicting a lab el ˆ y t ∈ ˆ Y . F or example, in binary classication with deterministic predictors, the prediction space consists of p ositiv e and negative lab els, i.e. ˆ Y ∈ { + , −} (e.g. whether the corresponding p erson should b e admitted to an honors class). T o assist her goal, the designer has access to a set F t of rules that suggest particular labels according to the features of the example; these are typically referred to as exp erts although they are not necessarily asso ciated to an y particular external knowledge. At ev ery example, each exp ert f ∈ F t mak es a prediction ˆ y t,f ∈ ˆ Y and the learner selects which exp ert’s advice to follo w in a (p ossibly randomized) manner; we use p t,f to denote the probabilit y with which the learner follows the advice of exp ert f . Subsequently , the true lab el y t ∈ Y is realized and each exp ert f is asso ciated with a loss ` t,f ∈ [0 , 1] . F or example, if b oth the prediction and true label spaces are deterministic, i.e. ˆ Y = Y = { + , −} , then a reasonable loss is whether the prediction was correct: ` t,f = 1 { ˆ y t,f 6 = y t } . In order to not imp ose an y i.i.d. assumption, w e allow the losses to b e adv ersarially selected. The learner then suffers expected loss ˆ ` t ( A ) = P f ∈F t p t,f ` t,f and observes some feedback (discussed b elo w). F eedbac k observed In the first p ortion of this pap er, we assume the learner receives ful l fe e db ack , i.e. at the end of the round she observes the losses of all exp erts (this t ypically can be ac hieved by observing the lab el). In Section 3.2, we turn our attention to the apple tasting setting in whic h we only receiv e feedback ab out the losses when we select the p ositiv e outcome. T o ease notation and streamline presentation, we present here the p erformance and fairness notions for full feedbac k and defer the apple tasting definitions to Section 3.2. Ov erall regret guaran tee. One natural goal for the designer in this setting is that it performs as well as the b est among a class of exp erts F that are av ailable every round, i.e. ∀ t : F ⊆ F t . In tuitively , when there exists a rule that consistently classifies examples more accurately , the learner should even tually realize this fact and trust it more often (or at least p erform as well via com bining other rules). This is formalized by the following notion of regret. R e g = T X t =1 X f ∈F t p t,f ` t,f − min f ? ∈F T X t =1 ` t,f ? . (1) In the classical exp ert setting, F t = F for all rounds t . Algorithms suc h as the celebrated m ulti- plicativ e weigh ts [ FS97 ] incur regret that, on a verage, v anishes o ver time at a rate of p log( |F | ) /T . Hence, despite the input not b eing i.i.d., the p enalt y we pay for not knowing in adv ance which exp ert 6 is the b est is v ery small and go es to 0 at a fast rate. W e allo w c hanges in the sets F t to incorp orate adaptiv e addition of rules as well as group-sp ecific rules. Subgroup regret guaran tees. In order to not treat any group w orse than what is inevitable, we are esp ecially in terested in group-based p erformance guaran tees. In particular, for each group g ∈ G , w e wan t the p erformance on its mem b ers to b e nearly as go o d as the b est exp ert in a class F ( g ) . This class can consist of all rules in F , in which case this means w e care not only ab out comp eting with the b est rule in the class o verall but also having the same guarantee for eac h group. It also allo ws each group to hav e rules sp ecialized to it; for example, a third-part y entit y may observe a disparit y in the p erformance of the group compared to what is achiev able and recommend the use of a particular rule. T o ease presen tation we assume that the set F ( g ) is fixed in adv ance but most of our results extend to the case where new rules are added adaptively as p oten tial unfairness is observ ed (see Remark 1). This notion of group-based regret is formalized b elow: R e g ( g ) = X t : g ∈G t X f ∈F t p t,f ` t,f − min f ? ∈F ( g ) X t : g ∈G t ` t,f ? . (2) In Section 3, we show that, via connecting to the literature of sleeping experts, we can hav e the a verage group-based regret v anish across time for all groups while still retaining the ov erall regret guaran tee describ ed ab o ve. In Section 5, we show that this is not in general p ossible for ob jectives based on fixed av erages of false-p ositiv e and false-negative rates. 3 Sleeping exp erts and one-sided feedbac k In this section, we fo cus on subgroup regret guarantees and conceptually connect the quest for these guaran tees to the literature of sleeping exp erts and the incentiv es of the groups. 3.1 Subgroup regret guarantees via sleeping exp erts Sleeping exp erts. Sleeping exp ert algorithms were originally developed to seamlessly combine task-sp ecific rules so that their co existence do es not create negativ e externalities to other tasks. More formally , there is a set of exp erts H and, at eac h round t , any exp ert h ∈ H ma y decide to either b ecome active (fire) or abstain (sleep); the set of exp erts that fire at round t is denoted b y H t . A t any round, the algorithm can only select among activ e exp erts, i.e. puts non-zero probability p t,h only on exp erts h ∈ H t . The goal is that, for all exp erts h ? ∈ H , the p erformance of the algorithm in the rounds where the exp erts fire is at least as go o d as the one of h ? . More formally the sleeping regret is: Sle epR e g ( h ? ) = X t : h ? ∈H t X h ∈H p t,h ` t,h − X t : h ? ∈H t ` t,h ? . (3) The goal is to ensure that the av erage sleeping regret for any sleeping exp ert h ∈ H is v anishing with the n umber of times the expert fires, T ( h ) = | t : h ∈ H ( t ) | , Multiple algorithms [ Blu97 , FSSW97 , BM07 , GSVE14 , LS15 ] achiev e this goal. Probably the most effective among them is A daNormalHedge by Luo and Sc hapire [ LS15 ] with an av erage sleeping regret of O  p log( | H | ) /T ( h )  . W e elab orate on this algorithm in Section 4.1 in order to discuss its game-theoretic prop erties. 7 Subgroup regret form ulated via sleeping exp erts. Lo oking closely at the definition of the desired subgroup regret from Eq. (2) and the one of sleeping regret from Eq. (3) , there are clear similarities, whic h motiv ates form ulating our problem as a sleeping exp erts problem and applying algorithms such as AdaNormalHedge. This leads to the follo wing theorem. Theorem 3.1. Let A b e an algorithm with sleeping regret b ounded by O  p T ( h ) · log ( |H| )  for any exp ert h ∈ H where H is any class. Let G b e a set of o verlapping groups and N = |F | + P g ∈G |F ( g ) | . Then A can provide subgroup regret guarantee of O  p T ( g ) · log ( N )  for any g ∈ G while ensuring o verall regret guarantee of O ( p T log( N )) . Pr o of. The idea to b ound subgroup regret as a sleeping exp erts problem is to hav e each exp ert f ∈ F ( g ) fire only for mem b ers of group g , guaranteeing that they exp erience p erformance at least as go od as its own. One small issue is that we may wan t to use the same rule f ∈ F t for m ultiple subgroups; to deal with that, we create different copies of this expert, eac h asso ciated to one group. More formally , we create a set of global sleeping exp erts H with one sleeping exp ert h ∈ H for ev ery expert f ∈ F . These sleeping exp erts fire ev ery round and ensure the o verall regret guaran tee. Subsequen tly w e create disjoint sets H ( g ) for eac h group g ∈ G where again w e create a sleeping exp ert h ∈ H ( g ) for an y exp ert f ∈ F ( g ) . These sleeping exp erts fire only when the example is a mem b er of group g and hence ensure the subgroup regret guaran tee. The ev entual sleeping exp ert set is S g ∈G H ( g ) ∪ H . The cardinalit y of this set is N = |F | + P g ∈G |F ( g ) | . This form ulation enables to automatically apply sleeping exp erts algorithms achieving subgroup regret of p T ( g ) · log ( N ) for an y group g ∈ G while also guaranteeing an o verall regret of p T · log ( N ) . Remark 1. In the ab ove formulation, we assume d that al l the exp erts exist in the b e ginning of the time-horizon. However, the sle eping exp erts algorithms al low for exp erts to b e adde d dynamic al ly over time (tr e ating them as not firing in the initial r ounds). Henc e we c an adaptively add new sle eping exp erts if a gr oup or some thir d-p arty entity suggests it, guar ante eing that we do at le ast as wel l in the r emainder of the game on the memb ers of the gr oup without affe cting with the sub gr oup r e gr et guar ante es of other overlapping gr oups. 3.2 Subgroup regret with one-sided feedbac k W e no w mov e our attention to the more realistic setting where we receive the label of the example (and therefore learn the losses of the experts) only if we select the p ositiv e outcome (e.g. admit the studen t to the honors class and observe her p erformance). This is captured by the so-called apple tasting setting whic h dates bac k to the work of Helmbold et al. [HLL92]. P ay-for-feed back. Our algorithms op earate in a more c hallenging mo del where, at the beginning of each round, the learner needs to select whether to ask for the lab el. If she asks for it, she receives it at a cost of 1 instead of the loss of this outcome. An y guarantee for this setting is automatically an upp er b ound for the apple tasting setting. W e can transform any such algorithm to an apple tasting one by selecting the p ositiv e outcomes at the rounds that w e ask the lab el and ignoring an y extra feedback. The loss of the p ositiv e outcome is upper b ounded by 1 ; since w e assume the losses to b e b ounded in [0 , 1] , the loss in the pay-per feedback mo del is therefore only larger. 8 There are a few reasons why we w ant to w ork on this more stringent setting instead of the classical apple tasting setting. First, this feedback mo del makes it easy to create an estimator that is unbiased (since it do es not condition on the prediction for the example and therefore our estimates do not suffer from selection bias). Second, in some applications, this mo del actually is more appropriate; for example, one may need to p oll the participant to learn ab out their exp erience (which may b e indep enden t of whether they were classified as p ositiv e). Finally , this serves as an upp er b ound on the apple tasting setting; as we will see, arguing ab out lo wer b ounds in this setting is significantly easier. Using the feedbac k in a more fine-grained w ay is an in teresting op en direction. W e now offer three guarantees for sleeping experts in the pa y-p er feedback setting, whic h are all ac hieved via blac k-b o x reductions to full-feedback algorithms (either sleeping exp erts or classical exp erts). Although the reductions become more in volv ed as we pro ceed in the pap er, no guarantee strictly dominates each other. Our algorithms select random p oin ts of exploration (when they ask for the lab els to receive feedback). W e denote by E the set of rounds that the algorithm ended up exploring (this is a random v ariable). F or ease of presen tation, w e assume that w e know the size of eac h demographic group and of all the intersections among groups; if these are not known, we can apply the so called doubling tric k (similarly to the w ay describ ed in Section 4.2). First reduction: Indep endent classical exp erts algorithms p er intersection. The first reduction we provide comes from treating all disjoint in tersections betw een subgroups separately and running separate apple-tasting versions of classical (non-sleeping) exp erts algorithms on eac h in tersection. Although this provides optimal dep endence on the size of each subp opulation, the guaran tee has an exp onen tial dep endence on the n umber of different groups. F or eac h disjoint in tersection I b et w een groups, let T ( I ) b e the size of this intersection and denote b y g ∈ I the case when in tersection I includes g . Our algorithm splits the examples that lie in this in tersection in ( T ( I )) 2 / 3 phases, eac h of whic h consists of ( T ( I )) 1 / 3 examples. At ev ery phase, w e select one random p oin t of exploration. Whenever an example comes that b elongs in I , our algorithm follo ws the advice of a classical exp erts algorithm (e.g. m ultiplicative weigh ts) that is asso ciated to I . This exp erts algorithm is up dated at the end of the phase b y the sample of the exploration round. This construction is in the spirit of A werbuc h and Mansour [AM03]. Theorem 3.2. Let A b e an algorithm with regret bounded b y O p T · log ( |H| ) when compared to an exp ert class H , run on T examples and split the examples in disjoin t intersections, where each in tersection corresp onds to a distinct profile of subgroup memberships. F or each intersection I , randomly selecting an exploration p oin t every ( T ( I )) 1 / 3 examples and running separate v ersions of A for each I pro vides subgroup regret on group g ∈ G of O   2 |G |  1 / 3 · ( T ( g )) 2 / 3 · p log( N )  where T ( g ) = | t : g ∈ G ( t ) | is the size of the g -p opulation and N = |F | + P g ∈G |F ( g ) | . Pr o of sketch. The guarantee follows from three observ ations formalized in App endix A. 1. Among the exploration p oints, we run a classical exp erts algorithm so, on these examples, we ha ve a regret guaran tee that is square-ro ot of the num b er of these examples. 2. F or eac h phase, the exploration p oin t is randomly selected and therefore the regret that w e incur in the exploration p oin t is an un biased estimator of the av erage regret we incur in the 9 whole phase (since the distribution of the algorithm in the phase is the same). As a result, the total regret in a phase is in exp ectation ( T ( I )) 1 / 3 times the regret at the exploration p oint. 3. A particular group can ha ve examples in at most 2 | G | in tersections (as this is all the p ossible mem b ership relationships with resp ect to the demographic groups). Second reduction: Sleeping exp erts with fixed exploration phases. Aiming to a void the exp onen tial dep endence on the num b er of groups, w e now apply a sleeping exp erts algorithm such as AdaNormalHedge as our base algorithm. The algorithm describ ed in this part remov es this exp onen tial dep endence but introduces a dep endence on the time horizon and therefore the regret guaran tee can b e sub optimal for minority p opulations whose size is significantly smaller than the total p opulation. On the other hand, when all p opulations are w ell represen ted (and are of the same order as the time-horizon) then the guarantee has the optimal dep endence on the size of the p opulation without suffering in the n umber of groups. The algorithm splits the examples in T 2 / 3 phases and selects one random p oin t in each of the phases. Eac h phase consists in total of T 1 / 3 examples but its examples can b e distributed differently across it. At the end of the phase, we up date a sleeping exp erts algorithm (e.g. A daNormalHedge) based on the observ ations at the exploration p oin t. Theorem 3.3. Let A b e an algorithm with sleeping regret b ounded by O  p T ( h ) · log ( |H| )  for an y expert h in class H . Randomly selecting an exploration p oin t ev ery T 1 / 3 examples (irresp ectiv e of what groups they come from) and running A on these p oin ts provides subgroup regret on group g ∈ G of O  T 1 / 6 · ( T ( g )) 1 / 2 · p log( N )  where T ( g ) = | t : g ∈ G ( t ) | is the size of the group- g population and N = |F | + P g ∈G |F ( g ) | . Pr o of sketch. The guarantee follows from t wo observ ations formalized in App endix B. 1. Giv en that we run a sleeping exp erts algorithm across the exploration p oin ts, if we just fo cused on those examples, we simultaneously satisfy regret on them that is square-ro ot of their size. 2. Within an y phase the exploration p oin t is selected uniformly at random. As a result, it is an un biased estimator of the av erage regret w e incur in the whole phase. Note that this is now the a verage across all rounds and not only rounds where w e hav e mem b ers of the particular group which results in the dep endence on the time-horizon T . Third reduction: Sleeping exp erts with adaptive exploration phases Our final reduction aims to remov e the dep endence on the time horizon while also av oiding an exp onen tial dep endence on the num ber of groups. T ow ards this end, we make the size of the phases adaptiv e based on the sizes of the p opulations. Our guarantee has b oth the aforementioned desired prop erties. On the negativ e side, the exp onen t on the group size is sub optimal (see discussion in the end of the section). W e again use a sleeping exp erts algorithm A across phases but phases are no w designed adaptiv ely . A t the b eginning of each phase r , we initialize a counter p er group to capture the num ber of examples 10 w e hav e seen from it in phase r . When an example arriv es, w e increase the corresponding coun ters for all groups related to the example (p ossibly the example b elongs in m ultiple groups; then we increase all the corresp onding coun ters sim ultaneously). The phase ends when one of the groups g ∈ G has received ( T ( g )) 1 / 4 examples in this phase. A t the b eginning of a phase r , w e draw for each group g ∈ G a uniform random n umber X ( g , r ) ∈ { 1 , 2 , . . . , T ( g ) } . This determines the exploration round for group g at phase r ; let t ( r , g ) b e the random v ariable determining the time that the X ( g , r ) -th example (after time τ r ) from group g will arriv e. If the phase ends b efore this example arriv es, i.e. t ( r , g ) > τ r then w e asso ciate this phase with 0 estimated losses for group g : ˜ ` f ( r , g ) = 0 for all f ∈ F ( g ) . Otherwise, the estimated loss corresp onding to the phase is the loss at the exploration p oint, i.e. ˜ ` f ( r , g ) = ` t ( r,g ) ,f . Since the phase ends once any group counter reaches the upper b ound on examples for its phase, this means that we may not reach the exploration p oin t for some other groups (this is the reason why we may ha ve t ( r , g ) > τ r +1 ) . Before pro ceeding to the next phase, w e feed the estimated losses ˜ ` ( r , g ) to A . Theorem 3.4. Applying the ab ov e algorithm pro vides subgroup regret on group g ∈ G of O  |G | · ( T ( g )) 3 / 4 · p log( N )  where T ( g ) = | t : g ∈ G ( t ) | is the size of the g -p opulation and N = |F | + P g ∈G |F ( g ) | . Pr o of sketch. There are three components to pro ve this guarantee, formalized in App endix C. 1. The num ber for relev an t phases of each group (phases where they hav e at least one example) is at most T ( g ) . This pro vides an upp er b ound on the num b er of phases that we need to consider with resp ect to group g . 2. Using a similar analysis as b efore, w e can create a guarantee ab out the regret we are incurring in the exploration points and multiply it b y the ( T ( g )) 1 / 4 whic h is the size of the phase. This w ould ha v e created a completely un biased estimator if there was no o verlap with other groups. 3. A final complication is that exploration may o ccur due to other groups so we need to under- stand how muc h we lose there. F or that, we observe that smaller groups are explored with higher probability (the interaction with larger groups do es not therefore significan tly increase their probability of insp ection). On the other hand, larger groups do not often collide with significan tly smaller groups due to the latters’ size. On the optimality of the b ounds. In the pa y-for-feedback mo del, even if we do not w ork in a sleeping exp erts setting, the b est that one can hop e for is a guaran tee of T 2 / 3 . If we explore T a examples, we obtain a regret of T a/ 2 on them. If the estimator is unbiased, we need to then m ultiply this with T 1 − a , which gives a regret of T 1 − a/ 2 . Since w e pay for feedback, w e also lose T a . The maxim um of the t wo is minimized when a = 2 . As a result, ev en if the groups w ere disjoint, we cannot hop e for a b etter subgroup regret than ( T ( g )) 2 / 3 . Note that our results do not quite achiev e this b ound: ha ving either a multiplicativ e term exp onen tial in the n umber of groups (Theorem 3.2) or ha ving a p ortion of the regret b ound b e in terms of the total time T rather than T ( g ) (Theorem 3.3). A chieving a b ound of ( T ( g )) 2 / 3 without en umerating ov er all p ossible disjoint intersections across groups is therefore an interesting op en direction. 11 4 A game-theoretic in terpretation In this section, we provide a game-theoretic interpretation of the abov e subgroup regret guarantee, connecting it to the notions of Individually Rationalit y (IR) and Incentiv e Compatibility (IC). Individual Rationalit y . In game theory , a mec hanism is considered Individually Rational when the participan ts prefer to stay and b e served in the system rather than to leav e and use their b est outside option. Consider each subgroup as a pla yer in a game and the cost exp erienced by this pla yer as the total loss in all its members; e.g., imagine eac h group has a representativ e who lo oks out for their b est in terests. This represen tative has access to the rules in F ( g ) (priv ate t yp e) and can opt to defect from the global learning system and create its o wn predictor with only rules in F ( g ) . W e sa y that a learning algorithm induces an IR mec hanism if no group has significan t incentiv e to opt out. (W e cannot require zero incen tive since the system needs some time to learn.) The subgroup regret guarantee can b e thought as an asymptotic v ersion of the individual rationality prop ert y . The guarantee ensures that the av erage b enefit from b eing served outside of the system v anishes as the group size gro ws as formalized b elo w. Definition 4.1. A le arning algorithm induc es an asymp otic al ly individual r ational (IR) me chanism if no gr oup gains (asymptotic al ly) by getting serve d outside of the system. ∀ g ∈ G : X t : g ∈G t X f ∈F t p t,f ` t,f − min f ? ∈F ( g ) X t : g ∈G t ` t,f ? = o ( T ( g )) . Incen tiv e Compatibility . A second desired game-theoretic notion is that of Incen tive Compat- ibilit y whic h states that the play er has no incen tive to misrep ort her true type. In our context, w e define the t yp e of a group to b e the set F ( g ) of exp erts that it kno ws ab out, and IC means that group g could not achiev e enhanced p erformance for the group b y hiding a subset of F ( g ) and remo ving it from the global learning pro cess. Recall that ˆ ` t ( A ) = P f ∈F ( t ) p t f ` t f denote the expected loss of the algorithm A at round t . W e sa y that a learning algorithm induces an IC mec hanism if removing a subset of group-based exp erts do es not impro ve the p erformance of the group. More formally , let A ( ∅ ) denote the algorithm running with all the exp erts and nothing remov ed and A ( { g , H } ) denote the algorithm running with the subset H ∈ F ( g ) remov ed from the group-based exp erts and also from the o verall set F . Then: Definition 4.2. A le arning algorithm A induc es an asymp otic al ly inc entive c omp atible (IC) me cha- nism if no gr oup gains (asymptotic al ly) by hiding a subset of its exp erts. ∀ g ∈ G , ∀ H ∈ F ( g ) : X t : g ∈G ( t ) ˆ ` t ( A ( ∅ )) − X t : g ∈G ( t ) ˆ ` t ( A ( { g , H } )) = o ( T ( g )) . As b efore, we cannot hop e to ac hieve the IC prop ert y exactly as extra exp erts will definitely dela y the learning pro cess, but w e would like to satisfy an appro ximate version of this prop ert y where the a verage b enefit from remo ving some exp erts v anishes as the group size grows. This is a desirable prop ert y as, when satisfied, it suggests that the groups should not o verly think ab out p oten tial adv erse effects of suggesting particular predictors but instead provide all their prop osed rules. 12 4.1 Roadblo c ks in applying sleeping exp erts tow ards Incen tive Compatibilit y In this section, we sho w a strong negativ e result ab out classical sleeping experts algorithms with resp ect to the IC property . T o make this formal, we presen t the result for the AdaNormalHedge algorithm of Luo and Schapire [ LS15 ] but a similar intuition carries ov er to other sleeping exp erts algorithms (see Section 4.3). A daNormalHedge. A daNormalHedge [ LS15 ] is an algorithm with strong adaptiv e regret guar- an tees. Its sleeping exp erts version starts with a set of exp erts with cardinality N and a prior distribution q that is typically initialized uniformly: q i = 1 / N for all i ∈ [ N ] . Every exp ert keeps tw o quan tities R t,i capturing the total regret it has exp erienced so far in the rounds that it fired and C t,i capturing the desired regret guarantee. These parameters determine the w eight of its expert whic h is expresssed using a p oten tial function Φ( R , C ) = exp( max(0 ,R ) 2 3 C ) giving rise to a w eight function: w ( R, C ) = 1 2 (Φ( R + 1 , C + 1) − Φ( R − 1 , C + 1) . More formally , b oth the exp ert quantities are initialized to 0 , i.e. R 0 ,i = C 0 ,i = 0 . A t round t = 1 . . . T , a set A ( t ) of exp erts is activ ated and the learner predicts with probability prop ortional to the weigh t of its firing exp ert: p t,i ∝ q i · w ( R t − 1 ,i , C t − 1 ,i ) · 1 { i ∈ F ( t ) } . 4 The adversary then rev eals the loss vector ` t and the learner suffers loss ˆ ` t = P i ∈ [ N ] p t,i ` t,i . This giv es rise to an instantaneous regret for each firing exp ert: r t,i =  ˆ ` t − ` t,i  · 1 { i ∈ A ( t ) } whic h is used to up date the exp ert parameters: R t,i = R t − 1 ,i + r t,i and C t,i = C t − 1 ,i + | r t,i | , b efore pro ceeding to the next round. The regret of this algorithm with resp ect to an expert i ∈ [ N ] is roughly of order p C T ,i log( N ) whic h in the sleeping exp erts version gives a sleeping regret p | t : i ∈ A ( t ) | log( N ) . This is wh y using suc h an algorithm for subgroup regret guarantees provides a guaran tee of p T ( g ) log( N ) , ignoring constan ts. The IC low er b ound. Although the subgroup regret guarantee that sleeping exp erts pro vide mak es them satisfy an asymptotic v ersion of the IR prop erty , w e now show that this is not the case for the IC prop ert y; we illustrate this for A daNormalHedge and further discuss it in Section 4.3. Theorem 4.1. A daNormalHedge do es not induce an asymptotically incentiv e compatible mechanism, i.e. there exists an instance where a group can asymptotically benefit from hiding one of its exp erts. Pr o of. Consider a setting with t wo groups where the bigger group B consists of the whole p opulation whereas the smaller group S corresp onds to half of the p opulation. Every o dd round an example from S arriv es and ev ery even round an example from B \ S arriv es. The algorithm has access to one global exp ert F = { f } as w ell as one group-sp ecific exp ert p er group: F ( g ) = { f ( g ) } for g ∈ { B , S } . • Both group-sp ecific exp erts hav e alwa ys loss ` t,f ( g ) = 0 . 2 if g ∈ G ( t ) . • The global exp ert is really bad at predicting group S : ` t,f = 1 if g ∈ S but makes no mistak es on the remaining p opulation: ` t,f = 0 if g ∈ B \ S . 4 Luo and Schapire [ LS15 ] predict arbitrarily if all weigh ts are 0 ; we commit on selecting uniformly at random then. 13 The high-level idea on wh y IC do es not hold is the following. Group B prefers to use exp ert f ( S ) on mem b ers of group S and exp ert f on B \ S ; this is achiev ed if it hides expert f ( B ) . How ever, if it do es not, AdaNormalHedge ends up using often exp ert f ( B ) on B \ S whic h leads to a m uch higher loss. Intuitiv ely , since f ( B ) and f mak e predictions on exactly the same set of examples and f ( B ) has lo wer total loss, then f ( B ) gets m uch higher weigh t than f — the algorithm therefore uses f ( B ) instead of f on B \ S . More formally , supp ose first that group B hides exp ert f ( B ) . The sleeping regret guarantee for exp ert f ( S ) guarantees that exp ert f ( S ) is selected in all but a v anishingly small num b er of rounds. As a result, asymptotically , the loss accum ulated from group S is 0 . 2 · ( T / 2) . Giv en that f ( B ) is not an option (as it is hidden), in members B \ S the algorithm can only select exp ert f incurring 0 loss. W e no w show that, if f ( B ) is not hidden, it incurs more loss in mem b ers of B \ S without gaining on S . In this case, we show b elow that exp ert f is selected with probabilit y p t,f ≤ 1 / 2 after a few initial rounds. This leads to an additional loss of at least 0 . 2 · ( T / 4) on examples of B \ S since we select exp ert f ( B ) which has higher loss than f on these examples in at least half of the ev en rounds in exp ectation. As a result, by not hiding exp ert f ( B ) , the algorithm selects it often which leads in an additional loss that is linear in T – this directly implies that group B is b etter off by hiding this exp ert and enhancing the o verall p erformance. What is left is to sho w why the probability of selecting exp ert f is indeed p f ,t ≤ 1 / 2 after a few inital steps. The sleeping regret guarantee for exp ert f ( S ) implies that the cumulativ e (across rounds) probabilit y of selecting exp ert f on examples in S un til round t is at most √ t up to constants and log factors. As a result, the exp ected loss of the algorithm until round t is at most ˆ ` t ≤ 0 . 2 · t/ 2 + √ t and the total instan taneous regret of exp erts f and f ( B ) on o dd rounds (i.e., members of group S ) is X τ ≤ t τ is o dd r τ ,f ≥ 0 . 8 · t/ 2 − √ t and X τ ≤ t τ is o dd r τ ,f ( B ) ≤ √ t resp ectiv ely . On even rounds (i.e., members of B \ S ), the algorithm selects either f or f ( B ) ; therefore its loss is at most 0 . 2 . This means that the instantaneous regret on these rounds is: X τ ≤ t τ is even r τ ,f ≥ − 0 . 2 · t/ 2 and X τ ≤ t τ is even r τ ,f ( B ) ≤ 0 . 2 · t/ 2 resp ectiv ely . As a result, after a few initial rounds, the cumulativ e instan taneous regret of f is consisten tly negativ e and also smaller than the one of f ( B ) . This means that, by the construction of the p oten tial function, the weigh t of f is 0 as Φ( R f ,t +1 + 1 , C + 1) = Φ( R f ,t +1 − 1 , C + 1) = 1 . Since Φ is an increasing function on its first argumen t and the probability is prop ortional on the weigh t, this means that f has the smallest probabilit y across all other exp erts who fire and therefore has probability of b eing selected at most 1 / 2 which is what we wan ted to show. Note that, when f ( B ) is hidden, the w eight still b ecomes 0 but no w f is the sole alternativ e and is therefore alwa ys selected in members of B \ S . The reason why this negative result arises is that, b y hiding a subset of the exp erts, the group can p oten tially lead the algorithm to use differen t exp erts in differen t disjoint parts of the p opulation. Sleeping exp erts algorithms p enalize each exp ert for its o verall p erformance at times when it fires and do es not distinguish disjoin t subp opulations where it p erforms muc h b etter. 14 4.2 Incen tiv e Compatibilit y in a computationally inefficien t w ay Multiplicativ e weigh ts for eac h disjoint subgroup. W e now turn our atten tion to the use of separate algorithms for eac h disjoin t in tersection of groups. This suffers from an exp onential dep endence on the num b er of groups but satisfies b oth IR and IC. In some sense, it therefore serves as an existential pro of that satisfying these quan tities in an online manner is feasible and creates an in triguing op en question of whether this can be ac hieved in a computationally efficien t manner. W e run separate m ultiplicative w eights algorithms for each disjoin t group in tersection. More formally , for every S ∈ 2 |G | , we run sub-algorithm A ( S ) on examples t where g ∈ S if and only if g ∈ G t . W e denote by A t this sub-algorithm. W e assume that exp erts are not added adaptiv ely for this part (if a new exp ert app ears then we reinitialize the algorithm). The sub-algorithm A ( S ) has as exp erts all exp erts that fire at examples asso ciated with S so all mem b ers of F or F ( g ) for some g ∈ S . W e let N ( A ) denote the num b er of those exp erts for ub-algorithm A . F or sub-algorithm A , eac h exp ert i is initialized with w eight w t,i ( A ) = 1 . The algorithm selects exp erts prop ortionally to the weigh ts in the corresp onding sub-algorithm, i.e. p t,i = w t,i ( A t ) P j ∈F ( S ) w t,j ( A t ) . W e then m ultiplicatively up date weigh ts with learning rate η only for A t : w t +1 ,i ( A ) = w t,i · (1 − η ) ` t,i · 1 {A = A t } . Denoting by T ( A ) the size of the disjoint subgroup, w e should let η = p log( N ( A )) /T ( A ) for a guaran tee sublinear to this size. T o deal with the fact that w e do not kno w the size of each disjoin t subgroup in adv ance we apply the so called doubling tric k, assuming that it is 2 r (initial r = 2 ) and reinitializing the algorithm by increasing r – this can happ en at most log( T ) times. This algorithm satisfies the v anishing subgroup regret prop ert y (th us also the asymptotic IR prop ert y) as any group consists of multiple disjoint subgroups and the group has v anishing regret within each of them; other than the regret term, it cannot hop e to do b etter if it is serv ed outside of the system with its o wn functions. W e now show that these separate m ultiplicative weigh ts algorithms also ac hieve the asymptotic IC prop erty . F or that, we use a nice property of multiplicativ e w eights establishing that m ultiplicative w eights not only do es not do worse than the b est exp ert, but it also do es not do b etter. This was first formalized b y Gofer and Mansour [ GM16 ] and was used in a fairness context by Blum et al. [ BGLS18 ] to establish that equality of av erage losses across differen t groups is preserved when combining exp erts that hav e equally go od p erformance across these groups. Theorem 4.2. Running separate m ultiplicative weigh ts algorithms for each disjoint in tersection among groups induces an asymptotically IC mechanis m, i.e. no group can asymptotically b enefit by hiding any of its exp erts. Pr o of. One essential step of the pro of is the prop ert y that multiplicativ e w eights with a fixed learning rate has p erformance almost equal to the one of the b est exp ert. This is exactly shown in the pro of of Theorem 3 in [ BGLS18 ] which establishes that for any sub-algorithm and any fixed learning rate, the performance of the sub-algorithm is equal to the p erformance of the b est exp ert L ? in these rounds plus, up to constan ts, η · L ? + log ( N ( A )) /η . Since we run each sub-algorithm with a fixed learning rate for at most 2 r rounds, this is, up to constan ts less than p 2 r log( N ( A )) . Summing across all r = 2 . . . T ( A ) , this is at most p T ( A ) log( N ( A )) . T o no w establish the asymptotic IC prop ert y , note that the interv als that hav e fixed learning rates are not affected by any groups’ decision to hide a subset of their predictors. As a result, if some predictor is remov ed, then m ultiplicativ e w eights will then ha ve p erformance (asymptotically) close to the one of the b est exp ert without the hidden one; this p erformance can only b e worse (up to the 15 regret term). This holds for an y disjoint subgroup; it therefore still holds summing across all the (p ossibly exp onen tial) disjoint subgroups. 4.3 Op en problem: Computationally efficien t Incen tiv e Compatibilit y In Section 4.1, we show ed that AdaNormalHedge do es not induce an IC mec hanism as it used a single w eight that deteriorated significan tly in some disjoing subgroups and made some experts not usable in places that were essential and where hiding some exp erts would help with that. Using a single w eight for each sleeping exp ert and up dating it based on its instantaneous regret is not unique to A daNormalHedge but is presen t to other algorithms such as the one of Blum and Mansour [ BM07 ]. Since the algorithms do not distinguish b et w een instantaneous regret obtained b y different disjoint subgroups, the same construction can extend to those as well. On the other hand, in Section 4.2, w e also show ed that separate multiplicativ e w eights for each subgroup provide the asymptotic IC prop ert y as they ha ve the nice prop erty that the p erformance of the algorithm is exactly as go o d as the one of the b est exp ert in this group. On the negativ e side, having one sub-algorithm per disjoin t subgroup is computationally costly as it means that w e hav e exp onen tial num b er of sub-algorithms. These t wo results lead to the follo wing op en question: can we design a computationally efficien t algorithm (not enumerating ov er all disjoint subgroups) that satisfies the subgroup regret guaran tee? Besides its application to subgroup fairness, this question has indep enden t technical interest as it seems to require a fundamentally new approac h for the sleeping experts problem. 5 Imp ossibilit y results for fixed a v erages of FNR and FPR In this section w e demonstrate that if rather than minimizing the fr action of err ors , each group wishes to minimize the un weigh ted av erage of its F alse Negative Rate (FNR) and F alse P ositive Rate (FPR), or any fixed nontrivial weigh ted av erage of these t wo quan tities, then guaran tees of the form giv en in earlier sections are intr insically not p ossible when there is no zero-error predictor. 5 F alse negativ e and false p ositiv e rates. Man y fairness notions explicitly consider false negativ e and false p ositive rates. F or example, the notion of Equalized Odds [ HPS16 ] imp oses that b oth these rates should b e the same for all groups of interest, and the notion of Equalit y of Opp ortunity [ HPS16 ] imp oses equalit y for just the FNR. Given the atten tion paid to false negative and false p ositiv e rates, it is natural to consider the ob jectiv e of minimizing their av erage (or minimizing their maximum, which our negative results will also apply to). More formally , under this ob jective, the loss of the algorithm corresp onds to 1 | t : y t =+ | P t : y t =+ ˆ ` t + 1 | t : y t = −| P t : y t = − ˆ ` t . If we ha ve just one group and aim to achiev e p erformance compared to the one of the b est exp ert f ? ∈ F ( g ) for this ob jective, i.e. 1 | t : y t =+ | P t : y t =+ ` t,f ? + 1 | t : y t = −| P t : y t = − ` t,f ? , this is feasible ev en in an online setting b y running a no-regret algorithm and w eighting eac h example based on the n umber of examples with the same lab el that exist. Also note that, unlike balance notions, if gr oups ar e disjoint then sim ultaneously optimizing this ob jectiv e for b oth groups do es not create some inheren t conflict b et w een groups, e.g. necessitating to ha ve p erformance on some group that is worse than necessary . Unfortunately , we show that while minimizing the unw eigh ted av erage of FNR and FPR seems like a b enign ob jectiv e, it can b e imp ossible to pro duce a global prediction strategy that, on each group, 5 Minimizing just FNR (resp. just FPR) is trivial by predicting p ositiv e (resp. negativ e) on every example. 16 p erforms nearly as w ell as the b est predictor giv en for that group under this ob jectiv e. That is, ev en though all groups app ear to ha ve ob jectiv es that are “aligned” they may not b e sim ultaneously satisfiable. This imp ossibilit y result holds due to purely statistical reasons ev en in the batch setting and has nothing to do with the online or non-i.i.d. nature of examples. Note that if the p opulations are not ov erlapping, the batc h setting for this goal is straightforw ard: among the av ailable classifiers, for each group select the one with the b est p erformance according to the given ob jective. The follo wing theorem shows that, with ov erlapping p opulations, this is no longer p ossible. Theorem 5.1. Consider o verlapping groups wishing to achiev e performance on their mem b ers comparable to their group-sp ecific predictors with resp ect to the ob jective of unw eigh ted av erage of false p ositiv e and false negative rates. Ev en in the batc h setting, there exist settings where it is imp ossible to simultaneously ac hieve this goal for b oth groups. Pr o of. Assume tw o groups G = { A, B } that hav e 80% of their examples disjoint from the other group and the remainder 20% in common, as illustrated in Figure 1. All the non-o v erlapping examples of group A ha ve p ositiv e lab el and the non-o v erlapping examples of group B ha ve negative lab el. With resp ect to the shared p ortion, the examples hav e lab els that are uniformly distributed: conditioned on b eing in the in tersection, the probability of the lab el b eing p ositiv e is 0.5. Regarding the predictors, we ha ve tw o predictors F = { f A , f B } . Predictor f A correctly predicts p ositiv e in the non-o verlapping examples ( A \ B ), and predicts negative in the ov erlapping part A ∩ B . The false p ositiv e rate of this predictor is 0 and the false negativ e rate is 1 / 9 . Analogously , predictor f B predicts negativ e on the examples in B \ A and predicts p ositive in A ∩ B , resulting in a false p ositiv e rate on examples in B of 1 / 9 and a false negativ e rate of 0 . Therefore, for b oth g ∈ { A, B } , f g has a generalization error of 1 / 18 with resp ect to the unw eighted av erage of false p ositiv e and false negative rates. W e now show that this p erformance cannot (even appro ximately) b e sim ultaneously achiev ed for b oth groups b y combining these tw o predictors even in the batch setting. In particular, since examples in A ∩ B all hav e uniformly random lab els, w e can only select some probability p to predict a p ositiv e lab el for p oin ts in the intersection. Even if we p erfectly classify all the examples not in the in tersection, if p > 1 / 2 then the false p ositive rate of A is higher than 1 / 2 (as it misclassifies half of the negative examples); otherwise the false negativ e rate of B is no less than 1 / 2 . This means that one of the tw o p opulations will necessarily hav e unw eighted av erage of false p ositiv e and false negativ e rates higher than 1 / 4 and therefore will not achiev e p erformance even approximately as go od as its b est predictor. Discussion. The ab ov e result illustrates the additional complications caused b y ov erlapping groups and highlights a subtlety in the p ositiv e results we provided in the previous sections. In particular, the key distinction is that for the ob jective of av erage loss (or error rate), the a verage contribution of any given example is the same across all groups; a misclassification is equally damaging for all groups an example b elongs to. This is not the case in the setting considered here of minimizing the a verage (or maximum or any fixed nontrivial combination) of FPR and FNR where the harm of each misclassified example needs to b e weigh ted differently across groups dep ending on their prop ortion of p ositiv e and negative examples. 17 Figure 1: Construction illustrating that unw eigh ted a verage of false negativ e and false positive rate cannot b e simultaneously optimized with resp ect to o verlapping p opulations (Theorem 5.1). 6 Conclusions In this pap er, we consider settings where different ov erlapping p opulations co exist and we wish to design algorithms that do not treat any p opulation unfairly . W e consider a notion of fairness that corresp onds to predicting as well as humanly (or algorithmically) p ossible on each given group, rather than based on requirements for equality . This framew ork can directly incorp orate a designer’s goal of go od o verall prediction b y creating one extra group (for the designer) that includes all the examples. Our results extend to the more realistic one-sided feedback (apple tasting) setting and hav e a nice game-theoretic in terpretation. Our work makes a step tow ards identifying fairness notions that work w ell in online settings and are compatible with the existence of m ultiple parties, each with their o wn interests in mind. Our imp ossibilit y results with respect to the av erage (or maximum) of false negativ e and false p ositiv e rates demonstrate that satisfying the interests of different ov erlapping p opulations is quite subtle, further highligh ting the positive results. Regarding incentiv es, we show ho w to efficien tly ac hieve Individual Rationalit y and ho w to inefficien tly ac hieve b oth Individual Rationalit y and Incen tive Compatibility . Ac hieving IR and IC together in a computationally efficien t wa y (without enumerating across all disjoint intersections of subgroups) is a v ery interesting question that seems to require no vel learning-theoretic ideas as we discuss in Section 4.3. On the apple tasting fron t, we pro vide three distinct guarantees; the resulting algorithms are near-optimal but suffer from orthogonal shortcomings as we discuss in Section 3.2. A voiding these and ac hieving an optimal guarantee for sleeping exp erts with apple tasting is an interesting op en question. A c knowledgemen ts The authors w ould like to thank Suriya Gunasek ar for v arious useful discussions during the initial stages of this work and Manish Raghav an for offering comments in a preliminary v ersion of the work. 18 References [ABD + 18] Alekh Agarw al, Alina Beygelzimer, Mirosla v Dudík, John Langford, and Hanna W allac h. A reductions approach to fair classification. arXiv pr eprint arXiv:1803.02453 , 2018. [AF AKDP13] Itai Ashlagi, F elix Fischer, Ian A. Kash, and Ariel D. Pro caccia. Mix and matc h: A strategypro of mechanism for multi-hospital kidney exc hange. Games and Ec onomic Behavior , 91, 05 2013. [AM03] Baruc h A w erbuc h and Yishay Mansour. Adapting to a reliable netw ork path. In Pr o- c e e dings of the twenty-se c ond annual symp osium on Principles of distribute d c omputing (PODC) , 2003. [AR14] Itai Ashlagi and Alvin E Roth. F ree riding and participation in large scale, m ulti- hospital kidney exchange. The or etic al Ec onomics , 9(3):817–863, 2014. [BGLS18] A vrim Blum, Suriya Gunasek ar, Tho doris Lyk ouris, and Nathan Srebro. On preserving non-discrimination when combining exp ert advice. In 32nd A nnual Confer enc e on Neur al Information Pr o c essing Systems (NeurIPS) , 2018. [BLR + 19] Y aha v Becha v o d, Katrina Ligett, Aaron Roth, Bo W aggoner, and Zhiw ei Steven W u. Equal opp ortunit y in online classification with partial feedbac k. In 33r d Annual Confer enc e on Neur al Information Pr o c essing Systems (NeurIPS) , 2019. [Blu97] A vrim Blum. Empirical supp ort for winno w and weigh ted-ma jority algorithms: Results on a calendar scheduling domain. Machine L e arning , 26(1):5–23, 1997. [BM07] A vrim Blum and Yishay Mansour. F rom external to in ternal regret. Journal of Machine L e arning R ese ar ch (JMLR) , 2007. [CBLS05] Nicolo Cesa-Bianc hi, Gáb or Lugosi, and Gilles Stoltz. Minimizing regret with lab el efficien t prediction. IEEE T r ansactions on Information The ory , 51(6):2152–2162, 2005. [CDG18] Sam Corb ett-Da vies and Sharad Go el. The measure and mismeasure of fairness: A critical review of fair machine learning. arXiv pr eprint arXiv:1808.00023 , 2018. [Cho17] Alexandra Chouldec hov a. F air prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data , 5(2):153–163, 2017. [CKP09] T o on Calders, F aisal Kamiran, and Myk ola Pec henizkiy . Building classifiers with indep endency constraints. In IEEE International Confer enc e on Data Mining (ICDM) , 2009. [CKSV18] L Elisa Celis, Sa yash Kap oor, F arno o d Salehi, and Nisheeth K Vishnoi. An algo- rithmic framew ork to con trol bias in bandit-based personalization. arXiv pr eprint arXiv:1802.08674 , 2018. [DHP + 12] Cyn thia Dwork, Moritz Hardt, T oniann Pitassi, Omer Reingold, and Richard Zemel. F airness through a wareness. In Pr o c e e dings of the 3r d Innovations in The or etic al Computer Scienc e Confer enc e (ITCS) , 2012. 19 [FFM + 15] Mic hael F eldman, Sorelle A. F riedler, John Mo eller, Carlos Scheidegger, and Suresh V enk atasubramanian. Certifying and removing disparate impact. In Pr o c e e dings of the 21th A CM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mining (KDD) , 2015. [FS97] Y oa v F reund and Rob ert E Schapire. A decision-theoretic generalization of on-line learning and an application to bo osting. J. Comput. Syst. Sci. , 1997. [FSSW97] Y oa v F reund, Rob ert E Schapire, Y oram Singer, and Manfred K W armuth. Using and com bining predictors that specialize. In Pr o c e e dings of the twenty-ninth annual A CM symp osium on The ory of c omputing (STOC) , pages 334–343. ACM, 1997. [GJKR18] Stephen Gillen, Christopher Jung, Michael Kearns, and Aaron Roth. Online learning with an unknown fairness metric. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2018. [GK19] Sw ati Gupta and Vijay Kamble. Individual fairness in hindsigh t. In Pr o c e e dings of the 2019 A CM Confer enc e on Ec onomics and Computation (EC) , 2019. [GM16] Ey al Gofer and Yisha y Mansour. Low er b ounds on individual sequence regret. Machine L e arning , 103(1):1–26, 2016. [GSVE14] Pierre Gaillard, Gilles Stoltz, and Tim V an Erven. A second-order b ound with excess losses. In Confer enc e on L e arning The ory (COL T) , 2014. [HJKRR18] Ursula Heb ert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicali- bration: Calibration for the (Computationally-iden tifiable) masses. In Pr o c e e dings of the 35th International Confer enc e on Machine L e arning (ICML) , 2018. [HLL92] Da vid P . Helmbold, Nick Littlestone, and Philip M. Long. Apple tasting and nearly one-sided learning. In 33r d A nnual Symp osium on F oundations of Computer Scienc e (F OCS) , 1992. [HPS16] Moritz Hardt, Eric Price, and Nathan Srebro. Equalit y of opp ortunit y in sup ervised learning. In A dvanc es in neur al information pr o c essing systems (NIPS) , 2016. [JKMR16] Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. F airness in learning: Classic and con textual bandits. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 2016. [K CP + 17] Niki Kilbertus, Mateo Ro jas Carulla, Giam battista P arascandolo, Moritz Hardt, Dominik Janzing, and Bernhard Schölk opf. A v oiding discrimination through causal reasoning. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 2017. [K GZ19] Mic hael P . Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-box p ost-processing for fairness in classification. In Pr o c e e dings of the 2019 AAAI/A CM Confer enc e on AI, Ethics, and So ciety , AIES ’19, 2019. [KLL + 17] Jon Klein b erg, Himabindu Lakk ara ju, Jure Lesk ov ec, Jens Ludwig, and Sendhil Mullainathan. Human decisions and machine predictions. The quarterly journal of e c onomics , 133(1):237–293, 2017. [KMR17] Jon M. Kleinberg, Sendhil Mullainathan, and Manish Raghav an. Inherent trade-offs in the fair determination of risk scores. In Innovations of The or etic al Computer Scienc e (ITCS) , 2017. 20 [KNR W18] Mic hael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven W u. Prev enting fairness gerrymandering: Auditing and learning for subgroup fairness. In In Pr o c e e dings of the 35th International Confer enc e on Machine L e arning (ICML) , 2018. [KRR18] Mic hael P . Kim, Omer Reingold, and Guy N. Rothblum. F airness through computationally-b ounded aw areness. In Pr o c e e dings of the 32Nd International Con- fer enc e on Neur al Information Pr o c essing Systems (NeurIPS) , 2018. [LDR + 18] Lydia T. Liu, Sarah Dean, Esther Rolf, Max Simcho witz, and Moritz Hardt. Delay ed impact of fair mac hine learning. 35th International Confer enc e on Machine L e arning (ICML) , 2018. [LRD + 17] Y ang Liu, Goran Radano vic, Christos Dimitrak akis, Debmalya Mandal, and David C P arkes. Calibrated fairness in bandits. In W orkshop on F airness, A c c ountability, and T r ansp ar ency in Machine L e arning (F A T-ML) , 2017. [LS15] Haip eng Luo and Rob ert E. Schapire. A chieving all with no parameters: A danor- malhedge. In Pr o c e e dings of The 28th Confer enc e on L e arning The ory (COL T) , 2015. [L W94] Nic k Littlestone and Manfred K. W arm uth. The weigh ted ma jority algorithm. Inf. Comput. , 108(2):212–261, F ebruary 1994. [RSU07] Alvin E. Roth, T ayfun Sönmez, and M. Utku Ünv er. Efficien t kidney exc hange: Coincidence of w ants in markets with compatibility-based preferences. Americ an Ec onomic R eview , 97(3):828–851, June 2007. [RSWW18] Manish Raghav an, Aleksandrs Slivkins, Jennifer V aughan W ortman, and Zhiwei Steven W u. The externalities of exploration and how data diversit y helps exploitation. In Pr o c e e dings of the 31st Confer enc e On L e arning The ory (COL T) , 2018. A Pro of of Theorem 3.2 Theorem 3.2 restated. Let A b e an algorithm with regret b ounded by O p T · log ( |H| ) when compared to an exp ert class H , run on T examples and split the examples in disjoin t intersections, where each intersection corresp onds to a distinct profile of subgroup memberships. F or eac h in tersection I , randomly selecting an exploration p oin t every ( T ( I )) 1 / 3 examples and running separate versi ons of A for each I pro vides subgroup regret on group g ∈ G of O   2 |G |  1 / 3 · ( T ( g )) 2 / 3 · p log( N )  where T ( g ) = | t : g ∈ G ( t ) | is the size of the g -p opulation and N = |F | + P g ∈G |F ( g ) | . Pr o of. The guarantee follows from three observ ations: 1. Among the exploration p oints, we run a classical exp erts algorithm so, on these examples, we ha ve a regret guaran tee that is square-ro ot of the num b er of these examples. 2. F or eac h phase, the exploration p oin t is randomly selected and therefore the regret that w e incur in the exploration p oin t is an un biased estimator of the av erage regret we incur in the 21 whole phase (since the distribution of the algorithm in the phase is the same). As a result, the total regret in a phase is in exp ectation ( T ( I )) 1 / 3 times the regret at the exploration p oint. 3. A particular group can ha ve examples in at 2 | G | in tersections (as this is all the p ossible mem b ership relationships with resp ect to the demographic groups). W e no w formalize these three ideas, to obtain the guarantee. Let F ( I ) b e the set of experts that are either global or b elong to some g ∈ I , and up date F ( g ) to include b oth the global exp erts and the group-sp ecific exp erts. Initially we split the p erformance of our algorithm across all the in tersections I such that g ∈ I . Let f ? b e the comparator with the minim um cumulativ e loss on g . R e g ( g ) = X t : g ∈G ( t ) X f ∈F ( g ) p t,f ` t,f − X t : g ∈G ( t ) ` t,f ? = X I : g ∈ I X t : g ∈G ( t ) ∩ I   X f ∈F ( g ) p t,f ` t,f − ` t,f ?   F o cusing on a particular in tersection I w e connect its regret to the performance of the exploration p oin ts. Denote by t ( r , I ) the exploration p oin t for phase r in in tersection I . Also denote b y τ ( r ) the b eginning of the r -th phase. W e use the fact that the exploration p oin t is an un biased representation on the regret at a phase (as it is selected uniformly and the algorithm is only up dated at the end of the phase). Applying the guaran tee for A on the exploration times of all phases, we obtain: X t : g ∈G ( t ) ∩ I   X f ∈F ( g ) p t,f ` t,f − ` t,f ?   = ( T ( I )) 2 / 3 X r =1 τ r +1 − 1 X t = τ r   X f ∈F ( g ) p t,f ` t,f − ` t,f ?   · 1 { g ∈ G ( t ) ∩ I } = ( T ( I )) 1 / 3 · ( T ( I )) 2 / 3 X r =1 E   X f ∈F ( g ) p t ( r,I ) ,f ` t ( r,I ) ,f − ` t ( r,I ) ,f ?   ≤ ( T ( I )) 1 / 3 q ( T ( I )) 2 / 3 log( |F | ) = ( T ( I )) 2 / 3 · p log( |F | ) where the exp ectations are tak en o ve r the random selections of t ( r , I ) . Com bining the ab ov e and applying Holder inequality , w e obtain: R ( g ) = X I : g ∈ I X t : g ∈G ( t ) ∩ I   X f ∈F ( g ) p t,f ` t,f − ` t,f ?   ≤ X I : g ∈ I ( T ( I )) 2 / 3 · p log( |F ) ≤   X I : g ∈ I 1   1 / 3 ·   X I : g ∈ I ( T ( I ))   2 / 3 p log( |F | ≤  2 |G |  1 / 3 · ( T ( g )) 2 / 3 · p (log( |F | ) . Finally , w e need to control ho w m uch we are losing via the exploration rounds. Since we apply pa y-p er-feedbac k we lose 1 every time we insp ect (an upp er b ound on the apple tasting loss). F or an y intersection, we hav e exactly ( T ( I )) 2 / 3 suc h exploration p oints so we are not losing something more compared to what we previously discussed whic h completes the proof. B Pro of of Theorem 3.3 Theorem 3.3 restated. Let A b e an algorithm with sleeping regret b ounded by O  p T ( h ) · log ( |H| )  for an y exp ert h ∈ H where H is an y class. Randomly selecting an exploration p oin t ev ery T 1 / 3 22 examples (irresp ectiv e of what groups they come from) and running A on these p oin ts pro vides subgroup regret on group g ∈ G of O  T 1 / 6 · ( T ( g )) − 1 / 2 · p log( N )  where T ( g ) = | t : g ∈ G ( t ) | is the size of the g -p opulation and N = |F | + P g ∈G |F ( g ) | . Pr o of. The guarantee follows from t wo observ ations: 1. Giv en that we run a sleeping exp erts algorithm across the exploration p oin ts, if we just fo cused on those examples, we simultaneously satisfy regret on them that is square-ro ot of their size. 2. Within an y phase the exploration p oin t is uniformly at random selected. As a result, it is an un biased estimator of the av erage regret w e incur in the whole phase. Note that this is now the a verage across all rounds and not only rounds where w e hav e mem b ers of the particular group which results to the dep endence on the time-horizon T . W e now formalize these ideas. As b efore, w e connect the regret on a group to the one of the exploration p oin ts. Denote by t ( r ) the random v ariable that corresp onds to the exploration p oin t. No w the size of the phases is fixed in adv ance so r -th phase starts at τ r = r · T 1 / 3 . Similarly as b efore, we denote by f ? the comparator with the minimum cumulativ e loss on g . Applying linearity of exp ectation and Jensen’s inequalit y , we obtain: R e g ( g ) = X t : g ∈G ( t ) X f ∈F ( g ) p t,f ` t,f − X t : g ∈G ( t ) ` t,f ? = T 2 / 3 X r =1 τ r +1 − 1 X t = τ r   X f ∈F ( g ) p t,f ` t,f − ` t,f ?   · 1 { g ∈ G ( t ) } = T 1 / 3 · T 2 / 3 X r =1 E     X f ∈F ( g ) p t ( r ) ,f ` t ( r ) ,f − ` t ( r ) ,f ?   · 1 { g ∈ G ( t ( r )) }   ≤ T 1 / 3 · E    v u u t T 2 / 3 X r =1 1 { g ∈ G ( t ( r )) } · log ( N )    ≤ T 1 / 3 · v u u u t E   T 2 / 3 X r =1 1 { g ∈ G ( t ( r )) } · log ( N )   = T 1 / 3 · q T ( g ) · T − 1 / 3 · log ( N ) = T 1 / 6 · p T ( g ) · log ( N ) . The exp ectation is taken ov er the random selections of the exploration times t ( r ) . The second-to-last equalit y holds as each mem b er of group g has probability T − 1 / 3 to b e an exploration p oint and therefore the exp ected n umber of exploration p oints in the group is T ( g ) · T − 1 / 3 . Finally , for the pa y-for-feedback mo del, w e need to also consider the effect of insp ected rounds in loss (w e are losing 1 every time that w e insp ect). The num b er of these rounds on examples in g is at most T ( g ) · T − 1 / 3 ≤ ( T ( g )) 2 / 3 so low er order than the term we already hav e. 23 C Pro of of Theorem 3.4 Theorem 3.4 restated. Applying the algorithm describ ed in the third reduction (Section3.2 pro vides subgroup regret on group g ∈ G of O  |G | · ( T ( g )) 3 / 4 · p log( N )  where T ( g ) = | t : g ∈ G ( t ) | is the size of the g -p opulation and N = |F | + P g ∈G |F ( g ) | . Pr o of. There are three imp ortant comp onen ts to prov e this guarantee. 1. The num ber for relev an t phases of each group (phases where they hav e at least one example) is at most T ( g ) . This pro vides an upp er b ound on the num b er of phases that we need to consider with resp ect to group g . 2. Using a similar analysis as b efore, w e can create a guarantee ab out the regret we are incurring in the exploration points and multiply it b y the ( T ( g )) 1 / 4 whic h is the size of the phase. This w ould ha v e created a completely un biased estimator if there was no o verlap with other groups. 3. A final complication is that exploration may o ccur due to other groups so we need to under- stand how muc h we lose there. F or that, we observe that smaller groups are explored with higher probability (the interaction with larger groups do es not therefore significan tly increase their probability of insp ection). On the other hand, larger groups do not often collide with significan tly smaller groups due to the latters’ size. More formally , to analyze the subgroup regret for g ∈ G , let’s consider a fictitious setting where all phases hav e equal size for group g . This can b e done by padding 0 s in the end of the phase. This fictitious setting has the same loss as the original (as it only differs in that it has some more examples with zero loss). F or this fictitious setting, the exploration p oin t is an unbiased estimator of the a verage regret. W e first analyze the subgroup regret assuming that the points of insp ection in other groups do not ov erlap with g . Applying similar ideas as in the previous theorem: R e g ( g ) = X t : g ∈G ( t ) X f ∈F ( g ) p t,f ` t,f − X t : g ∈G ( t ) ` t,f ? = T ( g ) X r =1 τ r +1 − 1 X t = τ r   X f ∈F ( g ) p t,f ` t,f − ` t,f ?   · 1 { g ∈ G ( t ) } = ( T ( g )) 1 / 4 · T ( g ) X r =1 E     X f ∈F ( g ) p t ( r ) ,f ` t ( r,g ) ,f − ` t ( r,g ) ,f ?   · 1 { g ∈ G ( t ( r , g )) , t ( r , g ) ≤ t r +1 }   = ( T ( g )) 1 / 4 · T ( g ) X r =1 E     X f ∈F ( g ) p t ( r ) ,f ˜ ` f ( r , g ) − ˜ ` f ? ( r,g )     ≤ ( T ( g )) 1 / 4 · p T ( g ) log( N ) = ( T ( g )) 3 / 4 p log( N ) The exp ectation is taken o ver the random selections of the exploration times t ( r , g ) . The last inequalit y holds as the num b er of non-zero entries in the previous sum is at most the n umber of phases and we run A on the estimated losses. 24 Let’s understand now how muc h group g is harmed by the fact that some of its examples may b e insp ected by ov erlapping groups. As a result, we need to count in the regret the contribution of these examples as well. Note that eac h group has at most one exploration p oin t within a phase – this is the reason why we get the dep endence on |G | in our b ound. Group g is only affected by others’ exploration if this happ ens on examples that are also members of group g . W e first consider the effect of smaller groups than g . F or any group g 0 with T ( g 0 ) ≤ T ( g ) the exploration p oin ts at group g 0 are at most the size of the group times the probability that eac h example is an exploration p oin t for g 0 , i.e. T ( g 0 ) · ( T ( g 0 )) − 1 / 4 = ( T ( g 0 )) 3 / 4 . Since the size of g 0 is smaller than the size of g , this contributes at most an extra term of ( T ( g )) 3 / 4 in the regret. Let’s no w consider groups g 0 with T ( g 0 ) > T ( g ) . Then, even if all examples of g are also examples of g 0 , the probability that each of them is an exploration p oin t due to g 0 is ( T ( g 0 )) − 1 / 4 . Hence the exp ected n umber of exploration p oin ts of g 0 on examples in g is at most T ( g ) · ( T ( g 0 )) 1 / 4 ≤ ( T ( g )) 3 / 4 . 25

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment