Performative Prediction

When predictions support decisions they may influence the outcome they aim to predict. We call such predictions performative; the prediction influences the target. Performativity is a well-studied phenomenon in policy-making that has so far been negl…

Authors: Juan C. Perdomo, Tijana Zrnic, Celestine Mendler-D"unner

Performative Prediction
P erforma tiv e Prediction J uan C. P erdomo* Tijana Zrnic* Celestine Mendler -Dünner Moritz Hardt {jcperdomo, tijana.zrnic, mendler , hard t}@berkeley .edu University of C alifornia, Ber keley March 2, 2021 Abstract When predictions support decisions they may influence the outcome they aim to predict. W e call such predictions performative ; the prediction influences the target. P erformativity is a well-studied phenomenon in policy-making tha t has so far been neglected in supervised learning. When ignored, performativity surfaces as undesirable distribution shift, routinely addressed with retraining. W e dev elop a risk minimization framewor k for performativ e prediction bringing together concepts from statistics, game theory , and causality . A conceptual novelty is an equilibrium notion we call performativ e stability . P erf ormativ e stability implies that the predictions are calibrated not ag ainst past outcomes, but against the future outcomes that manif est from acting on the prediction. Our main results are necessary and su ffi cient conditions for the conv ergence of retraining to a perf ormativ ely stable point of near ly minimal loss. In full gener ality , perf ormativ e prediction strictly subsumes the setting known as str ategic classification . W e thus also give the first su ffi cient conditions for retraining to overcome strategic f eedback e ff ects. 1 Introduction Supervised learning excels a t pattern recognition. When used to support consequential deci- sions, howev er , predictiv e models can trigger actions that influence the outcome they aim to predict. W e call such predictions performative ; the prediction ca uses a change in the distribution of the target v ariable. Consider a simplified example of predicting credit defa ult risk. A bank might estimate that a loan applican t has an elev ated risk of default , and will act on it by assigning a high in terest rate. In a self -fulfilling prophecy , the high interest rate further increases the customer ' s defa ult risk. Put di ff erently , the bank ' s predictive model is not calibrated to the outcomes that manifest from acting on the model. Once recognized, performativity turns out to be ubiquitous. T ra ffi c predictions influence tra ffi c patterns, crime location prediction influences police allocations that may deter crime, recommendations shape pref erences and thus consumption, stock price prediction determines trading activity and hence prices. * Equal contribution. 1 When ignored, performativity can surface as a form of distribution shift . As the decision- maker acts according to a predictive model, the distribution over data points appears to change over time. In practice, the response to such distribution shifts is to frequently retrain the predictive model as more data becomes av ailable. Retraining is often considered an undesired — yet necessary — cat and mouse g ame of chasing a moving target. What woul d be desirable from the perspective of the decision maker is a certain equilibrium where the model is optimal for the distribution it induces. Such equilibria coincide with the stable points of retraining, that is, models in variant under retraining. P erformativity therefore sugg ests a di ff eren t perspective on retraining, exposing it as a natural equilibra ting dynamic rather than a n uisance. This raises fundamental questions. When do such stable points exist? How can we e ffi ciently find them? Under what conditions does retr aining conv erge? When do stable points also hav e good predictive performance? In this work, we formalize performativ e prediction, tying together conceptual elements from statistical decision theory , causal reasoning, and game theory . W e then resolv e some of the fundamental questions that performa tivity raises. 1.1 Our con tributions W e put performativity at the center of a decision-theoretic framework that extends the classical statistical theory underlying risk minimization. The goal of risk minimization is to find a decision rule, specified by model parameters θ , that perf orms w ell on a fixed joint distribution D over cov ariates X and an outcome variable Y . Whenever predictions are performa tive, the choice of predictive model a ff ects the observed distribution ov er instances Z = ( X , Y ). W e formalize this intuitive notion by introducing a map D ( · ) from the set of model parameters to the space of distributions. For a given choice of parameters θ , we think of D ( θ ) as the distribution over features and outcomes that results from making decisions according to the model specified by θ . This mapping from predictiv e model to distribution is the key conceptual device of our framewor k. A natural objective in performativ e prediction is to evalua te model parameters θ on the resulting distribution D ( θ ) as measured via a loss function ` . This results in the notion we call performative risk , defined as PR( θ ) def = E Z ∼D ( θ ) ` ( Z ; θ ) . The di ffi culty in minimizing PR ( θ ) is that the distribution itself depends on the argument θ , a dependence that defeats traditional theory for risk minimization. Moreover , we generally envision that the map D ( · ) is unknown to the decision maker . P erhaps the most natural algorithmic heuristic in this situation is a kind of fixed point iteration: repeatedly find a model that minimizes risk on the distribution resulting from the previous model, corresponding to the update rule θ t +1 = arg min θ E Z ∼D ( θ t ) ` ( Z ; θ ) . W e call this procedure repeated risk minimization . W e also analyze its empirical counterpart that works in finite samples. These procedures exemplify a family of retraining heuristics that are ubiquitous in practice for dealing with all kinds of distributions shifts irrespectiv e of cause. 2 When repeated risk minimization con verg es in objective val ue the model has minimal loss on the distribution it entails: PR( θ ) = min θ 0 E Z ∼D ( θ ) ` ( Z ; θ 0 ) . W e refer to this condition as performative stability , noting that it is neither implied by nor does it imply minimal perf ormativ e risk. Our central result can be summarized inf ormally as follows. Theorem 1.1 (Informal) . If the loss is smooth, strongly convex, and the mapping D ( · ) is su ffi ciently Lipschitz, then repeated risk minimization converg es to performative stability at a linear rate. Moreover , if any one of these assumptions does not hold, repeated risk minimization can fail to converge at all. The notion of Lipschitz con tinuity here ref ers to the E uclidean distance on model parameters and the W asserstein distance on distributions. Informally , it requires that a small change in model parameters θ does not have an outsized e ff ect on the induced distribution D ( θ ). In contrast to standard supervised learning, convexity al one is not su ffi cient for con verg ence in objective v alue, ev en if the other assumptions hold. P erformative prediction therefore giv es a new and interesting perspectiv e on the importance of strong conv exity . Strong con vexity has a second benefit. Not only does retraining converg e to a stable point at a linear rate, this stable point also approxima tely minimizes the performativ e risk. Theorem 1.2 (Informal) . If the loss is Lipschitz and strongly convex, and the map D ( · ) is Lipschitz, all stable points and performative optima lie in a small neighborhood around each o ther . Recall that performative stability on its own does not imply minimal performative risk. What the previous theorem shows, however , is that strong convexity guarantees that we can approximately satisfy both. W e complement our main results with a case study in strategic classification . Stra tegic classification aims to anticipate a stra tegic response to a classifier from an individual, who can change their features prior to being classified. W e observ e that stra tegic classification is a special case of performa tive prediction. On the one hand, this allows us to tr ansfer our technical results to this established setting. In particular , our results are the first to give a guarantee on repeated risk minimization in the str ategic setting. On the other hand, strategic classification provides us with one concrete setting for what the mapping D ( · ) can be. W e use this as a basis of an empirical ev aluation in a semi-synthetic setting, where the initial distribution is based on a real data set , but the distribution map is modeled. 1.2 Rela ted work P erformativity is a broad concept in the social sciences, philosoph y , and economics [ 18 , 30 ]. Below w e focus on the relationship of our w ork to the most relev ant technical scholarship. Learning on non-stationary distributions. A closely related line of wor k considers the prob- lem of concept drift , broadly defined as the problem of learning when the target distribution over instances drifts with time. This setting has attracted attention both in the learning theory community [ 2 , 3 , 27 ] and by machine learning practitioners [ 14 ]. Concept drift is more general phenomenon than performativity in that it considers arbitrary sources of shift. Howev er , studying the problem at this lev el of gener ality has led to a n umber 3 of di ffi culties in creating a unified languag e and objective [ 14 , 40 ], an issue w e circumv ent by assuming that the population distribution is determined by the deployed predictive model. Importantl y , this line of wor k also discusses the importance of retr aining [ 14 , 41 ]. Howev er , it stops short of discussing the need for stability or analyzing the l ong-term behavior of retraining. Stra tegic classification. Stra tegic classification recognizes that individuals often adapt to the specifics of a decision rule so as to gain an advantag e (see, e.g., [ 8 , 11 , 15 , 24 ]). Recent work in this area considers issues of incentive design [ 4 , 25 , 32 , 37 ], control ov er an algorithm [ 10 ], and fairness concerns [ 20 , 33 ]. Importantly , the concurrent wor k of Bechavod et al. [ 4 ] analyzes the implications of retraining in the context of causal discov ery in linear models. Our model of performativ e prediction includes all notions of strategic adaption that we are aware of as a special case. Unlike many works in this area, our results do not depend on a specific cost function f or changing individual f eatures. Rather , w e rely on an assumption about the sensitivity of the data-gener ating distribution to chang es in the model parameters. Recently , there has been increased interest within the algorithmic fairness community in classification dynamics. See, for example, Liu et al. [ 28 ], Hu and Chen [ 19 ], and Hashimoto et al. [ 17 ]. The latter work considers repeated risk minimization, but from the perspective of what it does to a measure of disparity between groups. Ca usal inference. The reader familiar with ca usality can think of D ( θ ) as the interv entional distribution over instances Z resulting from a do-in tervention tha t sets the model parameters to θ in some underl ying causal graph. Importantly , this mapping D ( · ) remains fixed and does not change over time or by interv ention: deploying the same model at two di ff erent points in time must induce the same distribution over observations Z . While causal inference focuses on estimating properties of interven tional distributions such as treatment e ff ects [ 21 , 35 ], our focus is on a new stability notion and iterativ e retraining procedures for finding stable points. Conv ex optimization. The two solution concepts w e introduce g eneralize the usual notion of optimality in (empirical) risk minimization to our new framew ork of performa tivity . Similarly , we extend the classical property of gradient descent acting as a contraction under smooth and strongly convex losses to account for distribution shifts due to performativity . Finally , we discuss how di ff erent regularity assumptions on the loss function a ff ect conv ergence of retraining schemes, m uch like optimization wor ks discuss these assumptions in the context of conv ergence of iterativ e optimization alg orithms. Reinforcement learning. In general, an y instance of performativ e prediction can be reframed as a reinforcement learning or contextual bandit problem. Y et, by studying performativ e predic- tion problems within such a broad framewor k, we lose man y of the intricacies of performativity which make the problem interesting and tractable to analyze. W e return to discuss some of the connections between both framew orks later on. 2 Framew ork and main definitions In this section, we formall y introduce the principal solution concepts of our framewor k: perfor - mativ e optimality and performativ e stability . 4 Throughout our presenta tion, w e focus on predictiv e models that are parametrized by a vector θ ∈ Θ , where the parameter space Θ ⊆ R d is a closed, convex set. W e use capital letters to denote random variables and their lowercase counterparts to denote realizations of these variables. W e consider instances z = ( x, y ) defined as f eature, outcome pairs, where x ∈ R m − 1 and y ∈ R . Whenever w e define a variable θ ∗ = arg min θ g ( θ ) as the minimizer of a function g , we resolv e the issue of the minimizer not being unique by setting θ ∗ to an arbitrary point in the arg min θ g ( θ ) set. 2.1 P erformativ e optimality In supervised learning, the goal is to learn a predictive model f θ which minimizes the expected loss with respect to feature, outcome pairs ( x, y ) drawn i.i.d. from a fixed distribution D . The optimal model f θ SL solv es the following optimiza tion problem, θ SL = arg min θ ∈ Θ E Z ∼D ` ( Z ; θ ) , where ` ( z ; θ ) denotes the loss of f θ at a point z . W e contrast this with the performative optimum . As introduced previously , in settings where predictions support decisions, the manifested distribution over f eatures and outcomes is in part determined by the deploy ed model. Instead of considering a fixed distribution D , each model f θ induces a poten tially di ff erent distribution D ( θ ) over instances z . A predictiv e model must therefore be eval uated with regard to the expected loss over the distribution D ( θ ) it induces: its performative risk . Definition 2.1 (performativ e optimality and risk) . A model f θ PO is performatively optimal if the following rela tionship holds: θ PO = arg min θ E Z ∼D ( θ ) ` ( Z ; θ ) . W e define PR( θ ) def = E Z ∼D ( θ ) ` ( Z ; θ ) as the performative risk ; then, θ PO = arg min θ PR( θ ). The following example illustrates the di ff erences between the traditional notion of optimality in supervised learning and performativ e optima. Example 2.2 (biased coin flip) . Consider the task of predicting the outcome of a biased coin flip where the bias of the coin depends on a fea ture X and the assigned score f θ ( X ). In particular , define D ( θ ) in the foll owing wa y . X is a 1-dimensional fea ture supported on {± 1 } and Y | X ∼ Bernoulli ( 1 2 + µX + εθ X ) with µ ∈ (0 , 1 2 ) and ε < 1 2 − µ . Assume that the class of predictors consists of linear models of the form f θ ( x ) = θ x + 1 2 and that the objective is to minimize the squared loss: ` ( z ; θ ) = ( y − f θ ( x )) 2 . The parameter ε represents the performa tive aspect of the model. If ε = 0, outcomes are independent of the assigned scores and the problem reduces to a standard supervised learning task where the optimal predictiv e model is the conditional expectation f θ SL ( x ) = E [ Y | X = x ] = 1 2 + µx , with θ SL = µ . In the performativ e setting with ε , 0, the optimal model θ PO balances between its predictive accuracy as w ell as the bias ind uced by the prediction itself . In particular , a direct calculation demonstrates that θ PO = arg min θ ∈ [0 , 1] E Z ∼D ( θ )  Y − θ X − 1 2  2 ⇐ ⇒ θ PO = µ 1 − 2 ε . 5 Hence, the performa tive optim um and the supervised learning solution are equal if ε = 0 and diverg e as the performativity strength ε increases. 2.2 P erformativ e stability A natural, desirable property of a model f θ is that , given that we use the predictions of f θ as a basis for decisions, those predictions are also simultaneously optimal for distribution that the model induces. W e introduce the notion of performative stability to ref er to predictiv e models that satisfy this property . Definition 2.3 (performativ e stability and decoupled risk) . A model f θ PS is performatively stable if the following rela tionship holds: θ PS = arg min θ E Z ∼D ( θ PS ) ` ( Z ; θ ) . W e define DPR ( θ , θ 0 ) def = E Z ∼D ( θ ) ` ( Z ; θ 0 ) as the decoupled performative risk ; then, θ PS = arg min θ DPR( θ PS , θ ). A performativ ely stable model f θ PS minimizes the expected loss on the distribution D ( θ PS ) resulting from deploying f θ PS in the first place. Therefore, a model that is performativ ely stable eliminates the need for retraining after deployment since any retraining procedure would simply return the same model parameters. P erformativ ely stable models are fixed points of risk minimization. W e further devel op this idea in the next section. Observe tha t performativ e optimality and performative stability are in g eneral two distinct solution concepts. P erformativ ely optimal models need not be performativel y stable and performativ ely stable models need not be performativ ely optimal. W e illustrate this point in the context of our previous biased coin toss example. Example 2.2 (continued). Consider again our model of a biased coin toss. In order for a predictive model f θ to be performativ ely stable, it must satisfy the f ollowing relationship: θ PS = arg min θ ∈ [0 , 1] E Z ∼D ( θ PS )  Y − θ X − 1 2  2 ⇐ ⇒ θ PS = µ 1 − ε . Solving for θ PS directly , w e see that there is a unique performativ ely stable point. Therefore, performa tive stability and perf ormative optimality need not identify . In fact, in this example they identify if and only if ε = 0. Note that, in general, if the map D ( θ ) is constant across θ , performativ e optima must coincide with performativ ely stable solutions. Furthermore, both coincide with " static" supervised learning solutions as w ell. For ease of presentation, we refer to a choice of parameters θ as performativel y stable (optimal) if the model parametrized by θ , f θ is performativ ely stable (optimal). W e will occasionally also ref er to performativ e stability as simply stability . Remark 2.4. Notice that both performative stability and optimality can be expressed via the decoupled performative risk as follows: θ PS is performatively stable ⇔ θ PS = arg min θ DPR( θ PS , θ ) , θ PO is performatively optimal ⇔ θ PO = arg min θ DPR( θ , θ ) . 6 3 When retraining con verg es to stable points Having introduced our framework for performativ e prediction, we now address some of the basic questions that arise in this setting and examine the behavior of common machine learning practices, such as retraining, through the lens of performa tivity . As discussed previously , performativ ely stable models hav e the favorable property that they achieve minimal risk for the distribution they induce and hence eliminate the need for retraining. Howev er , it is a priori not clear that such stable points exist; and even if they do exist, whether we can find them e ffi cientl y . Furthermore, seeing as how performativ e optimality and stability are in g eneral distinct sol ution concepts, under what conditions can we find models that approximately sa tisfy both? In this work, w e begin to answer these questions by analyzing two di ff eren t optimization strategies. The first is retraining, formally ref erred to as repeated risk minimization (RRM), where the exact minimizer is repeatedly computed on the distribution induced by the previous model parameters. The second is r epeated gradient descent (RGD), in which the model parameters are incrementally updated using a single gradient descent step on the objective defined by the previous iterate. W e introduce RGD as a computationally e ffi cient approximation of RRM, which, as we show , adopts man y favor able properties of RRM. Our algorithmic analysis of these methods rev eals the existence of stable points under the assumption that the distribution map D ( · ) is su ffi ciently Lipschitz. W e identify necessary and su ffi cient conditions for con verg ence to a performativel y stable point and establish properties of the objective under which stable poin ts and performativ e optima are close. W e begin by analyzing the behavior of these procedures when they operate a t a population level and then extend our anal ysis to finite samples. 3.1 Assum ptions It is easy to see that one cannot make any guarantees on the conv ergence of retraining or the existence of stable points without making some regularity assumptions on D ( · ). One reasonable wa y to quantify the regularity of D ( · ) is to assume Lipschitz contin uity; the Lipschitz constant determines how sensitive the induced distribution is to a change in model parameters. Intuitiv ely , such an assumption captures the idea that, if decisions are made according to similar predictive models, then the resulting distributions over instances should also be similar . W e now introduce this key assum ption of our work, which w e call ε -sensitivity . Definition 3.1 ( ε -sensitivity) . W e say that a distribution map D ( · ) is ε -sensitive if f or all θ , θ 0 ∈ Θ : W 1  D ( θ ) , D ( θ 0 )  6 ε k θ − θ 0 k 2 , where W 1 denotes the W asserstein-1 distance, or earth mover’ s distance. The earth mover’ s distance is a natural notion of distance between probability distributions that provides access to a rich technical repertoire [ 38 , 39 ]. Furthermore, we can v erify that it is satisfied in v arious settings. Remark 3.2. A simple example where this assumption is satisfied is for a Gaussian family . Given θ = ( µ, σ 1 , . . . , σ p ) ∈ R 2 p , define D ( θ ) = N ( ε 1 µ, ε 2 2 diag ( σ 2 1 , . . . , σ 2 p )) where ε 1 , ε 2 ∈ R . Then D ( · ) is ε -sensitive for ε = max n | ε 1 | , | ε 2 | o . 7 In addition to this assumption on the distribution map, we will often make standard assumptions on the loss function ` ( z ; θ ) which hold for broad classes of l osses. T o simplify our presentation, let Z def = ∪ θ ∈ Θ supp( D ( θ )). • ( joint smoothness ) W e say that a loss function ` ( z ; θ ) is β -jointly smooth if the gradient ∇ θ ` ( z ; θ ) is β -Lipschitz in θ and z , that is    ∇ θ ` ( z ; θ ) − ∇ θ ` ( z ; θ 0 )    2 6 β    θ − θ 0    2 ,    ∇ θ ` ( z ; θ ) − ∇ θ ` ( z 0 ; θ )    2 6 β    z − z 0    2 , (A1) for all θ , θ 0 ∈ Θ and z , z 0 ∈ Z . • ( strong convexity ) W e say tha t a loss function ` ( z ; θ ) is γ -strongly conv ex if ` ( z ; θ ) > ` ( z ; θ 0 ) + ∇ θ ` ( z ; θ 0 ) > ( θ − θ 0 ) + γ 2    θ − θ 0    2 2 , (A2) for all θ , θ 0 ∈ Θ and z ∈ Z . If γ = 0, this assumption is equivalen t to convexity . W e will sometimes refer to β γ , where β is as in ( A1 ) and γ as in ( A2 ) , as the condition n umber . 3.2 Repea ted risk minimization W e now formally define repeated risk minimization and prov e one of our main resul ts: su ffi cient and necessary conditions for retraining to con verg e to a performativel y stable point. Definition 3.3 (RRM) . Repeated risk minimization (RRM) refers to the procedure where, starting from an initial model f θ 0 , we perf orm the following sequence of upda tes for ev ery t > 0: θ t +1 = G ( θ t ) def = arg min θ ∈ Θ E Z ∼D ( θ t ) ` ( Z ; θ ) . Using a toy example, we again argue that restrictions on the map D ( · ) are necessary to enable interesting analyses of RRM, otherwise it might be computationall y infeasible to find performativ e optima, and performativel y stable points might not ev en exist. Example 3.4. Consider optimizing the squared loss ` ( z ; θ ) = ( y − θ ) 2 , where θ ∈ [0 , 1] and the distribution of the outcome Y , according to D ( θ ), is a point mass at 0 if θ > 1 2 , and a point mass at 1 if θ < 1 2 . Clearly there is no perf ormatively stable point , and RRM will simply result in the alternating sequence 1 , 0 , 1 , 0 , . . . . The performative optim um in this case is θ PO = 1 2 . T o show converg ence of retraining schemes, it is hence necessary to make a regularity assumption on D ( · ), such as ε -sensitivity . W e are now ready to state our main result regarding the conv ergence of repeated risk minimization. Theorem 3.5. Suppose that the loss ` ( z ; θ ) is β -jointly smooth ( A1 ) and γ -strongly convex ( A2 ) . If the distribution map D ( · ) is ε -sensitive, then the following statements are true: (a) k G ( θ ) − G ( θ 0 ) k 2 6 ε β γ k θ − θ 0 k 2 , for all θ , θ 0 ∈ Θ . (b) If ε < γ β , the iterates θ t of RRM converge to a unique performatively stable point θ PS at a linear rate: k θ t − θ PS k 2 6 δ for t >  1 − ε β γ  − 1 log  k θ 0 − θ PS k 2 δ  . 8 The main message of this theorem is that in performative prediction, if the loss function is su ffi ciently "nice " and the distribution map is su ffi ciently (in)sensitive, then one need onl y retrain a model a small number of times before it converg es to a unique stable point. The complete proof of Theorem 3.5 can be found in Appendix D.1 . Here, we provide the main intuition through a proof sketch. Proof Sketch. Fix θ , θ 0 ∈ Θ . Let f ( ϕ ) = E Z ∼D ( θ ) ` ( Z ; ϕ ) and f 0 ( ϕ ) = E Z ∼D ( θ 0 ) ` ( Z ; ϕ ). By applying standard properties of strong conv exity and the fact that G ( θ ) is the unique minimizer of f ( ϕ ), we can deriv e that − γ k G ( θ ) − G ( θ 0 ) k 2 2 > ( G ( θ ) − G ( θ 0 ) ) > ∇ f ( G ( θ 0 )) . Next, we observe that ( G ( θ ) − G ( θ 0 )) > ∇ θ ` ( z ; G ( θ 0 )) is k G ( θ ) − G ( θ 0 ) k 2 β -Lipschitz in z . This follows from applying the Ca uchy-Schw arz inequality and the fact that the loss is β -jointly smooth. Using the dual formulation of the earth mover’ s distance (Lemma C.3 ) and ε -sensitivity of D ( · ), as w ell as the first -order conditions of optimality for convex functions, a short calculation reveals tha t ( G ( θ ) − G ( θ 0 )) > ∇ f ( G ( θ 0 )) > − ε β k G ( θ ) − G ( θ 0 ) k 2 k θ − θ 0 k 2 . Claim (a) then follows by combining the previous two inequalities and rearranging. Intuitivel y , strong conv exity forces the iterates to contr act after retraining, y et this contraction is o ff set by the distribution shift induced by changing the underl ying model parameters. Join t smoothness and ε -sensitivity ensure that this shift is not too larg e. P art (b) is essentially a consequence of applying the Banach fixed-point theorem to the resul t of part (a).  One intriguing insight from our analysis is that this converg ence result is in fact tight; removing any single assum ption required for converg ence by Theorem 3.5 is enough to construct a counterexample f or which RRM diverg es. Proposition 3.6. Suppose that the distribution map D ( · ) is ε -sensitive with ε > 0 . RRM can fail to converge at all in any of the following cases, for any choice of par ameters β , γ > 0 : (a) The loss is β -jointly smooth and convex, but not strongly convex. (b) The loss is γ -strongly convex, but no t jointly smooth. (c) The loss is β -jointly smooth and γ -strongly convex, but ε > γ β . W e include a counterexample for statement (a), and defer the proofs of (b) and (c) to Appendix D.2 . Proof of Proposition 3.6 (a): Consider the linear loss defined as ` ( ( x, y ); θ ) = β y θ , for θ ∈ [ − 1 , 1]. Note that this objective is β -jointly smooth and conv ex, but not strongly conv ex. Let the distribution of Y according to D ( θ ) be a point mass at εθ , and let the distribution of X be inv ariant with respect to θ . Clearly , this distribution is ε -sensitive. Here, the decoupled performative risk has the following form DPR ( θ , ϕ ) = εβ θ ϕ . The unique performativ ely stable point is 0. Howev er , if we initialize RRM at any poin t other than 0, the procedure generates the sequence of iterates . . . , 1 , − 1 , 1 , − 1 . . . , thus failing to converg e. Furthermore, this behavior hol ds for all ε , β > 0.  9 Proposition 3.6 suggests a fundamen tal di ff erence betw een strong and weak conv exity in our framing of performativ e prediction (weak meaning γ = 0). In supervised learning, using strongly conv ex losses generally guarantees a f aster rate of optimization, yet asymptoticall y , the solution achiev ed with either strongly or weakly conv ex losses is globally optimal. Howev er , in our framewor k, strong convexity is in f act necessary to guarantee conv ergence of repeated risk minimization, ev en for arbitrarily smooth l osses and an arbitrarily small sensitivity parameter . 3.3 Repea ted gradient descent Theorem 3.5 demonstr ates that repeated risk minimization converg es to a unique performa tively stable point if the sensitivity parameter ε is small enough. Howev er , implementing RRM requires access to an exact optimization oracle. W e now relax this requirement and demonstrate how a simple gradient descen t algorithm also con verges to a unique stable point. Definition 3.7 (RGD) . Repeated gradient descent (RGD) is the procedure where, starting from an initial model f θ 0 , we perf orm the following sequence of upda tes for ev ery t > 0: θ t +1 = G gd ( θ t ) def = Π Θ θ t − η E Z ∼D ( θ t ) ∇ θ ` ( Z ; θ t ) ! , where η > 0 is a fixed step size and Π Θ denotes the Euclidean projection oper ator onto Θ . Note that repeated gradient descent onl y requires the loss ` to be di ff erentiable with respect to θ . It does not require taking gradients of the performativ e risk. Like RRM, we can show that RGD is a contractiv e mapping for small enough sensitivity parameter ε . Theorem 3.8. Suppose that the loss ` ( z ; θ ) is β -jointly smooth ( A1 ) and γ -strongly convex ( A2 ) . If the distribution map D ( · ) is ε -sensitive with ε < γ ( β + γ )(1+1 . 5 η β ) , then RGD with step size η 6 2 β + γ satisfies the following: (a) k G gd ( θ ) − G gd ( θ 0 ) k 2 6  1 − η  β γ β + γ − ε (1 . 5 η β 2 + β )  k θ − θ 0 k 2 < k θ − θ 0 k 2 . (b) The iterates θ t of RGD converge to a unique performatively stable point θ PS at a linear rate, k θ t − θ PS k 2 6 δ for t > 1 η  β γ β + γ − ε (1 . 5 η β 2 + β )  − 1 log  k θ 0 − θ PS k 2 δ  . The conclusion of Theorem 3.8 is a strict generaliza tion of a classical optimization result which considers a static objective, in which case the rate of contraction is  1 − η β γ β + γ  (see for example Theorem 2.1.15 in [ 34 ] or Lemma 3.7 in [ 16 ]). Our rate exactl y matches this classical result in the case that ε = 0. The proof of Theorem 3.8 can be found in Appendix D.3 . 3.4 F inite-sample analysis W e now extend our main results regarding the converg ence of RRM and RGD to the finite- sample regime. T o do so, we leverage the fact that, under mild regularity conditions, the empirical distribution D n given by n samples drawn i.i.d. from a true distribution D is with high probability close to D in earth mover’ s distance [ 13 ]. W e begin by defining the finite-sample version of these proced ures. 10 Definition 3.9 (RERM & REGD) . Define repeated empirical risk minimization (RERM) to be the procedure where starting from a model f θ 0 at ev ery iteration t > 0, we collect n t samples from D ( θ t ) and perform the update: θ t +1 = G n t ( θ t ) def = arg min θ E Z ∼D n t ( θ t ) ` ( Z ; θ ) . Similarl y , define repeated empirical gradient descent (REGD) to be the optimization procedure with update rule: θ t +1 = G n t gd ( θ t ) def = Π Θ θ t − η E Z ∼D n t ( θ t ) ∇ θ ` ( Z ; θ t ) ! . Here, η > 0 is a step size and Π Θ denotes the Euclidean projection oper ator onto Θ . The following theorem illustrates that with enough samples collected at every iteration, with high probability both algorithms con verge to a small neighborhood around a stable poin t. Recall that m is the dimension of data samples z . Theorem 3.10. Suppose that the loss ` ( z ; θ ) is β -jointly smooth ( A1 ) and γ -strongly convex ( A2 ) , and that there exist α > 1 , µ > 0 such that ξ α ,µ def = R R m e µ | x | α d D ( θ ) is finite ∀ θ ∈ Θ . Let δ ∈ (0 , 1) be a radius of convergence. Consider running RERM or RGD with n t = O  1 ( εδ ) m log  t p  samples at time t . (a) If D ( · ) is ε -sensitive with ε < γ 2 β , then with probability 1 − p , RERM satisfies, k θ t − θ PS k 2 6 δ , for all t > log  1 δ k θ 0 − θ PS k 2   1 − 2 εβ γ  . (b) If D ( · ) is ε -sensitive with ε < γ ( β + γ )(1+1 . 5 η β ) , then with probability 1 − p , RE GD with satisfies, k θ t − θ PS k 2 6 δ , for all t > log  1 δ k θ 0 − θ PS k 2  η  β γ β + γ − ε (3 η β 2 + 2 β )  , for a constant choice of step size η 6 2 β + γ . Proof sketch. The basic idea behind these results is the foll owing. While k θ t − θ PS k 2 > δ , the sample size n t is su ffi ciently larg e to ensure a behavior similar to that on a population level: as in Theorems 3.5 and 3.8 , the iterates θ t contract toward θ PS . This implies that θ t even tually enters a δ -ball around θ PS , for some large enough t . Once this happens, contrary to population- level resul ts, a contraction is no long er guaranteed due to the noise inherent in observing only finite-sample approximations of D ( θ t ). Nevertheless, the sample size n t is su ffi ciently larg e to ensure that θ t cannot escape a δ -ball around θ PS either .  4 Relating performa tive optimality and stability As we discussed previously , while performative optima are always guaranteed to exist, 1 it is not clear whether performativ ely stable points exist in all settings. Our algorithmic analysis 1 In particular , they are guaranteed to exist over the extended real line, i.e. we allow θ ∈ ( R ∪ {±∞} ) d . 11 of repeated risk minimization and repeated gradient descen t revealed the existence of unique stable points under the assum ption that the objectiv e is strongly conv ex and smooth. The first result of this section illustra tes existence of stable points under weaker assumptions on the loss, in the case where the solution space Θ is constrained. All proofs can be found in Appendix D . Proposition 4.1. Let the distribution map D ( · ) be ε -sensitive and Θ ⊂ R d be compact. If the loss ` ( z ; θ ) is convex and jointly continuous in ( z , θ ) , then there exists a performatively stable point. A natural question to consider at this point is whether there are procedures analogous to RRM and RGD for e ffi ciently com puting performative optima. Our analysis sugg ests that directly minimizing the performativ e risk is in general a more challenging problem than finding performativ ely stable points. In particular , we can construct simple examples where the performative risk PR ( θ ) is non-conv ex, despite strong regularity assumptions on the loss and the distribution map. Proposition 4.2. The performative risk PR ( θ ) can be concave in θ , even if the loss ` ( z ; θ ) is β -jointly smooth ( A1 ) , γ -strongly convex ( A2 ) , and the distribution map D ( · ) is ε -sensitive with ε < γ β . Howev er , what we can show is that there are cases where finding perf ormatively stable points is su ffi cient to guarantee tha t the resulting model has low performa tive risk. In particular , our next result demonstrates tha t if the loss function ` ( z ; θ ) is Lipschitz in z and γ -strongly conv ex, then all performativ ely stable points and performativ e optima lie in a small neighborhood around each other . Moreover , the theorem holds for cases where performativ e optima and performativ ely stable points are not necessarily unique. Theorem 4.3. Suppose that the loss ` ( z ; θ ) is L z -Lipschitz in z , γ -strongly convex ( A2 ) , and that the distribution map D ( · ) is ε -sensitive. Then, for every performatively stable point θ PS and every performative optimum θ PO : k θ PO − θ PS k 2 6 2 L z ε γ . This result shows that in cases where repeated risk minimization conv erges to a stable point, the resulting model approximatel y minimizes the performativ e risk. Moreover , Theorem 4.3 suggests a way of converging close to performativ e optima in objective value even if the loss function is smooth and con vex, but not strongly con vex. In particular , by adding quadratic regularization to the objective, w e can ensure that RRM or RGD conv erge to a stable point that approximatel y minimizes the performativ e risk, see Appendix E . 5 A case study in strategic classifica tion Having presented our model for performativ e prediction, w e now proceed to illustrate how these ideas can be applied within the context of strategic classification and discuss some of the implications of our theorems f or this setting. W e begin by formally establishing how strategic classification can be cast as a performa tive prediction problem and ill ustrate how our framew ork can be used to deriv e results reg arding the conv ergence of popular retraining heuristics in strategic classification settings. Afterwards, we further dev elop the connections between both fields by empirically ev aluating the behavior of repeated risk minimization on a dynamic credit scoring task. 12 Input: base distribution D , classifier f θ , cost function c , and utility function u Sampling procedure f or D ( θ ) : 1. Sample ( x , y ) ∼ D 2. Compute best response x BR ← arg max x 0 u ( x 0 , θ ) − c ( x 0 , x ) 3. Output sample ( x BR , y ) Figure 1: Distribution map for strategic classification. 5.1 S tackelberg equilibria are performativ e optima Stra tegic classification is a two-pla yer game betw een an institution which deploys a classifier and agents who selectiv ely adapt their features in order to im prove their outcomes. A classic example of this setting is that of a bank which uses a machine learning classifier to predict whether or not a loan applican t is creditworthy . Individual applicants react to the bank’ s classifier by manipulating their features with the hopes of inducing a favorable classifica tion. This game is said to hav e a Stackelber g structure since agents adapt their f eatures only after the bank has deploy ed their classifier . The optimal strategy f or the institution in a strategic classification setting is to deploy the solution corresponding to the S tackelberg equilibrium , defined as the classifier f θ which achieves minimal loss over the induced distribution D ( θ ) in which agents hav e strategically adapted their features in response to f θ . In fact, w e see that this equilibrium notion exactly matches our definition of performativ e optimality: f θ SE is a Stackelberg equilibrium ⇐ ⇒ θ SE ∈ arg min θ PR( θ ) . W e think of D as a "baseline " distribution ov er fea ture-outcome pairs before an y classifier deployment , and D ( θ ) denotes the distribution over fea tures and outcomes obtained by strategi- cally manipulating D . As described in existing wor k [ 8 , 15 , 33 ], the distribution function D ( θ ) in strategic classification corresponds to the da ta-genera ting process outlined in Figure 1 . Here, u and c are problem-specific functions which determine the best response for ag ents in the game. T ogether with the base distribution D , these define the relevan t distribution map D ( · ) f or the problem of strategic classification. A strategy that is commonl y adapted in practice as a means of coping with the distribution shift that arises in stra tegic classification is to repeatedl y retrain classifiers on the ind uced dis- tributions. This procedure corresponds to the repeated risk minimization procedure introduced in Definition 3.3 . Our results describe the first set of su ffi cient conditions under which repeated retraining ov ercomes strategic e ff ects. Corollary 5.1. Let the institution’ s loss ` ( z ; θ ) be L z - and L θ -Lipschitz in z and θ respectively , β - jointly smooth ( A1 ) , and γ -strongly convex ( A2 ) . If the induced distribution map is ε -sensitive, with ε < γ β , then RRM converges at a linear rate to a performatively stable classifier θ PS that is 2 L z ε ( L θ + L z ε ) γ − 1 close in objective value to the S tackelberg equilibrium. 5.2 Sim ulations W e next examine the converg ence of repeated risk minimization and repeated gradient descent in a simula ted strategic classification setting. W e run experiments on a dynamic credit scoring 13 0 20 40 60 I t e r a t i o n t 1 0 - 1 2 1 0 - 7 1 0 - 2 c · k θ t + 1 − θ t k 2 Repeated Risk Minimization ε = 0 . 0 1 ε = 1 ε = 1 0 0 ε = 1 0 0 0 0 1000 2000 3000 I t e r a t i o n t 1 0 - 1 2 1 0 - 7 1 0 - 2 Repeated Gradient Descent ε = 0 . 0 1 ε = 1 ε = 1 0 0 ε = 1 0 0 0 Figure 2: Converg ence in domain of RRM (left) and RGD (right) for varying ε -sensitivity parameters. W e add a marker if at the next itera tion the distance between iterates is n umerically zero. W e normalize the distance by c = k θ 0 ,S k − 1 2 . simulator in which an institution classifies the creditworthiness of loan applicants. 2 As moti- va ted previously , agents react to the institution ’ s classifier by manipulating their features to increase the likelihood that they receiv e a fav orable classification. T o run our simulations, we construct a distribution map D ( θ ), as described in Figure 1 . For the base distribution D , we use a class-balanced subset of a Kag gle credit scoring da taset [ 22 ]. Features x ∈ R m − 1 correspond to historical information about an individual, such as their monthly income and number of credit lines. Outcomes y ∈ { 0 , 1 } are binary variables which are equal to 1 if the individual defa ulted on a loan and 0 otherwise. The institution makes predictions using a logistic regression classifier . W e assume that individuals hav e linear utilities u ( θ , x ) = −h θ , x i and quadratic costs c ( x 0 , x ) = 1 2 ε k x 0 − x k 2 2 , where ε is a positive constant that regulates the cost incurred by changing features. Linear utilities indicate that ag ents wish to minimize their assigned probability of defaul t. W e divide the set of features into str ategic f eatures S ⊆ [ m − 1], such as the number of open credit lines, and non-strategic f eatures (e.g., age). Solving the optimiza tion problem described in Figure 1 , the best response for an individ ual corresponds to the following update, x 0 S = x S − ε θ S , where x S , x 0 S , θ S ∈ R | S | . As per conven tion in the literature [ 8 , 15 , 33 ], individual outcomes y are una ff ected by stra tegic manipulation. Intuitiv ely , this data-g enerating process is ε -sensitive since f or a given choice of classifiers, f θ and f θ 0 , an individual f eature vector is shifted to x S − ε θ S and to x S − ε θ 0 S , respectivel y . The distance between these two shifted points is equal to ε k θ S − θ 0 S k 2 . Since the optimal transport distance is bounded by ε k θ − θ 0 k 2 for ev ery individual point , it is also bounded by this quantity over the en tire distribution. A full proof of this claim is presented in Appendix B.2 . For our experiments, instead of sampling from D ( θ ), we treat the points in the original dataset as the true distribution. Hence, w e can think of all the foll owing procedures as operating at the population lev el. Furthermore, we add a regularization term to the logistic loss to ensure that the objectiv e is strongly conv ex. 2 Code is av ailable at https://github.com/zykls/performative-prediction , and the sim ulator has been integrated into the WhyNot softw are package [ 31 ]. 14                    Figure 3: P erformativ e risk (left) and accur acy (right) of the classifier θ t at di ff erent stag es of RRM for ε = 80. Blue lines indicates the optimization phase and green lines indicate the e ff ect of the distribution shift after the classifier deployment. Repeated risk minimization. The first experiment we consider is the converg ence of RRM. From our theoretical analysis, we know that RRM is guaranteed to conv erge at a linear rate to a performativ ely stable point if the sensitivity parameter ε is smaller than γ β . In Figure 2 (left), we see that RRM does indeed converg e in only a few iterations for small values of ε while it diverg ences if ε is too large. The evolution of the performativ e risk during the RRM optimization is ill ustrated in Figure 3 . W e evalua te PR ( θ ) at the beginning and at the end of each optimization round and indicate the e ff ect d ue to distribution shift with a dashed green line. W e also verify that the surroga te loss is a good proxy for classifica tion accuracy in the performativ e setting. Repeated gradient descent. In the case of RGD, we find similar behavior to that of RRM. While the iterates again conv erge linearly , they naturall y do so at a slower rate than in the exact minimization setting, given that each iteration consists only of a single gradient step. Again, we can see in Figure 2 that the iter ates conv erge for small v alues of ε and diverg e for large v alues. 6 Discussion and Future W ork Our work dr aws attention to the fundamen tal problem of performativity in statistical learning and decision-making. P erformative prediction enjoys a clean f ormal setup that we introduced, drawing on elements from ca usality and game theory . Retraining is often considered a nuisance intended to cope with distribution shift. In contrast , our work interprets retraining as the natural equilibrating dynamic for perf ormativ e prediction. The fixed points of retraining are performativ e stable points. Moreover , retraining conv erges to such stable points under natural assum ptions, including strong conv exity of the loss function. It is interesting to note that (weak) convexity alone is not enough. P erformativity thus gives another intriguing perspectiv e on why strong con vexity is desirable in supervised learning. Sever al interesting questions remain. For example, by letting the step size of repeated gradient descent tend to 0, we see that this procedure conv erges for ε < γ β + γ . Exact repeated risk minimization, on the other hand, prov ably conv erges f or every ε < γ β , and we show ed this inequality is tight. It would be interesting to understand whether this gap is a fundamental di ff erence betw een both procedures or an artifact of our analysis. 15 Lastly , we believ e that the tools and ideas from performativ e prediction can be used to make progress in other subareas of machine learning. For example, in this paper , we hav e illustrated how reframing strategic classification as a performativ e prediction problem leads to a new understanding of when retraining ov ercomes strategic e ff ects. However , we view this exam ple as only scratching the surf ace of work connecting performativ e prediction with other fields. In particular , reinforcement learning can be thought of as a case of performativ e prediction. In this setting, the choice of policy f θ , a ff ects the distribution D ( θ ) over z = { ( s h , a h ) } ∞ h =1 , the set of visited states, s , and actions, a , in a Mar kov Decision Process. Building o ff this connection, we can reinterpret repeated risk minimization as a form of o ff -policy learning in which an agen t first collects a batch of da ta under a particular policy f θ , and then finds the optimal policy f or that trajectory o ffl ine. W e believ e that some of the ideas developed in the context of performative prediction can shed new light on when these o ff -policy methods can conv erge. Acknowledg ements W e wish to acknowledge support from the U .S. National Science Foundation Graduate Re- search Fell owship Program and the Swiss National Science F oundation Early P ostdoc Mobility Fellowship Progr am. References [1] C.D. Aliprantis and K.C. Border . Infinite Dimensional Analysis: A Hitchhiker’ s Guide . Springer Ber lin Heidelberg, 2006. [2] P eter L. Bartlett. Learning with a Slowly Changing Distribution. In Proceedings of the Fifth Annual A C M C onference on C omputational Learning Theory (C OLT) , pag es 243–252, 1992. [3] P eter L. Bartlett, Shai Ben-David, and Sanjeev R. K ulkarni. Learning Changing Concepts by Exploiting the S tructure of Change. Machine Learning , 41(2):153–174, 2000. [4] Y ahav Bechavod, Katrina Ligett, Zhiwei Stev en W u, and J uba Ziani. Ca usal Feature Discovery through S trategic Modification. arXiv preprint , 2020. [5] Claude Berge. T opological Spaces . Courier Corporation, 1997. [6] Richard J Bolton and David J Hand. Statistical Fraud Detection: A Review. S tatistical Science , pages 235–249, 2002. [7] Léon Bottou, Jonas P eters, J oaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon P ortugaly , Dipankar Ra y , P atrice Simard, and Ed Snelson. Counterfactual Reasoning and Learning Systems: The Example of Com putational Adv ertising. The J ournal of Machine Learning Research , 14(1):3207–3260, 2013. [8] Michael Brückner , Christian Kanzow , and T obias Sche ff er . Static Prediction Games for Adv ersarial Learning Problems. J ournal of Machine Learning Research , 13(Sep):2617–2654, 2012. [9] Sébastien Bubeck. Conv ex Optimization: Algorithms and Complexity. F oundations and T rends® in Machine Learning , 8(3-4):231–357, 2015. 16 [10] J enna Burrell, Zoe Kahn, Anne Jonas, and Daniel Gri ffi n. When Users Control the Al- gorithms: V alues Expressed in Practices on T witter . Proceedings of the A C M on Human- C omputer Interaction , 3:19, 2019. [11] Nilesh Dalvi, P edro Domingos, Sumit Sanghai, and Deepak V erma. Adv ersarial Classi- fication. In Proceedings of the 10 th A C M SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 99–108, 2004. [12] Danielle Ensign, Sorelle A. Friedler , Scott Neville, Carlos Scheideg ger , and Suresh V enkata- subramanian. Runa way Feedback Loops in Predictive P olicing. In Proceedings of the 1st A C M C onference on F airness, Accountability and T ransparency , pag es 160–171, 2018. [13] Nicolas Fournier and Arna ud Guillin. On the Rate of Conv ergence in W asserstein Distance of the Empirical Measure. Probability Theory and Related F ields , 162(3):707–738, 2015. [14] J oão Gama, Indr ˙ e Žliobait ˙ e, Albert Bifet , Mykola P echenizkiy , and A bdelhamid Bouchachia. A Survey on Concept Drift Adaptation. A C M Computing Surveys (C SUR) , 46(4):1–37, 2014. [15] Moritz Hardt, Nimrod Megiddo, Christos P apadimitriou, and Mary W ootters. Strategic Classification. In Proceedings of the A C M C onference on Innovations in Theoretical Computer Science , pages 111–122, 2016. [16] Moritz Hardt , Ben Recht, and Y or am Singer . T rain F aster , Generalize Better: Stability of Stochastic Gradient Descen t. In Proceedings of the 33rd International Confer ence on Machine Learning (IC ML) , pages 1225–1234, 2016. [17] T atsunori Hashimoto, Megha Srivastav a, Hongseok Namkoong, and P ercy Liang. F airness Without Demographics in Repeated Loss Minimiza tion. In Proceedings of the 35th Interna- tional C onference on Machine Learning (IC ML) , pages 1929–1938, 2018. [18] Kieran Healy . The P erformativity of Networks. European J ournal of Sociology/Archives E uropéennes de Sociologie , 56(2):175–205, 2015. [19] Lily Hu and Y iling Chen. A Short -term Intervention for Long-term F airness in the Labor Market. In Proceedings of the W orld Wide W eb C onference , pag es 1389–1398, 2018. [20] Lily Hu, Nicole Immorlica, and Jennif er W ortman V aughan. The Disparate E ff ects of Stra tegic Manipulation. In Proceedings of the 2nd A C M C onference on F airness, Accountability , and T ransparency , pag es 259–268, 2019. [21] Guido W Imbens and Donald B R ubin. C ausal Inference in S tatistics, Social, and Biomedical sciences . Cambridge University Press, 2015. [22] Kaggle. Give me some credit. https://www.kaggle.com/c/GiveMeSomeCredit/data , 2012. [23] Shizuo Kakutani. A Generalization of Brouw er’ s Fixed P oint Theorem. Duke Mathematical J ournal , 8(3):457–459, 1941. [24] Moein Khajehnejad, Behzad T abibian, Bernhard Schölkopf, Adish Singla, and Manuel Gomez-Rodriguez. Optimal Decision Making Under Strategic Behavior. arXiv preprint arXiv:1905.09239 , 2019. 17 [25] J on Kleinberg and Manish Ragha van. How Do Classifiers Induce Agents to In vest E ff ort Stra tegically? In Proceedings of the A C M Confer ence on Economics and Computation (EC) , pages 825–844, 2019. [26] Amanda K ube, Sanmay Das, and P atrick J Fowler . Allocating Interven tions Based on Predicted Outcomes: A Case Study on Homelessness Services. In Proceedings of the AAAI C onference on Artificial Intellig ence , vol ume 33, pages 622–629, 2019. [27] Anthon y K uh, Thomas P etsche, and Ronald L Rivest. Learning Time- V arying Concepts. In Advances in Neur al Information Processing Systems (NIPS) , pages 183–189, 1991. [28] L ydia T Liu, Sarah Dean, Esther Rolf, Max Simchowitz, and Moritz Hardt. Delay ed Impact of F air Machine Learning. In Proceedings of the 35th International C onference on Machine Learning (IC ML) , pages 3150–3158, 2018. [29] Kristian L um and William Isaac. T o Predict and Serve? Significance , 13(5):14–19, 2016. [30] Donald A MacK enzie, F abian Muniesa, and L ucia Siu. Do Economists Make Markets?: On the P erformativity of Economics . Princeton University Press, 2007. [31] J ohn Miller , Chloe Hsu, J ordan T routman, J uan P erdomo, Tijana Zrnic, L ydia Liu, Y u Sun, L udwig Schmidt, and Moritz Hard t. Whynot, 2020. [32] J ohn Miller , Smitha Milli, and Moritz Hardt. Strategic Classification is Ca usal Modeling in Disguise. arXiv preprint , 2019. [33] Smitha Milli, J ohn Miller , Anca D Dragan, and Moritz Hardt. The Social Cost of Strategic Classification. In Proceedings of the 2nd A C M Conference on F airness, Accountability , and T ransparency , pag es 230–239, 2019. [34] Y urii Nesterov . Introductory Lectures on C onvex Optimization: A Basic C ourse , volume 87. Springer Science & Business Media, 2013. [35] Judea P earl. Causality . Cambridge Univ ersity Press, 2009. [36] Shai Shalev-Shw artz and Shai Ben-David. Understanding Machine Learning: Fr om Theory to Algorithms . Cambridge University Press, 2014. [37] Y onadav Shavit , Benjamin Edelman, and Brian Axelrod. Learning From S trategic A gents: Accuracy , Im provement, and C ausality. arXiv preprint , 2020. [38] Cédric Villani. T opics in optimal transportation . Number 58. American Mathematical Society , 2003. [39] Cédric Villani. Optimal T ransport: Old and New , vol ume 338. Springer Science & Business Media, 2008. [40] Geo ff rey I. W ebb, Roy Hyde, Hong Cao, Hai Long Nguyen, and Francois P etitjean. Charac- terizing Concept Drift. Data Mining and Knowledge Discovery , 30(4):964–994, 2016. [41] Indr ˙ e Žliobait ˙ e. Learning under Concept Drift: An Overview. arXiv preprint arXiv:1010.4784 , 2010. 18 [42] Indr ˙ e Žliobait ˙ e, Mykola P echenizkiy , and Joao Gama. An Overview of Concept Drift Applications. In Big Data Analysis: New Algorithms for a New Society , pages 91–114. Springer , 2016. 19 A Applications of performa tivity T o ill ustrate the fact tha t performa tivity is a common cause for concept drift , we review a table of concepts drift applications from [ 42 ]. In T able 1 , w e highlight those settings that na turally occur due to performa tivity . Below w e briefly discuss the role of performativity in such applications. Indust. Appl. Monitoring & control Information management Analytics & diagnostics Security , P olice fraud detection, insider trading detection, adversary actions detection next crime place prediction crime volume prediction Finance, Banking, T elecom, Insurance, Marketing, Retail, Adv ertising monitoring & management of customer segments, bankruptcy prediction product or service recommendation, including complimen tary , user intent or informa tion need prediction demand prediction, response rate prediction, budget planning Production industry controlling output quality - predict bottlenecks Education (e-Learning, e-Health), Media, Entertainment gaming the system, drop out prediction music, V OD, movie, news, learning object personalized search & recommendations play er -centered game design, learner -centered education T able 1: T able of concept drift applications from Žliobait ˙ e et al. [ 42 ] The role of fraud detection systems is to predict whether an instance such as a transaction or email is legitimate or not. It is well-known that designers of such fraud ulent instances adapt to the fra ud detection system in place in order to breach security [ 6 ]. Therefore, the deploymen t of fraud detection systems shapes the f eatures of fraud ulent instances. Crime place prediction , sometimes referred to as predictive policing [ 12 , 26 , 29 ], uses his- torical data to estimate the likelihood of crime at a given location. Those locations where criminal behavior is deemed likely by the system typically g et more police patrols and better surveillance which act on the one hand to deter crime. These actions resulting from prediction significantly decrease the probability of crime taking place, thus changing the data used for future predictions. In personalized recommendations , instances are recommended to a user based on their histori- cal context , such as their ratings or purchases. The set of recommendations thus depends on the trained machine learning model, which in turn changes the user’ s future ratings or purchases [ 7 ]. In other words, user f eatures serving as input to a recommender inevitably depend on the previously used recommendation mechanisms. In online two-player games , it is common to request an AI opponent. The lev el of sophistica- tion of the AI opponen t might be chosen depending on the user’ s success history in the giv en game, with the goal of making the game appropriately challenging. This choice of AI opponent changes players’ future success profiles, ag ain ca using a distribution shift in the f eatures serving as an input to the prediction system. Gaming the system f alls under the umbrella of str ategic classification, which we discuss in detail in Section 5 , so we a void further discussion in this section. 20 B Experiments B.1 V isualizing the performativ e risk and trajectory of RRM                (a) ε = 25              (b) ε = 100 Figure 4: P erformative risk surface and trajectory of repeated risk minimiza tion for two di ff erent val ues of sensitivity parameter ε . The initial iterate is the risk minimizer on the base dataset ( • ). W e mark the performativ e optimum ( ? ) and performatively stable poin t ( × ). W e provide additional experimental results in which we visualize the trajectory of repeated risk minimization on the surface of the performativ e risk. W e adopt the gener al setting of Section 5 . However , to properly visualize the loss, we rerun the experiments on a reduced version of the dataset with only tw o features (i.e x ∈ R 2 ), both of which are adapted strategically according to the update described in Section 5 . Figure 4 pl ots the performativ e risk surface, together with the trajectory of RRM given by straight black lines. The top plot shows the trajectory for a suitably small sensitivity parameter 21 ε . W e see that RRM converg es to a stable point which is close to the performative optimum. W e contrast this beha vior with that of RRM when ε is large in the bottom plot. Here, we observe that the iterates oscillate and that the algorithm fails to converg e. Both plots mark the risk minimizer on the initial data set ( • ), before any stra tegic adaptation takes place. This point also corresponds to the initial iterate of RRM θ 0 . W e additionally mark the perf ormative optimum ( ? ) on the risk curve. The top plot additionally mar ks the last iterate of RRM, which serves as a proxy for the performativ ely stable point ( × ). As predicted by our theory , this stable point is in a small neighborhood around the performativ e optimum. B.2 Experimen tal details Base distribution. The base distribution consists of the Kag gle data set [ 22 ]. W e subsample n = 18 , 357 points from the original training set such that both classes are approximately balanced (45% of points hav e y equal to 1). There are a total of 10 features, 3 of which we treat as strategic fea tures: utilization of credit lines, number of open credit lines, and number of real estate loans. W e scale features in the base distribution so that they have zero mean and unit variance. V erifying ε -sensitivity . W e verify tha t the map D ( · ), as described in Section 5 , is ε -sensitive. T o do so, we analyze W 1 ( D ( θ ) , D ( θ 0 )), for arbitrary θ , θ 0 ∈ Θ . Fix a sam ple point x ∈ R m − 1 from the base dataset. Because the base distribution D is supported on n points, we can upper bound the optimal transport distance betw een any pair of distributions D ( θ ) and D ( θ 0 ) by the Euclidean distance betw een the shifted versions of x in D ( θ ) and D ( θ 0 ). In our construction, the point x is shifted to x − εθ and to x − εθ 0 in D ( θ ) and D ( θ 0 ) respectivel y . The distance between these two shifted points is k x − ε θ − x + εθ 0 k 2 = ε k θ − θ 0 k 2 . Since the same relationship holds f or all other samples x in the base dataset , the optimal transport from D ( θ ) to D ( θ 0 ) is at most ε k θ − θ 0 k 2 . V erifying joint smoothness of the objectiv e. For the experiments described in Figure 2 , w e run repeated risk minimization and repeated gradient descent on the logistic loss with ` 2 regularization: 1 n n X i =1 − y i θ > x i + log  1 + exp( θ > x i )  + γ 2 k θ k 2 2 (1) For both the repeated risk minimization and repeated gradient descent w e set γ = 1000 / n , where n is the size of the base dataset. For a particular fea ture-outcome pair ( x i , y i ), the logistic loss is 1 4 k x i k 2 2 smooth [ 36 ]. Therefore, the entire objective is 1 4 n P n i =1 k x i k 2 2 + γ smooth. Due to the strategic updates, x BR = x − ε θ , the norm of individual f eatures change depending on the choice of model parameters. Theoretically , w e can upper bound the smoothness of the objectiv e by finding the implicit constraints on Θ , which can be revealed by looking at the dual of the objective function for every fixed v alue of ε . Howev er , for simplicity , w e simply cal culate the worst -case smoothness of the objective, giv en the trajectory of iterates { θ t } , for ev ery fixed ε . Furthermore, w e can verify the l ogistic loss is jointly smooth. For a fixed exam ple z = ( x, y ), the gradient of the regularized logistic l oss with respect to θ is, 22 ∇ θ ` ( z ; θ ) = y x + exp( θ > x ) 1 + exp( θ > x ) x + γ θ , which is 2-Lipschitz in z due to y ∈ { 0 , 1 } . Hence, the ov erall objectiv e is β -jointly smooth with parameter β = max n 2 , 1 4 n n X i =1 k x i k 2 2 + γ o . For RRM, ε is less than γ β only in the case that ε = 0 . 01. For RGD, ε is never smaller than the theoretical cuto ff of γ ( β + γ )(1+1 . 5 η β ) . Optimization details. The definition of RRM requires exact minimiza tion of the objective at every iteration. W e approximate this requirement by minimizing the objective described in expression ( 1 ) to small tolerance, 10 − 8 , using gradient descent. W e choose the step size at every iteration using backtracking line search. In the case of repeated gradient descent, w e run the procedure as described in Definition 3.7 with a fixed step size of η = 2 β + γ . C A uxiliary lemmas Lemma C.1 (First -order optimality condition) . Let f be convex and let Ω be a closed convex set on which f is di ff erentiable, then x ∗ ∈ arg min x ∈ Ω f ( x ) if and only if ∇ f ( x ∗ ) T ( y − x ∗ ) > 0 , ∀ y ∈ Ω . Lemma C.2 (Bubeck, 2015 [ 9 ], Lemma 3.11) . Let f : R d → R be β -smooth and γ -strongly convex, then for all x, y in R d , ( ∇ f ( x ) − ∇ f ( y )) > ( x − y ) > γ β γ + β k x − y k 2 2 + 1 γ + β k∇ f ( x ) − ∇ f ( y ) k 2 2 . Lemma C.3 (Kantorovich-R ubinstein) . A distribution map D ( · ) is ε -sensitive if and only if for all θ , θ 0 ∈ Θ : sup (     E Z ∼D ( θ ) g ( Z ) − E Z ∼D ( θ 0 ) g ( Z )     6 ε k θ − θ 0 k 2 : g : R p → R , g 1-Lipschitz ) . Lemma C.4. Let f : R n → R d be an L -Lipschitz function, and let X , X 0 ∈ R n be random variables such that W 1 ( X , X 0 ) 6 C . Then k E [ f ( X )] − E [ f ( X 0 )] k 2 6 LC . Proof. k E [ f ( X )] − E [ f ( X 0 )] k 2 2 = ( E [ f ( X )] − E [ f ( X 0 )]) > ( E [ f ( X )] − E [ f ( X 0 )]) = k E [ f ( X )] − E [ f ( X 0 )] k 2 ( E [ f ( X )] − E [ f ( X 0 )]) > k E [ f ( X )] − E [ f ( X 0 )] k 2 ( E [ f ( X )] − E [ f ( X 0 )]) . 23 Now define the unit vector v := E [ f ( X )] − E [ f ( X 0 )] k E [ f ( X )] − E [ f ( X 0 )] k 2 . By linearity of expectation, w e can further write k E [ f ( X )] − E [ f ( X 0 )] k 2 2 = k E [ f ( X )] − E [ f ( X 0 )] k 2 ( E [ v > f ( X )] − E [ v > f ( X 0 )]) . For an y unit vector v and L -Lipschitz function f , v > f is a one-dimensional L -Lipschitz function, so we can appl y Lemma C.3 to obtain k E [ f ( X )] − E [ f ( X 0 )] k 2 2 6 k E [ f ( X )] − E [ f ( X 0 )] k 2 LC . Canceling out k E [ f ( X )] − E [ f ( X 0 )] k 2 from both sides concludes the proof .  D Proofs of main results D.1 Proof of Theorem 3.5 Fix θ , θ 0 ∈ Θ . Let f ( ϕ ) = E Z ∼D ( θ ) ` ( Z ; ϕ ) and f 0 ( ϕ ) = E Z ∼D ( θ 0 ) ` ( Z ; ϕ ). Since f is γ -strongly conv ex and G ( θ ) is the unique minimizer of f ( x ) we know that , f ( G ( θ )) − f ( G ( θ 0 )) > ( G ( θ ) − G ( θ 0 ) ) > ∇ f ( G ( θ 0 )) + γ 2 k G ( θ ) − G ( θ 0 ) k 2 2 (2) f ( G ( θ 0 )) − f ( G ( θ )) > γ 2 k G ( θ ) − G ( θ 0 ) k 2 2 (3) T ogether , these two inequalities imply tha t − γ k G ( θ ) − G ( θ 0 ) k 2 2 > ( G ( θ ) − G ( θ 0 ) ) > ∇ f ( G ( θ 0 )) . Next, we observe that ( G ( θ ) − G ( θ 0 )) > ∇ θ ` ( z ; G ( θ 0 )) is k G ( θ ) − G ( θ 0 ) k 2 β -Lipschitz in z . This follows from applying C auch y-Schwarz and the fact that the l oss is β -jointly smooth. Using the dual form ulation of the optimal transport distance (Lemma C.3 ) and ε -sensitivity of D ( · ), ( G ( θ ) − G ( θ 0 )) > ∇ f ( G ( θ 0 )) − ( G ( θ ) − G ( θ 0 )) > ∇ f 0 ( G ( θ 0 )) > − ε β k G ( θ ) − G ( θ 0 ) k 2 k θ − θ 0 k 2 . Furthermore, using the first -order optimality conditions for convex functions, w e have ( G ( θ ) − G ( θ 0 )) > ∇ f 0 ( G ( θ 0 )) > 0, and hence ( G ( θ ) − G ( θ 0 )) > ∇ f ( G ( θ 0 )) > − ε β k G ( θ ) − G ( θ 0 ) k 2 k θ − θ 0 k 2 . There- fore, w e conclude that , − γ k G ( θ ) − G ( θ 0 ) k 2 2 > − ε β k G ( θ ) − G ( θ 0 ) k 2 k θ − θ 0 k 2 . Claim (a) then follows by rearr anging. T o prove claim (b) we note tha t θ t = G ( θ t − 1 ) by the definition of RRM, and G ( θ PS ) = θ PS by the definition of stability . Applying the result of part (a) yiel ds k θ t − θ PS k 2 6 ε β γ k θ t − 1 − θ PS k 2 6 ε β γ ! t k θ 0 − θ PS k 2 . (4) Setting this expression to be at most δ and solving for t completes the proof of claim (b). D.2 Proof of Proposition 3.6 As for statement (a), w e provide one counterexample for each of the statemen ts (b) and (c). 24 Proof of (b): Consider a type of regularized hinge loss ` ( z ; θ ) = C max ( − 1 , y θ ) + γ 2 ( θ − 1) 2 , and suppose Θ ⊇ [ − 1 2 ε , 1 2 ε ]. Let the distribution of Y according to D ( θ ) be a point mass at εθ , and let the distribution of X be inv ariant with respect to θ . Clearl y , this distribution is ε -sensitive. Let θ 0 = 2. Then, by picking C big enough, RRM prioritizes to minimize the first term exactly , and hence w e get θ 1 = − 1 2 ε . In the next step, again due to larg e C , we g et θ 2 = 2. Thus, RRM keeps oscillating betw een 2 and − 1 2 ε , failing to conv erge. This argument holds for all γ , ε > 0. Proof of (c): Suppose tha t the loss function is the squared loss, ` ( z ; θ ) = ( y − θ ) 2 , where y , θ ∈ R . Note that this implies β = γ . Let the distribution of Y according to D ( θ ) be a point mass at 1 + εθ , and let the distribution of X be invariant with respect to θ . This distribution famil y satisfies ε -sensitivity , because W 1 ( D ( θ ) , D ( θ 0 )) = ε | θ − θ 0 | . By properties of the squared loss, w e know arg min θ 0 DPR( θ , θ 0 ) = E Z ∼D ( θ ) [ Y ] = 1 + ε θ . It is thus not hard to see that RRM does not contract if ε > γ β = 1: | G ( θ ) − G ( θ 0 ) | =    1 + ε θ − 1 − εθ 0    = ε | θ − θ 0 | , which exactly matches the bound of Theorem 3.5 and proves the first statemen t of the proposi- tion. The unique performativ ely stable point of this problem is θ such that θ = 1 + εθ , which is θ PS = 1 1 − ε for ε > 1. For ε = 1, no performativ ely stable point exists, thereby proving the second claim of the proposition. If ε > 1 on the other hand, and θ 0 , θ PS , we either hav e θ t → ∞ or θ t → −∞ , because θ t = 1 + ε θ t − 1 = t − 1 X k =0 ε k + θ 0 ε t = ε t − 1 ε − 1 + θ 0 ε t , thus concluding the proof . D.3 Proof of Theorem 3.8 Since projecting onto a con vex set can only bring tw o iterates cl oser together , in this proof we ignore the projection operator Π Θ and treat G gd as performing merely the gradien t step. W e begin by expanding out k G gd ( θ ) − G gd ( θ 0 ) k 2 2 , k G gd ( θ ) − G gd ( θ 0 ) k 2 2 =       θ − η E Z ∼D ( θ ) ∇ θ ` ( Z ; θ ) − θ 0 + η E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 )       2 2 = k θ − θ 0 k 2 2 − 2 η ( θ − θ 0 ) > E Z ∼D ( θ ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 ) ! + η 2       E Z ∼D ( θ ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 )       2 2 def = T 1 − 2 η T 2 + η 2 T 3 . 25 Next, w e analyze each term individually , T 1 def = k θ − θ 0 k 2 2 , T 2 def = ( θ − θ 0 ) > E Z ∼D ( θ ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 ) ! , T 3 def = k E Z ∼D ( θ ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 ) k 2 2 . W e start by lower bounding T 2 : T 2 = ( θ − θ 0 ) > E Z ∼D ( θ ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) + E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 ) ! = ( θ − θ 0 ) > E Z ∼D ( θ ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) ! + ( θ − θ 0 ) > E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 ) ! > −k θ − θ 0 k 2 k E Z ∼D ( θ ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) k 2 + ( θ − θ 0 ) > E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 ) ! , where in the last step we apply the Cauch y-Schwarz inequality . By smoothness, ∇ θ ` ( Z ; θ ) is β -Lipschitz in Z . T ogether with the fact that Z is ε -sensitive, w e can lower bound the first term in the above expression by appl ying Lemma C.4 , which results in − β ε k θ − θ 0 k 2 2 . W e apply Lemma C.2 to low er bound the second term by ( θ − θ 0 ) > E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 ) ! > β γ β + γ k θ − θ 0 k 2 2 + 1 β + γ E Z ∼D ( θ 0 ) h k∇ θ ` ( Z ; θ ) − ∇ θ ` ( Z ; θ 0 ) k 2 2 i > β γ β + γ k θ − θ 0 k 2 2 + 1 β + γ       E Z ∼D ( θ 0 )  ∇ θ ` ( Z ; θ ) − ∇ θ ` ( Z ; θ 0 )        2 2 , where we ha ve applied J ensen ’ s inequality in the last line. Putting everything together , we g et T 2 > β γ β + γ − β ε ! k θ − θ 0 k 2 2 + 1 β + γ       E Z ∼D ( θ 0 )  ∇ θ ` ( Z ; θ ) − ∇ θ ` ( Z ; θ 0 )        2 2 . Now we upper bound T 3 . W e begin by expanding out the square just as before, T 3 =       E Z ∼D ( θ ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) + E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 )       2 =       E Z ∼D ( θ ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ )       2 2 +       E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 )       2 2 + 2 E Z ∼D ( θ ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) ! > E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 ) ! . (5) W e again bound each term individually . By the smoothness of the loss and Lemma C.4 ,       E Z ∼D ( θ ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ )       2 2 6 β 2 ε 2 k θ − θ 0 k 2 2 . 26 Moving on to the last term in ( 5 ): 2 E Z ∼D ( θ ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) ! > E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 ) ! def = 2       E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 )       2 E Z ∼D ( θ ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) ! > v = 2       E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 )       2 E Z ∼D ( θ ) v > ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) v > ∇ θ ` ( Z ; θ ) ! , where we define the unit v ector v def = E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 ) k E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 ) k 2 . By smoothness of the loss, we can concl ude that v > ∇ θ ` ( Z , θ ) is β -Lipschitz, so by ε -sensitivity w e get 2 E Z ∼D ( θ ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) ! > E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 ) ! 6 2       E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 )       2 β ε k θ − θ 0 k 2 6 2 β 2 ε k θ − θ 0 k 2 2 , where in the last step we again appl y smoothness. Hence, T 3 6 ( ε 2 β 2 + 2 β 2 ε )    θ − θ 0    2 2 +       E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 )       2 2 . Having bounded all the terms, w e now conclude that    G gd ( θ ) − G gd ( θ 0 )    2 2 6 1 + η 2 ε 2 β 2 + 2 η 2 β 2 ε − 2 η β γ β + γ + 2 η β ε ! k θ − θ 0 k 2 2 − 2 η β + γ − η 2 !       E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ ) − E Z ∼D ( θ 0 ) ∇ θ ` ( Z ; θ 0 )       2 2 . If we take the step size η to be small enough, namely η 6 2 β + γ , we g et    G gd ( θ ) − G gd ( θ 0 )    2 2 6 1 + η 2 ε 2 β 2 + 2 η 2 β 2 ε − 2 η β γ β + γ + 2 η β ε ! k θ − θ 0 k 2 2 . T o ensure a contraction, we need 2 η β γ β + γ − η 2 ε 2 β 2 − 2 η 2 β 2 ε − 2 η β ε > 0. Canceling out η β , and assuming ε 6 1, it su ffi ces to have 2 γ β + γ − 3 η ε β − 2 ε > 0. Therefore, if ε < γ ( β + γ )(1+1 . 5 η β ) 6 1, the map G gd is contractiv e. In particular , we hav e    G gd ( θ ) − G gd ( θ 0 )    2 6 s 1 − η 2 β γ β + γ − ε (3 η β 2 + 2 β ) !! k θ − θ 0 k 2 6 1 − η β γ β + γ − ε (1 . 5 η β 2 + β ) !! k θ − θ 0 k 2 , 27 where we use the f act that √ 1 − x 6 1 − x 2 for x ∈ [0 , 1]. This completes the proof of part (a). Since we have shown G gd is contractiv e, by the Banach fixed-point theorem w e know that there exists a unique fixed point of G gd . That is, there exists θ PS such that E Z ∼D ( θ PS ) ∇ θ ` ( Z ; θ PS ) = 0. By convexity of the loss function, this means that θ PS is the optimum of E Z ∼D ( θ PS ) ` ( Z ; θ ) over θ , which in turn implies that θ PS is performativ ely stable. Recursivel y applying the result of part (a) we g et the rate of conv ergence of RRM to θ PS : k θ t − θ PS k 2 6 1 − η β γ β + γ − ε (1 . 5 η β 2 + β ) !! t k θ 0 − θ PS k 2 6 exp − t η β γ β + γ − ε (1 . 5 η β 2 + β ) !! k θ 0 − θ PS k 2 , where in the last step we use the fact that 1 − x 6 e − x . Setting this expression to be at most δ and solving for t completes the proof . D.4 Proof of Theorem 3.10 Proof of (a): W e introduce the main proof idea and then present the full argument. The proof proceeds by case analysis. First, w e show that if k θ t − θ PS k 2 > δ , performing ERM ensures tha t with high probability k θ t +1 − θ PS k 2 6 2 ε β γ k θ t − θ PS k 2 . Using our assumption that ε < γ 2 β , this implies that the itera te θ t +1 contracts toward θ PS . On the other hand, if k θ t − θ PS k 2 6 δ , we show that while ERM might not contract , it cannot push θ t +1 too far from θ PS either . In particular , θ t +1 must be in a εβ 2 γ δ -ball around θ PS . The proof then concludes by arguing that θ t for t > log( k θ 0 − θ PS k 2 / δ ) log( γ / 2 ε β ) must enter a ball of radius δ around θ PS . Once this even t occurs, no future iterate can exit the εβ 2 γ δ -ball around θ PS . Case 1: k θ t − θ PS k 2 > δ . If the current iterate is outside the ball, we show that with high probability the next iterate contracts tow ards a performatively stable poin t. In particular , k θ t +1 − θ PS k 2 6 2 εβ γ k θ t − θ PS k 2 . T o prove this claim, we begin by showing that W 1 ( D n t ( θ t ) , D ( θ PS )) 6 2 ε k θ t − θ PS k 2 , with probability 1 − 6 p π 2 t 2 . (6) Since the W 1 -distance is a metric on the space of distributions, we can apply the triangle inequality to get W 1 ( D n t ( θ t ) , D ( θ PS )) 6 W 1 ( D n t ( θ t ) , D ( θ t )) + W 1 ( D ( θ t ) , D ( θ PS )) . The second term is bounded deterministically by ε k θ t − θ PS k 2 due to ε -sensitivity . By Theorem 2 of Fournier & Guillin, 2015 [ 13 ], for n t > 1 c 2 ( εδ ) m log  t 2 π 2 c 1 6 p  , the probability that the first term is greater than εδ is less that 6 p t 2 π 2 . Here, the positive constants c 1 , c 2 depend on α , µ, ξ α ,µ and m . Therefore, W 1 ( D n t ( θ t ) , D ( θ PS )) 6 ε δ + ε k θ t − θ PS k 2 6 2 ε k θ t − θ PS k 2 , with probability 1 − 6 p π 2 t 2 . 28 Using this, we can now prov e that the iterates contract. Following the first steps of the proof of Theorem 3.5 , we ha ve that ( G n t ( θ t ) − G ( θ PS ) ) >  E Z ∼D n t ( θ t ) ∇ θ ` ( Z ; G n t ( θ t )) − E Z ∼D ( θ PS ) ∇ θ ` ( Z ; G n t ( θ t ))  + ( G n t ( θ t ) − G ( θ PS ) ) >  E Z ∼D ( θ PS ) ∇ θ ` ( Z ; G n t ( θ t )) − E Z ∼D ( θ PS ) ∇ θ ` ( Z ; G ( θ PS ))  6 0 . (7) Like in the proof of Theorem 3.5 , the term ( G n t ( θ t ) − G ( θ PS )) > E Z ∼D n t ( θ t ) ∇ θ ` ( Z ; G n t ( θ t )) is k G n t ( θ t ) − G ( θ PS ) k 2 · β Lipschitz in Z . Using equation ( 6 ), with probability 1 − 6 p π 2 t 2 we can bound the first term by ( G n t ( θ t ) − G ( θ PS ) ) >  E Z ∼D n t ( θ t ) ∇ θ ` ( Z ; G n t ( θ t )) − E Z ∼D ( θ PS ) ∇ θ ` ( Z ; G n t ( θ t ))  > − 2 ε β k G n t ( θ t ) − G ( θ PS ) k 2 k θ t − θ PS k 2 . And by strong conv exity , ( G n t ( θ t ) − G ( θ PS ) ) >  E Z ∼D ( θ PS ) ∇ θ ` ( Z ; G n t ( θ t )) − E Z ∼D ( θ PS ) ∇ θ ` ( Z ; G ( θ PS ))  > γ k G n t ( θ t ) − G ( θ PS ) k 2 2 . Plug ging back into equation ( 7 ), w e conclude that with high probability k θ t +1 − θ PS k 2 6 2 εβ γ k θ t − θ PS k 2 . Applying a union bound, w e conclude that the itera tes contract at every iter ation where k θ t − θ PS k 2 > δ with probability at least 1 − P ∞ t =1 6 p π 2 t 2 = 1 − p . Therefore, for t >  1 − 2 εβ γ  − 1 log  k θ 0 − θ PS k 2 δ  steps we ha ve k θ t − θ PS k 2 6 2 εβ γ ! t k θ 0 − θ PS k 2 6 2 εβ γ ! t k θ 0 − θ PS k 2 6 exp − t 1 − 2 εβ γ !! k θ 0 − θ PS k 2 6 δ , where we use 1 − x 6 e − x . This implies that θ t even tually contracts to a ball of radius δ around θ PS . Case 2: k θ t − θ PS k 2 6 δ . W e show that the RERM iterates can leav e a ball of radius δ around θ PS only with negligible probability . W e begin by applying the triangle inequality just as we did in the previous case, W 1 ( D n t ( θ t ) , D ( θ PS )) 6 W 1 ( D n t ( θ t ) , D ( θ t )) + W 1 ( D ( θ t ) , D ( θ PS )) 6 W 1 ( D n t ( θ t ) , D ( θ t )) + ε δ. For our choice of n t , with probability at least 1 − 6 p π 2 t 2 this quantity is upper bounded by W 1 ( D n t ( θ t ) , D ( θ PS )) 6 2 ε δ. With this informa tion, we can now apply the exact same steps as in the previous case, but now using the fact that W 1 ( D n t ( θ t ) , D ( θ PS )) 6 2 εδ instead of W 1 ( D n t ( θ t ) , D ( θ PS )) 6 2 ε k θ t − θ PS k 2 , to conclude that with probability a t least 1 − 6 p π 2 t 2 k θ t +1 − θ PS k 2 6 2 ε β γ δ 6 δ. As before, a union bound argument prov es that the entire analysis hol ds with probability 1 − p . 29 Proof of (b): The only di ff erence between part (b) in relation to part (a) is the fact that one needs to invoke the steps of Theorem 3.8 r ather than Theorem 3.5 . D.5 Proof of Proposition 4.1 W e begin by defining the set -val ued function, g ( θ ) = arg min θ 0 ∈ Θ DPR ( θ , θ 0 ). Observe that fixed points of this function correspond to models which are performativ ely stable. The proof thereby follows from showing tha t the function g ( · ) has a fixed point. Since the loss is jointly con tinuous and the set Θ is compact , we can apply Berge ’ s Maximum Theorem [ 1 , 5 ] to conclude that the function g ( · ) is upper hemicontinuous with com pact and non-empty val ues. Furthermore, by convexity of the loss, it follows that in addition to being compact and non-empty , g ( θ ) is a convex set for every θ ∈ Θ . Therefore, the conditions of Kakutani’ s Theorem [ 23 ] (also see Ch 17. in [ 1 ]) hold and w e can conclude that g ( · ) has a fixed point. Hence, a performativ ely stable model exists. D.6 Proof of Proposition 4.2 W e make a slight modification to Example 2.2 to prove the proposition. As in the example, D ( θ ) is given as foll ows: X is a single feature supported on {± 1 } and Y | X ∼ Bernoulli ( 1 2 + µX + εθ X ), where Θ = [0 , 1]. W e let ε > 1 2 , and constrain µ to satisfy | µ + ε | 6 1 2 . W e assume that outcomes are predicted according to the model f θ ( x ) = θ x + 1 2 and that performance is measured via the squared loss, ` ( z ; θ ) = ( y − f θ ( x )) 2 . This loss has condition n umber β γ = 1. A direct calculation demonstr ates that the performa tive risk is a quadratic in θ : PR( θ ) = 1 4 − 2 θ µ + (1 − 2 ε ) θ 2 . Therefore, if ε ∈ h 1 2 , 1  , the performativ e risk is a concave function of θ , ev en though ε < γ β . D.7 Proof of Theorem 4.3 By definition of performativ e optimality and performative stability w e have that: DPR( θ PO , θ PO ) 6 DPR( θ PS , θ PS ) 6 DPR( θ PS , θ PO ) . W e claim that DPR ( θ PS , θ PO ) − DPR ( θ PS , θ PS ) > γ 2 k θ PO − θ PS k 2 2 . By definition of DPR , we can write DPR( θ PS , θ PO ) − DPR( θ PS , θ PS ) = E Z ∼D ( θ PS ) h ` ( Z ; θ PO ) − ` ( Z ; θ PS ) i . Since ` ( z ; θ PO ) > ` ( z ; θ PS ) + ∇ θ ` ( z ; θ PS ) > ( θ PO − θ PS ) + γ 2 k θ PO − θ PS k 2 2 for all z , we hav e that E Z ∼D ( θ PS ) h ` ( Z ; θ PO ) − ` ( Z ; θ PS ) i > E Z ∼D ( θ PS ) h ∇ θ ` ( Z ; θ PS ) > ( θ PO − θ PS ) i + γ 2 k θ PO − θ PS k 2 2 . (8) Now , by Lemma C.1 , E Z ∼D ( θ PS ) h ∇ θ ` ( Z ; θ PS ) > ( θ PO − θ PS ) i > 0, so we get tha t equation ( 8 ) implies E Z ∼D ( θ PS ) h ` ( Z ; θ PO ) − ` ( Z ; θ PS ) i > γ 2 k θ PO − θ PS k 2 2 . 30 Since the population distributions are ε -sensitive and the loss is L z -Lipschitz in z , we hav e that DPR ( θ PS , θ PO ) − DPR ( θ PO , θ PO ) 6 L z ε k θ PO − θ PS k 2 . If ε < γ k θ PO − θ PS k 2 2 L z then we have that L z ε k θ PO − θ PS k 2 < γ 2 k θ PO − θ PS k 2 2 which is a contradiction since it m ust hold that DPR( θ PS , θ PO ) − DPR( θ PO , θ PO ) > DPR( θ PS , θ PO ) − DPR( θ PS , θ PS ) . D.8 Proof of C orollary 5.1 By Theorem 3.5 we know that repeated risk minimization conv erges at a linear rate to a performativ ely stable point θ PS . Furthermore, by Theorem 4.3 , this performativel y stable point is close in domain to the institution ’ s Stackelberg equilibrium classifier θ SE , k θ SE − θ PS k 2 6 2 L z ε γ . W e can then use the f act tha t the loss is Lipschitz to show tha t this performativel y stable classifier is close in objectiv e value to the S tackelberg equilibrium: PR( θ PS ) − PR( θ SE ) 6    PR( θ PS ) − DPR( θ PS , θ SE )    +    DPR( θ PS , θ SE ) − PR( θ SE )    6 L θ k θ SE − θ PS k 2 + L z ε k θ SE − θ PS k 2 6 2 L z ε ( L θ + L z ε ) γ Here, we ha ve used the Kantorovich-Rubinstein Lemma ( C.3 ) to bound the second term. E Approximately minimizing perf ormative risk via regulariza tion Recall that in Proposition 3.6 w e have shown tha t RRM might not conv erge at all if the objectiv e is smooth and convex, but not strongly conv ex. In this section, we show how adding a small amount of quadratic regularization to the objectiv e guarantees that RRM will converg e to a stable point which approximately minimizes the perf ormative risk on the original loss. T o do so, we additionally require that the space of model parameters Θ be bounded with diameter D = sup θ ,θ 0 ∈ Θ k θ − θ 0 k 2 . W e can assume without loss of generality tha t D = 1. Proposition E.1. Suppose that the loss ` ( z ; θ ) is L z -Lipschitz iz z and L θ -Lipschitz in θ , β -jointly smooth ( A1 ) and convex (but not necessarily str ongly convex). F urthermore, suppose that distribution map D ( · ) is ε -sensitive with ε < 1 , and that the set Θ is bounded with diameter 1. Then, there exists a choice of α , such that running RRM with loss ` reg ( z ; θ ) def = ` ( z ; θ ) + α 2 k θ − θ 0 k 2 2 converges to a performatively stable point θ reg PS which satisfies the following PR( θ reg PS ) 6 min θ PR( θ ) + O √ ε 1 − ε ! . W e note that in the case where ε = 0, the limit point θ reg PS of regularized repeated risk minimization is also performa tivel y optimal. 31 Proof. First , we observe that the regularized loss function ` reg ( z ; θ ) is α -strongly convex and α + β -jointly smooth. Since ε < 1, we can then choose an α such that ε < α α + β . In particular , we choose α = √ εβ / (1 − ε ). From our choice of α , we ha ve that ε is smaller than the in verse condition number . Hence, by Theorem 3.5 repeated risk minimization conv erges at a linear rate to a performativ ely stable solution θ r e g PS of the regularized objective. T o finish the proof, we show that the objectiv e value a t the θ r e g PS is close to the objectiv e val ue at the performativ e optima of the original objective θ PO . W e do so by bounding their di ff erence using the triangle inequality: E Z ∼D ( θ reg PS ) ` reg ( Z ; θ reg PS ) − E Z ∼D ( θ PO ) ` ( Z ; θ PO ) = E Z ∼D ( θ reg PS ) ` reg ( Z ; θ reg PS ) − E Z ∼D ( θ reg PO ) ` reg ( Z ; θ reg PO ) + E Z ∼D ( θ reg PO ) ` reg ( Z ; θ reg PO ) − E Z ∼D ( θ PO ) ` ( Z ; θ PO ) W e can bound the first di ff erence via Lipschitzness: E Z ∼D ( θ reg PS ) ` reg ( Z ; θ reg PS ) − E Z ∼D ( θ reg PO ) ` reg ( Z ; θ reg PO ) = E Z ∼D ( θ reg PS ) ` reg ( Z ; θ reg PS ) − E Z ∼D ( θ reg PS ) ` reg ( Z ; θ reg PO ) + E Z ∼D ( θ reg PS ) ` reg ( Z ; θ reg PO ) − E Z ∼D ( θ reg PO ) ` reg ( Z ; θ reg PO ) 6 ( L θ + α sup θ ,θ 0 ∈ Θ k θ − θ 0 k 2 ) k θ reg PS − θ reg PO k 2 + ε L z k θ reg PS − θ reg PO k 2 = ( L θ + α + ε L z ) k θ reg PS − θ reg PO k 2 6 2( L θ + α + ε L z ) L z ε α . In the last two lines, w e hav e applied the f act that D = sup θ ,θ 0 ∈ Θ k θ − θ 0 k 2 = 1 as well as Theorem 4.3 . For the second di ff erence, by definition of perf ormative optimality we ha ve that , E Z ∼D ( θ reg PO ) ` reg ( Z ; θ reg PO ) 6 E Z ∼D ( θ PO ) ` reg ( Z ; θ PO ) 6 E Z ∼D ( θ PO ) ` ( Z ; θ PO ) + α 2 . Where we have again used the fact that D = 1 for the last inequality . Combining these two together , we can bound the total di ff erence: E Z ∼D ( θ reg PS ) ` reg ( Z ; θ reg PS ) − E Z ∼D ( θ PO ) ` ( Z ; θ PO ) 6 2( L θ + α + ε L z ) L z ε α + α 2 . Plug ging in α = √ εβ 1 − ε completes the proof .  32

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment