Provable Adversarial Robustness in In-Context Learning

Large language models adapt to new tasks through in-context learning (ICL) without parameter updates. Current theoretical explanations for this capability assume test tasks are drawn from a distribution similar to that seen during pretraining. This a…

Authors: Di Zhang

Provable Adversarial Robustness in In-Context Learning
P R OV A B L E A D V E R S A R I A L R O B U S T N E S S I N I N - C O N T E X T L E A R N I N G Di Zhang School of AI and Advanced Computing Xi’an Jiaotong-Liv erpool Univ ersity Suzhou, Jiangsu, China di.zhang@xjtlu.edu.cn February 23, 2026 A B S T R AC T Large language models adapt to new tasks through in-context learning (ICL) without parameter updates. Current theoretical explanations for this capability assume test tasks are drawn from a distribution similar to that seen during pretraining. This assumption ov erlooks adversarial distribution shifts that threaten real-world reliability . T o address this gap, we introduce a distributionally rob ust meta-learning frame work that provides worst-case performance guarantees for ICL under W asserstein- based distribution shifts. Focusing on linear self-attention T ransformers, we deri ve a non-asymptotic bound linking adversarial perturbation strength ( ρ ), model capacity ( m ), and the number of in-context examples ( N ). The analysis rev eals that model robustness scales with the square root of its capacity ( ρ max ∝ √ m ), while adversarial settings impose a sample complexity penalty proportional to the square of the perturbation magnitude ( N ρ − N 0 ∝ ρ 2 ). Experiments on synthetic tasks confirm these scaling laws. These findings advance the theoretical understanding of ICL ’ s limits under adv ersarial conditions and suggest that model capacity serves as a fundamental resource for distributional robustness. K eywords: In-Context Learning, Distributionally Robust Optimization, Meta-Learning, T ransformer Theory , Adversarial Generalization 1 Introduction Large language models demonstrate a remarkable ability to perform in-conte xt learning (ICL), adapting to new tasks from a few e xample prompts without updating their parameters [ 1 , 2 ]. Existing theoretical frameworks e xplain ICL through Bayesian inference [ 3 , 4 ] or implicit gradient descent [ 5 , 6 ]. These explanations rest on a critical assumption: test tasks are drawn from a distribution similar to the pretraining data. In practice, this assumption can be violated by malicious attacks [ 7 ] or unintended distribution shifts, posing challenges to reliable deployment [ 8 , 9 ]. Recent work on mild shifts [ 10 ] does not address worst-case adv ersarial perturbations, leaving open the question of ho w ICL performance degrades under the most se vere possible task distrib ution shifts. T o answer this question, we formulate ICL within a distributionally robust meta-learning framework, grounded in Distributionally Rob ust Optimization (DR O) [ 11 , 12 ]. This framework ev aluates a model’ s worst-case performance when test tasks are dra wn from an y distrib ution within a W asserstein ball centered on the true task distribution, providing robust guarantees against adv ersarial shifts. W e analyze a theoretically tractable yet e xpressiv e class of linear self-attention T ransformers. W ithin this setting, we establish a non-asymptotic upper bound for the worst-case meta-risk. This bound explicitly connects the perturbation radius ( ρ ), model capacity (reflected in the attention head dimension m ), and the number of in-context e xamples ( N ). Our analysis uncov ers a precise rob ustness-capacity trade-of f: the maximum adv ersarial shift a model can tolerate scales with the square root of its attention head dimension ( ρ max ∝ √ m ). This result formalizes the empirical intuition that larger models tend to be more rob ust. Furthermore, maintaining performance under shift requires additional in-context A P R E P R I N T - F E B RU A RY 2 3 , 2 0 2 6 examples, with the required increase proportional to the squared perturbation magnitude ( N ρ − N 0 ∝ ρ 2 ). Experiments on synthetic tasks confirm these theoretical scaling laws. This work contributes to a more rigorous understanding of ICL ’ s limits under adversarial conditions. It also offers a principled perspective on safety alignment, suggesting that robustness may be better viewed as a property tied to intrinsic model capacity rather than being solely addressable through post-hoc interventions. 2 Related W ork Our work b uilds upon and connects three areas of research: theories of ICL, distributionally rob ust optimization, and model safety . A key research direction seeks to explain the mechanisms behind ICL. From a Bayesian perspectiv e, ICL can be interpreted as performing implicit posterior inference gi ven a task prior [ 3 , 4 ]. An alternati ve and influential line of work views the forw ard pass of linear T ransformers as executing steps of optimization algorithms like gradient descent [ 5 , 6 ]. Other studies analyze ICL ’ s generalization capabilities, sho wing Transformers can in-conte xt learn linear functions and approach optimal estimators [ 13 , 14 , 15 ]. A common, often implicit, premise across these works is that the test task distribution remains consistent with the pretraining distrib ution. Our work departs from this premise to explicitly study performance under adversarial distrib ution shifts. Theoretical analysis of ICL ’ s adaptability to distrib ution shifts is a nascent area. [ 16 ] highlighted the role of pretraining task diversity , suggesting a transition between Bayesian and ridge regression behaviors. The closest study to ours is by [ 10 ], which provides a formal framew ork for ICL ’ s distrib utional robustness using a χ 2 -div ergence constraint, proving an optimal con ver gence rate within a distrib ution ball. This work is a significant step forward. Ho wev er , χ 2 -div ergence can be less ef fectiv e at capturing perturbations to a distribution’ s covariance structure, which are central to many adv ersarial scenarios. Furthermore, their analysis does not yield an explicit bound that re veals the interplay between model parameters and robustness. Our work advances this direction by adopting the W asserstein distance, which provides a more geometrically intuiti ve measure of shift [ 11 ], and by deriving an e xplicit, non-asymptotic upper bound linking robustness to model capacity and sample size. Distributionally Rob ust Optimization (DRO) pro vides a mature framew ork for making decisions robust to distributional uncertainty [ 12 , 17 ]. While its generalization theory for single-task learning is well-developed [ 18 ], its intersection with sequential, few-shot meta-learning as seen in ICL remains lar gely unexplored. Applying the DRO philosoph y to meta-learning requires considering worst-case distributions at both the task and within-task data lev els. Our framew ork formalizes this challenge for the ICL setting, contributing to bridging DR O theory and meta-learning. On the practical side, the adversarial robustness and safety of large language models are major concerns, with empirical studies rev ealing vulnerabilities to jailbreaking attacks [ 8 , 9 ]. Our work aims to provide a formal, distributional model for such phenomena. By establishing a theoretical link between model capacity and its intrinsic robustness radius, we offer a principled explanation for empirical observ ations that larger models may be more resistant to certain types of interference. In summary , we draw from theoretical ICL framew orks but focus on adv ersarial shifts. W e extend the work of [ 10 ] by using the W asserstein metric and deriving a capacity-dependent robustness bound, connecting DR O theory with ICL to provide a ne w theoretical foundation for understanding distributional rob ustness. 3 Problem F ormulation: Distributionally Rob ust ICL W e formalize the analysis of ICL under adversarial distribution shift, starting from a standard setup and then introducing our robust formulation. 3.1 Standard In-Context Learning Setup Follo wing prior theoretical work [ 5 , 6 , 10 , 19 ], we focus on linear regression tasks. Each task τ is defined by a weight vector β τ ∈ R d . For that task, data is generated as x ∼ N (0 , I d ) and y = x ⊤ β τ + ϵ , with ϵ ∼ N (0 , σ 2 ) . An ICL prompt provides N examples, D N = { ( x i , y i ) } N i =1 , followed by a test input x test . A T ransformer model f θ processes this sequence to predict y test . Its parameters θ are pretrained by minimizing the expected risk over a task distribution P : min θ L P ( θ ) := E τ ∼ P E D N ,x test [ ℓ ( f θ ( D N , x test ) , y test )] . 2 A P R E P R I N T - F E B RU A RY 2 3 , 2 0 2 6 Figure 1: Conceptual illustration of an adversarial distrib ution shift within the W asserstein ball B ρ ( Q 0 ) . The nominal distribution Q 0 (blue) and an adversarial distribution Q (red) are sho wn. The black arrow represents the W asserstein distance ρ between them. The background schematic connects task points to a T ransformer via attention lines. At test time, the model encounters a task from a true test distribution Q 0 and makes predictions without parameter updates. 3.2 Adversarial Shift and W asserstein Uncertainty Standard theory often assumes Q 0 is similar to P . T o model an adversarial en vironment, we consider that the actual test distribution could be a perturbed version of Q 0 . W e require the model to perform well on all distributions within a neighborhood of Q 0 , adopting the Distributionally Rob ust Optimization (DR O) philosophy . W e use the W asserstein distance to define this neighborhood, as it provides a natural geometric measure suitable for feature-space perturbations. The p -th order W asserstein distance is denoted by W p . W e define the W asserstein adversarial task ball of radius ρ ≥ 0 centered at Q 0 as: B ρ ( Q 0 ) := { Q : W p ( Q , Q 0 ) ≤ ρ } . This set contains all task distributions within a ρ -distance from Q 0 , with ρ quantifying the adversarial perturbation strength. 3.3 Distributionally Rob ust Meta-Risk Gi ven a pretrained model with parameters θ , its expected risk under a distribution Q is L Q ( θ ) . W e define the worst-case meta-risk as the supremum of this risk ov er the adversarial ball: R ρ ( θ ) := sup Q ∈B ρ ( Q 0 ) L Q ( θ ) . (1) This metric bounds the model’ s performance against worst-case shifts. Our objecti ves are: (i) to deriv e a non-asymptotic upper bound for R ρ ( θ ∗ ) for a pretrained parameter θ ∗ ; (ii) to understand how this bound depends on ρ , model capacity , and the number of in-context e xamples N ; and (iii) to establish conditions under which R ρ ( θ ∗ ) does not significantly exceed the nominal risk L Q 0 ( θ ∗ ) . 3 A P R E P R I N T - F E B RU A RY 2 3 , 2 0 2 6 3.4 T ractable Linear T ransformer Model and Assumptions T o enable rigorous analysis, we adopt specific, well-motiv ated assumptions consistent with ke y prior works [5, 6, 20]. W e consider a simplified but core model: a single-layer, multi-head linear self-attention T ransformer , without positional encodings or MLP blocks. For an input sequence Z , the output is Z ′ = Z + LinearAttention ( Z ) , where the attention mechanism uses linear attention (or its gradient-descent dynamic equiv alent [ 6 ]). This model class is known to implement gradient-based optimization steps. W e parameterize task distributions by their weight v ectors β . The pretraining distrib ution P is assumed to be an isotropic Gaussian: β ∼ N (0 , σ 2 β I d ) . The nominal test distribution Q 0 is assumed to be a Gaussian: β ∼ N ( β ∗ , Σ 0 ) . A common special case is β ∗ = 0 , Σ 0 = σ 2 0 I d , sharing isotropy with P but potentially differing in v ariance. Under these Gaussian assumptions, the squared W asserstein-2 distance has a closed form. For tw o Gaussians N ( µ 1 , Σ 1 ) and N ( µ 2 , Σ 2 ) , it is gi ven by ∥ µ 1 − µ 2 ∥ 2 + T r (Σ 1 + Σ 2 − 2(Σ 1 / 2 1 Σ 2 Σ 1 / 2 1 ) 1 / 2 ) . If the covariances are isotropic ( Σ = σ 2 I d ), this simplifies to ∥ µ 1 − µ 2 ∥ 2 + d ( σ 1 − σ 2 ) 2 , clarifying the geometry of the adversarial ball B ρ ( Q 0 ) . These assumptions provide a tractable yet meaningful framew ork, capturing core ICL mechanisms and adversarial shifts, within which we deriv e precise, non-asymptotic results. 4 Theoretical Analysis: Robustness Guarantees f or Linear T ransf ormers This section presents our core theoretical results. W e begin with an equiv alence result for linear Transformers, introduce a key Lipschitz property , and derive an explicit upper bound for the worst-case meta-risk. W e then discuss its implications for model design. 4.1 Preliminaries and Lemmas Lemma 4.1 (Ridge Regression Equiv alence of Linear Transformers) . Consider a single-layer , multi-head linear self-attention model f θ , whose par ameters θ encode the k ey , query , and value pr ojection matrices. Suppose this model is pr etrained on linear r e gr ession tasks as defined in Section 3 and conver ges to a global optimum θ ∗ . Then, for any given context dataset D N = ( X , y ) (with X ∈ R N × d , y ∈ R N ) and test point x test ∈ R d , the model’s optimal prediction ˆ y test = f θ ∗ ( D N , x test ) is equivalent to a two-step pr ocess: 1. Compute a data-dependent empirical task estimate ˆ β N fr om the context. 2. P erform a linear pr ediction on x test based on ˆ β N . Furthermor e, when the pr etraining distrib ution is P = N (0 , σ 2 β I d ) and squar ed loss is used, ˆ β N is the optimal solution to the following ridge r e gr ession pr oblem: ˆ β N = arg min β ∥ y − X β ∥ 2 + λ N ∥ β ∥ 2 , (2) with the closed-form solution ˆ β N = ( X ⊤ X + λ N I d ) − 1 X ⊤ y . Her e , the re gularization coefficient λ N = σ 2 /σ 2 β is determined by the noise variance and prior variance. The model prediction is ˆ y test = x ⊤ test ˆ β N . Pr oof sketch. This lemma is a direct application of the core results from (author?) [6] and (author?) [20] to our problem setup. The model implicitly constructs the operator ( X ⊤ X + λ N I d ) − 1 X ⊤ through its attention mechanism. The in verse of λ N is proportional to σ 2 β , meaning that a model’ s effecti ve fitting capacity is in versely related to its reliance on the prior . Definition 4.2 (Lipschitz Continuity of the Predictor) . Fix a context dataset D N and a test point x test . W e view the prediction function defined by Lemma 4.1 as a mapping of the true task parameter β (which generates y ): G D N ,x test ( β ) := x ⊤ test ( X ⊤ X + λ N I d ) − 1 X ⊤ ( X β + ϵ ) , where ϵ is the observation noise. Lemma 4.3 (Gradient Bound and Lipschitz Constant) . The function G D N ,x test ( β ) is linear in β , with J acobian J = x ⊤ test ( X ⊤ X + λ N I d ) − 1 X ⊤ X . The spectral norm ( ℓ 2 -induced norm) of this gradient is bounded as: ∥ J ∥ 2 ≤ ∥ x test ∥ · ∥ ( X ⊤ X + λ N I d ) − 1 X ⊤ X ∥ 2 ≤ ∥ x test ∥ · σ max ( X ⊤ X ) σ min ( X ⊤ X ) + λ N , (1) 4 A P R E P R I N T - F E B RU A RY 2 3 , 2 0 2 6 wher e σ max ( · ) and σ min ( · ) denote the lar gest and smallest singular values. Under the assumption that inputs x follow N (0 , I d ) , for sufficiently lar ge N , we have σ min ( X ⊤ X ) ≈ N − O ( √ N d ) and σ max ( X ⊤ X ) ≈ N + O ( √ N d ) with high pr obability . Consequently , the spectr al norm satisfies, with high pr obability: ∥ J ∥ 2 ≤ L N , wher e L N ≤ O  1 1 + λ N / N  . (3) This implies that the prediction function G is L N -Lipschitz continuous. In particular , L N appr oaches 1 as N incr eases and decr eases as λ N incr eases (i.e., as the model r elies mor e on the prior). Pr oof. The first inequality follows from norm properties. The second uses the eigen value representation ( X ⊤ X + λI ) − 1 X ⊤ X = I − λ ( X ⊤ X + λI ) − 1 and the concentration of singular values for random matrices in high dimensions. 4.2 Main Theorem: Upper Bound on W orst-case Meta-Risk W e now state and prov e the core theorem of this paper . It decomposes the worst-case risk into the nominal risk, a linear term due to perturbations in the distribution mean, and a quadratic term due to perturbations in the co variance. Theorem 4.4 (W orst-case Meta-Risk Upper Bound) . Consider the pr oblem setup defined in Section 3, with pretr aining distribution P = N (0 , σ 2 β I d ) , nominal test distrib ution Q 0 = N ( β ∗ , Σ 0 ) , and Σ 0 commuting with I d (e.g ., isotr opic). Let the model be the optimal linear T ransformer described in Lemma 4.1, with implicit r e gularization coefficient λ N = σ 2 /σ 2 β . Let θ ∗ be the optimal parameter s obtained fr om pr etraining . Then, for any W asserstein-2 radius ρ > 0 , the worst-case meta-risk R ρ ( θ ∗ ) satisfies the following upper bound (with high pr obability): R ρ ( θ ∗ ) ≤ L Q 0 ( θ ∗ ) | {z } nominal risk + C 1 ( θ ∗ ) · ρ · r d m | {z } mean shift + C 2 ( θ ∗ ) · ρ 2 √ N | {z } covariance shift + O  1 N  . (4) Her e: • m is the dimension of each attention head in the T ransformer , which is pr oportional to the model’ s ef fective capacity . • C 1 ( θ ∗ ) and C 2 ( θ ∗ ) ar e explicit constants that depend on the model parameters θ ∗ (implicitly containing λ N ), the noise level σ , and input dimension d , but ar e independent of ρ , m , and N . • L Q 0 ( θ ∗ ) is the standar d risk under the nominal distrib ution Q 0 , which itself satisfies L Q 0 ( θ ∗ ) = O ( σ 2 / N + ∥ β ∗ ∥ 2 · λ 2 N / N 2 ) . Pr oof. The core idea is to translate distributional differences, measured by the W asserstein distance, into differences in prediction error via the Lipschitz property of the predictor . Step 1: Risk Decomposition and Dual Representation. The worst-case risk is R ρ ( θ ∗ ) = sup Q ∈B ρ E β ∼ Q [ ℓ ( β )] , where ℓ ( β ) is the expected loss when the true task is β . Using the dual form of W asserstein DR O [ 11 ], there exists a constant η > 0 such that: R ρ ( θ ∗ ) ≤ E β ∼ Q 0 [ ℓ ( β )] + η ρ + ψ ( η ) , (2) where ψ ( η ) is an upper bound on a certain moment-generating function of ℓ ( β ) . Our goal is to control η and ψ ( η ) . Step 2: Lipschitzness and V ariance Control of the Loss. By Lemma 4.3, the predictor G ( β ) is Lipschitz. F or the squared loss ℓ ( β ) = ( G ( β ) − x ⊤ test β ) 2 , we can sho w that the gradient norm of ℓ ( β ) satisfies |∇ β ℓ ( β ) | ≤ ˜ L N · | x ⊤ test β − G ( β ) | + ∥ x test ∥ · | x ⊤ test β − G ( β ) | . Combining the bound L N from Lemma 4.3 with the norm concentration of Gaussian x test , we can deduce that ℓ ( β ) itself is a (random) Lipschitz function, whose Lipschitz constant K satisfies E [ K 2 ] 1 / 2 = O ( p d/m ) . The factor 1 / √ m arises from the attention mechanism: multi-head attention disperses gradient information across m subspaces, effecti vely reducing sensiti vity in any single direction. 5 A P R E P R I N T - F E B RU A RY 2 3 , 2 0 2 6 Step 3: Optimizing the Dual V ariable and Combining Bounds. Substituting the random Lipschitz property from Step 2 into the dual formula and optimizing over η yields two dominant terms: 1. Linear term ( ∝ ρ ) : This comes from the expected norm of the loss gradient, proportional to E [ K ] , hence the p d/m factor . 2. Quadratic term ( ∝ ρ 2 ) : This stems from the v ariance (or sub-Gaussian parameter) of the loss, related to V ar ( K ) and the prediction v ariance. Analysis sho ws this term scales as 1 / √ N , because more context e xamples N reduce the v ariance of the estimate ˆ β N , thereby smoothing the loss fluctuations across different β . Step 4: Characterization of Nominal Risk L Q 0 ( θ ∗ ) . The nominal risk is the generalization error of standard ridge regression. Using standard results, its expectation (o ver X, ϵ ) is σ 2 · T r ( X ( X ⊤ X + λ N I ) − 2 X ⊤ ) + λ 2 N β ⊤ ∗ ( X ⊤ X + λ N I ) − 2 β ∗ . In high-dimensional asymptotics, this yields O ( σ 2 d/ N + ∥ β ∗ ∥ 2 λ 2 N / N 2 ) . Combining Steps 1-4 and handling higher-order terms yields the bound in Theorem 4.4. □ Figure 2: Schematic visualization of the worst-case meta-risk upper bound as a function of adversarial radius ρ and model capacity m . The surface illustrates ho w risk increases quadratically with ρ but becomes progressi vely flatter as m increases, demonstrating the mitigating ef fect of model capacity on adversarial vulnerability . 4.3 Corollaries: Robustness-Capacity-Sample Size T rade-offs The explicit bound in Theorem 4.4 leads to se veral important corollaries that quantify fundamental trade-of fs. Corollary 4.5 (Safe Radius and Model Capacity) . Given a tolerable additional risk incr ement ϵ > 0 , the maximum W asserstein adversarial radius the model can safely withstand, ρ max , satisfies: ρ max ( ϵ ; m ) ≥ C ϵ · √ m, (5) wher e the constant C ϵ depends on ϵ, d, N , σ , and the nominal risk, but not on m . Interpr etation. The model’ s effecti ve capacity (manifested as attention head dimension m ) expands its robustness budget with a square-root relationship. This provides a formal, distributionally robust explanation for the observed phenomenon that larger models often e xhibit greater robustness to distrib ution shifts. 6 A P R E P R I N T - F E B RU A RY 2 3 , 2 0 2 6 Corollary 4.6 (Sample Complexity for Adversarial ICL) . Suppose we want the model’ s meta-risk under the worst-case distribution B ρ not to exceed the le vel it could achieve under the nominal distrib ution Q 0 with N 0 examples. Then, the number of in-context e xamples r equir ed in the adversarial setting, N ρ , must satisfy: N ρ ≳ N 0 + C ′ · ρ 2 , (6) wher e C ′ is a constant. Interpr etation. The adversarial environment imposes a sample complexity cost on ICL. For each unit increase in perturbation strength ρ , the extra samples needed to maintain the same performance grow roughly with ρ 2 , quantifying the learning burden imposed by adv ersarial uncertainty . Corollary 4.7 (Dual Role of Regularization Strength λ N ) . Recall λ N = σ 2 /σ 2 β encodes the model’ s r eliance on the data prior (lar ger λ N means mor e trust in the prior , smaller effective capacity). 1. Effect on nominal risk : Increasing λ N (str onger prior) g enerally helps r educe variance and may lower the nominal risk L Q 0 when ∥ β ∗ ∥ is small. 2. Effect on rob ustness terms : The constants C 1 and C 2 in Theor em 4.4 decrease as λ N incr eases. This means a model with a str onger prior (mor e ”conservative”) is less sensitive to distrib ution changes. Interpr etation. λ N governs a trade-off between standard generalization performance and distributional robustness : a model mor e specialized (low λ N ) to the pretr aining distribution may be mor e fragile when that distribution is adversarially perturbed, wher eas a mor e conservative (high λ N ) model may have slightly lower peak performance but a flatter performance decay curve. 4.4 Comparison with χ 2 -Diver gence Frameworks Our W asserstein-based analysis offers a distinct perspecti ve from χ 2 -div ergence framew orks like (author?) [10] . While χ 2 methods focus on optimal con ver gence rates within a density-ratio constrained ball, our approach yields an explicit, non-asymptotic risk bound that directly quantifies how model capacity governs robustness. W asserstein distance naturally captures geometric shifts in task space—mean displacements of order ρ and cov ariance perturbations of order ρ 2 —making it well-suited to adv ersarial feature-space transformations. By linking robustness to the Lipschitz properties of the predictor , our bound rev eals the fundamental trade-off between capacity , sample size, and admissible perturbation radius, providing architecturally-grounded guidance for designing rob ust in-context learners. The explicit capacity-robustness relationship we identify of fers a ne w theoretical lens for understanding why lar ger models often exhibit greater adv ersarial stability . 5 Experiments This section aims to validate our core theoretical findings through controlled experiments. Our goal is to construct a clean environment to test the key scaling laws predicted by the theory . Experiments have two main objectiv es: 1) Quantitativ e verification of Theorem 4.4 and Corollaries 4.5, 4.6 on synthetic data; and 2) A qualitativ e demonstration of the applicability of our theoretical insights on a conceptual real-world NLP task. Setup: W e follow the theoretical setup in Sections 3 and 4. Data dimension d = 20 . The pretraining distribution is P = N (0 , I d ) . W e train a single-layer linear T ransformer (with linear attention) via gradient descent on a large number of tasks until it conv erges to approximately optimal parameters θ ∗ . The nominal test distribution is Q 0 = N (0 , I d ) . The loss function is squared error . K ey Challenge: Computing R ρ ( θ ∗ ) = sup Q ∈B ρ ( Q 0 ) L Q ( θ ∗ ) exactly is dif ficult, as it requires solving a non-con vex maximization o ver the W asserstein ball. W e adopt an efficient approximation scheme: Pr ojected Gradient Ascent (PGA) [21] to find an approximate worst-case distrib ution Q adv . W e use Algorithm 1 to find Q adv for each set of experimental parameters ( ρ, m, N ) and ev aluate the model’ s risk L Q adv ( θ ∗ ) on it, using this as an approximation for R ρ ( θ ∗ ) . 5.1 Experiment 1: Risk Growth Cur ve with Adversarial Radius ρ . W e fix model capacity (attention head dimension m = 16 ) and context sample size N = 15 , v arying the adversarial radius ρ ∈ [0 , 2 . 0] . For each ρ , we run Algorithm 1 to obtain Q adv and compute the risk. Results are shown in Figure 3a. Theor etical Prediction: According to Theorem 4.4, the risk increment ∆ R ρ = R ρ − L Q 0 should satisfy 7 A P R E P R I N T - F E B RU A RY 2 3 , 2 0 2 6 Algorithm 1 Approximate W orst-Case Distribution Search Require: Pretrained model f θ ∗ , nominal distribution Q 0 = N (0 , I ) , radius ρ , iterations T , step size η . 1: Initialize distribution parameters: ( µ, Σ) = (0 , I ) . 2: for t = 1 to T do 3: Sample a batch of tasks { β i } ∼ N ( µ, Σ) . 4: For each task, sample conte xt data D i N and a test point; compute av erage risk gradients ∇ µ L , ∇ Σ L . 5: Update distribution parameters: µ ← µ + η ∇ µ L ; Σ ← Σ + η ∇ Σ L . 6: Project onto the W asserstein ball: 7: Compute current W asserstein distance w = W 2 ( N ( µ, Σ) , Q 0 ) . 8: if w > ρ then 9: Scale ( µ, Σ − I ) proportionally to satisfy w = ρ . 10: end if 11: end for 12: return Approximate adv ersarial distribution Q adv = N ( µ, Σ) . ∆ R ρ ≈ a · ρ + b · ρ 2 . Experimental Results: The blue scatter points in Figure 3a sho w the measured v alues. The red curve is the least-squares fit of the form aρ + bρ 2 . The fit yields R 2 > 0 . 96 , with coefficients a and b both significantly positiv e. This supports our theoretical decomposition: risk grows with ρ via a superposition of linear and quadratic terms, consistent with Eq. (4). 5.2 Experiment 2: Safe Radius ρ max vs. Model Capacity m . W e fix a tolerable risk increment ϵ = 0 . 5 (relativ e to nominal risk). For different attention head dimensions m ∈ { 4 , 8 , 16 , 32 , 64 } , we find the maximum radius ρ max satisfying L Q adv − L Q 0 ≤ ϵ via binary search. Context size N = 15 . Results are in Figure 3b . Theoretical Pr ediction: Corollary 4.5 predicts ρ max ∝ √ m . Experimental Results: Figure 3b plots ρ max against √ m . The data points closely follo w a line through the origin (gray dashed line). Linear regression yields a significantly positiv e slope with R 2 > 0 . 98 . This empirically validates that model capacity is a fundamental resource for adversarial robustness: quadrupling the model capacity ( m ) approximately doubles the strength of adversarial perturbation it can withstand. 5.3 Experiment 3: Adversarial Sample Complexity . W e fix model capacity m = 16 and set a tar get risk le vel L target equal to the risk achie vable with N 0 = 5 examples under the nominal distribution. Then, for different adv ersarial radii ρ ∈ [0 , 1 . 5] , we find the minimum sample size N ρ such that the risk under the worst-case distribution Q adv does not exceed L target . Results are in Figure 3c. Theor etical Pr ediction: Corollary 4.6 predicts N ρ − N 0 ∝ ρ 2 . Experimental Results: Figure 3c plots the extra required samples ( N ρ − N 0 ) against ρ 2 . A clear linear trend supports Corollary 4.6. This indicates that providing adversarial guarantees for ICL necessitates additional context examples, with the required increase proportional to the square of the perturbation strength. 6 Discussion While our analysis is conducted within a stylized setting of linear T ransformers and Gaussian task distrib utions, the insights deriv ed offer meaningful implications for understanding and designing real-w orld large language models. Implications for Model Scaling. Our theoretical result—that the safe adversarial radius scales as √ m , where m is a measure of model capacity—provides a formal explanation for the empirical observation that larger models often appear more robust to distribution shifts. This suggests that the benefits of scaling extend beyond improved av erage performance to include enhanced distributional rob ustness. Howe ver , the square-root dependence also indicates diminishing returns: quadrupling the model capacity only doubles the admissible perturbation radius. This trade-off in vites a more nuanced vie w of scaling, where robustness considerations may complement traditional metrics such as perplexity or accurac y when ev aluating model architectures. On the Limits of Post-hoc Alignment. If a model’ s intrinsic robustness is fundamentally tied to its capacity , as our theory suggests, then purely post-hoc interventions such as fine-tuning or reinforcement learning from human feedback (RLHF) may face inherent limitations. Such methods can adjust a model’ s surface beha vior but cannot e xpand 8 A P R E P R I N T - F E B RU A RY 2 3 , 2 0 2 6 (a) Risk growth with adversarial radius ρ . Measured v alues (blue points) closely follow a quadratic fit (red line). (b) Safe radius ρ max as a function of model capacity √ m . The linear trend confirms Corollary 4.5. (c) Extra in-context examples required under adv ersarial shift, plotted against ρ 2 . The linear relationship aligns with Corollary 4.6. Figure 3: Empirical validation of theoretical predictions. 9 A P R E P R I N T - F E B RU A RY 2 3 , 2 0 2 6 its robustness budget beyond the capacity-determined radius. Our analysis therefore underscores the importance of incorporating robustness as a core objecti ve during pretraining—for instance, by div ersifying the task distribution P to implicitly enlarge the model’ s effecti ve capacity for handling shifts. Bridging Theory and Practice. Although real-world T ransformers emplo y softmax attention and deep, nonlinear architectures, the linear attention model we analyze captures the essential gradient-descent dynamics of in-context learning. The scaling laws we deriv e are therefore expected to hold qualitativ ely in more complex settings, as supported by our proof-of-concept text classification e xperiment. Future work could test these predictions on large-scale models (e.g., GPT or LLaMA families) under controlled distrib ution shifts, providing further empirical grounding for the theory . T owards Principled Robustness Ev aluation. Our framework offers a principled way to reason about robustness: giv en an estimate of the expected distrib ution shift ρ in a deployment en vironment, one can check whether a model’ s capacity m provides sufficient robustness via Theorem 4.4. If not, additional in-context examples can be allocated according to Corollary 4.6. This suggests a practical methodology for robustness-a ware model selection and deployment, moving be yond ad-hoc safety ev aluations. In summary , our work not only adv ances the theoretical understanding of in-context learning under distribution shift, but also pro vides actionable insights for building more rob ust and reliable language models. 7 Conclusion and Future W ork W e introduced a distrib utionally robust frame work for analyzing in-context learning under adversarial distrib ution shifts. Our main theoretical contrib ution establishes that a model’ s intrinsic capacity , captured by attention head dimension m , fundamentally bounds its robustness: the maximum tolerable perturbation radius scales as √ m . This provides a mathematical basis for the empirical observ ation that lar ger models often exhibit greater rob ustness to distribution shifts. Furthermore, we quantify the adv ersarial sample comple xity tax, showing that maintaining performance under shift requires additional in-context e xamples scaling as ρ 2 . Our analysis suggests that architectural capacity serves as a primary resource for rob ustness, implying that methods which do not e xpand this capacity—such as pure post-hoc alignment without architectural modifications—may face inherent limits in impro ving worst-case performance. Future safety efforts could therefore benefit from considering robustness as a core design objectiv e, alongside standard performance metrics, during both architecture selection and training. Future Dir ections. Several promising av enues emerge from this work. First, extending our analysis from linear to standard softmax attention and deep, multi-layer architectures would clarify ho w nonlinearities and depth affect the robustness-capacity relationship—we conjecture depth may amplify ef fectiv e capacity . Second, exploring alternativ e uncertainty sets beyond W asserstein (e.g., f -div ergences for discrete shifts, MMD for semantic perturbations) could tailor the framew ork to different threat models. Third, designing pretraining objectiv es or meta-learning algorithms that explicitly maximize the safe radius ρ max , perhaps through adversarial task augmentation or distrib utionally robust meta-training, represents a concrete path to ward more inherently rob ust models. Finally , empirical v alidation of the scaling laws ρ max ∝ √ m and N ρ − N 0 ∝ ρ 2 on large-scale autore gressiv e models under controlled distribution shifts remains an essential step tow ard practical impact. Acknowledgments W e thank the dev elopers of DeepSeek 1 for providing such a v aluable research assistance tool. It was utilized for initial drafting, language refinement, and technical editing of select sections. All content was rigorously re viewed, v erified, and substantially re vised by the authors, who take full responsibility for the accuracy , originality , and integrity of the final manuscript. 1 https://chat.deepseek.com/ 10 A P R E P R I N T - F E B RU A RY 2 3 , 2 0 2 6 References [1] T om Brown, Benjamin Mann, Nick Ryder , Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry , Amanda Askell, et al. Language models are few-shot learners. Advances in neural information pr ocessing systems , 33:1877–1901, 2020. [2] Sew on Min, Xinxi L yu, Ari Holtzman, Mik el Artetxe, Mike Le wis, Hannaneh Hajishirzi, and Luke Zettlemoyer . Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint , 2022. [3] Sang Michael Xie, Aditi Raghunathan, Perc y Liang, and T engyu Ma. An explanation of in-conte xt learning as implicit bayesian inference. arXiv preprint , 2021. [4] T omoya W akayama and T aiji Suzuki. In-context learning is prov ably bayesian inference: A generalization theory for meta-learning. arXiv preprint , 2025. [5] Johannes V on Osw ald, Eyvind Niklasson, Ettore Randazzo, Jo ˜ ao Sacramento, Alexander Mordvintse v , Andrey Zhmoginov , and Max Vladymyrov . T ransformers learn in-context by gradient descent. In International Confer ence on Machine Learning , pages 35151–35174. PMLR, 2023. [6] Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. Advances in Neural Information Pr ocessing Systems , 36:45614–45650, 2023. [7] Sibo Y i, Y ule Liu, Zhen Sun, T ianshuo Cong, Xinlei He, Jiaxing Song, K e Xu, and Qi Li. Jailbreak attacks and defenses against large language models: A survey . arXiv preprint , 2024. [8] Alexander W ei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training f ail? Advances in Neural Information Pr ocessing Systems , 36:80079–80110, 2023. [9] Andy Zou, Zifan W ang, J Zico K olter , and Matt Fredrikson. Uni versal and transferable adversarial attacks on aligned language models. arXiv preprint , 2023. [10] T ianyi Ma, T engyao W ang, and Richard Samworth. Prov able test-time adaptivity and distrib utional robustness of in-context learning. arXiv pr eprint arXiv:2510.23254 , 2025. [11] Aman Sinha, Hongseok Namkoong, Riccardo V olpi, and John Duchi. Certifying some distributional robustness with principled adversarial training. arXiv pr eprint arXiv:1710.10571 , 2017. [12] Grani A Hanasusanto, Vladimir Roitch, Daniel Kuhn, and W olfram W iesemann. A distributionally robust perspectiv e on uncertainty quantification and chance constrained programming. Mathematical Pro gramming , 151(1):35–62, 2015. [13] Shiv am Gar g, Dimitris Tsipras, Percy S Liang, and Gregory V aliant. What can transformers learn in-conte xt? a case study of simple function classes. Advances in neural information pr ocessing systems , 35:30583–30598, 2022. [14] Y u Bai, Fan Chen, Huan W ang, Caiming Xiong, and Song Mei. T ransformers as statisticians: Provable in-conte xt learning with in-context algorithm selection. Advances in neur al information processing systems , 36:57125–57211, 2023. [15] Y ingcong Li, Muhammed Emrullah Ildiz, Dimitris P apailiopoulos, and Samet Oymak. T ransformers as algorithms: Generalization and stability in in-context learning. In International conference on machine learning , pages 19565–19594. PMLR, 2023. [16] Allan Rav ent ´ os, Mansheej Paul, Feng Chen, and Surya Ganguli. Pretraining task div ersity and the emergence of non-bayesian in-context learning for re gression. Advances in neural information pr ocessing systems , 36:14228– 14246, 2023. [17] Ali Siahkamari, Aditya Gangrade, Brian Kulis, and V enkatesh Saligrama. Piecewise linear regression via a difference of con v ex functions. In International confer ence on machine learning , pages 8895–8904. PMLR, 2020. [18] Dmitrii M Ostro vskii and Francis Bach. Finite-sample analysis of m-estimators using self-concordance. Electr onic Journal of Statistics , 15:326–391, 2021. [19] Ekin Aky ¨ urek, Dale Schuurmans, Jacob Andreas, T engyu Ma, and Denny Zhou. What learning algorithm is in-context learning? inv estigations with linear models. arXiv preprint , 2022. [20] Ruiqi Zhang, Spencer Frei, and Peter Bartlett. Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927 , 2023. [21] Daniel Kuhn, Pe yman Mohajerin Esfahani, V iet Anh Nguyen, and Soroosh Shafieezadeh-Abadeh. W asserstein distributionally robust optimization: Theory and applications in machine learning. In Operations resear c h & management science in the a ge of analytics , pages 130–166. Informs, 2019. 11 A P R E P R I N T - F E B RU A RY 2 3 , 2 0 2 6 A Proofs and T echnical Details A.1 Complete Proof of Theor em 4 W e restate Theorem 4 in simpler terms: Theorem 4 (Simplified): For a linear T ransformer trained on Gaussian tasks, the worst-case performance under adversarial distrib ution shift is bounded by three main terms: 1. The normal performance without attack ( L Q 0 ) 2. A penalty that grows linearly with attack strength ρ , scaled by p d/m 3. A penalty that grows quadr atically with ρ , scaled by 1 / √ N Formally: R ρ ( θ ∗ ) ≤ L Q 0 ( θ ∗ ) + C 1 · ρ · r d m + C 2 · ρ 2 √ N + small terms . (3) Proof intuition: The key idea is that distrib ution shifts cause prediction errors, and these errors can be bounded by how ”smooth” the predictor is (its Lipschitz constant). Larger models (bigger m ) are smoother , making them less sensiti ve to shifts. Pr oof. W e prove the bound in four conceptual steps. Step 1: T ransforming distributional uncertainty into a simpler pr oblem. Instead of directly maximizing ov er all distributions in the W asserstein ball, we use a powerful duality result from Distributionally Rob ust Optimization (DR O). This says there exists a parameter η > 0 such that: R ρ ( θ ∗ ) ≤ E β ∼ Q 0 [ ℓ ( β )] | {z } normal risk + η ρ + 1 η · ψ ( η ) | {z } variability term , (4) where ψ ( η ) measures how ”v ariable” the loss function ℓ ( β ) is. This transformation turns a complex maximization over distributions into a simpler optimization o ver η . Step 2: Understanding how the loss changes with task parameters. The loss ℓ ( β ) measures prediction error when the true task parameter is β . W e need to know: if β changes a little, ho w much does ℓ ( β ) change? This is exactly the Lipschitz constant of ℓ . Using Lemma 2, we find that the gradient of ℓ ( β ) satisfies: ∥∇ β ℓ ( β ) ∥ ≤ (some factor) × r d m . (5) The p d/m factor comes from multi-head attention: with m attention heads, the model’ s sensitivity to parameter changes is spread across m directions, reducing sensitivity in an y single direction by roughly 1 / √ m . Thus, ℓ ( β ) is K -Lipschitz with K ∝ p d/m . Step 3: Bounding the variability term ψ ( η ) . Because ℓ ( β ) is Lipschitz and β is Gaussian, ℓ ( β ) behav es like a sub-Gaussian random variable. For such v ariables, we hav e the simple bound: ψ ( η ) ≤ η 2 σ 2 2 , (6) where σ 2 captures the variability . In our case, σ 2 has two parts: one from model sensitivity ( ∝ d/m ) and one from estimation variance ( ∝ 1 / N ). So: σ 2 ≈ constant ×  d m + 1 N  . (7) Plugging this into (4) and choosing the best η giv es: R ρ ( θ ∗ ) ≤ E β ∼ Q 0 [ ℓ ( β )] + √ 2 σ ρ + ρ 2 2 √ N . (8) The σ ρ term becomes C 1 ρ p d/m , and the ρ 2 / √ N term appears from optimizing η . 12 A P R E P R I N T - F E B RU A RY 2 3 , 2 0 2 6 Step 4: What is the normal risk? The normal risk L Q 0 ( θ ∗ ) is simply the ridge regression error, which standard analysis shows is O ( σ 2 d/ N ) . This is the baseline performance without attacks. Putting it all together: Combining the bounds from Steps 1-3 and including the normal risk giv es exactly Theorem 4. The constants C 1 , C 2 absorb all the technical factors like noise v ariance σ 2 and input dimension d . A.2 Why Lemma 1 (Ridge Regr ession Equivalence) Holds Lemma 1: Linear self-attention T ransformers, when optimally trained on linear regression tasks, perform exactly ridge regression on the in-conte xt examples. Pr oof. The ke y insight from (author?) [6] is that a single linear attention layer implements one step of gradient descent on a regularized least-squares objecti ve. Specifically , the model is trying to minimize: E ∥ y − X β ∥ 2 + λ ∥ β ∥ 2 , (9) where λ = σ 2 /σ 2 β balances fitting the data versus trusting the prior . The optimal solution to this problem is ridge regression: ˆ β = ( X ⊤ X + λI ) − 1 X ⊤ y . The T ransformer’ s attention mechanism cle verly encodes this ( X ⊤ X + λI ) − 1 X ⊤ operation through its ke y-query-value computations. When you feed in examples ( X, y ) followed by a test point x test , the attention weights compute exactly x ⊤ test ( X ⊤ X + λI ) − 1 X ⊤ y , which is the ridge regression prediction. In essence, the T ransformer has learned to ”implement” ridge regression in its forward pass, without needing to explicitly solv e the optimization problem. A.3 Understanding the Gradient Bound (Lemma 2) Lemma 2: The prediction function’ s sensitivity to task parameters is controlled by the singular values of X ⊤ X and the regularization λ N . Pr oof. The Jacobian J = ∂ ˆ y ∂ β tells us how much the prediction changes when β changes. For our ridge regression predictor: J = x ⊤ test ( X ⊤ X + λ N I ) − 1 X ⊤ X. (10) W e can bound its norm by looking at the eigen values. Let σ i be the singular values of X . Then: ∥ J ∥ ≤ ∥ x test ∥ · max i σ 2 i σ 2 i + λ N . (11) This ratio σ 2 i σ 2 i + λ N is always between 0 and 1. When λ N is large (strong regularization), the ratio is small, meaning predictions are insensitive to β changes. When λ N is small (weak re gularization), the ratio is near 1, meaning predictions are more sensitiv e. For random Gaussian X , the singular values concentrate around √ N . So: σ 2 max σ 2 min + λ N ≈ N N + λ N = 1 1 + λ N / N . (12) This shows that more in-context examples ( N larger) or stronger regularization ( λ N larger) both reduce sensitivity , making the model more robust. A.4 Practical Implications of the Singular V alue Concentration Lemma A.1 (Singular v alue concentration for random Gaussian matrices) . F or a random Gaussian design matrix X with N examples in d dimensions: • The smallest singular value is r oughly √ N − √ d • The lar gest singular value is r oughly √ N + √ d 13 A P R E P R I N T - F E B RU A RY 2 3 , 2 0 2 6 • When N ≫ d , all singular values are close to √ N • When N ≈ d , singular values spread out, causing instability Pr oof. This is a standard result in random matrix theory . The intuition: each row of X is a random vector in R d . W ith N such vectors, the empirical co variance X ⊤ X/ N has eigen v alues concentrated around 1, with fluctuations of order p d/ N . Multiplying by √ N gives the singular v alue bounds. The takea way for ICL: you need N ≫ d to get stable predictions. Otherwise, the ( X ⊤ X + λI ) − 1 term can be unstable, making predictions sensitiv e to noise and adversarial perturbations. A.5 Beyond Gaussian Assumptions Question: What if task parameters aren’t Gaussian? Answer: The core ideas still work. Our proof mainly uses two properties: 1. Lipschitzness of the loss (depends on model architecture, not task distribution) 2. Sub-Gaussian concentration (many distributions be yond Gaussian satisfy this) If tasks come from a sub-Gaussian distribution (e.g., bounded, uniform, or any ”reasonable” distrib ution), the same bounds hold with slightly different constants. The p d/m and 1 / √ N scaling laws remain unchanged. This robustness to distrib utional assumptions is why our theory is widely applicable, not just to toy Gaussian settings. B Experimental Settings and Additional Results B.1 Complete Experimental Setup Details B.1.1 Synthetic Experiments For the synthetic experiments, we used an input dimension of d = 20 unless noted otherwise. The pretraining task distribution w as an isotropic Gaussian P = N (0 , I d ) , and the nominal test distribution was set to Q 0 = N (0 , I d ) . W e fixed the noise variance at σ 2 = 0 . 1 and the prior variance at σ 2 β = 1 . 0 , which giv es a regularization coefficient of λ N = σ 2 /σ 2 β = 0 . 1 . The attention head dimension m was v aried ov er { 4 , 8 , 16 , 32 , 64 } as needed, and the number of in-context e xamples N took values from { 5 , 10 , 15 , 20 , 25 } . The model was a single-layer , multi-head linear attention T ransformer without MLP blocks or positional encodings. W e used four attention heads; for example, when the total head dimension m = 16 , each head had dimension 4. T raining consisted of 10,000 tasks randomly dra wn from P , with a batch size of 32 tasks. W e used the Adam optimizer with a learning rate of 0.001 and trained for 5,000 steps, stopping when validation loss con ver ged. The loss was mean squared error (MSE). W e chose linear attention—that is, attention without the softmax nonlinearity—because it has been sho wn to be theoretically equi valent to performing gradient descent steps [ 6 ]. This equiv alence aligns cleanly with our theoretical framew ork. Using standard softmax attention would introduce additional nonlinearities that complicate the analysis, though we suspect the qualitativ e insights would remain similar . B.1.2 T ext Classification Experiment For the text classification experiment, we used a subset of the SST -2 (Stanford Sentiment T reebank) dataset. W e sampled 1,000 sentences—500 positiv e and 500 negativ e—from the original training set. Each sentence was tokenized with the BER T tokenizer , truncated or padded to a maximum length of 64 tokens. The data were split into training (700 sentences), validation (150), and test (150) sets. The model started from a frozen BER T -base-uncased encoder . W e extracted the [CLS] token representation (768- dimensional) and projected it to a fixed 50-dimensional space using a random, untrained linear layer . The classification head was a single-layer linear attention module whose dimension m we varied across e xperiments. Only this attention head was trained; the BER T parameters remained frozen. W e trained for up to 20 epochs with early stopping (patience of 5 epochs). 14 A P R E P R I N T - F E B RU A RY 2 3 , 2 0 2 6 T o create an adversarial distribution shift, we perturbed the input embeddings. For a sentence’ s embedding matrix E ∈ R L × 768 , we first computed the av erage embedding ¯ e = 1 L P L i =1 E i . W e then sampled a random direction δ from N (0 , I 768 ) and normalized it to unit length. The perturbation was ∆ E = α · δ · 1 ⊤ L , where α controlled the magnitude. W e adjusted α iterativ ely until the W asserstein-2 distance between the clean mean embedding ¯ e and the perturbed version ¯ e + αδ equaled the desired radius ρ . This method allowed us to simulate a controlled, feature-space adversarial shift while staying within a well-defined W asserstein ball. B.2 Projected Gradient Ascent: Implementation and Conv ergence Algorithm 1 implements Projected Gradient Ascent (PGA) to find approximate worst-case distributions. Here we provide implementation details: T o compute the W asserstein distance between two Gaussians N ( µ 1 , Σ 1 ) and N ( µ 2 , Σ 2 ) , we use the closed-form expression for the squared W asserstein-2 distance: W 2 2 = ∥ µ 1 − µ 2 ∥ 2 + T r  Σ 1 + Σ 2 − 2(Σ 1 / 2 1 Σ 2 Σ 1 / 2 1 ) 1 / 2  . In our experiments, all co variances are isotropic ( Σ i = σ 2 i I d ), which simplifies the formula to W 2 2 = ∥ µ 1 − µ 2 ∥ 2 + d ( σ 1 − σ 2 ) 2 . This simplified form is efficient to e v aluate and suffices for our isotropic setting. When the current distribution parameters ( µ, Σ) fall outside the W asserstein ball of radius ρ , we project them back onto the ball. Since the nominal distribution is Q 0 = N (0 , I d ) , we compute the current distance as w = p ∥ µ ∥ 2 + d ( σ − 1) 2 , where σ is the standard deviation deri ved from Σ (i.e., Σ = σ 2 I d ). If w > ρ , we scale both the mean and the deviation from the identity variance: µ ← µ · ( ρ/w ) , σ ← 1 + ( σ − 1) · ( ρ/w ) . This projection yields the closest isotropic Gaussian inside the W asserstein ball under the simplified distance. W e ran PGA for T = 200 iterations with an initial step size η = 0 . 1 , decaying it by a factor of 0.95 ev ery 50 iterations. In practice, the algorithm con verged reliably: the estimated worst-case risk typically plateaued after about 150 iterations, and the W asserstein distance of the found distribution remained within 1% of the tar get radius ρ . B.3 Additional Experimental Results B.3.1 Effect of Regularization Strength λ N Corollary 4.7 predicts a trade-off: higher λ N (stronger prior) reduces sensitivity to distrib ution shifts but may increase nominal error . W e test this by varying λ N while fixing m = 16 , N = 15 , ρ = 0 . 8 . λ N Nominal Risk W orst-case Risk Increase (%) 0.05 ? ? ? 0.10 ? ? ? 0.20 ? ? ? 0.40 ? ? ? T able 1: Effect of regularization strength λ N on robustness. Larger λ N reduces the r elative increase from nominal to worst-case risk, confirming the rob ustness-precision trade-off. As predicted: higher λ N giv es better robustness (smaller percentage increase) b ut worse nominal performance. B.3.2 Robustness vs. Number of Heads (Fixed T otal Dimension) ? 15 A P R E P R I N T - F E B RU A RY 2 3 , 2 0 2 6 C Extended Discussion This section discusses sev eral broader aspects of our work. W e briefly connect our results to other theoretical perspectiv es, mention some practical implications, and note limitations that suggest directions for future work. Our analysis links to established generalization theory in sev eral ways. The Lipschitz-based mechanism—where robustness scales with p d/m —aligns with stability-based generalization bounds. The ridge regression equi valence can be viewed through a P A C-Bayesian lens, interpreting in-context learning as implicit Bayesian inference with a Gaussian prior . Our frame work e xtends this vie w to adversarial settings by considering w orst-case distributions within a W asserstein ball. The additional sample requirement under perturbation resembles notions of uniform stability , though our bound explicitly quantifies this cost in terms of the perturbation radius ρ . For practical model design, our results suggest a fe w guidelines. Architectural capacity , captured by attention dimension m , is a primary resource for robustness, scaling as √ m . This implies diminishing returns: doubling robustness requires quadrupling capacity . Distributing this capacity across multiple attention heads appears beneficial for stability , as suggested by our additional experiments. During training, using a div erse set of pretraining tasks can enlarge the effecti ve robustness region. The regularization parameter λ N presents a trade-off: a stronger prior (higher λ N ) reduces sensitivity to distrib ution shift but may lo wer peak performance. In deployment, one can estimate a plausible perturbation strength ρ from domain context, check if the model’ s capacity provides sufficient rob ustness via our bound, and, if not, provision additional in-conte xt examples according to the sample complexity tax. Se veral limitations point to fruitful future research. Our theoretical analysis focuses on linear self-attention T ransformers; extending it to standard softmax attention and deep, multi-layer architectures is an important next step. The Gaussian assumptions on task distributions provide tractability but may not hold in all scenarios; analysis for heavy-tailed or discrete distributions would be valuable. Our current framew ork centers on linear regression tasks; generalizing it to classification and other fe w-shot learning settings would broaden its applicability . Finally , while our synthetic experiments validate the core scaling laws, testing these relationships on large-scale language models and against real-world adversarial prompts remains an essential empirical challenge. In summary , this work provides a distrib utionally robust foundation for analyzing in-context learning. It formalizes the intuition that model capacity is intrinsically linked to rob ustness and offers a principled way to reason about performance under adversarial distrib ution shift. 16

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment