Stability and robustness are critical for deploying Transformers in safety-sensitive settings. A principled way to enforce such behavior is to constrain the model's Lipschitz constant. However, approximation-theoretic guarantees for architectures that explicitly preserve Lipschitz continuity have yet to be established. In this work, we bridge this gap by introducing a class of gradient-descent-type in-context Transformers that are Lipschitz-continuous by construction. We realize both MLP and attention blocks as explicit Euler steps of negative gradient flows, ensuring inherent stability without sacrificing expressivity. We prove a universal approximation theorem for this class within a Lipschitz-constrained function space. Crucially, our analysis adopts a measure-theoretic formalism, interpreting Transformers as operators on probability measures, to yield approximation guarantees independent of token count. These results provide a rigorous theoretical foundation for the design of robust, Lipschitz continuous Transformer architectures.
Transformers are at the forefront of contemporary machine learning, but are susceptible to adversarial examples (Gupta & Verma, 2023;Xu et al., 2023), and can be unstable to train (Liu et al., 2020;Davis et al., 2021;Qi et al., 2023). A mathematically sound remedy against these issues is to control the Lipschitz constant of the input-to-output map they realize.
Getting reliable Lipschitz control is structurally challenging in Transformers. Standard self-attention involves un-bounded dot products of linear projections, rendering it non-globally Lipschitz continuous (Kim et al., 2021). These issues are further exacerbated by the residual connections, which do not yield tight bounds on their Lipschitz constants (Xu et al., 2023;Li et al., 2023). Since computing a map’s Lipschitz constant is NP-hard (Virmaux & Scaman, 2018), controlling the Lipschitz constant of Transformers is typically achieved by modifying and constraining a new architecture. Existing methods typically rely on spectral regularization, modifications to the self-attention layer, and the weighting of residual layers.
Among the various Lipschitz constraints, the 1-Lipschitz case is particularly natural (Sherry et al., 2024;Murari et al., 2025;Prach et al., 2023;Béthune et al., 2022;Hasannasab et al., 2019;Xu et al., 2023;Anil et al., 2019;Anonymous, 2026). First, it is sometimes a necessary modeling requirement: 1-Lipschitz function classes arise as the admissible critics to approximate the Wasserstein-1 distance (Arjovsky et al., 2017), and being 1-Lipschitz is the baseline condition needed to obtain contraction maps in fixed-point iterations and stability analyzes. Second, even when one ultimately cares about an L-Lipschitz model, restricting the focus to 1-Lipschitz maps is largely without loss of generality, since any L-Lipschitz map can be written as a simple rescaling of a 1-Lipschitz map (Béthune et al., 2022). This makes 1-Lipschitzness a convenient normalization since it fixes a canonical scale for sensitivity while retaining the essential approximation questions within a standardized function class.
Transformers can be viewed as in-context maps in the sense that they take as input both a context and a distinguished query (or query token) and output a prediction that depends on both. Requiring the resulting map to be 1-Lipschitz with respect to the query for each fixed context ensures nonexpansiveness with respect to the query, which is useful, for instance, in learned fixed-point schemes that require convergence guarantees. This parallels the context-free setting; see (Hasannasab et al., 2019;Sherry et al., 2024). In the in-context case, however, varying the context changes the induced query-to-output transformation, effectively yielding a family of maps parameterized by the context. We therefore also require Lipschitz continuity with respect to the context for each fixed query, to ensure stability and robustness under small perturbations of the context. In this work, we model the context as a probability measure and control its effect via Lipschitz continuity with respect to the Wasserstein-1 distance.
We introduce a Transformer architecture that is provably 1-Lipschitz with respect to the query for any fixed context, and admits a finite Lipschitz constant as a function of the context. The 1-Lipschitz bound is independent of the context length, and our layer is defined in the more general setting where the context is a measure. Moreover, we theoretically analyze the approximation capabilities of these models, proving universal approximation for the class of 1-Lipschitz in-context maps on compact domains. To the best of our knowledge, this is the first theoretical investigation of the approximation properties of 1-Lipschitz in-context maps.
Lipschitz-constrained neural networks Lipschitz control has been pursued as a principled mechanism for stability, robustness, and regularization of neural networks. Such a control on the Lipschitz constant is usually achieved by constraining the operator norms of linear layers during training (Miyato et al., 2018;Gouk et al., 2021;Bungert et al., 2021;Trockman & Kolter, 2021), and encoding non-expansiveness structurally through architectural design (Cisse et al., 2017;Sherry et al., 2024;Meunier et al., 2021). Especially when new architectural choices are made, some study of the network expressivity is necessary. A growing body of papers is exploring the theoretical approximation properties of Lipschitz-constrained networks. (Anil et al., 2019;Neumayer et al., 2023) study 1-Lipschitz feedforward networks, and (Murari et al., 2025) 1-Lipschitz ResNets. Our work builds on results from Section 3 of (Murari et al., 2025), but significantly extends their theory to accommodate the context measure, which plays a crucial role in Transformers.
Lipschitz-constrained Transformers. Enforcing Lipschitz guarantees in Transformers is challenging because key components are not globally Lipschitz under standar
This content is AI-processed based on open access ArXiv data.