Oracle-Robust Online Alignment for Large Language Models
We study online alignment of large language models under misspecified preference feedback, where the observed preference oracle deviates from an ideal but unknown ground-truth oracle. The online LLM alignment problem is a bi-level reinforcement probl…
Authors: Zimeng Li, Mudit Gaur, Vaneet Aggarwal
O R A C L E - R O B U S T O N L I N E A L I G N M E N T F O R L A R G E L A N G U A G E M O D E L S Zimeng Li , Mudit Gaur , V aneet Aggarwal Purdue Univ ersity A B S T R AC T W e study online alignment of large language models under misspecified preference feedback, where the observed preference oracle de viates from an ideal b ut unknown ground-truth oracle. The online LLM alignment problem is a bi-lev el reinforcement problem due to the coupling between data collection and policy updates. Recently , the problem has been reduced to tractable single-level objectiv e in the SAIL (Self-Improving Ef ficient Online Alignment) frame work. In this paper , we introduce a pointwise oracle uncertainty set in this problem and formulate an oracle-robust online alignment objectiv e as a worst-case optimization problem. For log-linear policies, we show that this robust objecti ve admits an exact closed-form decomposition into the original loss function plus an explicit sensiti vity penalty . W e dev elop projected stochastic composite updates for the resulting weakly con ve x objectiv e and prove e O ( ε − 2 ) oracle complexity for reaching approximate stationarity . K eywords LLM alignment · preference learning · oracle rob ustness · weakly con ve x optimization 1 Introduction Large language models (LLMs) are increa singly deployed as interacti ve systems, where failures in instruction follo wing or safety can hav e immediate impact. A common alignment pipeline is reinforcement learning from human feedback (RLHF), which updates a policy π θ using pairwise preference feedback on sampled responses [ 1 , 2 , 3 , 4 , 5 ]. In practice, the feedback is produced by a preference oracle P , which may be a pool of human annotators or a learned re ward/preference model [ 2 ]. Such oracles can deviate from an idealized true oracle P ⋆ in structured ways, for example due to population heterogeneity across users or temporal drift in labeling standards, often mediated by style/value confounders (e.g., preferring verbosity o ver correctness), thereby inducing preference shift [ 6 , 7 ]. In offline alignment with a fixed preference dataset, this manifests as a train/test preference mismatch between the dataset and the deployed user population [ 8 ]. In online or on-policy alignment, the polic y π θ controls the distribution of queried comparisons, so systematic oracle de viations can be amplified by the feedback loop and lead to ov er-optimization of oracle quirks rather than the intended preferences [ 9 , 10 , 11 , 12 ]. Therefore, robustness should be modeled explicitly through an uncertainty set and a worst-case objecti ve, rather than via simplistic i.i.d. label-noise assumptions. Existing work on LLM alignment robustness lar gely falls into two cate gories. On the one hand, online RLHF methods explicitly account for the coupling between data collection and policy updates, using bile vel formulations that reduce to tractable single-lev el objectiv es [ 13 , 14 , 15 ]. Howe ver , these approaches typically assume the oracle matches the modeling assumptions and do not provide distributionally robust guarantees under oracle misspecification. On the other hand, recent offline methods robustify direct preference optimization (DPO) on a fixed dataset by solving a minimax problem ov er a distributional uncertainty set, improving performance under preference shift [ 16 , 8 , 17 ]. These robust DPO formulations are tailored to static datasets and do not address the on-polic y setting where the comparison distribution changes with π θ , and where oracle perturbations can interact with the ev olving sampling distribution. This gap moti vates a rob ustness model that is compatible with online preference collection and still yields a tractable objectiv e and optimization theory . The authors of [ 13 ] proposed an approach, Self-Impro ving Ef ficient Online Alignment (SAIL), that reduces the bi- lev el reinforcement learning problem of LLM alignment [ 18 , 19 ] to an ef ficient single-lev el first-order method using the reward-polic y equiv alence approach. In this paper , we aim to consider this problem in the presence of oracle Oracle-Robust Online Alignment for Lar ge Language Models misspecification. W e introduce a pointwise uncertainty set U W ( P ⋆ , ρ ) that bounds the de viation of the preference probability P (1 | z ) from P ⋆ (1 | z ) for e very comparison z = ( x, y 1 , y 2 ) that may be generated under the policy- induced sampling distribution d θ . W e then define an oracle-robust objective L W ρ ( θ ) as the worst-case neg ative log-likelihood ov er P ∈ U W ( P ⋆ , ρ ) , which guards against worst-case exploitation of structured oracle deviations in the on-policy feedback loop. Although robust bilev el formulations are generally difficult, under the log-linear preference model we sho w that L W ρ ( θ ) admits an exact closed-form decomposition into the nominal loss L SAIL ( θ ) plus an explicit robustness penalty λR ( θ ) , where R ( θ ) = E z ∼ d θ | R ( θ ; x, y 1 , y 2 ) | is the expected absolute pairwise score and λ = ρβ . Since the constrained objectiv e can be non-smooth, we measure first-order stationarity via the Moreau-en velope F λ env . By standard env elope/proximal properties, an ϵ -stationary point of F λ env implies that the associated proximal point is ϵ -nearly stationary for the original constrained problem (and the iterate lies within O ( ϵλ env ) of the proximal point). Using this stationarity surrogate, we obtain an ˜ O ( ε − 2 ) oracle complexity bound for reaching an ε -stationary point of the en v elope. Finally , because ρ = 0 recov ers the nominal SAIL objecti ve, our analysis also yields a con v ergence-to-stationarity guarantee for optimizing L SAIL ( θ ) as a special case. W e summarize our main contributions as follo ws. • W e formulate oracle-robust online alignment by combining SAIL with a pointwise oracle uncertainty set U W ( P ⋆ , ρ ) , leading to the robust objectiv e L W ρ ( θ ) that optimizes against worst-case preference perturbations under policy-induced sampling. • W e give an exact closed-form characterization of L W ρ ( θ ) as L SAIL ( θ ) + λR ( θ ) and interpret R ( θ ) as an explicit sensiti vity penalty giv en by an expected absolute pairwise score term. • W e show the resulting constrained robust objectiv e is weakly con vex and analyze projected stochastic composite updates using the Moreau-en velope as a smooth stationarity surrogate. W e prove an ˜ O ( ε − 2 ) oracle complexity bound for reaching an ϵ -stationary point of the en velope. As a corollary , setting ρ = 0 yields a con ver gence-to- stationarity guarantee for the original SAIL objectiv e. 2 Related W ork Bilevel r einf orcement learning and RLHF . Bilev el optimization provides a principled abstraction for hierarchical learning problems such as hyperparameter optimization and meta-learning, where an upper-lev el objectiv e depends on the solution of a lower -lev el training problem [ 20 , 21 , 22 ]. This perspectiv e is increasingly rele v ant to alignment: RLHF couples (i) learning preferences/rewards and (ii) policy optimization (often implemented with PPO/TRPO-style updates), and the alignment objecti ve is e v aluated on data whose distrib ution is induced by the policy produced by the lower -lev el optimization [ 1 , 2 , 23 , 24 ]. P ARL (Policy Alignment in Reinforcement Learning) formalizes policy alignment in RL as a stochastic bile vel program that explicitly accounts for decision-dependent data collection at the upper lev el, and develops an algorithm with finite-sample guarantees [ 18 ]. Focusing on online LLM alignment, SAIL similarly argues that the alignment process is underpinned by bilevel optimization, and deri ves an efficient single-le vel first-order surrogate via re ward–policy equivalence, enabling iterative on-policy data generation and self-improving alignment [ 13 ]. Complementing these algorithmic framew orks, recent theory studies the statistical and computational limits of bilev el RL in noncon v ex settings; for example, [ 19 ] establish sample comple xity bounds for bilev el reinforcement learning with parameterized settings. Related de velopments on general noncon ve x bilev el optimization further analyze and mitigate the cost of hyper gradient computation through penalty-based approaches [25, 26]. Robust alignment under distrib ution shift. A central challenge in alignment is robustness: preference data are typically collected from a narrow , static source distrib ution, while deployment-time preferences can vary across populations and drift ov er time, causing brittleness for of fline objectiv es such as DPO [ 16 , 27 ]. A recent line of work imports distributionally rob ust optimization (DR O)[ 28 ] to explicitly hedge against such preference shifts. In particular , [8] propose distributionally rob ust DPO with W asserstein and KL uncertainty sets (WDPO/KLDPO), provide sample- complexity characterizations, and de velop scalable gradient-based algorithms suitable for large-scale LLM fine-tuning. Concurrently , robust variants of direct preference learning consider alternati ve uncertainty sets and regularizers; e.g., [ 17 ] distributionally rob ustify DPO and empirically study robustness to preference/data perturbations, and [ 7 ] optimize for worst-case group mixtures to handle heterogeneous preferences. These methods connect robust alignment to foundational DR O results and tractable reformulations [ 29 , 30 , 31 ], as well as classical robust RL/MDP formulations that optimize against worst-case transition models [32]. 2 Oracle-Robust Online Alignment for Lar ge Language Models 3 Problem Setup Let X be the prompt (context) space and Y the response space. For each parameter vector θ ∈ Θ ⊆ R d , the language model induces a conditional distribution (policy) π θ ( · | x ) ∈ ∆( Y ) for each x ∈ X , where ∆( Y ) denotes the probability simple x o ver Y . W e view each y ∈ Y as a finite tok en sequence and assume π θ is generated autoregressi vely , i.e., π θ ( y | x ) = Q | y | t =1 π θ ( y t | x, y 0 . F or each z = ( x, y 1 , y 2 ) ∈ Z , define the pointwise uncertainty set U W z ( P ⋆ , ρ ) := n P ∈ O P (1 | z ) − P ⋆ (1 | z ) ≤ ρ o . (2) Although U W z ( P ⋆ , ρ ) is specified as a pointwise (uniform) neighborhood in the scalar preference probability , it admits a W asserstein interpretation at fix ed z [ 33 ]: vie wing P ( · | z ) and P ⋆ ( · | z ) as Bernoulli distrib utions on { 0 , 1 } with ground cost c ( y , y ′ ) = | y − y ′ | , the 1 -W asserstein distance satisfies W 1 Ber( P (1 | z )) , Ber( P ⋆ (1 | z )) = | P (1 | z ) − P ⋆ (1 | z ) | . Hence, under binary support, the W asserstein ball constraint W 1 ( · , · ) ≤ ρ is equiv alent to the interv al constraint in (2) . The corresponding global uncertainty set requires the pointwise constraint to hold uniformly over all comparison triples: U W ( P ⋆ , ρ ) := \ z ∈ Z U W z ( P ⋆ , ρ ) = n P ∈ O sup z ∈ Z P ( · | z ) − P ⋆ ( · | z ) ≤ ρ o . (3) U W ( P ⋆ , ρ ) models an adversarial b ut uniformly bounded misspecification of the (conditional) preference probability across all prompts and response pairs that may be encountered under the policy-induced sampling in (1). Fix a prompt (context) x ∈ X . The policy π θ ( · | x ) independently generates a pair of responses y 1 , y 2 ∼ π θ ( · | x ) , which are then compared by a (possibly noisy) preference oracle. Although the latent reward function r ⋆ ( x, y ) underlying preferences is not directly observed, we assume the oracle satisfies the Bradley–T erry (BT) model[ 34 ]: for all x ∈ X and y 1 , y 2 ∈ Y , P ⋆ ( y 1 ≻ y 2 | x ) = σ r ⋆ ( x, y 1 ) − r ⋆ ( x, y 2 ) , (4) where σ ( u ) := (1 + e − u ) − 1 denotes the logistic sigmoid. W e impose a mild mar gin condition to ensure that all admissible oracles remain valid. W e state this non-degenerac y requirement in the following assumption. Assumption 1 (Nondegenerate true oracle and admissible radius) . Ther e exists a constant δ ∈ (0 , 1 / 2] such that for all ( x, y 1 , y 2 ) ∈ X × Y × Y , δ ≤ P ⋆ ( y 1 ≻ y 2 | x ) ≤ 1 − δ. (5) W e restrict the oracle uncertainty radius to ρ ∈ (0 , δ ) , so that every pr efer ence or acle P is nondeg enerate, i.e. ∀ ( x, y 1 , y 2 ) , ∀ P ∈ U W ( P ⋆ , ρ ) 0 ≤ δ − ρ ≤ P ( y 1 ≻ y 2 | x ) ≤ 1 − ( δ − ρ ) ≤ 1 . Remark 3.1 . Assumption 1 rules out nearly deterministic preferences: for ev ery comparison ( x, y 1 , y 2 ) , the oracle assigns nontri vial probability to either outcome. When the oracle obeys the Bradley–T erry form (4) , this condition is equiv alent to a uniform bound on the reward gap. Indeed, writing ∆ r ⋆ := r ⋆ ( x, y 1 ) − r ⋆ ( x, y 2 ) , the monotonicity of σ giv es log δ 1 − δ ≤ ∆ r ⋆ ≤ log 1 − δ δ , and therefore r ⋆ ( x, y 1 ) − r ⋆ ( x, y 2 ) ≤ log 1 − δ δ . (6) 3 Oracle-Robust Online Alignment for Lar ge Language Models Since y 1 , y 2 are drawn i.i.d. from π θ ( · | x ) , (6) can be read as a structural condition on the comparisons induced by the policy: the sampling procedure does not generate response pairs whose latent rewards dif fer so drastically that the preference label becomes essentially deterministic. Background (SAIL). SAIL is a preference-based RLHF frame work for online alignment that explicitly models the coupling between (i) learning from pairwise preference feedback and (ii) updating the policy that generates the responses being compared. In the online regime, the preference data distrib ution is policy-dependent, and SAIL represents this dependence via a bilev el program: a rew ard model r is fit from Bradley–T erry comparisons on responses sampled from the KL-regularized optimal polic y induced by r , (upper) min r − E x ∼ µ, y i ∼ π ( ·| x ) , ( y w ≻ y ℓ ) ∼ P ⋆ h log σ r ( x, y w ) − r ( x, y ℓ ) i (lower) s.t. π ⋆ r ∈ arg max π E x ∼ µ, y ∼ π ( ·| x ) h r ( x, y ) − β D KL π ( · | x ) ∥ π SFT ( · | x ) i , (7) Direct dif ferentiation through the inner solution mapping r 7→ π ⋆ r requires hyper gradient computations. SAIL circumvents this by e xploiting the reward–polic y equiv alence for KL-regularized RL: an y optimizer π ⋆ r satisfies r ( x, y ) = β log π ⋆ r ( y | x ) π SFT ( y | x ) + β log Z r ( x ) , (8) for a normalization Z r ( x ) independent of y . Substituting (8) into the upper-le vel BT likelihood reduces the bile vel program to a tractable single-lev el policy objecti ve; parametrizing π θ yields L SAIL ( θ ) := − E x ∼ µ, y 1 ,y 2 iid ∼ π θ ( ·| x ) , ( y w ,y ℓ ) ∼ P ⋆ log σ β log π θ ( y w | x ) π SFT ( y w | x ) − β log π θ ( y ℓ | x ) π SFT ( y ℓ | x ) . (9) The single-level SAIL objecti ve L SAIL ( θ ) in Eq. (9) ev aluates a policy π θ under preference feedback generated by the true oracle P ⋆ . In practice, howe ver , we do not ha ve access to P ⋆ , and the observed preference labels may be produced by a perturbed oracle whose conditional preference probabilities de viate from those of P ⋆ . W e model this misspecification by allo wing the data-generating oracle P to range o ver the global uncertainty set U W ( P ⋆ , ρ ) in Eq. (3) , which enforces a uniform pointwise de viation bound across all prompts and response pairs that may be encountered under policy-induced sampling. Concretely , gi ven θ , we draw x ∼ µ and y 1 , y 2 i.i.d. ∼ π θ ( · | x ) , and the oracle P induces an ordered pair ( y w , y ℓ ) corresponding to the preferred and less preferred response. W e then define the oracle-robust alignment objectiv e as the worst-case SAIL v alue over U W ( P ⋆ , ρ ) : L W ρ ( θ ) := sup P ∈ U W ( P ⋆ ,ρ ) − E x ∼ µ, y 1 ,y 2 i.i.d. ∼ π θ ( ·| x ) , ( y w ,y ℓ ) ∼ P " log σ β log π θ ( y w | x ) π SFT ( y w | x ) − β log π θ ( y ℓ | x ) π SFT ( y ℓ | x ) !# . (10) 4 Proposed A ppr oach In this section, we first relate our oracle-rob ust objectiv e L W ρ to the nominal SAIL objective L SAIL by sho wing that L W ρ admits an exact decomposition into the SAIL loss and an explicit re gularization term. W e then study the re gularity of the two components separately: we impose a smoothness condition on L SAIL and analyze the policy-dependent penalty R ( θ ) . Under mild assumptions, these results together imply that L W ρ is weakly con ve x. 4.1 Decomposition of L W ρ ( θ ) Our robust objecti ve L W ρ ( θ ) is defined via a worst-case e xpectation over the global oracle uncertainty set U W ( P ⋆ , ρ ) . The following assumption specifies the log-linear SAIL comparison model and enables an exact reduction of the inner worst-case problem to an explicit re gularizer . Assumption 2 (Log-linear policy class) . Let ψ : X × Y → R d be a known d -dimensional featur e map W e consider the class of log-linear (softmax) policies Π := ( π θ : π θ ( y | x ) = exp θ ⊤ ψ ( x, y ) P y ′ ∈ Y exp θ ⊤ ψ ( x, y ′ ) ) . For notational con v enience, let θ ref ∈ Θ denote the fixed parameter of the reference (SFT) policy , i.e π θ ref = π SFT . (11) 4 Oracle-Robust Online Alignment for Lar ge Language Models Theorem 4.1 (Decomposition of L W ρ ( θ ) ) . Recall L SAIL ( θ ) denotes the non-r obust SAIL objective and L W ρ ( θ ) denotes the r obust objective defined by the wor st-case oracle in U W ( P ⋆ , ρ ) .Define the pairwise scor e R ( θ ; x, y 1 , y 2 ) := ( θ − θ ref ) ⊤ ψ ( x, y 1 ) − ψ ( x, y 2 ) , and the r obust penalty R ( θ ) := E x ∼ µ E y 1 ,y 2 ∼ π θ ( ·| x ) R ( θ ; x, y 1 , y 2 ) . (12) Then under Assumptions 1,2, the r obust objective admits the e xact decomposition L W ρ ( θ ) = L SAIL ( θ ) + λR ( θ ) , λ := ρβ . (13) Pr oof sketc h of Theor em 4.1. Fix θ and write z = ( x, y 1 , y 2 ) ∈ Z . Define the pairwise log-ratio score h θ ( x, y 1 , y 2 ) := log π θ ( y 1 | x ) π SFT ( y 1 | x ) − log π θ ( y 2 | x ) π SFT ( y 2 | x ) and the two per -sample losses ℓ 1 θ ( z ) := − log σ β h θ ( x, y 1 , y 2 ) , ℓ 0 θ ( z ) := − log σ β h θ ( x, y 2 , y 1 ) . Let p ⋆ ( z ) := P ⋆ (1 | z ) and p ( z ) := P (1 | z ) . Then the SAIL objecti ve admits the representation L SAIL ( θ ) = E z ∼ d θ h p ⋆ ( z ) ℓ 1 θ ( z ) + 1 − p ⋆ ( z ) ℓ 0 θ ( z ) i , while the robust objecti ve is L W ρ ( θ ) = sup P ∈ U W ( P ⋆ ,ρ ) E z ∼ d θ h p ( z ) ℓ 1 θ ( z ) + 1 − p ( z ) ℓ 0 θ ( z ) i . Since U W ( P ⋆ , ρ ) enforces | p ( z ) − p ⋆ ( z ) | ≤ ρ pointwise, the inner supremum is separable across z and (by linearity in p ( z ) ) is attained at an endpoint, yielding sup P ∈ U W ( P ⋆ ,ρ ) E y ∼ P ( ·| z ) y ℓ 1 θ ( z ) + (1 − y ) ℓ 0 θ ( z ) = E y ∼ P ⋆ ( ·| z ) y ℓ 1 θ ( z ) + (1 − y ) ℓ 0 θ ( z ) + ρ ℓ 1 θ ( z ) − ℓ 0 θ ( z ) . (14) T aking E z ∼ d θ giv es L W ρ ( θ ) = L SAIL ( θ ) + ρ E z ∼ d θ ℓ 1 θ ( z ) − ℓ 0 θ ( z ) . Finally , ℓ 1 θ ( z ) − ℓ 0 θ ( z ) = log σ ( − β h θ ( x, y 1 , y 2 )) σ ( β h θ ( x, y 1 , y 2 )) = − β h θ ( x, y 1 , y 2 ) , so the e xtra term equals ρβ E z ∼ d θ [ | h θ ( x, y 1 , y 2 ) | ] . Under the log-linear policy assumption (Assumption 2), h θ ( x, y 1 , y 2 ) = R ( θ ; x, y 1 , y 2 ) , hence L W ρ ( θ ) = L SAIL ( θ ) + λR ( θ ) with λ = ρβ . Remark 4.2 (Interpretation of the decomposition) . Theorem 4.1 separates the robust objective into a nominal fitting term L SAIL ( θ ) and an explicit robustness penalty λR ( θ ) that depends on the policy-induced sampling distribution. The penalty R ( θ ) measures the expected magnitude of the pairwise score R ( θ ; x, y 1 , y 2 ) ov er i.i.d. response pairs ( y 1 , y 2 ) ∼ π θ ( · | x ) , and thus quantifies sensitivity of the likelihood to adversarial perturbations of the pointwise preference probability . The uncertainty radius enters only through the linear prefactor λ = ρβ : increasing ρ monotonically strengthens the penalty , while ρ ↓ 0 recovers the nominal objecti ve L SAIL ( θ ) . 4.2 Regularity of of L SAIL . Prior analyses of preference-based objecti ves often in v oke stronger global conditions (e.g., PL-type geometries [ 19 ]). Here we treat L SAIL ( θ ) under the following smoothness assumption. Assumption 3 (Smooth SAIL objectiv e) . The SAIL objective L SAIL has L SAIL -Lipschitz gr adient: ∥∇ L SAIL ( θ ) − ∇ L SAIL ( θ ′ ) ∥ ≤ L SAIL ∥ θ − θ ′ ∥ ∀ θ , θ ′ ∈ R d . Remark 4.3 . Assumption 3 is a standard smoothness condition in first-order optimization, and it is routinely imposed in analyses of gradient-based methods for both con v ex and nonconv ex objectiv es, including stochastic settings [ 35 , 36 , 37 ]. W e adopt it here as a mild regularity requirement on L SAIL . 5 Oracle-Robust Online Alignment for Lar ge Language Models 4.3 W eak con v exity of R ( θ ) It remains to understand the regularity of the rob ustness penalty R ( θ ) . Although | R ( θ ; x, y 1 , y 2 ) | is a con ve x function of θ for fixed ( x, y 1 , y 2 ) , the expectation in R ( θ ) is taken under the policy-dependent distrib ution ( x, y 1 , y 2 ) ∼ d θ (Eq. (1) ). Since d θ itself depends on θ , con v exity of the pointwise quantity | R ( θ ; x, y 1 , y 2 ) | does not carry ov er to R ( θ ) . This coupling can destroy con ve xity , as e videnced by the follo wing example. Example 4.4. Consider the simplest setting with a single pr ompt X = { x } , two r esponses Y = { a, b } , and d = 1 . Fix θ ref = 0 and define featur es by ψ ( x, a ) = 1 and ψ ( x, b ) = 0 , so that for z = ( x, y 1 , y 2 ) , R ( θ ; x, y 1 , y 2 ) = θ ψ ( x, y 1 ) − ψ ( x, y 2 ) ∈ { 0 , ± θ } . Let the policy be π θ ( a | x ) = σ ( θ ) and π θ ( b | x ) = σ ( − θ ) , and sample y 1 , y 2 iid ∼ π θ ( · | x ) . Then | R ( θ ; x, y 1 , y 2 ) | = | θ | 1 { y 1 = y 2 } , and hence R ( θ ) = | θ | Pr( y 1 = y 2 ) = 2 | θ | π θ ( a | x ) π θ ( b | x ) = 2 | θ | σ ( θ ) σ ( − θ ) = 2 | θ | e θ (1 + e θ ) 2 . (15) Con vexity would imply R (1) ≤ 1 2 ( R (0) + R (2)) = 1 2 R (2) , but using (15) we obtain R (1) − 1 2 R (2) = 2 e ( e − 1)( e 3 − 1) (1 + e ) 2 (1 + e 2 ) 2 > 0 , so R is not con vex in gener al. Giv en the nonconv exity exhibited abo ve, we control the curv ature of R ( θ ) through weak conv exity: f is κ -weakly con ve x if f ( · ) + κ 2 ∥ · ∥ 2 is con ve x. First we introduce se veral assumptions required for the analysis. Assumption 4 (Finite response set and bounded features) . The r esponse space Y is finite, and the featur e map satisfies ∥ ψ ( x, y ) ∥ ≤ B ψ < ∞ for all ( x, y ) ∈ X × Y . Remark 4.5 . The boundedness in Assumption 4 is standard in analyses of log-linear and softmax models [ 38 , 39 ]. W ithout loss of generality , B ψ can be normalized to 1 by rescaling the feature map. Howe ver , we keep B ψ explicit to highlight its impact on the weak con ve xity constant. Assumption 5 (Bounded feasible set) . The feasible set Θ is nonempty , closed, con vex, and bounded. In particular , with θ ref as in Eq. (11) , define D := sup θ ∈ Θ ∥ θ − θ ref ∥ < ∞ . Remark 4.6 . Assumption 5 also ensures that the Euclidean projection operator onto Θ , Π Θ ( u ) := arg min θ ∈ Θ ∥ θ − u ∥ 2 , is well-defined and nonexpansi ve[40]. Theorem 4.7 (W eak con vexity of the robust penalty) . Under Assumptions 2, 4, 5, the r obust penalty R : Θ → R defined in Eq. (12) is κ R -weakly con vex on Θ , with κ R ≤ 16 B 2 ψ + 4 D B 3 ψ . Pr oof sketc h of Theor em 4.7. Write z = ( x, y 1 , y 2 ) and Π θ ( z ) := µ ( x ) π θ ( y 1 | x ) π θ ( y 2 | x ) , so R ( θ ) = E z ∼ Π θ | s θ ( z ) | with s θ ( z ) := ( θ − θ ref ) ⊤ ( ψ ( x, y 1 ) − ψ ( x, y 2 )) . First, we smooth the absolute value via φ ε ( u ) := √ u 2 + ε 2 and considers R ε ( θ ) := E z ∼ Π θ [ φ ε ( s θ ( z ))] . Since Y is finite , R ε is a finite sum and deri vati ves can be exchanged with expectation. A log-deriv ati ve (score-function) calculation expresses ∇ 2 R ε ( θ ) as an expectation of terms in volving: (i) deriv ati ves of φ ε ◦ s θ , (ii) the score S θ ( z ) := ∇ θ log Π θ ( z ) and its Jacobian. Bounded features bounds ∥ S θ ( z ) ∥ and ∥∇ θ S θ ( z ) ∥ op , while bounded parameter set bound ∥ ψ ( x, y 1 ) − ψ ( x, y 2 ) ∥ and | s θ ( z ) | uniformly on Θ . Combining these bounds yields a uniform lower bound ∇ 2 R ε ( θ ) ⪰ − κ ε I with κ ε := 8 GB ψ + 4 M DB ψ + 2 M ε . Hence R ε is κ ε -weakly con ve x, and letting ε ↓ 0 preserves weak con v exity with κ R ≤ 16 B 2 ψ + 4 D B 3 ψ . Theorem 4.8 (W eak con vexity of the composite objectiv e ) . Under Assumptions 1, 2, 3, 4, 5, the r ob ust objective L W ρ ( θ ) = L SAIL ( θ ) + λR ( θ ) is κ -weakly conve x on Θ , with κ := L SAIL + λκ R , λ = ρβ . Pr oof sketc h. By Assumption 3, L SAIL has L SAIL -Lipschitz gradient and is therefore L SAIL -weakly conv ex. By Theorem 4.7, R is κ R -weakly con ve x on Θ , so λR is ( λκ R ) -weakly con ve x. The sum of weakly con vex functions is weakly con ve x with parameter giv en by the sum of the parameters, yielding κ = L SAIL + λκ R for L W ρ = L SAIL + λR . 6 Oracle-Robust Online Alignment for Lar ge Language Models 4.4 Algorithm development W e optimize the oracle-robust alignment objectiv e L W ρ ( θ ) ov er the bounded con ve x parameter set Θ . T o enforce feasibility explicitly , we consider the constrained objectiv e F ( θ ) := L W ρ ( θ ) + I Θ ( θ ) , I Θ ( θ ) := 0 , θ ∈ Θ , + ∞ , θ / ∈ Θ . (16) Remark 4.9 . By Theorem 4.4, L W ρ is κ -weakly con ve x on Θ . Since I Θ is con ve x, the constrained objective F in (16) is also κ -weakly con ve x with the same constant κ . Because F may be nonsmooth and only weakly conv ex, we measure first-order stationarity using the Moreau en velope[41]. Definition 4.10 (Moreau en velope and proximal point) . Fix λ env ∈ (0 , 1 /κ ) . The Moreau en velope of F with parameter λ env is F λ env ( θ ) := min u ∈ R d F ( u ) + 1 2 λ env ∥ u − θ ∥ 2 , (17) and the associated proximal point mapping is pro x λ env F ( θ ) := arg min u ∈ R d F ( u ) + 1 2 λ env ∥ u − θ ∥ 2 . (18) When con venient, we write ˆ θ := prox λ env F ( θ ) . Lemma 4.11 (Properties of the Moreau env elope [ 42 ]) . Assume F is κ -weakly conve x and bounded below , and let λ env ∈ (0 , 1 /κ ) . Then: 1. F λ env is finite and continuously differ entiable on R d . 2. Its gr adient is ∇ F λ env ( θ ) = 1 λ env θ − ˆ θ , and in particular , ∥ θ − ˆ θ ∥ = λ env ∥∇ F λ env ( θ ) ∥ . (19) 3. The gradient ∇ F λ env is Lipschitz with constant L env := 1 λ env (1 − κλ env ) . (20) 4. (Appr oximate stationarity) F or ˆ θ = prox λ env F ( θ ) , dist 0 , ∂ F ( ˆ θ ) ≤ ∥∇ F λ env ( θ ) ∥ . (21) Lemma 4.11 motiv ates ∥∇ F λ env ( θ ) ∥ as a smooth stationarity measure: controlling ∥∇ F λ env ( θ ) ∥ ensures (i) θ is close to its proximal point ˆ θ via (19) and (ii) ˆ θ is nearly stationary for the original constrained problem via (21) . The en velope parameter λ env gov erns the smoothing–bias trade-off: smaller λ env reduces smoothing bias but increases L env in (20) . Assumption 6 (Stochastic gradient/subgradient oracles) . Let d θ denote the policy-induced sampling distrib ution on Z = X × Y × Y fr om Eq. (1) . At iteration t , Algorithm 1 samples a mini-batch Z t = { z i } B i =1 with i.i.d. draws z i ∼ d θ t . Ther e exist measur able mappings G SAIL : Θ × Z B → R d , G R : Θ × Z B → R d , such that for all θ ∈ Θ , E G SAIL ( θ ; Z t ) | θ = ∇ L SAIL ( θ ) , E G R ( θ ; Z t ) | θ ∈ ∂ R ( θ ) . Mor eover , ther e exist constants σ 2 SAIL , σ 2 R < ∞ such that E h G SAIL ( θ ; Z t ) − ∇ L SAIL ( θ ) 2 | θ i ≤ σ 2 SAIL B , E h dist 2 G R ( θ ; Z t ) , ∂ R ( θ ) | θ i ≤ σ 2 R B . F inally define the composite dir ection G ( θ ; Z t ) as in (22) . 7 Oracle-Robust Online Alignment for Lar ge Language Models Algorithm 1 Robust Stochastic Composite Gradient Descent (R-SCGD) 1: Input: θ 0 ∈ Θ , stepsize η > 0 , horizon T , batch size B , weight λ = ρβ . 2: for t = 0 , 1 , . . . , T − 1 do 3: Sample a mini-batch Z t = { ( x i , y 1 ,i , y 2 ,i ) } B i =1 with x i ∼ µ and y 1 ,i , y 2 ,i i . i . d . ∼ π θ t ( · | x i ) . 4: Query preference labels for the sampled pairs to construct ( y w,i , y ℓ,i ) (equi valently y i ∈ { 0 , 1 } ). 5: Compute stochastic oracles G SAIL ( θ t ; Z t ) and G R ( θ t ; Z t ) , and form G ( θ t ; Z t ) = G SAIL ( θ t ; Z t ) + λG R ( θ t ; Z t ) . 6: Update θ t +1 ← pro j Θ θ t − η G ( θ t ; Z t ) . 7: end for 8: Sample R ∼ Unif { 0 , 1 , . . . , T − 1 } . 9: Output: θ R . Remark 4.12 . Assumption 6 instantiates the standard stochastic first-or der oracle model. Such assumptions are ubiquitous in the analysis of stochastic approximation and stochastic (sub)gradient methods; see, e.g.,[36, 38, 43]. W e use a projected stochastic composite gradient method to minimize F ( θ ) in (16) . At iterate θ t , we sample prompts and response pairs according to the policy-induced sampling in Eq. (1) : draw x ∼ µ and then y 1 , y 2 i . i . d . ∼ π θ t ( · | x ) , forming a triple z = ( x, y 1 , y 2 ) . For the SAIL term L SAIL ( θ ) , we additionally query a preference label from the oracle, which determines the ordered pair ( y w , y ℓ ) (equiv alently , a Bernoulli label y ∈ { 0 , 1 } indicating whether y 1 ≻ y 2 ). Using a mini-batch Z t of B i.i.d. triples, we form a stochastic gradient estimator for ∇ L SAIL ( θ t ) and a stochastic (sub)gradient estimator for the robust penalty R ( θ t ) , and combine them with weight λ = ρβ : G ( θ t ; Z t ) := G SAIL ( θ t ; Z t ) + λ G R ( θ t ; Z t ) . (22) The constrained formulation then yields the projected update θ t +1 := pro j Θ ( θ t − η G ( θ t ; Z t )) , (23) where pro j Θ ( v ) := arg min u ∈ Θ ∥ u − v ∥ and η > 0 is a stepsize. The projection ensures θ t ∈ Θ for all t , so the composite objectiv e F = L W ρ + I Θ is well-defined along the iterates. 5 Analysis 5.1 Con vergence analysis W e analyze Algorithm 1 for the constrained objectiv e F in (16) using the Moreau-en velope stationarity measure from Definition 4.10. Throughout, we use the weak conv exity constant κ from Theorem 4.8 and assume the smoothness condition on L SAIL stated in Assumption 3. Assumption 7 (Lower bounded objective) . The constrained objective F ( θ ) = L W ρ ( θ ) + I Θ ( θ ) is pr oper , lower semicontinuous, and bounded fr om below on R d ; denote F inf := inf θ ∈ R d F ( θ ) > −∞ . Remark 5.1 . Lower boundedness is needed to telescope descent inequalities for the Moreau en velope and is natural for constrained likelihood-based objecti ves on a bounded parameter set. Main Result: W ith the above assumptions in place, we are ready to present the main theoretical results of this work. W e first establish the con ver gence guarantee for Algorithm 1, and then deriv e the corresponding oracle (sample) complexity bound. Theorem 5.2 (Con v ergence rate for the Moreau en velope) . Let Assumptions 1–7 hold and let κ be the weak con vexity constant from Theor em 4.8. F ix any λ env ∈ (0 , 1 /κ ) and run Algorithm 1 for T iterations with stepsize η > 0 . Let R ∼ Unif { 0 , 1 , . . . , T − 1 } denote the output index and θ R the corr esponding iterate . Lemma B.4 implies that E ∥ G ( θ ; Z ) ∥ 2 | θ ≤ G 2 tot ∀ θ ∈ Θ , wher e Z denotes a generic mini-batch of size B drawn i.i.d. from d θ and G 2 tot is the e xplicit constant given in Lemma B.4.. Then E h ∇ F λ env ( θ R ) 2 i ≤ F λ env ( θ 0 ) − F inf η (1 − κλ env ) T + L env η 2(1 − κλ env ) G 2 tot , (24) wher e L env is defined in (20) . 8 Oracle-Robust Online Alignment for Lar ge Language Models Corollary 5.3 (Sample complexity for en v elope stationarity) . Under the conditions of Theorem 5.2, fix λ env ∈ (0 , 1 /κ ) and set η := s 2 λ env (1 − κλ env ) F λ env ( θ 0 ) − F inf G 2 tot T . Then Algorithm 1 guarantees E h ∇ F λ env ( θ R ) 2 i ≤ s 2 G 2 tot F λ env ( θ 0 ) − F inf λ env (1 − κλ env ) 3 T . Consequently , to ac hieve E [ ∥∇ F λ env ( θ R ) ∥ 2 ] ≤ ε , it suf fices to take T ≥ 2 G 2 tot F λ env ( θ 0 ) − F inf λ env (1 − κλ env ) 3 ε 2 = 8 C + σ 2 SAIL + λ 2 σ 2 R B F λ env ( θ 0 ) − F inf λ env (1 − κλ env ) 3 ε 2 . wher e C = G 2 ∇ SAIL + λ 2 G 2 ∂ R ar e constant as defined in Lemma B.4. If we set B = e O (1) ,then we obtain a sample complexity of B T = e O ( ε − 2 ) Pr oof sketc h of Theor em 5.2. Fix λ env ∈ (0 , 1 /κ ) and write ξ t := ∇ F λ env ( θ t ) . By Lemma 4.11, F λ env is L env -smooth with L env giv en in (20) . One establishes a one-step descent inequality for the projected update (23) by combining: (i) smoothness of F λ env to upper bound F λ env ( θ t +1 ) in terms of F λ env ( θ t ) , ⟨ ξ t , G ( θ t ; Z t ) ⟩ , and ∥ G ( θ t ; Z t ) ∥ 2 ; (ii) the fact that projection cannot increase the en velope value for F = f + I Θ ; and (iii) a weak-con vexity monotonicity inequality relating ⟨ ξ t , v t ⟩ to ∥ ξ t ∥ 2 , where v t := E [ G ( θ t ; Z t ) | θ t ] ∈ ∂ F ( θ t ) by Assumption 6. T aking conditional expectations and using E [ ∥ G ( θ t ; Z t ) ∥ 2 | θ t ] ≤ G 2 tot yields E [ F λ env ( θ t +1 ) | θ t ] ≤ F λ env ( θ t ) − η (1 − κλ env ) ∥ ξ t ∥ 2 + L env η 2 2 E ∥ G ( θ t ; Z t ) ∥ 2 | θ t . (25) Summing ov er t = 0 , . . . , T − 1 giv es a telescoping bound on P T − 1 t =0 E [ ∥ ξ t ∥ 2 ] in terms of F λ env ( θ 0 ) − F inf (Assump- tion 7) and T η 2 G 2 tot . Finally , selecting R uniformly from { 0 , . . . , T − 1 } con verts the a verage to E [ ∥∇ F λ env ( θ R ) ∥ 2 ] , yielding (24). 6 On the practical role of the pr oximal point Our analysis measures stationarity of the constrained robust objecti ve F ( θ ) := L W ρ ( θ ) + I Θ ( θ ) , via the Moreau en v elope F λ env with parameter λ env ∈ (0 , 1 /κ ) (Definition 4.10), where κ is the weak con v exity constant from Theorem 4.8. Algorithm 1 outputs a uniformly random iterate θ R . The theory (Theorem 5.2) certifies en velope stationarity at θ R by controlling ∥∇ F λ env ( θ R ) ∥ 2 , rather than directly bounding dist(0 , ∂ F ( θ R )) . The interpretation is instead through the associated proximal point ˆ θ R := prox λ env F ( θ R ) ∈ arg min u ∈ R d n F ( u ) + 1 2 λ env ∥ u − θ R ∥ 2 2 o , which is unique since F is κ -weakly con ve x and λ env < 1 /κ . Lemma 4.11 provides the ke y link between en velope stationarity and proximity: ∥ θ R − ˆ θ R ∥ 2 = λ env ∥∇ F λ env ( θ R ) ∥ 2 , and moreov er dist(0 , ∂ F ( ˆ θ R )) ≤ ∥∇ F λ env ( θ R ) ∥ 2 . Thus, small ∥∇ F λ env ( θ R ) ∥ 2 simultaneously implies (i) θ R lies in a small neighborhood of ˆ θ R , and (ii) ˆ θ R is nearly first-order stationary for the original constrained objectiv e. A practical issue is that computing ˆ θ R exactly still requires solving the (generally nontri vial) strongly con vex proximal subproblem above. Even when ∥∇ F λ env ( θ R ) ∥ 2 is small (hence ˆ θ R is nearby), proximal operators frequently do not admit closed forms and are ev aluated via iterati ve inner solvers, which can be e xpensiv e in realistic models [44]. 9 Oracle-Robust Online Alignment for Lar ge Language Models Lemma D.1 shows that while dist(0 , ∂ F ( θ )) ≥ (1 − κλ env ) ∥∇ F λ env ( θ ) ∥ always holds, under κ -weak con ve xity alone there is no uni versal constant C such that dist(0 , ∂ F ( θ )) ≤ C ∥∇ F λ env ( θ ) ∥ for all θ (e ven when κ = 0 ). This moti vates phrasing guarantees in terms of the proximal point ˆ θ R rather than the raw iterate θ R . In practice, one can compute an ine xact proximal refinement ¯ θ R ≈ ˆ θ R via a warm-started inner loop on Ψ R ( θ ) := F ( θ ) + 1 2 λ env ∥ θ − θ R ∥ 2 2 ,where proximal subproblems are solved only up to a prescribed accurac y (e.g., (author?) 45 ).Appendix D.1 formalizes this vie wpoint for our setting and pro vides a residual-based stopping rule. Lemma D.3 yields the practical takeaw ay: if ¯ θ R satisfies the proximal residual condition dist(0 , ∂ Ψ R ( ¯ θ R )) ≤ ε prox , then dist(0 , ∂ F ( ¯ θ R )) ≤ ∥∇ F λ env ( θ R ) ∥ 2 + ε prox + 1 λ env ∥ ¯ θ R − ˆ θ R ∥ 2 . Consequently , controlling ∥∇ F λ env ( θ R ) ∥ 2 together with the inexactness terms ε prox and ∥ ¯ θ R − ˆ θ R ∥ 2 yields an explicit near-stationarity certificate for the original objecti ve F . 7 Conclusion W e studied oracle-r obust online alignment of language models, where preference feedback is collected on-polic y but the true preference oracle can deviate from the assumed model in a structured, worst-case manner . W e introduced a pointwise oracle uncertainty set U W ( P ⋆ , ρ ) and defined the robust objectiv e L W ρ ( θ ) as the worst-case negati ve log-likelihood over P ∈ U W ( P ⋆ , ρ ) . Our main no velty is an e xact closed-form decomposition of this otherwise dif ficult minimax objectiv e: L W ρ ( θ ) = L SAIL ( θ ) + λR ( θ ) with λ = ρβ ; W e cast the constrained problem as minimizing F ( θ ) = L W ρ ( θ ) + I Θ ( θ ) , analyze it as a weakly con ve x composite objecti ve, and measure stationarity using the Moreau en velope F λ env . W e establish a e O ( ε − 2 ) stochastic-oracle (sample) complexity for reaching an ε -stationary point of F λ env , which in turn implies proximity to a nearly stationary point of the original constrained robust objectiv e; as a special case, setting ρ = 0 recovers a con ver gence-to-stationarity guarantee for optimizing L SAIL ( θ ) . References [1] Paul F Christiano, Jan Leik e, T om Brown, Miljan Martic, Shane Le gg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information pr ocessing systems , 30, 2017. [2] Long Ouyang, Jef frey W u, Xu Jiang, Diogo Almeida, Carroll W ainwright, P amela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray , et al. Training language models to follo w instructions with human feedback. Advances in neural information pr ocessing systems , 35:27730–27744, 2022. [3] Daniel M Ziegler , Nisan Stiennon, Jef frey W u, T om B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffre y Irving. Fine-tuning language models from human preferences. arXiv pr eprint arXiv:1909.08593 , 2019. [4] Nisan Stiennon, Long Ouyang, Jeffre y W u, Daniel Ziegler , Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information pr ocessing systems , 33:3008–3021, 2020. [5] Y untao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nov a DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, T om Henighan, et al. T raining a helpful and harmless assistant with reinforcement learning from human feedback. arXiv pr eprint arXiv:2204.05862 , 2022. [6] Sriyash Poddar , Y anming W an, Hamish Ivison, Abhishek Gupta, and Natasha Jaques. Personalizing reinforcement learning from human feedback with variational preference learning. Advances in Neural Information Pr ocessing Systems , 37:52516–52544, 2024. [7] Shyam Sundhar Ramesh, Y ifan Hu, Iason Chaimalas, V iraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar , and Ilija Bogunovic. Group robust preference optimization in re ward-free rlhf. Advances in Neur al Information Pr ocessing Systems , 37:37100–37137, 2024. [8] Zaiyan Xu, Sushil V emuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, and Deepak Ramachandran. Rob ust llm alignment via distributionally rob ust direct preference optimization. arXiv preprint , 2025. [9] Joar Skalse, Nikolaus Ho we, Dmitrii Krasheninnikov , and David Krue ger . Defining and characterizing re ward gaming. Advances in Neural Information Pr ocessing Systems , 35:9460–9471, 2022. [10] Jacek Karwo wski, Oliv er Hayman, Xingjian Bai, Klaus Kiendlhofer , Charlie Grif fin, and Joar Skalse. Goodhart’ s law in reinforcement learning. arXiv preprint , 2023. 10 Oracle-Robust Online Alignment for Lar ge Language Models [11] Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Ale x D’Amour , DJ Dvijotham, Adam Fisch, Katherine Heller , Stephen Pfohl, Deepak Ramachandran, et al. Helping or herding? rew ard model ensembles mitigate but do not eliminate re ward hacking. arXiv preprint , 2023. [12] Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for re ward model overoptimization. In International Confer ence on Machine Learning , pages 10835–10866. PMLR, 2023. [13] Mucong Ding, Souradip Chakraborty , V ibhu Agrawal, Zora Che, Alec Koppel, Mengdi W ang, Amrit Bedi, and Furong Huang. Sail: Self-impro ving ef ficient online alignment of large language models. arXiv pr eprint arXiv:2406.15567 , 2024. [14] Chenjia Bai, Y ang Zhang, Shuang Qiu, Qiaosheng Zhang, Kang Xu, and Xuelong Li. Online preference alignment for language models via count-based exploration. arXiv preprint , 2025. [15] Shenao Zhang, Donghan Y u, Hiteshi Sharma, Han Zhong, Zhihan Liu, Ziyi Y ang, Shuohang W ang, Hany Hassan, and Zhaoran W ang. Self-exploring language models: Acti ve preference elicitation for online alignment. arXiv pr eprint arXiv:2405.19332 , 2024. [16] Rafael Rafailo v , Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Y our language model is secretly a rew ard model. Advances in neural information pr ocessing systems , 36:53728–53741, 2023. [17] Junkang W u, Y uexiang Xie, Zhengyi Y ang, Jiancan W u, Jia wei Chen, Jinyang Gao, Bolin Ding, Xiang W ang, and Xiangnan He. T owards rob ust alignment of language models: Distributionally rob ustifying direct preference optimization. arXiv pr eprint arXiv:2407.07880 , 2024. [18] Souradip Chakraborty , Amrit Singh Bedi, Alec Koppel, Dinesh Manocha, Huazheng W ang, Mengdi W ang, and Furong Huang. Parl: A unified framework for policy alignment in reinforcement learning from human feedback. arXiv pr eprint arXiv:2308.02585 , 2023. [19] Mudit Gaur , Utsav Singh, Amrit Singh Bedi, Raghu Pasupathu, and V aneet Aggarwal. On the sample comple xity bounds in bilev el reinforcement learning. arXiv pr eprint arXiv:2503.17644 , 2025. [20] Justin Domke. Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics , pages 318–326. PMLR, 2012. [21] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through rev ersible learning. In International confer ence on machine learning , pages 2113–2122. PMLR, 2015. [22] Luca Franceschi, Paolo Frasconi, Sav erio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilev el programming for hyperparameter optimization and meta-learning. In International confer ence on machine learning , pages 1568–1577. PMLR, 2018. [23] John Schulman, Filip W olski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov . Proximal policy optimization algorithms. arXiv pr eprint arXiv:1707.06347 , 2017. [24] John Schulman, Serge y Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust re gion policy optimiza- tion. In International confer ence on machine learning , pages 1889–1897. PMLR, 2015. [25] Han Shen and T ianyi Chen. On penalty-based bilev el gradient descent method. In International conference on machine learning , pages 30992–31015. PMLR, 2023. [26] Han Shen, Zhuoran Y ang, and T ianyi Chen. Principled penalty-based methods for bilevel reinforcement learning and rlhf. Journal of Mac hine Learning Resear ch , 26(114):1–49, 2025. [27] Seongho Son, W illiam Bankes, Sayak Ray Chowdhury , Brooks Paige, and Ilija Boguno vic. Right now , wrong then: Non-stationary direct preference optimization under preference drift. arXiv pr eprint arXiv:2407.18676 , 2024. [28] Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A revie w . arXiv preprint arXiv:1908.05659 , 2019. [29] Peyman Mohajerin Esf ahani and Daniel Kuhn. Data-driv en distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Pr ogramming , 171(1):115– 166, 2018. [30] Daniel Kuhn, Pe yman Mohajerin Esfahani, V iet Anh Nguyen, and Soroosh Shafieezadeh-Abadeh. W asserstein distributionally robust optimization: Theory and applications in machine learning. In Operations r esearc h & management science in the a ge of analytics , pages 130–166. Informs, 2019. [31] John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally rob ust optimization. The Annals of Statistics , 49(3):1378–1406, 2021. 11 Oracle-Robust Online Alignment for Lar ge Language Models [32] Garud N Iyengar . Robust dynamic programming. Mathematics of Operations Resear ch , 30(2):257–280, 2005. [33] Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. F ounda- tions and T r ends® in Machine Learning , 11(5-6):355–607, 2019. [34] Ralph Allan Bradley and Milton E T erry . Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika , 39(3/4):324–345, 1952. [35] Y urii Nesterov . Intr oductory lectures on con vex optimization: A basic course , volume 87. Springer Science & Business Media, 2013. [36] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for noncon vex stochastic program- ming. SIAM journal on optimization , 23(4):2341–2368, 2013. [37] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM r evie w , 60(2):223–311, 2018. [38] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaura v Mahajan. On the theory of polic y gradient methods: Optimality , approximation, and distribution shift. Journal of Machine Learning Resear ch , 22(98):1–76, 2021. [39] Matthe w S Zhang, Murat A Erdogdu, and Animesh Garg. Conv ergence and optimality of policy gradient methods in weakly smooth settings. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 36, pages 9066–9073, 2022. [40] Heinz H Bauschke and Patrick L Combettes. Correction to: conv ex analysis and monotone operator theory in hilbert spaces. In Conve x analysis and monotone operator theory in Hilbert spaces , pages C1–C4. Springer, 2020. [41] Jean-Jacques Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la Société mathématique de F rance , 93:273–299, 1965. [42] Damek Davis and Dmitriy Drusvyatskiy . Stochastic subgradient method conv erges at the rate o ( k − 1 / 4 ) on weakly con ve x functions. arXiv preprint , 2018. [43] Lesi Chen, Jing Xu, and Jingzhao Zhang. On finding small hyper-gradients in bilevel optimization: Hardness results and improved analysis. In The Thirty Seventh Annual Confer ence on Learning Theory , pages 947–980. PMLR, 2024. [44] Sav erio Salzo and Silvia V illa. Inexact and accelerated proximal point algorithms. Journal of Con vex Analysis , 19:1167–1192, 01 2012. [45] Mark Schmidt, Nicolas Roux, and Francis Bach. Con vergence rates of inexact proximal-gradient methods for con ve x optimization. Advances in neural information pr ocessing systems , 24, 2011. [46] Ronald J W illiams. Simple statistical gradient-follo wing algorithms for connectionist reinforcement learning. Machine learning , 8(3):229–256, 1992. [47] R T yrrell Rockafellar and Roger JB W ets. V ariational analysis . Springer , 1998. 12 Oracle-Robust Online Alignment for Lar ge Language Models A Proof of Theor em 4.1: Decomposition of L W ρ ( θ ) A.1 A pointwise maximization identity Lemma A.1 (Pointwise worst-case Bernoulli perturbation) . F ix any p ⋆ ∈ [0 , 1] , any ρ ≥ 0 such that [ p ⋆ − ρ, p ⋆ + ρ ] ⊆ [0 , 1] , and any r eal numbers a, b ∈ R . Then sup p ∈ [ p ⋆ − ρ,p ⋆ + ρ ] p a + (1 − p ) b = p ⋆ a + (1 − p ⋆ ) b + ρ | a − b | . (26) Mor eover , an optimizer is p ⋆ + ρ if a ≥ b and p ⋆ − ρ if a < b . Pr oof. The map p 7→ p a + (1 − p ) b = b + p ( a − b ) is affine in p , hence it attains its maximum over the interval [ p ⋆ − ρ, p ⋆ + ρ ] at an endpoint. If a − b ≥ 0 the maximizer is p ⋆ + ρ , giving value b + ( p ⋆ + ρ )( a − b ) = p ⋆ a + (1 − p ⋆ ) b + ρ ( a − b ) . If a − b < 0 the maximizer is p ⋆ − ρ , gi ving b + ( p ⋆ − ρ )( a − b ) = p ⋆ a + (1 − p ⋆ ) b + ρ ( b − a ) . Combining the two cases yields (26). A.2 Proof of Theor em 4.1 Pr oof of Theorem 4.1. Recall: (i) the polic y-dependent sampling la w d θ on Z = X × Y × Y (Eq. (1) , (ii) the pointwise oracle uncertainty set U W ( P ⋆ , ρ ) (Eq. (3) ), (iii) the robust objectiv e L W ρ ( θ ) (Eq. (10) ), and (iv) the SAIL objectiv e L SAIL ( θ ) . Step 1: Rewrite the rob ust objective pointwise in p ( z ) . For z = ( x, y 1 , y 2 ) , let p ( z ) := P (1 | z ) = P ( y 1 ≻ y 2 | x ) and p ⋆ ( z ) := P ⋆ (1 | z ) . By Eq. (3), P ∈ U W ( P ⋆ , ρ ) if f for all z , | p ( z ) − p ⋆ ( z ) | ≤ ρ. (27) Assumption 1 ensures p ⋆ ( z ) ∈ [ δ, 1 − δ ] and ρ ∈ (0 , δ ) , so [ p ⋆ ( z ) − ρ, p ⋆ ( z ) + ρ ] ⊆ [0 , 1] for all z . Define the pairwise logit as h θ ( z ) ≡ h θ ( x, y 1 , y 2 ) := log π θ ( y 1 | x ) π SFT ( y 1 | x ) − log π θ ( y 2 | x ) π SFT ( y 2 | x ) . For notational con venience, set the two label-conditional losses ℓ 1 θ ( z ) := − log σ β h θ ( z ) , ℓ 0 θ ( z ) := − log σ β h θ ( x, y 2 , y 1 ) = − log σ − β h θ ( z ) , (28) where we used h θ ( x, y 2 , y 1 ) = − h θ ( x, y 1 , y 2 ) . Then, for any oracle P with Bernoulli parameter p ( z ) , the conditional risk equals E y ∼ P ( ·| z ) ℓ θ ( z , y ) = p ( z ) ℓ 1 θ ( z ) + (1 − p ( z )) ℓ 0 θ ( z ) =: ℓ p ( z ); h θ ( z ) . Therefore L W ρ ( θ ) can be written as L W ρ ( θ ) = sup P ∈ U W ( P ⋆ ,ρ ) E z ∼ d θ h p ( z ) ℓ 1 θ ( z ) + (1 − p ( z )) ℓ 0 θ ( z ) i . (29) Because the constraint (27) is pointwise in z and the objectiv e is an integral (expectation) of a pointwise affine function of p ( z ) , the worst-case oracle can be chosen pointwise in z . Equiv alently , the supremum in (29) equals the expectation of the pointwise supremum: L W ρ ( θ ) = E z ∼ d θ " sup p ∈ [ p ⋆ ( z ) − ρ, p ⋆ ( z )+ ρ ] p ℓ 1 θ ( z ) + (1 − p ) ℓ 0 θ ( z ) # . (30) By Lemma A.1, an optimizer is always an endpoint, hence a measurable selector can be obtained by taking p ( z ) = p ⋆ ( z ) + ρ when ℓ 1 θ ( z ) ≥ ℓ 0 θ ( z ) and p ( z ) = p ⋆ ( z ) − ρ otherwise. Apply Lemma A.1 to (30) with a = ℓ 1 θ ( z ) , b = ℓ 0 θ ( z ) and p ⋆ = p ⋆ ( z ) . This giv es L W ρ ( θ ) = E z ∼ d θ h p ⋆ ( z ) ℓ 1 θ ( z ) + (1 − p ⋆ ( z )) ℓ 0 θ ( z ) i + ρ E z ∼ d θ h | ℓ 1 θ ( z ) − ℓ 0 θ ( z ) | i . (31) 13 Oracle-Robust Online Alignment for Lar ge Language Models The first expectation in (31) is e xactly L SAIL ( θ ) . For the second term, use the identity log σ ( t ) − log σ ( − t ) = t (since σ ( t ) /σ ( − t ) = e t ). W ith t = β h θ ( z ) and definitions (28), we obtain ℓ 1 θ ( z ) − ℓ 0 θ ( z ) = − log σ ( β h θ ( z )) + log σ ( − β h θ ( z )) = − β h θ ( z ) . Hence | ℓ 1 θ ( z ) − ℓ 0 θ ( z ) | = β | h θ ( z ) | , and (31) becomes L W ρ ( θ ) = L SAIL ( θ ) + ρβ E z ∼ d θ | h θ ( z ) | . (32) Define λ := ρβ . Under Assumption 2, for fixed x we have the log-linear / softmax form π θ ( y | x ) = exp( θ ⊤ ψ ( x, y )) P y ′ ∈Y exp( θ ⊤ ψ ( x, y ′ )) . By Eq. (11), π SFT = π θ ref for some θ ref . Then log π θ ( y | x ) π SFT ( y | x ) = ( θ − θ ref ) ⊤ ψ ( x, y ) − log X y ′ e θ ⊤ ψ ( x,y ′ ) + log X y ′ e θ ⊤ ref ψ ( x,y ′ ) . T aking the difference between y 1 and y 2 cancels the log-partition terms, yielding h θ ( x, y 1 , y 2 ) = ( θ − θ ref ) ⊤ ψ ( x, y 1 ) − ψ ( x, y 2 ) . (33) Then (33) sho ws h θ ( z ) = R ( θ ; x, y 1 , y 2 ) where R ( θ ; x, y 1 , y 2 ) is defined in Theorem 4.1, and thus the rob ust correction term in (32) is precisely λR ( θ ) with R ( θ ) := E z ∼ d θ | R ( θ ; x, y 1 , y 2 ) | . Substituting into (32) giv es the claimed decomposition L W ρ ( θ ) = L SAIL ( θ ) + λR ( θ ) , λ = ρβ . This completes the proof. B W eak con vexity: Proofs of Theor ems 4.7 and 4.8 B.1 A uxiliary lemmas Lemma B.1 (Policy-re gularity constants for the log-linear polic y class) . Assume Assumption 2 (log-linear / softmax policy class) and Assumption 4 (finite response set and bounded featur es). Define the policy scor e g θ ( x, y ) := ∇ θ log π θ ( y | x ) . Then for all θ , θ ′ ∈ R d and all ( x, y ) ∈ X × Y , ∥ g θ ( x, y ) ∥ 2 ≤ G, ∥ g θ ( x, y ) − g θ ′ ( x, y ) ∥ 2 ≤ M ∥ θ − θ ′ ∥ 2 , with explicit constants G := 2 B ψ and M := B 2 ψ . Mor eover , for z = ( x, y 1 , y 2 ) and d θ ( z ) = µ ( x ) π θ ( y 1 | x ) π θ ( y 2 | x ) , define scor e [46] S θ ( z ) := ∇ θ log d θ ( z ) = g θ ( x, y 1 ) + g θ ( x, y 2 ) . Then ∥ S θ ( z ) ∥ 2 ≤ 2 G = 4 B ψ , ∥∇ θ S θ ( z ) ∥ op ≤ 2 M = 2 B 2 ψ . Pr oof. Under Assumption 2, for fixed x the policy is π θ ( y | x ) = exp( θ ⊤ ψ ( x, y )) P y ′ ∈Y exp( θ ⊤ ψ ( x, y ′ )) . Hence log π θ ( y | x ) = θ ⊤ ψ ( x, y ) − log X y ′ ∈Y e θ ⊤ ψ ( x,y ′ ) , 14 Oracle-Robust Online Alignment for Lar ge Language Models so differentiating gi ves g θ ( x, y ) = ψ ( x, y ) − X y ′ ∈Y π θ ( y ′ | x ) ψ ( x, y ′ ) = ψ ( x, y ) − µ θ ( x ) , µ θ ( x ) := E y ′ ∼ π θ ( ·| x ) [ ψ ( x, y ′ )] . By Assumption 4, ∥ ψ ( x, y ) ∥ 2 ≤ B ψ and Jensen implies ∥ µ θ ( x ) ∥ 2 ≤ B ψ , hence ∥ g θ ( x, y ) ∥ 2 ≤ 2 B ψ . This gi ves G := 2 B ψ . Next, ∇ θ g θ ( x, y ) = −∇ θ µ θ ( x ) . A standard calculation yields that the Jacobian of µ θ ( x ) is the cov ariance matrix ∇ θ µ θ ( x ) = Co v y ′ ∼ π θ ( ·| x ) ψ ( x, y ′ ) . For an y unit vector v ∈ R d , v ⊤ ∇ θ µ θ ( x ) v = V ar v ⊤ ψ ( x, y ′ ) ≤ E ( v ⊤ ψ ( x, y ′ )) 2 ≤ E ∥ ψ ( x, y ′ ) ∥ 2 2 ≤ B 2 ψ . T aking the supremum over ∥ v ∥ 2 = 1 yields ∥∇ θ µ θ ( x ) ∥ op ≤ B 2 ψ , so ∥∇ θ g θ ( x, y ) ∥ op ≤ B 2 ψ as well. By the mean value theorem, ∥ g θ ( x, y ) − g θ ′ ( x, y ) ∥ 2 ≤ B 2 ψ ∥ θ − θ ′ ∥ 2 , so we may take M := B 2 ψ . Finally , since S θ ( z ) = g θ ( x, y 1 ) + g θ ( x, y 2 ) , the bounds ∥ S θ ( z ) ∥ 2 ≤ 2 G and ∥∇ θ S θ ( z ) ∥ op ≤ 2 M follow by triangle inequality . Lemma B.2 (Bounded Fisher information) . Define the F isher information matrix F ( θ ) := E x ∼ µ, y ∼ π θ ( ·| x ) g θ ( x, y ) g θ ( x, y ) ⊤ . Then ∥F ( θ ) ∥ op ≤ G 2 for all θ ∈ Θ . Pr oof. For any unit v ector u ∈ R d , u ⊤ F ( θ ) u = E ⟨ u, g θ ( x, y ) ⟩ 2 ≤ E ∥ g θ ( x, y ) ∥ 2 2 ≤ G 2 . T aking the supremum over ∥ u ∥ 2 = 1 yields ∥F ( θ ) ∥ op ≤ G 2 . Remark B.3 . In the weak conv exity analysis of R ( θ ) , we dif ferentiate an e xpectation taken under the θ -dependent la w d θ . The resulting Hessian formula contains curv ature terms in volving the score and its Jacobian, notably r ε,θ ( z ) ∇ θ S θ ( z ) where S θ ( z ) = ∇ θ log d θ ( z ) and r ε,θ ( z ) is a smoothed absolute v alue. T o obtain a uniform lo wer bound ∇ 2 R ε ( θ ) ⪰ − κ ε I (hence weak conv exity), we must control the w orst-case quadratic form v ⊤ ( ∇ θ S θ ( z )) v uniformly o ver ∥ v ∥ 2 = 1 . This is exactly an operator-norm control. Lemma B.2 exemplifies this type of bound for second-moment (information) matrices, and the proof of Lemma B.1 uses the same operator-norm reasoning to bound ∥∇ θ S θ ( z ) ∥ op , which is the quantity that directly enters the proof of Theorem 4.7. B.2 Proof of Theor em 4.7 (weak con vexity of the rob ust penalty) Pr oof of Theorem 4.7. Recall the rob ust penalty R ( θ ) := E z ∼ d θ | R ( θ ; x, y 1 , y 2 ) | , R ( θ ; x, y 1 , y 2 ) := ( θ − θ ref ) ⊤ ψ ( x, y 1 ) − ψ ( x, y 2 ) , z = ( x, y 1 , y 2 ) . Step 1: Smooth approximation. Fix ε > 0 and define ϕ ε ( u ) := √ u 2 + ε 2 . Set R ε ( θ ) := E z ∼ d θ ϕ ε ( s θ ( z )) . Since Y is finite, the expectation ov er d θ is a finite sum, so R ε is twice continuously differentiable and we may differentiate under the sum. W e will show that ∇ 2 R ε ( θ ) ⪰ − κ ε I for all θ ∈ Θ with κ ε := 8 GB ψ + 4 M D B ψ + 2 M ε, (34) where G, M are as in Lemma B.1. Since R ε is C 2 and has a global Hessian lo wer bound, the function R ε ( · ) + κ ε 2 ∥ · ∥ 2 2 has positiv e semidefinite Hessian and is con ve x; equiv alently , R ε is κ ε -weakly con ve x. 15 Oracle-Robust Online Alignment for Lar ge Language Models Step 2: Uniform bounds. Define ∆ ψ ( z ) := ψ ( x, y 1 ) − ψ ( x, y 2 ) . By Assumption 4, ∥ ∆ ψ ( z ) ∥ 2 ≤ ∥ ψ ( x, y 1 ) ∥ 2 + ∥ ψ ( x, y 2 ) ∥ 2 ≤ 2 B ψ . By Assumption 5, ∥ θ − θ ref ∥ 2 ≤ D for all θ ∈ Θ , hence | s θ ( z ) | ≤ ∥ θ − θ ref ∥ 2 ∥ ∆ ψ ( z ) ∥ 2 ≤ 2 DB ψ . Moreov er , ϕ ε ( u ) ≤ | u | + ε , so 0 ≤ ϕ ε ( s θ ( z )) ≤ 2 D B ψ + ε. (35) Step 3: Score-function calculus for d θ . Write d θ ( z ) = µ ( x ) π θ ( y 1 | x ) π θ ( y 2 | x ) and define the score[46] S θ ( z ) := ∇ θ log d θ ( z ) . Then S θ ( z ) = g θ ( x, y 1 ) + g θ ( x, y 2 ) , hence Lemma B.1 giv es ∥ S θ ( z ) ∥ 2 ≤ 2 G, ∥∇ θ S θ ( z ) ∥ op ≤ 2 M . (36) For the inte grand r ε,θ ( z ) := ϕ ε ( s θ ( z )) , note that ∇ θ s θ ( z ) = ∆ ψ ( z ) , ∇ θ r ε,θ ( z ) = ϕ ′ ε ( s θ ( z )) ∆ ψ ( z ) , and since | ϕ ′ ε ( u ) | = | u | / √ u 2 + ε 2 ≤ 1 , ∥∇ θ r ε,θ ( z ) ∥ 2 ≤ ∥ ∆ ψ ( z ) ∥ 2 ≤ 2 B ψ . (37) Also, ∇ 2 θ r ε,θ ( z ) = ϕ ′′ ε ( s θ ( z )) ∆ ψ ( z )∆ ψ ( z ) ⊤ , ϕ ′′ ε ( u ) = ε 2 ( u 2 + ε 2 ) 3 / 2 ≥ 0 , so ∇ 2 θ r ε,θ ( z ) ⪰ 0 for all ( θ , z ) . Because Y is finite, we can dif ferentiate R ε ( θ ) = P z d θ ( z ) r ε,θ ( z ) term-by-term and use ∇ θ d θ ( z ) = d θ ( z ) S θ ( z ) to obtain: ∇ θ R ε ( θ ) = E z ∼ d θ ∇ θ r ε,θ ( z ) + r ε,θ ( z ) S θ ( z ) , (38) ∇ 2 θ R ε ( θ ) = E z ∼ d θ h ∇ 2 θ r ε,θ ( z ) + S θ ( z ) ∇ θ r ε,θ ( z ) ⊤ + ∇ θ r ε,θ ( z ) S θ ( z ) ⊤ + r ε,θ ( z ) S θ ( z ) S θ ( z ) ⊤ + ∇ θ S θ ( z ) i . (39) Step 4: Hessian lower bound. Fix any unit vector u ∈ R d . Using (39) and the facts that ∇ 2 θ r ε,θ ( z ) ⪰ 0 and S θ ( z ) S θ ( z ) ⊤ ⪰ 0 , we ha ve u ⊤ ∇ 2 θ R ε ( θ ) u ≥ E z ∼ d θ h u ⊤ S θ ∇ θ r ⊤ ε,θ + ∇ θ r ε,θ S ⊤ θ u + r ε,θ ( z ) u ⊤ ( ∇ θ S θ ( z )) u i . For the first term, u ⊤ S θ ∇ θ r ⊤ ε,θ + ∇ θ r ε,θ S ⊤ θ u = 2 ⟨ u, S θ ( z ) ⟩ ⟨ u, ∇ θ r ε,θ ( z ) ⟩ ≥ − 2 ∥ S θ ( z ) ∥ 2 ∥∇ θ r ε,θ ( z ) ∥ 2 , and using (36) and (37) giv es the pointwise bound u ⊤ S θ ∇ θ r ⊤ ε,θ + ∇ θ r ε,θ S ⊤ θ u ≥ − 2(2 G )(2 B ψ ) = − 8 GB ψ . For the second term, by (36), u ⊤ ( ∇ θ S θ ( z )) u ≥ −∥∇ θ S θ ( z ) ∥ op ≥ − 2 M , so with (35) we hav e r ε,θ ( z ) u ⊤ ( ∇ θ S θ ( z )) u ≥ − (2 DB ψ + ε ) 2 M = − 4 M D B ψ − 2 M ε. Combining the two bounds and taking expectation yields u ⊤ ∇ 2 θ R ε ( θ ) u ≥ − (8 GB ψ + 4 M D B ψ + 2 M ε ) = − κ ε , for all unit u , hence ∇ 2 θ R ε ( θ ) ⪰ − κ ε I . Step 5: Pass to the nonsmooth limit ε ↓ 0 . F or an y u ∈ R , 0 ≤ ϕ ε ( u ) − | u | ≤ ε , hence 0 ≤ R ε ( θ ) − R ( θ ) ≤ ε for all θ . Thus R ε → R uniformly as ε ↓ 0 . Since each R ε is κ ε -weakly con vex and κ ε ↓ κ R := 8 GB ψ + 4 M D B ψ , taking ε ↓ 0 in the defining weak con vexity inequality yields that R is κ R -weakly con ve x. Finally , by Lemma B.1, we may take G = 2 B ψ and M = B 2 ψ , giving κ R ≤ 8(2 B ψ ) B ψ + 4( B 2 ψ ) D B ψ = 16 B 2 ψ + 4 D B 3 ψ , which is the claimed bound in Theorem 4.7. 16 Oracle-Robust Online Alignment for Lar ge Language Models B.3 Proof of Theor em 4.8 Pr oof of Theorem 4.8. By Theorem 4.1 we ha ve the exact decomposition L W ρ ( θ ) = L SAIL ( θ ) + λR ( θ ) , λ = ρβ . Under Assumption 3, L SAIL has L SAIL -Lipschitz gradient on R d . A basic fact is that any L -smooth function is L -weakly con ve x. This can be verified directly from the second-order characterization: if f is L -smooth, then ∇ 2 f ( θ ) ⪰ − LI . Equiv alently , the function θ 7→ f ( θ ) + L 2 ∥ θ ∥ 2 2 is con ve x. By Theorem 4.7, R is κ R -weakly conv ex on Θ . Scaling preserves weak con vexity , so λR is ( λκ R ) -weakly conv ex. Sums of weakly con vex functions add their curv ature parameters, hence L W ρ = L SAIL + λR is κ -weakly con ve x on Θ with κ := L SAIL + λκ R . This concludes the proof of Theorem 4.8. B.4 A uxiliary lemma for stochastic gradient oracle Lemma B.4 (Second-moment bound for gradient oracle) . F ix any mini-batch size B ∈ N , and r ecall the composite dir ection in Eq.22, G ( θ ; Z ) := G SAIL ( θ ; Z ) + λ G R ( θ ; Z ) , wher e Z denotes a generic mini-batch of size B drawn i.i.d. fr om d θ (Assumption 6). Under Assumptions 2, 4, 5, and 6, the composite dir ection has bounded conditional second moment: for all θ ∈ Θ , E ∥ G ( θ ; Z ) ∥ 2 2 θ ≤ G 2 tot , with the explicit c hoice G 2 tot := 4 G 2 ∇ SAIL + λ 2 G 2 ∂ R + σ 2 SAIL + λ 2 σ 2 R B , wher e G ∇ SAIL := 2 β B ψ + 4 B ψ log 2 + 2 β D B ψ , G ∂ R := 2 B ψ + 8 D B 2 ψ , and σ 2 SAIL , σ 2 R ar e as in Assumption 6. Pr oof. W e bound the deterministic (mean) components ∥∇ L SAIL ( θ ) ∥ 2 and dist(0 , ∂ R ( θ )) , then combine with the variance bounds in Assumption 6 and the definition (22). Step 1: Uniform bounds on the pairwise logit and scor e. F or z = ( x, y 1 , y 2 ) ∈ Z , define ∆ ψ ( z ) := ψ ( x, y 1 ) − ψ ( x, y 2 ) . By Assumption 4, ∥ ∆ ψ ( z ) ∥ 2 ≤ 2 B ψ . Under Assumption 2(log-linear policy class), the pairwise logit admits the cancellation h θ ( z ) = h θ ( x, y 1 , y 2 ) = ( θ − θ ref ) ⊤ ∆ ψ ( z ) Hence, by Assumption 5, | h θ ( z ) | ≤ ∥ θ − θ ref ∥ 2 ∥ ∆ ψ ( z ) ∥ 2 ≤ D · 2 B ψ = 2 D B ψ . Moreov er , recalling d θ ( z ) = µ ( x ) π θ ( y 1 | x ) π θ ( y 2 | x ) (Eq. (1) ) and S θ ( z ) := ∇ θ log d θ ( z ) , Lemma B.1 giv es ∥ S θ ( z ) ∥ 2 ≤ 4 B ψ . Step 2: Uniform bound on ∥∇ L SAIL ( θ ) ∥ 2 . Recall that L SAIL ( θ ) = E z ∼ d θ p ⋆ ℓ 1 ( h ) + (1 − p ⋆ ) ℓ 0 ( h ) , where ℓ 1 ( h ) = − log σ ( β h ) and ℓ 0 ( h ) = − log σ ( − β h ) , and p ⋆ ( z ) ∈ [0 , 1] . Define the pointwise SAIL loss ℓ θ ( z ) := p ⋆ ( z ) ℓ 1 h θ ( z ) + (1 − p ⋆ ( z )) ℓ 0 h θ ( z ) . Since ℓ 1 ( h ) = log(1 + e − β h ) and ℓ 0 ( h ) = log(1 + e β h ) , we hav e for all h ∈ R , max { ℓ 1 ( h ) , ℓ 0 ( h ) } = log(1 + e | β h | ) ≤ log 2 + | β h | . Using | h θ ( z ) | ≤ 2 DB ψ , 0 ≤ ℓ θ ( z ) ≤ log 2 + 2 β D B ψ . 17 Oracle-Robust Online Alignment for Lar ge Language Models Next, by direct dif ferentiation, the deriv atives of ℓ 1 , ℓ 0 with respect to h satisfy d dh ℓ 1 ( h ) ≤ β and d dh ℓ 0 ( h ) ≤ β for all h . T ogether with ∇ θ h θ ( z ) = ∆ ψ ( z ) and ∥ ∆ ψ ( z ) ∥ 2 ≤ 2 B ψ , ∥∇ θ ℓ θ ( z ) ∥ 2 ≤ β ∥ ∆ ψ ( z ) ∥ 2 ≤ 2 β B ψ . Because Y is finite, we may apply the same score-function calculus as in Eq. (38) to write ∇ L SAIL ( θ ) = E z ∼ d θ [ ∇ θ ℓ θ ( z ) + ℓ θ ( z ) S θ ( z )] . Therefore, ∥∇ L SAIL ( θ ) ∥ 2 ≤ E z ∼ d θ [ ∥∇ θ ℓ θ ( z ) ∥ 2 + ℓ θ ( z ) ∥ S θ ( z ) ∥ 2 ] ≤ 2 β B ψ + log 2 + 2 β D B ψ 4 B ψ =: G ∇ SAIL . Step 3: Uniform bound on dist(0 , ∂ R ( θ )) . Recall R ( θ ) = E z ∼ d θ [ | h θ ( z ) | ] . For ε > 0 , define the smooth approximation φ ε ( u ) := √ u 2 + ε 2 and R ε ( θ ) := E z ∼ d θ [ φ ε ( h θ ( z ))] . As in Eq.(38), we hav e ∇ R ε ( θ ) = E z ∼ d θ [ ∇ θ r ε,θ ( z ) + r ε,θ ( z ) S θ ( z )] , r ε,θ ( z ) := φ ε ( h θ ( z )) . Since | φ ′ ε ( u ) | ≤ 1 and ∇ θ h θ ( z ) = ∆ ψ ( z ) , we obtain ∥∇ θ r ε,θ ( z ) ∥ 2 ≤ ∥ ∆ ψ ( z ) ∥ 2 ≤ 2 B ψ . Also, 0 ≤ φ ε ( u ) ≤ | u | + ε , so r ε,θ ( z ) ≤ 2 D B ψ + ε . Using ∥ S θ ( z ) ∥ 2 ≤ 4 B ψ (Lemma B.1), we conclude ∥∇ R ε ( θ ) ∥ 2 ≤ 2 B ψ + (2 D B ψ + ε ) 4 B ψ = 2 B ψ + 8 D B 2 ψ + 4 εB ψ . Hence R ε is (2 B ψ + 8 D B 2 ψ + 4 εB ψ ) -Lipschitz on Θ . Moreover , | φ ε ( u ) − | u || ≤ ε implies sup θ ∈ Θ | R ε ( θ ) − R ( θ ) | ≤ ε . Thus for any θ, θ ′ ∈ Θ , | R ( θ ) − R ( θ ′ ) | ≤ | R ( θ ) − R ε ( θ ) | + | R ε ( θ ) − R ε ( θ ′ ) | + | R ε ( θ ′ ) − R ( θ ′ ) | ≤ 2 ε + 2 B ψ + 8 D B 2 ψ + 4 εB ψ ∥ θ − θ ′ ∥ 2 . Letting ε ↓ 0 yields that R is G ∂ R -Lipschitz on Θ , i.e., | R ( θ ) − R ( θ ′ ) | ≤ G ∂ R ∥ θ − θ ′ ∥ 2 , ∀ θ , θ ′ ∈ Θ , with G ∂ R := 2 B ψ + 8 D B 2 ψ . Hence the local Lipschitz modulus satisfies lim sup θ ′ → θ, θ ′ = θ | R ( θ ′ ) − R ( θ ) | ∥ θ ′ − θ ∥ 2 ≤ G ∂ R , ∀ θ ∈ Θ . A standard variational-analytic fact for locally Lipschitz functions implies that any (limiting) subgradient is norm- bounded by the local Lipschitz modulus[47]; namely , v ∈ ∂ R ( θ ) = ⇒ ∥ v ∥ 2 ≤ lim sup θ ′ → θ, θ ′ = θ | R ( θ ′ ) − R ( θ ) | ∥ θ ′ − θ ∥ 2 ≤ G ∂ R . Since ∂ R ( θ ) = ∅ for all θ ∈ Θ (Assumption 6), we conclude dist 0 , ∂ R ( θ ) ≤ G ∂ R , ∀ θ ∈ Θ . Fix θ ∈ Θ and abbreviate G SAIL := G SAIL ( θ ; Z ) and G R := G R ( θ ; Z ) . By Eq.22 and ∥ a + b ∥ 2 2 ≤ 2 ∥ a ∥ 2 2 + 2 ∥ b ∥ 2 2 , E ∥ G ( θ ; Z ) ∥ 2 2 | θ ≤ 2 E ∥ G SAIL ∥ 2 2 | θ + 2 λ 2 E ∥ G R ∥ 2 2 | θ . For the SAIL term, write G SAIL = ∇ L SAIL ( θ ) + ( G SAIL − ∇ L SAIL ( θ )) and apply ∥ a + b ∥ 2 2 ≤ 2 ∥ a ∥ 2 2 + 2 ∥ b ∥ 2 2 together with Assumption 6: E ∥ G SAIL ∥ 2 2 | θ ≤ 2 ∥∇ L SAIL ( θ ) ∥ 2 2 + 2 E ∥ G SAIL − ∇ L SAIL ( θ ) ∥ 2 2 | θ ≤ 2 G 2 ∇ SAIL + 2 σ 2 SAIL B . For the robust-penalty term, note that for any closed nonempty set A ⊂ R d and any u ∈ R d , ∥ u ∥ 2 ≤ dist( u, A ) + dist(0 , A ) , hence ∥ u ∥ 2 2 ≤ 2 dist 2 ( u, A ) + 2 dist 2 (0 , A ) . Applying this with A = ∂ R ( θ ) and using Assumption 6 and Step 3 giv es E ∥ G R ∥ 2 2 | θ ≤ 2 E dist 2 G R , ∂ R ( θ ) | θ + 2 dist 2 0 , ∂ R ( θ ) ≤ 2 σ 2 R B + 2 G 2 ∂ R . Combining the last three displays yields, for all θ ∈ Θ , E ∥ G ( θ ; Z ) ∥ 2 2 | θ ≤ 2 2 G 2 ∇ SAIL + 2 σ 2 SAIL B + 2 λ 2 2 G 2 ∂ R + 2 σ 2 R B = 4 G 2 ∇ SAIL + λ 2 G 2 ∂ R + σ 2 SAIL + λ 2 σ 2 R B . This is exactly the claimed bound with G 2 tot as stated. 18 Oracle-Robust Online Alignment for Lar ge Language Models C Moreau en velope and con vergence analysis C.1 Proof of Lemma 4.11 (Pr operties of the Moreau en velope) Pr oof of Lemma 4.11. Fix λ env ∈ (0 , 1 /κ ) and recall the definition of the Moreau en velope and proximal point (Definition 4.10) F λ env ( θ ) := min u ∈ R d F ( u ) + 1 2 λ env ∥ u − θ ∥ 2 2 , ˆ θ = prox λ env F ( θ ) := arg min u ∈ R d F ( u ) + 1 2 λ env ∥ u − θ ∥ 2 2 . 1–2. Finiteness, uniqueness, and the gradient formula. Since F is κ -weakly con vex, the function u 7→ F ( u ) + κ 2 ∥ u ∥ 2 2 is con ve x. Therefore, for any fix ed θ , the function u 7→ F ( u ) + 1 2 λ env ∥ u − θ ∥ 2 2 is ( 1 λ env − κ ) -strongly con vex in u (because λ env < 1 /κ ), hence it has a unique minimizer ˆ θ and the minimum v alue F λ env ( θ ) is finite. Moreov er , by first-order optimality of ˆ θ for the strongly con vex problem, 0 ∈ ∂ F ( ˆ θ ) + 1 λ env ( ˆ θ − θ ) , i.e., 1 λ env ( θ − ˆ θ ) ∈ ∂ F ( ˆ θ ) . (40) Standard properties of the Moreau en velope for weakly con vex functions imply that F λ env is continuously dif ferentiable and that ∇ F λ env ( θ ) = 1 λ env ( θ − ˆ θ ) . Eq. (19) follows immediately: ∥ θ − ˆ θ ∥ 2 = λ env ∥∇ F λ env ( θ ) ∥ 2 . 3. Lipschitz continuity of ∇ F λ env . A standard result for κ -weakly conv ex F is that F λ env has Lipschitz gradient with constant L env = 1 λ env (1 − κλ env ) , which is Eq. (20). 4. Approximate stationarity of the pr oximal point. By (40), the v ector 1 λ env ( θ − ˆ θ ) belongs to ∂ F ( ˆ θ ) . Therefore dist 0 , ∂ F ( ˆ θ ) ≤ 1 λ env ( θ − ˆ θ ) 2 = ∥∇ F λ env ( θ ) ∥ 2 This completes the proof. C.2 A uxiliary lemmas for Theorem 5.2 Lemma C.1 (Monotonicity inequality) . Assume F is κ -weakly con vex and let λ env ∈ (0 , 1 /κ ) . Fix any θ ∈ R d and let ˆ θ = prox λ env F ( θ ) , ξ := ∇ F λ env ( θ ) = 1 λ env ( θ − ˆ θ ) . Then for every v ∈ ∂ F ( θ ) , ⟨ ξ , v ⟩ ≥ (1 − κλ env ) ∥ ξ ∥ 2 2 . Pr oof. By κ -weak con vexity , for all x, y and all g ∈ ∂ F ( x ) , F ( y ) ≥ F ( x ) + ⟨ g , y − x ⟩ − κ 2 ∥ y − x ∥ 2 2 . (41) Apply (41) twice: (i) with x = θ , y = ˆ θ , and g = v ∈ ∂ F ( θ ) : F ( ˆ θ ) ≥ F ( θ ) + ⟨ v , ˆ θ − θ ⟩ − κ 2 ∥ ˆ θ − θ ∥ 2 2 . (42) 19 Oracle-Robust Online Alignment for Lar ge Language Models (ii) with x = ˆ θ , y = θ , and any ˆ v ∈ ∂ F ( ˆ θ ) : F ( θ ) ≥ F ( ˆ θ ) + ⟨ ˆ v , θ − ˆ θ ⟩ − κ 2 ∥ θ − ˆ θ ∥ 2 2 . (43) By optimality of the proximal point (as in (40) ), we may choose ˆ v = 1 λ env ( θ − ˆ θ ) = ξ ∈ ∂ F ( ˆ θ ) . Substituting this choice into (43) giv es F ( θ ) ≥ F ( ˆ θ ) + ⟨ ξ , θ − ˆ θ ⟩ − κ 2 ∥ θ − ˆ θ ∥ 2 2 . Add this inequality to (42) to eliminate F ( θ ) and F ( ˆ θ ) , yielding 0 ≥ ⟨ v, ˆ θ − θ ⟩ + ⟨ ξ , θ − ˆ θ ⟩ − κ ∥ θ − ˆ θ ∥ 2 2 . Rearrange and use θ − ˆ θ = λ env ξ : ⟨ ξ , θ − ˆ θ ⟩ ≤ ⟨ v , θ − ˆ θ ⟩ + κ ∥ θ − ˆ θ ∥ 2 2 = ⇒ λ env ∥ ξ ∥ 2 2 ≤ λ env ⟨ v , ξ ⟩ + κλ 2 env ∥ ξ ∥ 2 2 . Divide by λ env > 0 and rearrange: ⟨ v , ξ ⟩ ≥ (1 − κλ env ) ∥ ξ ∥ 2 2 , as claimed. Lemma C.2 (One-step inequality) . Assume F is κ -weakly con vex and bounded below by F inf > −∞ . F ix λ env ∈ (0 , 1 /κ ) and define L env := 1 λ env (1 − κλ env ) as in Lemma 4.11. Consider the pr ojected update (Algorithm 1) θ t +1 := pro j Θ θ t − η G ( θ t ; Z t ) , wher e G ( θ t ; Z t ) satisfies E G ( θ t ; Z t ) | θ t ∈ ∂ F ( θ t ) . Then for any stepsize η > 0 , E F λ env ( θ t +1 ) | θ t ≤ F λ env ( θ t ) − η (1 − κλ env ) ∥∇ F λ env ( θ t ) ∥ 2 2 + L env η 2 2 E ∥ G ( θ t ; Z t ) ∥ 2 2 | θ t . Pr oof. Write F = L W ρ + I Θ , where I Θ is the indicator of Θ (Eq.(16)). Then the env elope can be written as F λ env ( z ) = min u ∈ Θ f ( u ) + 1 2 λ env ∥ u − z ∥ 2 2 . Let ˜ z := pro j Θ ( z ) . For an y u ∈ Θ , Euclidean projection optimality implies ⟨ z − ˜ z , u − ˜ z ⟩ ≤ 0 , hence ∥ u − ˜ z ∥ 2 2 = ∥ u − z + ( z − ˜ z ) ∥ 2 2 = ∥ u − z ∥ 2 2 + ∥ z − ˜ z ∥ 2 2 + 2 ⟨ u − z , z − ˜ z ⟩ ≤ ∥ u − z ∥ 2 2 . T aking the minimum over u ∈ Θ gives the monotonicity under projection: F λ env (pro j Θ ( z )) ≤ F λ env ( z ) for all z ∈ R d . (44) Now define the unprojected point z t := θ t − η G ( θ t ; Z t ) , so θ t +1 = pro j Θ ( z t ) . By (44), F λ env ( θ t +1 ) ≤ F λ env ( z t ) . By Lemma 4.11, F λ env is L env -smooth, hence F λ env ( θ t − η G ( θ t ; Z t )) ≤ F λ env ( θ t ) − η ⟨∇ F λ env ( θ t ) , G ( θ t ; Z t ) ⟩ + L env η 2 2 ∥ G ( θ t ; Z t ) ∥ 2 2 . Combining with the previous display and taking conditional e xpectation yields E F λ env ( θ t +1 ) | θ t ≤ F λ env ( θ t ) − η ⟨∇ F λ env ( θ t ) , E [ G ( θ t ; Z t ) | θ t ] ⟩ + L env η 2 2 E ∥ G ( θ t ; Z t ) ∥ 2 2 | θ t . Let v t := E [ G ( θ t ; Z t ) | θ t ] ∈ ∂ F ( θ t ) by assumption. Applying Lemma C.1 with θ = θ t , ξ = ∇ F λ env ( θ t ) , and v = v t giv es ⟨∇ F λ env ( θ t ) , v t ⟩ ≥ (1 − κλ env ) ∥∇ F λ env ( θ t ) ∥ 2 2 . Substituting completes the proof. 20 Oracle-Robust Online Alignment for Lar ge Language Models C.3 Proof of Theor em 5.2 (con vergence rate of the Mor eau en velope) Pr oof of Theorem 5.2. Fix λ env ∈ (0 , 1 /κ ) and stepsize η > 0 . By Lemma C.2, for each t , E F λ env ( θ t +1 ) | θ t ≤ F λ env ( θ t ) − η (1 − κλ env ) ∥∇ F λ env ( θ t ) ∥ 2 2 + L env η 2 2 E ∥ G ( θ t ; Z t ) ∥ 2 2 | θ t . T ake total expectation and sum from t = 0 to T − 1 to obtain the telescoping inequality η (1 − κλ env ) T − 1 X t =0 E ∥∇ F λ env ( θ t ) ∥ 2 2 ≤ F λ env ( θ 0 ) − E F λ env ( θ T ) + L env η 2 2 T − 1 X t =0 E ∥ G ( θ t ; Z t ) ∥ 2 2 . (45) Since F λ env ( θ ) ≥ inf u F ( u ) = F inf for all θ , we hav e E [ F λ env ( θ T )] ≥ F inf . By Lemma B.4, E ∥ G ( θ t ; Z t ) ∥ 2 2 ≤ G 2 tot for all t . Thus (45) implies η (1 − κλ env ) T − 1 X t =0 E ∥∇ F λ env ( θ t ) ∥ 2 2 ≤ F λ env ( θ 0 ) − F inf + L env η 2 2 T G 2 tot . Divide by η (1 − κλ env ) T : 1 T T − 1 X t =0 E ∥∇ F λ env ( θ t ) ∥ 2 2 ≤ F λ env ( θ 0 ) − F inf η (1 − κλ env ) T + L env η 2(1 − κλ env ) G 2 tot . Let R be uniformly distributed on { 0 , . . . , T − 1 } , independent of the algorithmic randomness. Then E ∥∇ F λ env ( θ R ) ∥ 2 2 = 1 T T − 1 X t =0 E ∥∇ F λ env ( θ t ) ∥ 2 2 , and we obtain exactly Eq.(24) of Theorem 5.2. Pr oof of Corollary 5.3. Plug the stated stepsize choice η := 1 G tot s 2 F λ env ( θ 0 ) − F inf L env T into the right-hand side of Eq.(24) and simplify to obtain E ∥∇ F λ env ( θ R ) ∥ 2 2 ≤ 2 G tot q 2 L env F λ env ( θ 0 ) − F inf (1 − κλ env ) √ T . By Lemma B.4, we hav e G 2 tot := 4 G 2 ∇ SAIL + λ 2 G 2 ∂ R + σ 2 SAIL + λ 2 σ 2 R B . T o ensure that E ∥∇ F λ env ( θ R ) ∥ 2 2 ≤ ε , we substitute G 2 tot into Theorem 5.2 and obtain that it suffices to tak e T ≥ 2 G 2 tot F λ env ( θ 0 ) − F inf λ env (1 − κλ env ) 3 ε 2 = 8 C + σ 2 SAIL + λ 2 σ 2 R B F λ env ( θ 0 ) − F inf λ env (1 − κλ env ) 3 ε 2 , where C := G 2 ∇ SAIL + λ 2 G 2 ∂ R . 21 Oracle-Robust Online Alignment for Lar ge Language Models D On the practical role of the pr oximal point This section provides a practical interpretation of the proximal-point and Moreau-en velope constructions used in our analysis. Throughout, recall the constrained objective F ( θ ) := L W ρ ( θ ) + I Θ ( θ ) , and the en velope parameter λ env ∈ (0 , 1 /κ ) from Definition 4.10, where κ is the weak conv exity constant from Theorem 4.8. The Moreau en velope F λ env ( θ ) = min u ∈ R d n F ( u ) + 1 2 λ env ∥ u − θ ∥ 2 2 o can be viewed as a smoothed surrogate of the potentially nonsmooth, nonconv ex objecti ve F . The quadratic term discourages large moves away from the current iterate θ , and the minimizer of the inner problem—the proximal point —acts as a locally stabilized refinement of θ . Concretely , for any iterate θ we write ˆ θ := prox λ env F ( θ ) ∈ arg min u ∈ R d n F ( u ) + 1 2 λ env ∥ u − θ ∥ 2 2 o . Because F is κ -weakly con vex and λ env < 1 /κ , the proximal subproblem is strongly con vex in u and hence has a unique minimizer . This is the fundamental reason the en velope is dif ferentiable and why ∥∇ F λ env ( θ ) ∥ 2 is a meaningful stationarity proxy (Lemma 4.11). Algorithm 1 does not require solving the proximal subproblem at each iteration. The algorithm updates θ t using stochastic subgradient information for L W ρ = L SAIL + λR followed by projection onto Θ , and then outputs a random iterate θ R . The proximal point ˆ θ R := pro x λ env F ( θ R ) is introduced only as a theor etical device to translate en velope stationarity at θ R into near-stationarity of the original constrained objectiv e at ˆ θ R (see Lemma 4.11 and Theorem 5.2). In particular , Lemma 4.11 implies that θ R and ˆ θ R are close whenev er the en velope gradient is small: ∥ θ R − ˆ θ R ∥ 2 = λ env ∥∇ F λ env ( θ R ) ∥ 2 , and also that ˆ θ R is nearly stationary for F in the sense that dist 0 , ∂ F ( ˆ θ R ) ≤ ∥∇ F λ env ( θ R ) ∥ 2 . Thus, dri ving ∥∇ F λ env ( θ R ) ∥ 2 small (as guaranteed in e xpectation by Theorem 5.2, with sample complexity summarized in Corollary 5.3) simultaneously guarantees proximity to, and near-stationarity of, a point for the original constrained robust objecti ve. Lemma D.1 (What if we do not compute the proximal point?) . Assume F : R d → ( −∞ , + ∞ ] is κ -weakly con vex and bounded below , and fix λ env ∈ (0 , 1 /κ ) as in Definition 4.10. F or any θ ∈ R d , let ˆ θ := prox λ env F ( θ ) , ξ := ∇ F λ env ( θ ) = 1 λ env ( θ − ˆ θ ) (Definition 4.10and Lemma 4.11). Then: ∥ θ − ˆ θ ∥ = λ env ∥ ξ ∥ , (46) dist 0 , ∂ F ( ˆ θ ) ≤ ∥ ξ ∥ , (47) dist 0 , ∂ F ( θ ) ≥ (1 − κλ env ) ∥ ξ ∥ . (48) On the other hand, under κ -weak con vexity alone , ther e is no universal constant C such that dist 0 , ∂ F ( θ ) ≤ C ∥∇ F λ env ( θ ) ∥ for all θ and all κ -weakly con vex F (even in the con vex case κ = 0 ). Pr oof. W e first prove (46) – (47) . By Lemma 4.11 (see also the first-order optimality condition for the proximal subproblem), 0 ∈ ∂ F ( ˆ θ ) + 1 λ env ( ˆ θ − θ ) , equiv alently 1 λ env ( θ − ˆ θ ) ∈ ∂ F ( ˆ θ ) . 22 Oracle-Robust Online Alignment for Lar ge Language Models W ith ξ := 1 λ env ( θ − ˆ θ ) = ∇ F λ env ( θ ) , taking norms yields ∥ θ − ˆ θ ∥ = λ env ∥ ξ ∥ , which is (46) . Moreover , since ξ ∈ ∂ F ( ˆ θ ) , we hav e dist 0 , ∂ F ( ˆ θ ) ≤ ∥ ξ ∥ , which is (47). Next we pro ve (48). By Lemma C.1 (Monotonicity inequality), for ev ery v ∈ ∂ F ( θ ) , ⟨ ξ , v ⟩ ≥ (1 − κλ env ) ∥ ξ ∥ 2 . By Cauchy–Schwarz, ⟨ ξ , v ⟩ ≤ ∥ ξ ∥ ∥ v ∥ , hence for ev ery v ∈ ∂ F ( θ ) , ∥ v ∥ ≥ (1 − κλ env ) ∥ ξ ∥ . T aking the infimum over v ∈ ∂ F ( θ ) gi ves (48). For the stated consequences with the algorithmic output, apply (46)–(47) with θ = θ R and ˆ θ = ˆ θ R . Then ∥ θ R − ˆ θ R ∥ 2 = λ 2 env ∥∇ F λ env ( θ R ) ∥ 2 , dist 2 0 , ∂ F ( ˆ θ R ) ≤ ∥∇ F λ env ( θ R ) ∥ 2 . T aking e xpectations, using Jensen’ s inequality for E ∥ θ R − ˆ θ R ∥ , and plugging E ∥∇ F λ env ( θ R ) ∥ 2 ≤ ε yields the two bounds. Finally , we show that no uni versal constant C can upper bound dist(0 , ∂ F ( θ )) by ∥∇ F λ env ( θ ) ∥ under weak con vexity alone. Consider d = 1 and Θ = R , and define F ( θ ) = | θ | , which is con ve x (thus κ = 0 ). Fix λ env = 1 and any θ ∈ (0 , 1) . The proximal point is ˆ θ = arg min u ∈ R n | u | + 1 2 ( u − θ ) 2 o = 0 , so ∇ F λ env ( θ ) = θ − ˆ θ = θ . Ho wev er , for θ > 0 , ∂ F ( θ ) = { 1 } and therefore dist(0 , ∂ F ( θ )) = 1 . Hence dist(0 , ∂ F ( θ )) ∥∇ F λ env ( θ ) ∥ = 1 θ → ∞ as θ ↓ 0 , which rules out any finite uni versal C . Remark D.2 (Interpretation for practice) . Even if Algorithm 1 outputs only θ R (without solving the proximal subprob- lem), Theorem 5.2 still certifies that θ R lies within O ( λ env √ ε ) of a point ˆ θ R that is O ( √ ε ) -stationary for the original constrained objecti ve F . In contrast, without additional regularity beyond κ -weak con vexity , one cannot generally con vert en velope stationarity into a bound on dist(0 , ∂ F ( θ R )) itself; this is why the analysis (and the stationarity notion) is phrased in terms of the proximal point. D.1 Practical computation of the pr oximal point In the analysis we introduce the proximal point ˆ θ R := pro x λ env F ( θ R ) = arg min θ ∈ Θ n L W ρ ( θ ) + 1 2 λ env ∥ θ − θ R ∥ 2 2 o , (49) where F ( θ ) := L W ρ ( θ ) + I Θ ( θ ) and λ env ∈ (0 , 1 /κ ) . By weak con ve xity of F and λ env < 1 /κ , the objecti ve in (49) is strongly con vex and hence admits a unique minimizer ˆ θ R . Although Algorithm 1 outputs θ R , one can compute an approximation ¯ θ R ≈ ˆ θ R as an optional post-processing step, warm-started at θ R . A simple inner-loop solver is provided below . A simple inner-loop solver . Let θ (0) := θ R . For k = 0 , 1 , . . . , K − 1 , sample Z ( k ) ∼ d θ ( k ) and compute the stochastic direction G prox θ ( k ) ; Z ( k ) := G SAIL θ ( k ) ; Z ( k ) + λ G R θ ( k ) ; Z ( k ) + 1 λ env θ ( k ) − θ R , which is the usual oracle from Assumption 6 plus the deterministic gradient of the proximal quadratic. Then take a projected step θ ( k +1) := Π Θ θ ( k ) − α k G prox θ ( k ) ; Z ( k ) , ¯ θ R := θ ( K ) . (50) Since the proximal quadratic improves conditioning, this inner loop is typically stable in practice. Moreover , the theory only requires ¯ θ R to be a suf ficiently accurate approximation; exact solves of (49) are not necessary . This is consistent with standard practice in ine xact proximal methods, where the proximal subproblem is solv ed only approximately while maintaining meaningful con ver gence guarantees (e.g., (author?) 45). 23 Oracle-Robust Online Alignment for Lar ge Language Models Lemma D.3 (Stationarity of an inexact proximal point) . F ix λ env ∈ (0 , 1 /κ ) and an index R . Let ˆ θ R = pro x λ env F ( θ R ) and let ¯ θ R ∈ Θ be any point. Define the pr oximal objective Ψ R ( θ ) := F ( θ ) + 1 2 λ env ∥ θ − θ R ∥ 2 2 . If ¯ θ R satisfies the (first-or der) pr oximal residual bound dist 0 , ∂ Ψ R ( ¯ θ R ) ≤ ε prox , (51) then dist 0 , ∂ F ( ¯ θ R ) ≤ ∥∇ F λ env ( θ R ) ∥ 2 + ε prox + 1 λ env ∥ ¯ θ R − ˆ θ R ∥ 2 . (52) In particular , if one ensures ∥ ¯ θ R − ˆ θ R ∥ 2 ≤ λ env ε geom , then dist 0 , ∂ F ( ¯ θ R ) ≤ ∥∇ F λ env ( θ R ) ∥ 2 + ε prox + ε geom . Pr oof. By subdifferential calculus, ∂ Ψ R ( θ ) = ∂ F ( θ ) + 1 λ env ( θ − θ R ) . The residual condition (51) implies there exists g ∈ ∂ F ( ¯ θ R ) such that g + 1 λ env ( ¯ θ R − θ R ) 2 ≤ ε prox . Therefore, dist 0 , ∂ F ( ¯ θ R ) ≤ ∥ g ∥ 2 ≤ ε prox + 1 λ env ∥ ¯ θ R − θ R ∥ 2 . Next, by the triangle inequality , ∥ ¯ θ R − θ R ∥ 2 ≤ ∥ ¯ θ R − ˆ θ R ∥ 2 + ∥ ˆ θ R − θ R ∥ 2 . By Lemma 4.11 (property ∇ F λ env ( θ ) = 1 λ env ( θ − prox λ env F ( θ )) ), we hav e ∥ ˆ θ R − θ R ∥ 2 = λ env ∥∇ F λ env ( θ R ) ∥ 2 . Substituting yields (52). Remark D.4 (Practical stopping criteria.) . Lemma D.3 suggests two ine xpensive tar gets for the inner loop: (i) keep the proximal residual dist(0 , ∂ Ψ R ( θ ( k ) )) small (as estimated by a minibatch), and (ii) exploit warm-starting so that ∥ ¯ θ R − ˆ θ R ∥ 2 remains modest. 24
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment