LLMs as High-Dimensional Nonlinear Autoregressive Models with Attention: Training, Alignment and Inference

LLMs as High-Dimensional Nonlinear Autoregressive Models with Attention: Training, Alignment and Inference
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) based on transformer architectures are typically described through collections of architectural components and training procedures, obscuring their underlying computational structure. This review article provides a concise mathematical reference for researchers seeking an explicit, equation-level description of LLM training, alignment, and generation. We formulate LLMs as high-dimensional nonlinear autoregressive models with attention-based dependencies. The framework encompasses pretraining via next-token prediction, alignment methods such as reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), rejection sampling fine-tuning (RSFT), and reinforcement learning from verifiable rewards (RLVR), as well as autoregressive generation during inference. Self-attention emerges naturally as a repeated bilinear–softmax–linear composition, yielding highly expressive sequence models. This formulation enables principled analysis of alignment-induced behaviors (including sycophancy), inference-time phenomena (such as hallucination, in-context learning, chain-of-thought prompting, and retrieval-augmented generation), and extensions like continual learning, while serving as a concise reference for interpretation and further theoretical development.


💡 Research Summary

This review paper reframes transformer‑based large language models (LLMs) as high‑dimensional, high‑order nonlinear autoregressive (AR) systems with attention‑mediated dependencies. The authors present a unified, equation‑level description that spans the entire lifecycle of an LLM: pre‑training, alignment (or fine‑tuning), and inference. By treating the model as a Δ‑order AR process—where Δ denotes the context window (typically on the order of 10⁵ tokens) and d denotes the hidden‑state dimension (often 1 000–10 000)—the paper makes explicit the mapping from a token sequence to a probability distribution over the next token.

Core mathematical formulation
Tokens (x_t) are embedded via a learned matrix (W_{\text{emb}}) into vectors (e_t\in\mathbb{R}^d). A causal transformer (F_\theta) processes the entire embedded sequence in parallel, yielding hidden states (s_t = F_\theta(E)t). Causal masking guarantees that each (s_t) depends only on the most recent Δ embeddings. The hidden state at the final layer, denoted (h{L,t}=f_{\theta,t}(x_{t-\Delta+1:t})), is the result of L repeated self‑attention blocks. Each block can be written as a bilinear‑softmax‑linear composition: queries and keys are formed by linear maps (M^{(\ell)}) and (N^{(\ell)}); attention scores are inner products (a^{(\ell)}_t(r)=\langle M^{(\ell)}h_t, N^{(\ell)}h_r\rangle); a softmax normalises these scores; the values are linearly transformed by (L^{(\ell)}) and summed; finally a non‑linear function (\varphi) (capturing residual connections, layer‑norm, and feed‑forward networks) is applied. Repeating this structure across layers yields a highly expressive mapping from past tokens to a contextual representation.

The next‑token logits are computed as (z_t = W_{\text{out}}h_{L,t}+b_{\text{out}}) and turned into a probability vector (\pi_t = \operatorname{softmax}(z_t)). Pre‑training minimizes the cross‑entropy loss (\mathcal{L}{\text{pre}}(\theta) = -\sum{n,t}\log \pi_{\theta}(x_{t+1}^{(n)}|x_{1:t}^{(n)})) over massive corpora (up to 20 trillion tokens).

Alignment methods
After pre‑training, the base model (\theta^\star) is adapted to follow human intent and safety constraints. The paper formalises four major families:

  1. RLHF (Reinforcement Learning from Human Feedback) – a PPO‑style optimisation that maximises expected human‑derived reward while penalising divergence from the base policy:
    \

Comments & Academic Discussion

Loading comments...

Leave a Comment