JEPA-Reasoner: Decoupling Latent Reasoning from Token Generation

JEPA-Reasoner: Decoupling Latent Reasoning from Token Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Current autoregressive language models couple high-level reasoning and low-level token generation into a single sequential process, making the reasoning trajectory vulnerable to compounding expression errors. We propose JEPA-Reasoner, a novel architectural paradigm that decouples these tasks using a Joint-Embedding Predictive Architecture (JEPA) for pure latent-space reasoning and a separate Talker module for linguistic reconstruction. By isolating the reasoning engine from the discrete token-sampling process, our architecture enables: (1) Error Containment, where token-level failures cannot propagate into the latent reasoning chain; (2) Continuous Guidance, providing the generator with access to the entire lossless reasoning trajectory; and (3) Representation of Uncertainty, allowing the model to maintain multiple hypotheses via mixed latent vectors. Controlled experiments on synthetic and natural language tasks demonstrate that this decoupling enables a 0.9B model to achieve a 149.5% improvement in 8-shot GSM8K accuracy over a coupled Transformer baseline trained on identical data. This work shifts the focus from scaling coupled models to investigating decoupled architectures as a more robust foundation for complex reasoning.


💡 Research Summary

JEPA‑Reasoner introduces a fundamentally new architecture for language models by completely separating high‑level reasoning from low‑level token generation. The system consists of two independent modules: (1) a JEPA‑Reasoner that operates entirely in a continuous latent space, and (2) a Talker that reconstructs human‑readable text from the latent trajectory.

The JEPA‑Reasoner builds on the Joint‑Embedding Predictive Architecture (JEPA). Input tokens are first embedded, then processed by modified Transformer blocks that include a hybrid RMS/L2 normalization and a non‑learnable QK‑Norm for numerical stability. Instead of projecting the output onto a vocabulary distribution via a language‑model head, the model predicts the next latent vector, normalizes it onto the unit hypersphere, and feeds it back as the input for the next step. This creates a fully autoregressive chain of latent states that is mathematically independent of any token‑sampling process. Consequently, token‑level mistakes cannot affect the logical flow of the reasoning chain—a property the authors formalize with a probabilistic factorization.

The Talker module receives the completed latent sequence and translates it into tokens. Two variants are explored: Mono‑Talker, a decoder‑only model for context‑free reconstruction, and Dual‑Talker, an encoder‑decoder architecture that can incorporate additional context. In all experiments the Talker is trained with the Reasoner frozen, demonstrating that the Talker’s performance relies exclusively on the quality of the latent reasoning output.

Training proceeds in two stages. First, a conventional decoder‑only pre‑training phase teaches the model basic linguistic competence using next‑token prediction with tied embeddings and a temporary LM head. This phase also aligns token embeddings and latent vectors angularly, easing the later transition. Second, a self‑supervised training (SST) phase replaces the LM head, restores L2 normalization, and optimizes a scaled cosine‑distance loss: L = k·(1 − cos(h_pred, h_target)), with k = 4. The target latent vectors are generated by an exponential moving average (EMA) of the online embedding weights (momentum = 0.98), preventing rank collapse while allowing gradual angular adjustment.

The authors evaluate the architecture on synthetic tasks designed to probe specific capabilities and on a real‑world math‑reasoning benchmark (GSM8K). In a binary‑tree search task, a 42 M parameter Reasoner paired with Mono‑Talker achieves 99.87 % exact‑match accuracy. Principal component analysis of the predicted latent vectors shows they occupy a continuous cloud spanned by the embeddings of sibling nodes, confirming the model’s ability to encode mixed hypotheses rather than committing to a single token. A context‑free grammar generation task further demonstrates robustness to error propagation, with near‑perfect success rates.

On GSM8K, a 0.9 B parameter JEPA‑Reasoner outperforms a coupled Transformer baseline trained on the same data by 149.5 % in 8‑shot accuracy. This dramatic gain is achieved without scaling model size or employing sophisticated reinforcement‑learning tricks, highlighting the efficiency of the decoupled design.

Key contributions include:

  1. Mathematical error containment – proof that token‑level sampling errors have no pathway to influence latent reasoning.
  2. Mixed latent representations – empirical evidence that the Reasoner can maintain multiple possible reasoning paths simultaneously, enabling a form of uncertainty modeling.
  3. Training efficiency – a single forward pass in latent space generates the entire reasoning chain, avoiding the multiple passes or recurrent unrolling required by coupled approaches.

Limitations are acknowledged. The Talker currently functions only as a reconstruction module; it does not perform conditional generation, which restricts end‑to‑end generation flexibility. EMA‑based target embeddings update at a fixed rate, potentially slowing adaptation to rapid domain shifts. Moreover, the representation of uncertainty remains a linear combination of embeddings rather than a fully probabilistic distribution.

Future work suggested by the authors includes extending the Talker to a generative decoder, integrating multimodal inputs (e.g., vision, audio) into the latent reasoning pipeline, and incorporating Bayesian techniques to quantify uncertainty more rigorously. Dynamic EMA schedules or contrastive learning could further improve domain adaptability.

In summary, JEPA‑Reasoner demonstrates that decoupling reasoning from token generation can dramatically improve robustness, expressiveness, and efficiency of language models, offering a promising new direction for the design of next‑generation foundation models.


Comments & Academic Discussion

Loading comments...

Leave a Comment