ReflexGrad: A Dual-Process Architecture for Gradient-Free Inference-Time Learning

Scaling inference-time compute has emerged as a powerful paradigm–yet deliberating longer is not the same as learning. Current approaches to extended reasoning in large language models allocate more computation to thinking but remain fundamentally static: they cannot adapt from mistakes encountered during execution. Online reinforcement learning offers adaptation but requires gradient updates at runtime–expensive, prone to catastrophic forgetting, and unstable in deployment. We introduce ReflexGrad, a gradient-free framework for genuine inference-time learning: adaptation without retraining, without weight updates, without demonstrations. Our key insight is that effective runtime learning requires two complementary mechanisms–rapid policy refinement during forward progress, and deliberate causal diagnosis when stuck–with intelligent routing between them. ReflexGrad implements this by optimizing a natural language “policy” through textual feedback while keeping model weights frozen. When failures occur, the system analyzes recent action-outcome sequences to identify root causes and immediately applies corrections within the same execution–eliminating the need for multiple trials. Evaluated zero-shot across diverse interactive tasks without task-specific engineering, ReflexGrad achieves strong single-execution performance, demonstrating that gradient-free inference-time learning is not just theoretically appealing but practically viable.

📜 Original Paper Content