Quantifying Memory Use in Reinforcement Learning with Temporal Range

Reading time: 5 minute
...

📝 Original Info

  • Title: Quantifying Memory Use in Reinforcement Learning with Temporal Range
  • ArXiv ID: 2512.06204
  • Date: 2025-12-05
  • Authors: Rodney Lafuente-Mercado, Daniela Rus, T. Konstantin Rusch

📝 Abstract

How much does a trained RL policy actually use its past observations? We propose \emph{Temporal Range}, a model-agnostic metric that treats first-order sensitivities of multiple vector outputs across a temporal window to the input sequence as a temporal influence profile and summarizes it by the magnitude-weighted average lag. Temporal Range is computed via reverse-mode automatic differentiation from the Jacobian blocks $\partial y_s/\partial x_t\in\mathbb{R}^{c\times d}$ averaged over final timesteps $s\in\{t+1,\dots,T\}$ and is well-characterized in the linear setting by a small set of natural axioms. Across diagnostic and control tasks (POPGym; flicker/occlusion; Copy-$k$) and architectures (MLPs, RNNs, SSMs), Temporal Range (i) remains small in fully observed control, (ii) scales with the task's ground-truth lag in Copy-$k$, and (iii) aligns with the minimum history window required for near-optimal return as confirmed by window ablations. We also report Temporal Range for a compact Long Expressive Memory (LEM) policy trained on the task, using it as a proxy readout of task-level memory. Our axiomatic treatment draws on recent work on range measures, specialized here to temporal lag and extended to vector-valued outputs in the RL setting. Temporal Range thus offers a practical per-sequence readout of memory dependence for comparing agents and environments and for selecting the shortest sufficient context.

💡 Deep Analysis

Figure 1

📄 Full Content

QUANTIFYING MEMORY USE IN REINFORCEMENT LEARNING WITH TEMPORAL RANGE Rodney Lafuente-Mercado MIT Lincoln Laboratory Rodney.LafuenteMercado@ll.mit.edu Daniela Rus CSAIL, MIT rus@csail.mit.edu T. Konstantin Rusch ELLIS Institute T¨ubingen & Max Planck Institute for Intelligent Systems & T¨ubingen AI Center tkrusch@tue.ellis.eu ABSTRACT How much does a trained RL policy actually use its past observations? We propose Temporal Range, a model-agnostic metric that treats first-order sensitivities of multiple vector outputs across a temporal window to the input sequence as a tem- poral influence profile and summarizes it by the magnitude-weighted average lag. Temporal Range is computed via reverse-mode automatic differentiation from the Jacobian blocks ∂ys/∂xt ∈Rc×d averaged over final timesteps s ∈{t+1, . . . , T} and is well-characterized in the linear setting by a small set of natural axioms. Across diagnostic and control tasks (POPGym; flicker/occlusion; Copy-k) and architectures (MLPs, RNNs, SSMs), Temporal Range (i) remains small in fully observed control, (ii) scales with the task’s ground-truth lag in Copy-k, and (iii) aligns with the minimum history window required for near-optimal return as con- firmed by window ablations. We also report Temporal Range for a compact Long Expressive Memory (LEM) policy trained on the task, using it as a proxy readout of task-level memory. Our axiomatic treatment draws on recent work on range measures, specialized here to temporal lag and extended to vector-valued outputs in the RL setting. Temporal Range thus offers a practical per-sequence readout of memory dependence for comparing agents and environments and for selecting the shortest sufficient context. 1 INTRODUCTION Reinforcement learning (RL) has a long-standing history of utilizing memory to improve perfor- mance in complex environments (Hausknecht & Stone, 2015; Berner et al., 2019; Chen et al., 2021; Lu et al., 2023). Examples of machine learning models that incorporate memory include classical Recurrent Neural Networks (RNNs) such as Long Short-Term Memory (LSTM) models (Hochre- iter & Schmidhuber, 1997), Transformer (Vaswani et al., 2017), and recently State-Space Models (SSMs) (Gu et al., 2021; Gu & Dao, 2023). However, a rigorous analysis and quantitative measure of the extent to which a trained policy utilizes historical information remains largely absent. This matters in partially observed settings: if effective history dependence is short, simpler architectures or shorter contexts suffice; if it is long, we should see it directly in the learned policy rather than infer it from task design or model choice. Current practice relies on indirect signals, such as model class comparisons, environment-specific probes, or sample-complexity bounds (Williams, 2009; Efroni et al., 2022; Morad et al., 2023). These conflate optimization, inductive bias, and true memory demand, and they do not yield a comparable, sequence-level number. To address this, we formalize a Temporal Range metric that aggregates vector-output Jacobians over lags into a magnitude-weighted average look-back, axiomatized for uniqueness. Concretely, for each position t we average Jacobian norms ∥Js,t∥mat over all subsequent final timesteps s ∈ 1 arXiv:2512.06204v1 [cs.LG] 5 Dec 2025 {t + 1, . . . , T}, forming per-step weights wt = 1 T −t PT s=t+1 ∥Js,t∥mat, and report ˆρT = PT t=1 wt (T −t) PT t=1 wt ∈[0, T−1], with T X t=1 wt > 0. Thus ˆρT answers “how far back is this policy looking here?” at the level of a specific rollout and timestep. Our theoretical starting point is the axioms of range from Bamberger et al. (2025); we specialize them to temporal lag, extend to vector-valued outputs, and study their consequences for reinforcement learning agents. For vector-output linear maps we give a short axiomatic derivation that fixes both the unnormalized and normalized forms: single-step calibration, additivity over disjoint time indices (or magnitude- weighted averaging), and absolute homogeneity identify the unique matrix-norm–weighted lag sums/averages. The same formulas applied to the local linearization yield our nonlinear policy metric. The normalized form is invariant to uniform input rescaling (change of units) and uniform output rescaling, making cross-agent and cross-environment comparisons straightforward. Computationally, Temporal Range is inexpensive. The required Jacobian blocks are policy deriva- tives with respect to observation inputs and can be obtained with standard reverse-mode automatic differentiation on the policy alone. When direct auto-differentiation on a given model is impractical, we train a compact LEM policy on the same task and compute the same quantities on this proxy. We validate Temporal Range across POPGym diagnostics and classic control with Multi-Layer Per- ceptrons (MLPs), gated RNNs (e.g., LSTMs, Gated Recurrent Units (GRUs) (Chung et al., 2014), and LEM), as well as SSMs. The metric (i) stays near zero in fully

📸 Image Gallery

CartPole-v1_lem_temporal_profile.png CartPole-v1_window_curves.png NoisyStatelessCartPoleMedium_gru_temporal_profile.png NoisyStatelessCartPoleMedium_lem_temporal_profile.png NoisyStatelessCartPoleMedium_linoss_temporal_profile.png NoisyStatelessCartPoleMedium_lstm_temporal_profile.png RepeatPrevious10_gru_temporal_profile.png RepeatPrevious10_lem_temporal_profile.png RepeatPrevious10_linoss_temporal_profile.png RepeatPrevious10_lstm_temporal_profile.png RepeatPrevious10_window_curves.png RepeatPrevious1_lem_temporal_profile.png RepeatPrevious1_window_curves.png RepeatPrevious3_gru_temporal_profile.png RepeatPrevious3_lem_temporal_profile.png RepeatPrevious3_linoss_temporal_profile.png RepeatPrevious3_lstm_temporal_profile.png RepeatPrevious3_window_curves.png RepeatPrevious5_lem_temporal_profile.png RepeatPrevious5_window_curves.png StatelessCartPoleMedium_lem_temporal_profile.png StatelessCartPoleMedium_window_curves.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut