Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models
Learning from self-sampled data and sparse environmental feedback remains a fundamental challenge in training self-evolving agents. Temporal credit assignment mitigates this issue by transforming sparse feedback into dense supervision signals. However, previous approaches typically depend on learning task-specific value functions for credit assignment, which suffer from poor sample efficiency and limited generalization. In this work, we propose to leverage pretrained knowledge from large language models (LLMs) to transform sparse rewards into dense training signals (i.e., the advantage function) through retrospective in-context learning (RICL). We further propose an online learning framework, RICOL, which iteratively refines the policy based on the credit assignment results from RICL. We empirically demonstrate that RICL can accurately estimate the advantage function with limited samples and effectively identify critical states in the environment for temporal credit assignment. Extended evaluation on four BabyAI scenarios show that RICOL achieves comparable convergent performance with traditional online RL algorithms with significantly higher sample efficiency. Our findings highlight the potential of leveraging LLMs for temporal credit assignment, paving the way for more sample-efficient and generalizable RL paradigms.
💡 Research Summary
The paper tackles the long‑standing problem of learning from sparse environmental feedback, which is especially acute in multi‑turn decision‑making tasks where an agent must execute a long sequence of actions before receiving any reward. Traditional temporal credit assignment methods rely on learning value or advantage functions from scratch, demanding large amounts of interaction data and often failing to generalize across tasks.
To overcome these limitations, the authors propose Retrospective In‑Context Learning (RICL), a method that leverages the world knowledge already embedded in large language models (LLMs). An LLM serves directly as the policy: a state is encoded as a textual prompt and an action as a token sequence. After the policy generates a trajectory, a separate “reflector” LLM (π_reflect) receives the full hindsight trajectory and produces a natural‑language feedback sentence fₜ for each visited state sₜ. This feedback is appended to the original prompt, yielding an in‑context updated policy π′(·|sₜ) = π₀(·|sₜ, fₜ).
The key theoretical insight (Theorem 4.1) shows that for any two policies π₀ and π′ there exists a reward function r such that
β · log π′(a|s) − β · log π₀(a|s) ∝ A^{r}_{π₀}(s,a),
where A^{r}{π₀} is the advantage under r and β>0 is a scaling constant. Consequently, the log‑probability difference between the original and the in‑context‑updated policy directly estimates the advantage function without training any additional value network. By collecting n trajectories, applying RICL to each state, and averaging the log‑probability differences, the algorithm obtains a sample‑based estimate (\bar A{\pi_0}(s,a)).
Building on this estimate, the authors introduce RICOL (Retrospective In‑Context Online Learning), an online RL algorithm that iteratively refines the policy. Each iteration k proceeds as follows:
- Trajectory collection – policy πₖ interacts with the environment to generate a batch of trajectories.
- Policy evaluation – for every state sₜ in the batch, RICL is applied to produce π′ₖ₊₁(·|sₜ) and the corresponding advantage estimate (\bar A_{\pi_k}(sₜ,\cdot)).
- Policy improvement – using advantage‑weighted regression, πₖ is updated by minimizing a KL‑regularized objective that mixes the original policy logits and the logits of the in‑context‑updated policy:
min_π E_{s∼τₖ}
Comments & Academic Discussion
Loading comments...
Leave a Comment