TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design
Generalizing deep reinforcement learning agents to unseen environments remains a significant challenge. One promising solution is Unsupervised Environment Design (UED), a co-evolutionary framework in which a teacher adaptively generates tasks with high learning potential, while a student learns a robust policy from this evolving curriculum. Existing UED methods typically measure learning potential via regret, the gap between optimal and current performance, approximated solely by value-function loss. Building on these approaches, we introduce the transition-prediction error as an additional term in our regret approximation. To capture how training on one task affects performance on others, we further propose a lightweight metric called Co-Learnability. By combining these two measures, we present Transition-aware Regret Approximation with Co-learnability for Environment Design (TRACED). Empirical evaluations show that TRACED produces curricula that improve zero-shot generalization over strong baselines across multiple benchmarks. Ablation studies confirm that the transition-prediction error drives rapid complexity ramp-up and that Co-Learnability delivers additional gains when paired with the transition-prediction error. These results demonstrate how refined regret approximation and explicit modeling of task relationships can be leveraged for sample-efficient curriculum design in UED. Project Page: https://geonwoo.me/traced/
💡 Research Summary
The paper tackles the challenge of generalizing deep reinforcement‑learning agents to unseen environments by improving Unsupervised Environment Design (UED), a co‑evolutionary framework where a teacher generates curricula and a student learns a robust policy. Existing UED methods estimate the “regret” of a task— the gap between optimal and current performance—using coarse proxies such as Positive Value Loss (PVL) or the maximum observed return, which ignore errors in the learned dynamics model.
TRACED introduces a more faithful regret approximation by adding a transition‑prediction loss (TPL) to the PVL. The authors decompose regret into three components: value‑estimation error, reward gap, and future‑value gap. The future‑value gap depends not only on value‑function error but also on the mismatch between the learned transition model (\hat P) and the true dynamics (P). By training a recurrent transition model and measuring the one‑step reconstruction error, they compute an Average Transition‑Prediction Loss (ATPL) for each episode and combine it with PVL as
\
Comments & Academic Discussion
Loading comments...
Leave a Comment