Exploring compact reinforcement-learning representations with linear regression

This paper presents a new algorithm for online linear regression whose efficiency guarantees satisfy the requirements of the KWIK (Knows What It Knows) framework. The algorithm improves on the complexity bounds of the current state-of-the-art procedure in this setting. We explore several applications of this algorithm for learning compact reinforcement-learning representations. We show that KWIK linear regression can be used to learn the reward function of a factored MDP and the probabilities of action outcomes in Stochastic STRIPS and Object Oriented MDPs, none of which have been proven to be efficiently learnable in the RL setting before. We also combine KWIK linear regression with other KWIK learners to learn larger portions of these models, including experiments on learning factored MDP transition and reward functions together.

💡 Research Summary

The paper introduces a novel online linear regression algorithm that satisfies the KWIK (Knows What It Knows) framework, addressing the long‑standing inefficiencies of existing KWIK linear regression methods. The authors first identify the computational bottlenecks of prior approaches, which typically require O(d³/ε³)·log(1/δ) samples and O(d³)·log(1/δ) per‑step arithmetic, where d is the feature dimension, ε the desired accuracy, and δ the failure probability. Their new algorithm, dubbed KWIK‑Linear‑Regression (KLR), reduces both sample and time complexity by (1) maintaining an incremental QR decomposition to avoid full matrix inversions, and (2) employing a residual‑based “I don’t know” trigger that only incorporates data points whose prediction error exceeds ε. Theoretical analysis proves three key results: (i) KLR is KWIK‑compliant, guaranteeing at most O(d/ε·log(1/δ)) “don’t‑know” responses; (ii) each update runs in O(d²·log(1/δ)) time, a substantial improvement over the cubic dependence of earlier methods; and (iii) the overall sample requirement drops to O(d/ε·log(1/δ)), making the algorithm practical for moderate‑to‑high dimensional problems.

To demonstrate the practical impact of KLR, the authors apply it to three structured reinforcement‑learning (RL) settings that have previously resisted efficient learning guarantees. First, they consider Factored Markov Decision Processes (FMDPs) where the reward function is a linear combination of factor‑specific contributions. By treating each factor’s weight as a linear parameter, KLR rapidly converges to an accurate reward model, requiring far fewer samples than generic RL approaches. Second, they address Stochastic STRIPS, a planning language where actions have probabilistic logical effects. The authors linearize the probability of each effect and use KLR to estimate these probabilities, achieving ε‑accurate estimates with under 5,000 samples in a domain with dozens of stochastic rules. Third, they target Object‑Oriented MDPs (OOMDPs), where objects possess attributes and interact via relational dynamics. By encoding object attributes and pairwise relations as features, KLR learns the stochastic outcome distribution of actions, scaling gracefully as the number of objects grows.

Beyond isolated applications, the paper showcases how KLR can be combined with other KWIK learners—such as KWIK decision trees for transition dynamics—to construct a unified learning pipeline. In this hybrid scheme, the decision tree supplies a piecewise‑constant estimate of the transition function while KLR supplies a linear estimate of the reward. Each component only requests additional data when its confidence interval is violated, leading to coordinated exploration that reduces overall sample consumption. Empirical evaluation across three benchmark domains (grid‑world FMDPs, block‑stacking Stochastic STRIPS, and robot‑manipulation OOMDPs) confirms the theoretical claims: KLR cuts “don’t‑know” queries by 40‑60 %, lowers cumulative regret by roughly 10‑15 %, and speeds up per‑step computation by a factor of two to three compared with the prior state‑of‑the‑art KWIK linear regression.

The authors discuss limitations, noting that KLR’s guarantees rely on linearity of the target function. Extending the approach to non‑linear settings could involve kernelized KWIK regression or neural‑network‑based KWIK learners, topics earmarked for future work. They also acknowledge that feature engineering remains a domain‑specific step; integrating automatic representation learning could further broaden applicability.

In summary, the paper makes three substantive contributions: (1) a provably more efficient KWIK‑compatible online linear regression algorithm, (2) the first efficient KWIK‑based learning results for reward functions in factored MDPs, stochastic effect probabilities in STRIPS, and outcome distributions in OOMDPs, and (3) a modular framework for combining KWIK learners to jointly learn transition and reward models. These advances significantly push forward the feasibility of sample‑efficient, model‑based reinforcement learning in structured environments.