The Thing That We Tried Didnt Work Very Well : Deictic Representation in Reinforcement Learning

Most reinforcement learning methods operate on propositional representations of the world state. Such representations are often intractably large and generalize poorly. Using a deictic representation is believed to be a viable alternative: they promise generalization while allowing the use of existing reinforcement-learning methods. Yet, there are few experiments on learning with deictic representations reported in the literature. In this paper we explore the effectiveness of two forms of deictic representation and a na"{i}ve propositional representation in a simple blocks-world domain. We find, empirically, that the deictic representations actually worsen learning performance. We conclude with a discussion of possible causes of these results and strategies for more effective learning in domains with objects.

💡 Research Summary

The paper investigates whether deictic state representations—those that refer to objects by pointers such as “this” or “that”—can improve reinforcement‑learning (RL) performance compared with a naïve propositional encoding. The authors focus on a simple blocks‑world domain where a robot must stack a target block onto a specified location. Three state encodings are evaluated: (1) a full propositional representation that enumerates every possible block‑position relation as binary features, (2) a fixed‑set deictic representation that uses a small number of pointers (e.g., “the block currently held”, “the topmost block”), and (3) a dynamic deictic representation that allocates pointers on‑the‑fly to the current goal object. For learning they employ standard tabular Q‑learning and SARSA with an ε‑greedy exploration policy, keeping all other algorithmic details identical across conditions.

Empirical results show that, contrary to the common belief that deictic representations yield better generalization and smaller state spaces, both deictic variants actually degrade learning speed and final performance. The propositional baseline, although high‑dimensional, converges more reliably after sufficient episodes. The fixed deictic encoding sometimes exhibits a brief early advantage but soon suffers from large variance and slower asymptotic convergence. The dynamic deictic encoding performs the worst, with unstable learning curves caused by frequent remapping of pointers to different objects.

The authors attribute these failures to three intertwined factors. First, partial observability: a deictic state only contains information about the object(s) currently referenced by the pointers, so the same deictic vector can correspond to many underlying world configurations. This ambiguity forces the learner to associate the same state with different optimal actions, slowing value‑function learning. Second, action‑space explosion: because actions must be conditioned on the current pointer(s), the effective action set becomes a Cartesian product of primitive actions and pointer selections, dramatically increasing the number of state‑action pairs that must be explored. Third, inadequate exploration: ε‑greedy injects random primitive actions but does not explicitly encourage pointer changes, so the agent spends many steps exploring irrelevant parts of the state space while rarely trying new pointer assignments that could reveal useful structure. The dynamic pointer scheme exacerbates this problem because the mapping between pointers and objects is itself unstable during early learning, leading to non‑stationary transition dynamics.

To address these issues, the paper proposes several avenues for future work. (a) Meta‑learning of pointer‑object bindings: a separate network could learn to predict which object a pointer should refer to given the current observation, thereby reducing ambiguity. (b) Hybrid exploration strategies that treat pointer changes as distinct exploratory actions, assigning them a separate probability or using intrinsic‑motivation bonuses when a new pointer configuration is encountered. (c) Graph‑based hybrid representations that retain a full relational graph of objects while using deictic pointers as attention mechanisms, allowing the agent to exploit both the compactness of pointers and the completeness of relational information. The authors argue that such hybrids could preserve the generalization benefits of deictic representations while mitigating the partial‑observability and exploration problems identified in their experiments.

In summary, the study provides a rare empirical counter‑example to the assumption that deictic representations automatically improve RL in object‑oriented domains. It highlights that simply swapping propositional vectors for pointer‑based encodings is insufficient; careful handling of observability, action conditioning, and exploration is essential. The paper’s analysis and suggested research directions lay groundwork for more sophisticated object‑centric RL architectures that can truly leverage deictic abstractions without sacrificing learning stability or efficiency.