Exploring Deep and Recurrent Architectures for Optimal Control
Sophisticated multilayer neural networks have achieved state of the art results on multiple supervised tasks. However, successful applications of such multilayer networks to control have so far been limited largely to the perception portion of the control pipeline. In this paper, we explore the application of deep and recurrent neural networks to a continuous, high-dimensional locomotion task, where the network is used to represent a control policy that maps the state of the system (represented by joint angles) directly to the torques at each joint. By using a recent reinforcement learning algorithm called guided policy search, we can successfully train neural network controllers with thousands of parameters, allowing us to compare a variety of architectures. We discuss the differences between the locomotion control task and previous supervised perception tasks, present experimental results comparing various architectures, and discuss future directions in the application of techniques from deep learning to the problem of optimal control.
💡 Research Summary
The paper investigates whether deep learning techniques, which have revolutionized perception tasks, can also improve the representation of control policies themselves. Using a continuous, high‑dimensional bipedal locomotion problem, the authors map a 30‑dimensional state vector (joint angles, velocities, contact flags, and relative body positions) directly to six joint torques. The learning algorithm is Guided Policy Search (GPS), a reinforcement‑learning method that alternates between trajectory optimization (via a variant of Differential Dynamic Programming) and supervised policy updates. GPS starts from a set of demonstrations, builds a distribution over high‑reward trajectories, and uses samples from this distribution to initialize the policy before iteratively improving it with either LBFGS or stochastic gradient descent.
Three families of neural‑network policies are compared: (1) shallow single‑layer networks with 50 or 100 hidden units, (2) deep two‑layer networks with 20 or 50 units per layer, and (3) single‑layer recurrent networks (RNNs) with 20 or 50 hidden units. For each architecture the authors test both soft ReLU (log‑exp) and hard ReLU (max(0, z)) activations; hard ReLU requires plain SGD because LBFGS performs poorly with its nondifferentiable points. Training is performed on 1, 5, or 10 randomly generated terrains whose slopes range from –10° to +10°, each terrain consisting of many 0.5–1.0 m segments. For each terrain 80 trajectory samples (700 time steps each) are drawn from the GPS‑generated distribution, plus 10 on‑policy samples per iteration. Evaluation consists of five roll‑outs on each of ten held‑out test terrains, measuring the fraction of trials that finish without falling or stalling.
Results show clear trends. Increasing the number of training terrains improves generalization for all architectures, as expected. Deep and recurrent networks tend to outperform shallow ones when enough training data (5–10 terrains) is available, but the advantage is modest. With soft ReLU units, deep networks achieve the highest success rates, likely because LBFGS can exploit the smooth gradients of soft ReLU to train deeper structures effectively. With hard ReLU units, the small recurrent network (20 hidden units) performs best, suggesting that piecewise‑linear activations mitigate gradient‑vanishing problems in the recurrent back‑propagation through time, even without second‑order optimization.
However, larger networks also exhibit over‑fitting and susceptibility to local optima. The 100‑unit shallow network, despite having many parameters, performs poorly when trained on few terrains. Regularization techniques common in vision (sparsity penalties, denoising autoencoders) did not help; the authors argue that state‑feature vectors in control demand precise absolute values, making such regularizers ineffective without adaptation. Consequently, the paper highlights the need for regularization methods tailored to optimal‑control settings.
The authors conclude that deep and recurrent policies can be trained for challenging continuous control tasks, but success hinges on careful algorithmic choices: the type of activation, the optimizer (LBFGS vs. SGD), the amount of diverse training data, and appropriate regularization. They propose future work on extending variational GPS to recurrent policies—allowing minibatch SGD—and on hierarchical or meta‑learning approaches that could reuse low‑level balance controllers across many high‑level behaviors. In sum, the study provides an empirical baseline for applying modern deep‑learning architectures to control policy learning, while underscoring the practical challenges that remain.
Comments & Academic Discussion
Loading comments...
Leave a Comment