The goal of the Learning Locomotion Project was to use machine learning techniques to create autonomous control software for a robot quadruped such that it can traverse unknown rugged and complex terrains. As experimental platform, the LittleDog robot was chosen, which is about 30cm long and 20cm high, with three degrees of freedom per leg. The specifications for the project required that the robot should achieve a speed of at least 7.2 cm/s and climb over obstacles up to 10.7cm (for humans, this would correspond to obstacles of 50% body height which are traversed at slow walking speed). For additional information, follow the link to this article and the official project website.
The technical components of our approach included:
Bayesian Learning for state estimation and outlier removal
Imitation learning of behaviors
Reinforcement learning for behavior improvement
Learning foothold templates for planning where to step
Compliant floating base inverse dynamics control with force control at the feet
Predictive control of contact forces
Convex optimization for ZMP balancer
High speed semi-dynamic walking gait
Below are publications that describe more details of our research. The ICRA 2010 paper was the Overall Best Paper Award Finalist, i.e., among the best 4 papers out of 2000 submissions.
Path Integral Reinforcement Learning
Problem Formulation Reinforcement Learning (RL) is, in theory, one of the most general approaches to learning control, as it can learn from rather uninformative rewards how to generate optimal actions for a given task1. Among the main problems of RL are the the rather slow convergence to an optimal solution, the number of open parameters to be tuned, and the restricted suitability to rather low dimensional learning problems. Until recently, it has been unclear whether RL would ever scale to learning in normal movement systems, which have easily tens to hunderds of dimensions.
Approach With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, RL has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this project investigated how to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-Jacobi-Bellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the magnitude of exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is structured. The update equations have no danger of numerical instabilites as neither matrix inversions nor gradient learning rates are required.
Results Our new algorithm, called Policy Improvement with Pagh Iintegrals (PI2) , demon- strated interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrated significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Figure 7.10 illustrates an experiment on a simulated 12 degree-of-freedom robot dog learning how to jump over gap in front of it in a remarkably small number of trials.
Figure 7.10: The top images show a real robot dog jumping over a gap in the ground, and a simulated robot dog attempting the same task. The bottom figure illustrates the learning curve of the simulated robot dog learning how to jumpt as far and effectively as possible over the gap. Within about 10-20 trials, the robot learned to improve its jump by about one body length, which made jump- ing very effective and 100% successful. The learning speed of about 10-20 trials for tuning about 600 open parameters with re- inforcement learning is quite remarkable.
Discussion and Future Outlook We believe that PI2 offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs. Future work will explore this algorithm in many more learning control scenarios, in particular how to learn feedback controllers from arbitrary sensory feedback variables.
1 Sutton, R.S. & Barto, A.G., 1998. Reinforcement learning : An introduction, Cambridge: MIT Press.