Combining Local and Global Direct Derivative-Free Optimization for Reinforcement Learning

Matteo Leonetti; Petar Kormushev; Simone Sagratella

doi:10.2478/cait-2012-0021

Abstract

We consider the problem of optimization in policy space for reinforcement learning. While a plethora of methods have been applied to this problem, only a narrow category of them proved feasible in robotics. We consider the peculiar characteristics of reinforcement learning in robotics, and devise a combination of two algorithms from the literature of derivative-free optimization. The proposed combination is well suited for robotics, as it involves both off-line learning in simulation and on-line learning in the real environment. We demonstrate our approach on a real-world task, where an Autonomous Underwater Vehicle has to survey a target area under potentially unknown environment conditions. We start from a given controller, which can perform the task under foreseeable conditions, and make it adaptive to the actual environment.

References

1. Barash, D.. A genetic search in policy space for solving Markov decision processes, - In: AAAI Spring Symposium on Search Techniques for Problem Solving under Uncertainty and Incomplete Information, 1999.
Search in Google Scholar
2. Baxter, J., P. Bartlett. Direct gradient-based reinforcement learning, - In: Proceedings of IEEE International Symposium on Circuits and Systems, Vol. 3, IEEE, 2000, 271-274.
Search in Google Scholar
3. Brachetti, P., M. De Felice Ciccoli, G. Di Pillo, S. Lucidi. A new version of the Price’s algorithm for global optimization, Journal of Global Optimization, Vol. 10, No 2, 1997, 165-184.10.1023/A:1008250020656
Search in Google Scholar
4. Carpin, S., M. Lewis, J. Wang, S. Balakirsky, C. Scrapper. Bridging the gap between simulation and reality in urban search and rescue, Robocup 2006: Robot Soccer World Cup X, 1-12.10.1007/978-3-540-74024-7_1
Search in Google Scholar
5. Conn, A., K. Scheinberg, L. Vicente. Introduction to derivative-free optimization, Vol. 8, Society for Industrial Mathematics, 2009.10.1137/1.9780898718768
Search in Google Scholar
6. Dorigo, M., M. Birattari, T. Stutzle. Ant colony optimization, IEEE Computational Intelligence Magazine, Vol. 1, No 4, 2006, 28-39.10.1109/CI-M.2006.248054
Search in Google Scholar
7. Glover, F., M. Laguna. Tabu search, Vol. 1, Springer, 1998.10.1007/978-1-4615-6089-0_1
Search in Google Scholar
8. Goldberg, D.. Genetic algorithms in search, optimization, and machine learning, Addisonwesley, 1989.
Search in Google Scholar
9. Gomez, F., J. Schmidhuber, R. Miikkulainen. Efficient Non-Linear Control through Neuroevolution, - In: Proceedings of the European Conference on Machine Learning, Springer, Berlin, 2006, 654-662.10.1007/11871842_64
Search in Google Scholar
10. Horst, R., P. Pardalos, N. Thoai. Introduction to global optimization, Springer, 2000.10.1007/978-1-4615-0015-5
Search in Google Scholar
11. Jakobi, N., P. Husbands, I. Harvey. Noise and the reality gap: The use of simulation in evolutionary robotics, Advances in artificial life, 704-720.10.1007/3-540-59496-5_337
Search in Google Scholar
12. Kakade, S.. A natural policy gradient, Advances in neural information processing systems, Vol. 14, 2001, 1531-1538.
Search in Google Scholar
13. Kirkpatrick, S., C. Gelatt Jr, M. Vecchi. Optimization by simulated annealing, Science, Vol. 220, No 4598, 1983, 671-680.10.1126/science.220.4598.67117813860
Search in Google Scholar
14. Kober, J., J. Peters. Policy search for motor primitives in robotics, Machine learning, Vol. 84, No 1, 2011, 171-203.10.1007/s10994-010-5223-6
Search in Google Scholar
15. Kormushev, P., D. G. Caldwell. Simultaneous Discovery of Multiple Alternative Optimal Policies by Reinforcement Learning, - In: IEEE International Conference on Intelligent Systems (IS 2012), 2012.10.1109/IS.2012.6335136
Search in Google Scholar
16. Lucidi, S., M. Sciandrone. On the global convergence of derivative-free methods for unconstrained optimization, SIAM Journal of Optimization, Vol. 13, No 1, 2002, 97-116.10.1137/S1052623497330392
Search in Google Scholar
17. Peters, J., S. Schaal. Policy gradient methods for robotics, - In: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 2006, 2219-2225.10.1109/IROS.2006.282564
Search in Google Scholar
18. Peters, J., S. Schaal. Reinforcement learning by reward-weighted regression for operational space control, - In: Proceedings of the 24th international conference on Machine learning, ACM, 2007, 745-750.10.1145/1273496.1273590
Search in Google Scholar
19. Peters, J., S. Vijayakumar, S. Schaal. Natural Actor-Critic, - In: Proceedings of the 16th European Conference on Machine Learning (ECML), 2005, 280-291.10.1007/11564096_29
Search in Google Scholar
20. Price, W.. Global optimization by controlled random search, Journal of Optimization Theory and Applications, Vol. 40, No 3, 1983, 333-348.10.1007/BF00933504
Search in Google Scholar
21. Ribas, D., N. Palomeras, P. Ridao, M. Carreras, A. Mallios. Girona 500 AUV, from survey to intervention, IEEE/ASME Transactions on Mechatronics, Vol. 17, No 1, 2012, 46-53.10.1109/TMECH.2011.2174065
Search in Google Scholar
22. Sutton, R., D. McAllester, S. Singh, Y. Mansour. Policy gradient methods for reinforcement learning with function approximation, Advances in neural information processing systems, Vol. 12, No 22.
Search in Google Scholar
23. Theodorou, E., J. Buchli, S. Schaal. A generalized path integral control approach to reinforcement learning, The Journal of Machine Learning Research, Vol. 9999, 2010, 3137-3181.
Search in Google Scholar
24. Torczon, V., et al .. On the convergence of pattern search algorithms, SIAM Journal on optimization, Vol. 7, No 1, 1997, 1-25.10.1137/S1052623493250780
Search in Google Scholar
25. Torn, A., A. Zilinska s. Global Optimization, Springer, 1989.
Search in Google Scholar
26. Williams, R.. Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine learning, Vol. 8, No 3, 1992, 229-256.10.1007/BF00992696
Search in Google Scholar

Combining Local and Global Direct Derivative-Free Optimization for Reinforcement Learning

Abstract

Paradigm

My account