Off-Policy Reinforcement Learning with Gaussian Processes

Girish Chowdhary; Miao Liu; Robert Grande; Thomas Walsh; Jonathan How; Lawrence Carin

Volume 1 Issue 3

Jul. 2014

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 15.3, Top 1 (SCI Q1)

CiteScore: 23.5, Top 2% (Q1)
Google Scholar h5-index: 77， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2014 > 1(3): 227-238

Girish Chowdhary, Miao Liu, Robert Grande, Thomas Walsh, Jonathan How and Lawrence Carin, "Off-Policy Reinforcement Learning with Gaussian Processes," IEEE/CAA J. of Autom. Sinica, vol. 1, no. 3, pp. 227-238, 2014.

Citation:

Girish Chowdhary, Miao Liu, Robert Grande, Thomas Walsh, Jonathan How and Lawrence Carin, "Off-Policy Reinforcement Learning with Gaussian Processes," IEEE/CAA J. of Autom. Sinica, vol. 1, no. 3, pp. 227-238, 2014.

Citation:

PDF( 2384 KB)

Off-Policy Reinforcement Learning with Gaussian Processes

1. Oklahomas State University, Stillwater, USA;
2. Duke University, Durham, USA;
3. Massachusetts Institute of Technology, Cambridge, USA;
4. Duke University, Durham, USA

Funds:

This work was supported by Office of Naval Research Science of Autonomy Program (N000140910625).

Abstract

Abstract

Abstract—An off-policy Bayesian nonparameteric approximate reinforcement learning framework, termed as GPQ, that employs a Gaussian processes (GP) model of the value (Q) function is presented in both the batch and online settings. Sufficient conditions on GP hyperparameter selection are established to guarantee convergence of off-policy GPQ in the batch setting, and theoretical and practical extensions are provided for the online case. Empirical results demonstrate GPQ has competitive learning speed in addition to its convergence guarantees and its ability to automatically choose its own bases locations.
- Reinforcement learning,
- off-policy learning,
- Gaussian processes,
- Bayesian nonparametric

FullText(HTML)

References(43)

References

[1]	Sutton R, Barto A. Reinforcement Learning, an Introduction. Cambridge, MA: MIT Press, 1998.
[2]	Engel Y, Mannor S, Meir R. Reinforcement learning with Gaussian processes. In: Proceedings of the 22nd International Conference on Machine learning. New York: ACM, 2005. 201-208
[3]	Ernst D, Geurts P, Wehenkel L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 2005, 6: 503-556
[4]	Maei H R, Szepesvári C, Bhatnagar S, Sutton R S. Toward off-policy learning control with function approximation. In: Proceedings of the 2010 International Conference on Machine Learning, Haifa, Israel, 2010.
[5]	Rasmussen C E, Williams C K I. Gaussian Processes for Machine Learning. Cambridge, MA: The MIT Press, 2006.
[6]	Engel Y, Szabo P, Volkinshtein D. Learning to control an octopus arm with gaussian process temporal difference methods. In: Advances in Neural Information Processing Systems 18. 2005. 347-354
[7]	Melo F S, Meyn S P, Ribeiro M I. An analysis of reinforcement learning with function approximation. In: Proceedings of the 2008 International Conference on Machine Learning. New York: ACM, 2008. 664-671
[8]	Deisenroth M P. Efficient Reinforcement Learning Using Gaussian Processes[Ph. D. dissertation], Karlsruhe Institute of Technology, Germany, 2010.
[9]	Jung T, Stone P. Gaussian processes for sample efficient reinforcement learning with rmax-like exploration. In: Proceedings of the 2012 European Conference on Machine Learning (ECML). 2012. 601-616
[10]	Rasmussen C, Kuss M. Gaussian processes in reinforcement learning. In: Proceedings of the Advances in Neural Information Processing Systems, 2004, 751-759
[11]	Kolter J Z, Ng A Y. Regularization and feature selection in leastsquares temporal difference learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. New York: ACM, 2009. 521-528
[12]	Liu B, Mahadevan S, Liu J. Regularized off-policy TD-learning. In: Proceddings of the Advances in Neural Information Processing Systems 25. Cambridge, MA: The MIT Press, 2012. 845-853
[13]	Sutton R S, Maei H R, Precup D, Bhatnagar S, Silver D, Szepesvári C, Wiewiora E. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of the 26th Annual International Conference on Machine Learning. New York, USA: ACM, 2009. 993-1000
[14]	Csató L, Opper M. Sparse on-line gaussian processes. Neural Computation, 2002, 14(3): 641-668
[15]	Singh S P, Yee R C. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 1994, 16(3): 227-233
[16]	Watkins C J. Q-learning. Machine Learning, 1992, 8(3): 279-292
[17]	Baird L C. Residual algorithms: reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann, 1995. 30-37
[18]	Tsitsiklis J N, van Roy B. An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674-690
[19]	Parr R, Painter-Wakefield C, Li L, Littman M L. Analyzing feature generation for value-function approximation. In: Proceedings of the 2007 International Conference on Machine Learning. New York: IEEE, 2007. 737-744
[20]	Ormoneit D, Sen S. Kernel-based reinforcement learning. Machine Learning, 2002, 49(2-3): 161-178
[21]	Farahmand A M, Ghavamzadeh M, Szepesvári C, Mannor S. Regularized fitted q-iteration for planning in continuous-space Markovian decision problems. In: Proceedings of the 2009 American Control Conference. St. Louis, MO: IEEE, 2009. 725-730
[22]	Geist M, Pietquin O. Kalman temporal differences. Journal of Artificial Intelligence Research (JAIR), 2010, 39(1): 483-532
[23]	Strehl A L, Littman M L. A theoretical analysis of model-based interval estimation. In: Proceedings of the 22nd International Conference on Machine Learning. New York: IEEE, 2005. 856-863
[24]	Krause A, Guestrin C. Nonmyopic active learning of Gaussian processes: an exploration-exploitation approach. In: Proceedings of the 24th International Conference on Machine Learning. New York: IEEE, 2007. 449-456
[25]	Desautels T, Krause A, Burdick J W. Parallelizing explorationexploitation tradeoffs with Gaussian process bandit optimization. In: Proceedings of the 29th International Conference on Machine Learning. ICML, 2012. 1191-1198
[26]	Chung J J, Lawrance N R J, Sukkarieh S. Gaussian processes for informative exploration in reinforcement learning. In: Proceedings of the 2013 IEEE International Conference on Robotics and Automation. Karlsruhe: IEEE, 2013. 2633-2639
[27]	Barreto A D M S, Precup D, Pineau J. Reinforcement learning using kernel-based stochastic factorization. In: Proceedings of the Advances in Neural Information Processing Systems 24. Cambridge, MA: The MIT Press, 2011. 720-728
[28]	Chen X G, Gao Y, Wang R L. Online selective kernel-based temporal difference learning. IEEE Transactions on Neural Networks and Learning Systems, 2013, 24(12): 1944-1956
[29]	Kveton B, Theocharous G. Kernel-based reinforcement learning on representative states. In: Proceedings of the 26th AAAI Conference on Artificial Intelligence. AAAI, 2012. 977-983
[30]	Xu X, Hu D W, Lu X C. Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks, 2007, 18(4): 973-992
[31]	Snelson E, Ghahramani Z. Sparse Gaussian processes using pseudoinputs. In: Proceedings of the Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2006. 1257-126
[32]	Lázaro-Gredilla M, Quińonero-Candela J, Rasmussen C E, Figueiras-Vidal A R. Sparse spectrum gaussian process regression. The Journal of Machine Learning Research, 2010, 11: 1865-1881
[33]	Varah J M. A lower bound for the smallest singular value of a matrix. Linear Algebra and Its Applications, 1975, 11(1): 3-5
[34]	Lizotte D J. Convergent fitted value iteration with linear function approximation. In: Proceedings of the Advances in Neural Information Processing Systems 24. Cambridge, MA: The MIT Press, 2011. 2537-2545
[35]	Kingravi H. Reduced-Set Models for Improving the Training and Execution Speed of Kernel Methods [Ph.D dissertation], Georgia Institute of Technology, Atlanta, GA, 2013
[36]	Boyan J, Moore A. Generalization in reinforcement learning: Safely approximating the value function. In: Proceedings of the Advances in Neural Information Processing Systems 7. Cambridge, MA: The MIT Press, 1995. 369-376
[37]	Engel Y, Mannor S, Meir R. The kernel recursive least-squares algorithm. IEEE Transactions on Signal Processing, 2004, 52(8): 2275-2285
[38]	Krause A, Singh A, Guestrin C. Near-optimal sensor placements in gaussian processes: theory, efficient algorithms and empirical studies. The Journal of Machine Learning Research, 2008, 9: 235-284
[39]	Ehrhart E. Geometrie diophantienne-sur les polyedres rationnels homothetiques an dimensions. Comptes rendus hebdomadaires des seances de l'academia des sciences, 1962, 254(4): 616
[40]	Benveniste A, Priouret P, Métivier M. Adaptive Algorithms and Stochastic Approximations. New York: Springer-Verlag, 1990
[41]	Haddad W M, Chellaboina V. Nonlinear Dynamical Systems and Control: A Lyapunov-Based Approach. Princeton: Princeton University Press, 2008
[42]	Khalil H K. Nonlinear Systems. New York: Macmillan, 2002.
[43]	Lagoudakis M G, Parr R. Least-squares policy iteration. Journal of Machine Learning Research (JMLR), 2003, 4: 1107-1149

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Get Citation

PDF

XML

Article Metrics

Article views (1284) PDF downloads(31)

Off-Policy Reinforcement Learning with Gaussian Processes

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Export File

Citation

Format

Content