IEEE/CAA Journal of Automatica Sinica
Citation: | Dimitri Bertsekas, "Multiagent Reinforcement Learning:Rollout and Policy Iteration," IEEE/CAA J. Autom. Sinica, vol. 8, no. 2, pp. 249-272, Feb. 2021. doi: 10.1109/JAS.2021.1003814 |
[1] |
D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. I. 4th ed. Belmont, USA: Athena Scientific, 2017.
|
[2] |
D. P. Bertsekas, Reinforcement Learning and Optimal Control. Belmont, USA: Athena Scientific, 2019.
|
[3] |
D. P. Bertsekas, Rollout, Policy Iteration, and Distributed Reinforcement Learning. Belmont, USA: Athena Scientific, 2020.
|
[4] |
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, "Mastering chess and Shogi by self-play with a general reinforcement learning algorithm, " arXiv preprint arXiv: 1712.01815, 2017.
|
[5] |
D. P. Bertsekas, "Multiagent value iteration algorithms in dynamic programming and reinforcement learning, " arxiv: 2005.01627, 2020.
|
[6] |
D. P. Bertsekas, "Constrained multiagent rollout and multidimensional assignment with the auction algorithm, " arXiv: 2002.07407, 2020.
|
[7] |
D. P. Bertsekas, "Distributed dynamic programming, " IEEE Trans. Autom. Control, vol. 27, no. 3, pp. 610-616, Jun. 1982.
|
[8] |
D. P. Bertsekas, "Asynchronous distributed computation of fixed points, " Math. Programming, vol. 27, no. 1, pp. 107-120, Sep. 1983.
|
[9] |
D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods. Englewood Cliffs, USA: Prentice-Hall, 1989.
|
[10] |
D. P. Bertsekas and H. Z. Yu, "Asynchronous distributed policy iteration in dynamic programming, " in Proc. 48th Annu. Allerton Conf. Communication, Control, and Computing, Allerton, USA, 2010, pp. 1368-1374.
|
[11] |
D. P. Bertsekas and H. Z. Yu, "Q-learning and enhanced policy iteration in discounted dynamic programming, " Math. Oper. Res., vol. 37, pp. 66-94, Feb. 2012.
|
[12] |
H. Z. Yu and D. P. Bertsekas, "Q-learning and policy iteration algorithms for stochastic shortest path problems, " Ann. Oper. Res., vol. 208, no. 1, pp. 95-132, Sep. 2013.
|
[13] |
D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. Ⅱ. 4th ed. Belmont, USA: Athena Scientific, 2012.
|
[14] |
D. P. Bertsekas, Abstract Dynamic Programming. Belmont, USA: Athena Scientific, 2018.
|
[15] |
S. Bhattacharya, S. Badyal, T. Wheeler, S. Gil, and D. P. Bertsekas, "Reinforcement learning for POMDP: Partitioned rollout and policy iteration with application to autonomous sequential repair problems, " IEEE Rob. Autom. Lett., vol. 5, no. 3, pp. 3967-3974, Jul. 2020.
|
[16] |
H. S. Witsenhausen, "A counterexample in stochastic optimum control, " SIAM J. Control, vol. 6, no. 1, pp. 131-147, 1968. http://www.ams.org/mathscinet-getitem?mr=231649
|
[17] |
H. S. Witsenhausen, "Separation of estimation and control for discrete time systems, " Proc. IEEE, vol. 59, no. 11, pp. 1557-1566, Nov. 1971.
|
[18] |
J. Marschak, "Elements for a theory of teams, " Manage. Sci., vol. 1, no. 2, pp. 127-137, Jan. 1975.
|
[19] |
R. Radner, "Team decision problems, " Ann. Math. Statist., vol. 33, no. 3, pp. 857-881, Sep. 1962.
|
[20] |
H. S. Witsenhausen, "On information structures, feedback and causality, " SIAM J. Control, vol. 9, no. 2, pp. 149-160, 1971. http://www.researchgate.net/publication/250955878_On_Information_Structures_Feedback_and_Causality
|
[21] |
J. Marschak and R. Radner, Economic Theory of Teams. New Haven, USA: Yale University Press, 1976.
|
[22] |
N. Sandell, P. Varaiya, M. Athans, and M. Safonov, "Survey of decentralized control methods for large scale systems, " IEEE Trans. Autom. Control, vol. 23, no. 2, pp. 108-128, Apr. 1978.
|
[23] |
T. Yoshikawa, "Decomposition of dynamic team decision problems, " IEEE Trans. Autom. Control, vol. 23, no. 4, pp. 627-632, Aug. 1978.
|
[24] |
Y. C. Ho, "Team decision theory and information structures, " Proc. IEEE, vol. 68, no. 6, pp. 644-654, Jun. 1980.
|
[25] |
D. Bauso and R. Pesenti, "Generalized person-by-person optimization in team problems with binary decisions, " in Proc. American Control Conf., Seattle, USA, 2008, pp. 717-722.
|
[26] |
D. Bauso and R. Pesenti, "Team theory and person-by-person optimization with binary decisions, " SIAM J. Control Optim., vol. 50, no. 5, pp. 3011-3028, Jan. 2012.
|
[27] |
A. Nayyar, A. Mahajan, and D. Teneketzis, "Decentralized stochastic control with partial history sharing: A common information approach, " IEEE Trans. Autom. Control, vol. 58, no. 7, pp. 1644-1658, Jul. 2013.
|
[28] |
A. Nayyar and D. Teneketzis, "Common knowledge and sequential team problems, " IEEE Trans Autom. Control, vol. 64, no. 12, pp. 5108-5115, Dec. 2019.
|
[29] |
Y. Y. Li, Y. J. Tang, R. Y. Zhang, and N. Li, "Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach, " arXiv: 1912.09135, 2019.
|
[30] |
G. Qu and N. Li, "Exploiting Fast Decaying and Locality in Multi-Agent MDP with Tree Dependence Structure, " in Proc. of CDC, Nice, France, 2019.
|
[31] |
A. Gupta, "Existence of team-optimal solutions in static teams with common information: A topology of information approach, " SIAM J. Control Optim., vol. 58, no. 2, pp. 998-1021, Apr. 2020.
|
[32] |
F. Bullo, J. Cortes, and S. Martinez, Distributed Control of Robotic Networks: A Mathematical Approach to Motion Coordination Algorithms. St. Princeton, USA: Princeton University Press, 2009.
|
[33] |
M. Mesbahi and M. Egerstedt, Graph Theoretic Methods in Multiagent Networks. Princeton, USA: Princeton University Press, 2010.
|
[34] |
M. S. Mahmoud, Multiagent Systems: Introduction and Coordination Control. Boca Raton, USA: CRC Press, 2020.
|
[35] |
R. Zoppoli, M. Sanguineti, G. Gnecco, and T. Parisini, Neural Approximations for Optimal Control and Decision, Springer, 2020. http://www.researchgate.net/publication/338322921_Neural_Approximations_for_Optimal_Control_and_Decision
|
[36] |
F. A. Oliehoek and C. Amato, A Concise Introduction to Decentralized POMDPs, Springer International Publishing, 2016. doi: 10.1007/978-3-319-28929-8
|
[37] |
P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote, "A survey of learning in multiagent environments: Dealing with non-stationarity, " arXiv: 1707.09183, 2017.
|
[38] |
K. Q. Zhang, Z. R. Yang, and T. Başar, "Multi-agent reinforcement learning: A selective overview of theories and algorithms, " arXiv: 1911.10635, 2019.
|
[39] |
L. S. Shapley, "Stochastic games, " Proc. Natl. Acad. Sci., vol. 39, no. 10, pp. 1095-1100, 1953.
|
[40] |
M. L. Littman, "Markov games as a framework for multi-agent reinforcement learning, " in Machine Learning Proceedings 1994, W. W. Cohen and H. Hirsh, Eds. Amsterdam, The Netherlands: Elsevier, 1994, pp. 157-163.
|
[41] |
K. P. Sycara, "Multiagent systems, " AI Mag., vol. 19, no. 2, pp. 79-92, Jun. 1998.
|
[42] |
P. Stone and M. Veloso, "Multiagent systems: A survey from a machine learning perspective, " Auton. Rob., vol. 8, no. 3, pp. 345-383, Jun. 2000.
|
[43] |
L. Panait and S. Luke, "Cooperative multi-agent learning: The state of the art, " Auton. Agen. Multi-Agent Syst., vol. 11, no. 3, pp. 387-434, Nov. 2005.
|
[44] |
L. Busoniu, R. Babuska, and B. De Schutter, "A comprehensive survey of multiagent reinforcement learning, " IEEE Trans. Syst., Man, Cybern., Part C, vol. 38, no. 2, pp. 156-172, Mar. 2008.
|
[45] |
L. Busoniu, R. Babuška, and B. De Schutter, "Multi-agent reinforcement learning: An overview, " in Innovations in Multi-Agent Systems and Applications-1, D. Srinivasan and L. C. Jain, Eds. Berlin, Germany: Springer, 2010, pp. 183-221.
|
[46] |
L. Matignon, G. J. Laurent, and N. Le Fort-Piat, "Independent reinforcement learners in cooperative Markov games: A survey regarding coordination problems, " Knowl. Eng. Rev., vol. 27, no. 1, pp. 1-31, Feb. 2012.
|
[47] |
P. Hernandez-Leal, B. Kartal, and M. E. Taylor, "A survey and critique of multiagent deep reinforcement learning, " Auton. Agent. Multi-Agent Syst., vol. 33, no. 6, pp. 750-797, Oct. 2019.
|
[48] |
A. OroojlooyJadid and D. Hajinezhad, "A review of cooperative multi-agent deep reinforcement learning, " arXiv: 1908.03963, 2019.
|
[49] |
T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, "Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications, " IEEE Trans Cybern., vol. 50, no. 9, pp. 3826-3839, Sep. 2020.
|
[50] |
G. Tesauro, "Extending Q-learning to general adaptive multi-agent systems, " in Proc. 16th Int. Conf. Neural Information Processing Systems, 2004, pp. 871-878.
|
[51] |
F. A. Oliehoek, J. F. P. Kooij, and N. Vlassis, "The cross-entropy method for policy search in decentralized POMDPs, " Informatica, vol. 32, no. 4, pp. 341-357, 2008. http://www.ams.org/mathscinet-getitem?mr=2481391
|
[52] |
P. Pennesi and I. C. Paschalidis, "A distributed actor-critic algorithm and applications to mobile sensor network coordination problems, " IEEE Trans. Autom. Control, vol. 55, no. 2, pp. 492-497, Feb. 2010.
|
[53] |
I. C. Paschalidis and Y. W. Lin, "Mobile agent coordination via a distributed actor-critic algorithm, " in Proc. 19th Mediterranean Conf. Control Automation, Corfu, Greece, 2011, pp. 644-649.
|
[54] |
S. Kar, J. M. F. Moura, and H. V. Poor, "QD-Learning: A collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations, " IEEE Trans. Signal Process., vol. 61, no. 7, pp. 1848-1862, Apr. 2013.
|
[55] |
J. N. Foerster, Y. M. Assael, N. De Freitas, and S. Whiteson, "Learning to Communicate with Deep Multi-Agent Reinforcement Learning, " in Proc. 30th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 2137-2145.
|
[56] |
S. Omidshafiei, A. A. Agha-Mohammadi, C. Amato, S. Y. Liu, J. P. How, and J. Vian, "Graph-based cross entropy method for solving multi-robot decentralized POMDPs, " in Proc. IEEE Int. Conf. Robotics and Automation, Stockholm, Sweden, 2016, pp. 5395-5402.
|
[57] |
J. K. Gupta, M. Egorov, and M. Kochenderfer, "Cooperative multi-agent control using deep reinforcement learning, " in Proc. Int. Conf. Autonomous Agents and Multiagent Systems, Best Papers, Brazil, 2017, pp. 66-83.
|
[58] |
R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, "Multi-agent actor-critic for mixed cooperative-competitive environments, " in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, USA, 2017, pp. 6379-6390.
|
[59] |
M. Zhou, Y. Chen, Y. Wen, Y. D. Yang, Y. F. Su, W. N. Zhang, D. Zhang, and J. Wang, "Factorized Q-learning for large-scale multi-agent systems, " arXiv: 1809.03738, 2018.
|
[60] |
K. Q. Zhang, Z. R. Yang, H. Liu, T. Zhang, and T. Başar, "Fully decentralized multi-agent reinforcement learning with networked agents, " arXiv: 1802.08757, 2018.
|
[61] |
Y. Zhang and M. M. Zavlanos, 2019 "Distributed off-policy actor-critic reinforcement learning with policy consensus, " in Proc. IEEE 58th Conf. Decision and Control, Nice, France, 2018, pp. 4674-4679.
|
[62] |
C. S. de Witt, J. N. Foerster, G. Farquhar, P. H. S. Torr, W. Boehmer, and S. Whiteson, "Multi-agent common knowledge reinforcement learning", in Proc. 31st Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 9927-9939.
|
[63] |
D. P. Bertsekas, "Multiagent rollout algorithms and reinforcement learning, " arXiv: 2002.07407, 2019.
|
[64] |
S. Bhattacharya, S. Kailas, S. Badyal, S. Gil, and D. P. Bertsekas, "Multiagent rollout and policy iteration for POMDP with application to multi-robot repair problems, " in Proc. Conf. Robot Learning, 2020; also arXiv preprint, arXiv: 2011.04222.
|
[65] |
D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, USA: Athena Scientific, 1996.
|
[66] |
G. Tesauro, and G. R. Galperin, "On-line policy improvement using Monte-Carlo search, " in Proc. 9th Int. Conf. Neural Information Processing Systems, Denver, USA, 1996, pp. 1068-1074.
|
[67] |
D. P. Bertsekas, Nonlinear Programming. 3rd ed. Belmont, USA: Athena Scientific, 2016.
|
[68] |
M. G. Lagoudakis and R. Parr, "Reinforcement learning as classification: Leveraging modern classifiers, " in Proc. 20th Int. Conf. Machine Learning, Washington, USA, 2003, pp. 424-431.
|
[69] |
C. Dimitrakakis and M. G. Lagoudakis, "Rollout sampling approximate policy iteration, " Mach. Learn., vol. 72, no. 3, pp. 157-171, Jul. 2008.
|
[70] |
A. Lazaric, M. Ghavamzadeh, and R. Munos, "Analysis of a classification-based policy iteration algorithm, " in Proc. 27th Int. Conf. Machine Learning, Haifa, Israel, 2010.
|
[71] |
P. Abbeel and A. Y. Ng, "Apprenticeship learning via inverse reinforcement learning, " in Proc. 21st Int. Conf. Machine Learning, Banff, Canada, 2004.
|
[72] |
B. D. Argall, S. Chernova, M. Veloso, and B. Browning, "A survey of robot learning from demonstration, " Rob. Auton. Syst., vol. 57, no. 5, pp. 469-483, May 2009.
|
[73] |
G. Neu and C. Szepesvari, "Apprenticeship learning using inverse reinforcement learning and gradient methods, " arXiv: 1206.5264, 2012.
|
[74] |
H. Ben Amor, D. Vogt, M. Ewerton, E. Berger, B. Jung, and J. Peters, "Learning responsive robot behavior by imitation, " in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Tokyo, Japan, 2013, pp. 3257-3264.
|
[75] |
J. Lee, "A survey of robot learning from demonstrations for human-robot collaboration, " arXiv: 1710.08789, 2017.
|
[76] |
M. K. Hanawal, H. Liu, H. H. Zhu, and I. C. Paschalidis, "Learning policies for Markov decision processes from data, " IEEE Trans. Autom. Control, vol. 64, no. 6, pp. 2298-2309, Jun. 2019.
|
[77] |
D. Gagliardi and G. Russo, "On a probabilistic approach to synthesize control policies from example datasets, " arXiv: 2005.11191, 2020.
|
[78] |
T. T. Xu, H. H. Zhu, and I. C. Paschalidis, "Learning parametric policies and transition probability models of Markov decision processes from data, " Eur. J. Control, 2020. http://www.researchgate.net/publication/347300920_Nearly_Minimax_Optimal_Reinforcement_Learning_for_Linear_Mixture_Markov_Decision_Processes
|
[79] |
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd Ed. Cambridge, USA: MIT Press, 2018.
|
[80] |
D. P. Bertsekas, "Feature-based aggregation and deep reinforcement learning: A survey and some new implementations, " IEEE/CAA J. Autom. Sinica, vol. 6, no. 1, pp. 1-31, Jan. 2019.
|
[81] |
D. P. Bertsekas, "Approximate policy iteration: A survey and some new methods, " J. Control Theory Appl., vol. 9, no. 3, pp. 310-335, Jul. 2011; Expanded version appears as Lab. for Info. and Decision System Report LIDS-2833, MIT, 2011.
|
[82] |
J. N. Tsitsiklis, "Asynchronous stochastic approximation and Q-learning, " Mach. Learn., vol. 16, no. 3, pp. 185-202, Sep. 1994.
|
[83] |
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, "Human-level control through deep reinforcement learning, " Nature, vol. 518, no. 7540, pp. 529-533, 2015. http://europepmc.org/abstract/med/25719670
|