A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 8 Issue 2
Feb.  2021

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 15.3, Top 1 (SCI Q1)
    CiteScore: 23.5, Top 2% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
Dimitri Bertsekas, "Multiagent Reinforcement Learning:Rollout and Policy Iteration," IEEE/CAA J. Autom. Sinica, vol. 8, no. 2, pp. 249-272, Feb. 2021. doi: 10.1109/JAS.2021.1003814
Citation: Dimitri Bertsekas, "Multiagent Reinforcement Learning:Rollout and Policy Iteration," IEEE/CAA J. Autom. Sinica, vol. 8, no. 2, pp. 249-272, Feb. 2021. doi: 10.1109/JAS.2021.1003814

Multiagent Reinforcement Learning:Rollout and Policy Iteration

doi: 10.1109/JAS.2021.1003814
More Information
  • We discuss the solution of complex multistage decision problems using methods that are based on the idea of policy iteration (PI), i.e., start from some base policy and generate an improved policy. Rollout is the simplest method of this type, where just one improved policy is generated. We can view PI as repeated application of rollout, where the rollout policy at each iteration serves as the base policy for the next iteration. In contrast with PI, rollout has a robustness property: it can be applied on-line and is suitable for on-line replanning. Moreover, rollout can use as base policy one of the policies produced by PI, thereby improving on that policy. This is the type of scheme underlying the prominently successful AlphaZero chess program.In this paper we focus on rollout and PI-like methods for problems where the control consists of multiple components each selected (conceptually) by a separate agent. This is the class of multiagent problems where the agents have a shared objective function, and a shared and perfect state information. Based on a problem reformulation that trades off control space complexity with state space complexity, we develop an approach, whereby at every stage, the agents sequentially (one-at-a-time) execute a local rollout algorithm that uses a base policy, together with some coordinating information from the other agents. The amount of total computation required at every stage grows linearly with the number of agents. By contrast, in the standard rollout algorithm, the amount of total computation grows exponentially with the number of agents. Despite the dramatic reduction in required computation, we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout: it guarantees an improved performance relative to the base policy. We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information, which is sufficient to maintain the cost improvement property, without any on-line coordination of control selection between the agents.For discounted and other infinite horizon problems, we also consider exact and approximate PI algorithms involving a new type of one-agent-at-a-time policy improvement operation. For one of our PI algorithms, we prove convergence to an agent-by-agent optimal policy, thus establishing a connection with the theory of teams. For another PI algorithm, which is executed over a more complex state space, we prove convergence to an optimal policy. Approximate forms of these algorithms are also given, based on the use of policy and value neural networks. These PI algorithms, in both their exact and their approximate form are strictly off-line methods, but they can be used to provide a base policy for use in an on-line multiagent rollout scheme.

     

  • loading
  • Recommended by Associate Editor Qinglai Wei.
  • [1]
    D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. I. 4th ed. Belmont, USA: Athena Scientific, 2017.
    [2]
    D. P. Bertsekas, Reinforcement Learning and Optimal Control. Belmont, USA: Athena Scientific, 2019.
    [3]
    D. P. Bertsekas, Rollout, Policy Iteration, and Distributed Reinforcement Learning. Belmont, USA: Athena Scientific, 2020.
    [4]
    D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, "Mastering chess and Shogi by self-play with a general reinforcement learning algorithm, " arXiv preprint arXiv: 1712.01815, 2017.
    [5]
    D. P. Bertsekas, "Multiagent value iteration algorithms in dynamic programming and reinforcement learning, " arxiv: 2005.01627, 2020.
    [6]
    D. P. Bertsekas, "Constrained multiagent rollout and multidimensional assignment with the auction algorithm, " arXiv: 2002.07407, 2020.
    [7]
    D. P. Bertsekas, "Distributed dynamic programming, " IEEE Trans. Autom. Control, vol. 27, no. 3, pp. 610-616, Jun. 1982.
    [8]
    D. P. Bertsekas, "Asynchronous distributed computation of fixed points, " Math. Programming, vol. 27, no. 1, pp. 107-120, Sep. 1983.
    [9]
    D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods. Englewood Cliffs, USA: Prentice-Hall, 1989.
    [10]
    D. P. Bertsekas and H. Z. Yu, "Asynchronous distributed policy iteration in dynamic programming, " in Proc. 48th Annu. Allerton Conf. Communication, Control, and Computing, Allerton, USA, 2010, pp. 1368-1374.
    [11]
    D. P. Bertsekas and H. Z. Yu, "Q-learning and enhanced policy iteration in discounted dynamic programming, " Math. Oper. Res., vol. 37, pp. 66-94, Feb. 2012.
    [12]
    H. Z. Yu and D. P. Bertsekas, "Q-learning and policy iteration algorithms for stochastic shortest path problems, " Ann. Oper. Res., vol. 208, no. 1, pp. 95-132, Sep. 2013.
    [13]
    D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. Ⅱ. 4th ed. Belmont, USA: Athena Scientific, 2012.
    [14]
    D. P. Bertsekas, Abstract Dynamic Programming. Belmont, USA: Athena Scientific, 2018.
    [15]
    S. Bhattacharya, S. Badyal, T. Wheeler, S. Gil, and D. P. Bertsekas, "Reinforcement learning for POMDP: Partitioned rollout and policy iteration with application to autonomous sequential repair problems, " IEEE Rob. Autom. Lett., vol. 5, no. 3, pp. 3967-3974, Jul. 2020.
    [16]
    H. S. Witsenhausen, "A counterexample in stochastic optimum control, " SIAM J. Control, vol. 6, no. 1, pp. 131-147, 1968. http://www.ams.org/mathscinet-getitem?mr=231649
    [17]
    H. S. Witsenhausen, "Separation of estimation and control for discrete time systems, " Proc. IEEE, vol. 59, no. 11, pp. 1557-1566, Nov. 1971.
    [18]
    J. Marschak, "Elements for a theory of teams, " Manage. Sci., vol. 1, no. 2, pp. 127-137, Jan. 1975.
    [19]
    R. Radner, "Team decision problems, " Ann. Math. Statist., vol. 33, no. 3, pp. 857-881, Sep. 1962.
    [20]
    H. S. Witsenhausen, "On information structures, feedback and causality, " SIAM J. Control, vol. 9, no. 2, pp. 149-160, 1971. http://www.researchgate.net/publication/250955878_On_Information_Structures_Feedback_and_Causality
    [21]
    J. Marschak and R. Radner, Economic Theory of Teams. New Haven, USA: Yale University Press, 1976.
    [22]
    N. Sandell, P. Varaiya, M. Athans, and M. Safonov, "Survey of decentralized control methods for large scale systems, " IEEE Trans. Autom. Control, vol. 23, no. 2, pp. 108-128, Apr. 1978.
    [23]
    T. Yoshikawa, "Decomposition of dynamic team decision problems, " IEEE Trans. Autom. Control, vol. 23, no. 4, pp. 627-632, Aug. 1978.
    [24]
    Y. C. Ho, "Team decision theory and information structures, " Proc. IEEE, vol. 68, no. 6, pp. 644-654, Jun. 1980.
    [25]
    D. Bauso and R. Pesenti, "Generalized person-by-person optimization in team problems with binary decisions, " in Proc. American Control Conf., Seattle, USA, 2008, pp. 717-722.
    [26]
    D. Bauso and R. Pesenti, "Team theory and person-by-person optimization with binary decisions, " SIAM J. Control Optim., vol. 50, no. 5, pp. 3011-3028, Jan. 2012.
    [27]
    A. Nayyar, A. Mahajan, and D. Teneketzis, "Decentralized stochastic control with partial history sharing: A common information approach, " IEEE Trans. Autom. Control, vol. 58, no. 7, pp. 1644-1658, Jul. 2013.
    [28]
    A. Nayyar and D. Teneketzis, "Common knowledge and sequential team problems, " IEEE Trans Autom. Control, vol. 64, no. 12, pp. 5108-5115, Dec. 2019.
    [29]
    Y. Y. Li, Y. J. Tang, R. Y. Zhang, and N. Li, "Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach, " arXiv: 1912.09135, 2019.
    [30]
    G. Qu and N. Li, "Exploiting Fast Decaying and Locality in Multi-Agent MDP with Tree Dependence Structure, " in Proc. of CDC, Nice, France, 2019.
    [31]
    A. Gupta, "Existence of team-optimal solutions in static teams with common information: A topology of information approach, " SIAM J. Control Optim., vol. 58, no. 2, pp. 998-1021, Apr. 2020.
    [32]
    F. Bullo, J. Cortes, and S. Martinez, Distributed Control of Robotic Networks: A Mathematical Approach to Motion Coordination Algorithms. St. Princeton, USA: Princeton University Press, 2009.
    [33]
    M. Mesbahi and M. Egerstedt, Graph Theoretic Methods in Multiagent Networks. Princeton, USA: Princeton University Press, 2010.
    [34]
    M. S. Mahmoud, Multiagent Systems: Introduction and Coordination Control. Boca Raton, USA: CRC Press, 2020.
    [35]
    R. Zoppoli, M. Sanguineti, G. Gnecco, and T. Parisini, Neural Approximations for Optimal Control and Decision, Springer, 2020. http://www.researchgate.net/publication/338322921_Neural_Approximations_for_Optimal_Control_and_Decision
    [36]
    F. A. Oliehoek and C. Amato, A Concise Introduction to Decentralized POMDPs, Springer International Publishing, 2016. doi: 10.1007/978-3-319-28929-8
    [37]
    P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote, "A survey of learning in multiagent environments: Dealing with non-stationarity, " arXiv: 1707.09183, 2017.
    [38]
    K. Q. Zhang, Z. R. Yang, and T. Başar, "Multi-agent reinforcement learning: A selective overview of theories and algorithms, " arXiv: 1911.10635, 2019.
    [39]
    L. S. Shapley, "Stochastic games, " Proc. Natl. Acad. Sci., vol. 39, no. 10, pp. 1095-1100, 1953.
    [40]
    M. L. Littman, "Markov games as a framework for multi-agent reinforcement learning, " in Machine Learning Proceedings 1994, W. W. Cohen and H. Hirsh, Eds. Amsterdam, The Netherlands: Elsevier, 1994, pp. 157-163.
    [41]
    K. P. Sycara, "Multiagent systems, " AI Mag., vol. 19, no. 2, pp. 79-92, Jun. 1998.
    [42]
    P. Stone and M. Veloso, "Multiagent systems: A survey from a machine learning perspective, " Auton. Rob., vol. 8, no. 3, pp. 345-383, Jun. 2000.
    [43]
    L. Panait and S. Luke, "Cooperative multi-agent learning: The state of the art, " Auton. Agen. Multi-Agent Syst., vol. 11, no. 3, pp. 387-434, Nov. 2005.
    [44]
    L. Busoniu, R. Babuska, and B. De Schutter, "A comprehensive survey of multiagent reinforcement learning, " IEEE Trans. Syst., Man, Cybern., Part C, vol. 38, no. 2, pp. 156-172, Mar. 2008.
    [45]
    L. Busoniu, R. Babuška, and B. De Schutter, "Multi-agent reinforcement learning: An overview, " in Innovations in Multi-Agent Systems and Applications-1, D. Srinivasan and L. C. Jain, Eds. Berlin, Germany: Springer, 2010, pp. 183-221.
    [46]
    L. Matignon, G. J. Laurent, and N. Le Fort-Piat, "Independent reinforcement learners in cooperative Markov games: A survey regarding coordination problems, " Knowl. Eng. Rev., vol. 27, no. 1, pp. 1-31, Feb. 2012.
    [47]
    P. Hernandez-Leal, B. Kartal, and M. E. Taylor, "A survey and critique of multiagent deep reinforcement learning, " Auton. Agent. Multi-Agent Syst., vol. 33, no. 6, pp. 750-797, Oct. 2019.
    [48]
    A. OroojlooyJadid and D. Hajinezhad, "A review of cooperative multi-agent deep reinforcement learning, " arXiv: 1908.03963, 2019.
    [49]
    T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, "Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications, " IEEE Trans Cybern., vol. 50, no. 9, pp. 3826-3839, Sep. 2020.
    [50]
    G. Tesauro, "Extending Q-learning to general adaptive multi-agent systems, " in Proc. 16th Int. Conf. Neural Information Processing Systems, 2004, pp. 871-878.
    [51]
    F. A. Oliehoek, J. F. P. Kooij, and N. Vlassis, "The cross-entropy method for policy search in decentralized POMDPs, " Informatica, vol. 32, no. 4, pp. 341-357, 2008. http://www.ams.org/mathscinet-getitem?mr=2481391
    [52]
    P. Pennesi and I. C. Paschalidis, "A distributed actor-critic algorithm and applications to mobile sensor network coordination problems, " IEEE Trans. Autom. Control, vol. 55, no. 2, pp. 492-497, Feb. 2010.
    [53]
    I. C. Paschalidis and Y. W. Lin, "Mobile agent coordination via a distributed actor-critic algorithm, " in Proc. 19th Mediterranean Conf. Control Automation, Corfu, Greece, 2011, pp. 644-649.
    [54]
    S. Kar, J. M. F. Moura, and H. V. Poor, "QD-Learning: A collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations, " IEEE Trans. Signal Process., vol. 61, no. 7, pp. 1848-1862, Apr. 2013.
    [55]
    J. N. Foerster, Y. M. Assael, N. De Freitas, and S. Whiteson, "Learning to Communicate with Deep Multi-Agent Reinforcement Learning, " in Proc. 30th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 2137-2145.
    [56]
    S. Omidshafiei, A. A. Agha-Mohammadi, C. Amato, S. Y. Liu, J. P. How, and J. Vian, "Graph-based cross entropy method for solving multi-robot decentralized POMDPs, " in Proc. IEEE Int. Conf. Robotics and Automation, Stockholm, Sweden, 2016, pp. 5395-5402.
    [57]
    J. K. Gupta, M. Egorov, and M. Kochenderfer, "Cooperative multi-agent control using deep reinforcement learning, " in Proc. Int. Conf. Autonomous Agents and Multiagent Systems, Best Papers, Brazil, 2017, pp. 66-83.
    [58]
    R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, "Multi-agent actor-critic for mixed cooperative-competitive environments, " in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, USA, 2017, pp. 6379-6390.
    [59]
    M. Zhou, Y. Chen, Y. Wen, Y. D. Yang, Y. F. Su, W. N. Zhang, D. Zhang, and J. Wang, "Factorized Q-learning for large-scale multi-agent systems, " arXiv: 1809.03738, 2018.
    [60]
    K. Q. Zhang, Z. R. Yang, H. Liu, T. Zhang, and T. Başar, "Fully decentralized multi-agent reinforcement learning with networked agents, " arXiv: 1802.08757, 2018.
    [61]
    Y. Zhang and M. M. Zavlanos, 2019 "Distributed off-policy actor-critic reinforcement learning with policy consensus, " in Proc. IEEE 58th Conf. Decision and Control, Nice, France, 2018, pp. 4674-4679.
    [62]
    C. S. de Witt, J. N. Foerster, G. Farquhar, P. H. S. Torr, W. Boehmer, and S. Whiteson, "Multi-agent common knowledge reinforcement learning", in Proc. 31st Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 9927-9939.
    [63]
    D. P. Bertsekas, "Multiagent rollout algorithms and reinforcement learning, " arXiv: 2002.07407, 2019.
    [64]
    S. Bhattacharya, S. Kailas, S. Badyal, S. Gil, and D. P. Bertsekas, "Multiagent rollout and policy iteration for POMDP with application to multi-robot repair problems, " in Proc. Conf. Robot Learning, 2020; also arXiv preprint, arXiv: 2011.04222.
    [65]
    D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, USA: Athena Scientific, 1996.
    [66]
    G. Tesauro, and G. R. Galperin, "On-line policy improvement using Monte-Carlo search, " in Proc. 9th Int. Conf. Neural Information Processing Systems, Denver, USA, 1996, pp. 1068-1074.
    [67]
    D. P. Bertsekas, Nonlinear Programming. 3rd ed. Belmont, USA: Athena Scientific, 2016.
    [68]
    M. G. Lagoudakis and R. Parr, "Reinforcement learning as classification: Leveraging modern classifiers, " in Proc. 20th Int. Conf. Machine Learning, Washington, USA, 2003, pp. 424-431.
    [69]
    C. Dimitrakakis and M. G. Lagoudakis, "Rollout sampling approximate policy iteration, " Mach. Learn., vol. 72, no. 3, pp. 157-171, Jul. 2008.
    [70]
    A. Lazaric, M. Ghavamzadeh, and R. Munos, "Analysis of a classification-based policy iteration algorithm, " in Proc. 27th Int. Conf. Machine Learning, Haifa, Israel, 2010.
    [71]
    P. Abbeel and A. Y. Ng, "Apprenticeship learning via inverse reinforcement learning, " in Proc. 21st Int. Conf. Machine Learning, Banff, Canada, 2004.
    [72]
    B. D. Argall, S. Chernova, M. Veloso, and B. Browning, "A survey of robot learning from demonstration, " Rob. Auton. Syst., vol. 57, no. 5, pp. 469-483, May 2009.
    [73]
    G. Neu and C. Szepesvari, "Apprenticeship learning using inverse reinforcement learning and gradient methods, " arXiv: 1206.5264, 2012.
    [74]
    H. Ben Amor, D. Vogt, M. Ewerton, E. Berger, B. Jung, and J. Peters, "Learning responsive robot behavior by imitation, " in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Tokyo, Japan, 2013, pp. 3257-3264.
    [75]
    J. Lee, "A survey of robot learning from demonstrations for human-robot collaboration, " arXiv: 1710.08789, 2017.
    [76]
    M. K. Hanawal, H. Liu, H. H. Zhu, and I. C. Paschalidis, "Learning policies for Markov decision processes from data, " IEEE Trans. Autom. Control, vol. 64, no. 6, pp. 2298-2309, Jun. 2019.
    [77]
    D. Gagliardi and G. Russo, "On a probabilistic approach to synthesize control policies from example datasets, " arXiv: 2005.11191, 2020.
    [78]
    T. T. Xu, H. H. Zhu, and I. C. Paschalidis, "Learning parametric policies and transition probability models of Markov decision processes from data, " Eur. J. Control, 2020. http://www.researchgate.net/publication/347300920_Nearly_Minimax_Optimal_Reinforcement_Learning_for_Linear_Mixture_Markov_Decision_Processes
    [79]
    R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd Ed. Cambridge, USA: MIT Press, 2018.
    [80]
    D. P. Bertsekas, "Feature-based aggregation and deep reinforcement learning: A survey and some new implementations, " IEEE/CAA J. Autom. Sinica, vol. 6, no. 1, pp. 1-31, Jan. 2019.
    [81]
    D. P. Bertsekas, "Approximate policy iteration: A survey and some new methods, " J. Control Theory Appl., vol. 9, no. 3, pp. 310-335, Jul. 2011; Expanded version appears as Lab. for Info. and Decision System Report LIDS-2833, MIT, 2011.
    [82]
    J. N. Tsitsiklis, "Asynchronous stochastic approximation and Q-learning, " Mach. Learn., vol. 16, no. 3, pp. 185-202, Sep. 1994.
    [83]
    V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, "Human-level control through deep reinforcement learning, " Nature, vol. 518, no. 7540, pp. 529-533, 2015. http://europepmc.org/abstract/med/25719670

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(7)

    Article Metrics

    Article views (6001) PDF downloads(591) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return