A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 11 Issue 12
Dec.  2024

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 15.3, Top 1 (SCI Q1)
    CiteScore: 23.5, Top 2% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
S. Cao, X.  Wang, and  Y. Cheng,  “Robust offline actor-critic with on-policy regularized policy evaluation,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 12, pp. 2497–2511, Dec. 2024. doi: 10.1109/JAS.2024.124494
Citation: S. Cao, X.  Wang, and  Y. Cheng,  “Robust offline actor-critic with on-policy regularized policy evaluation,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 12, pp. 2497–2511, Dec. 2024. doi: 10.1109/JAS.2024.124494

Robust Offline Actor-Critic With On-policy Regularized Policy Evaluation

doi: 10.1109/JAS.2024.124494
Funds:  This work was supported in part by the National Natural Science Foundation of China (62176259, 62373364) and the Key Research and Development Program of Jiangsu Province (BE2022095)
More Information
  • To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning (QL-style) on static datasets, this article utilizes the on-policy state-action-reward-state-action (SARSA-style) to develop an offline reinforcement learning (RL) method termed robust offline Actor-Critic with on-policy regularized policy evaluation (OPRAC). With the help of SARSA-style bootstrap actions, a conservative on-policy Q-function and a penalty term for matching the on-policy and off-policy actions are jointly constructed to regularize the optimal Q-function of off-policy QL-style. This naturally equips the off-policy QL-style policy evaluation with the intrinsic pessimistic conservatism of on-policy SARSA-style, thus facilitating the acquisition of stable estimated Q-function. Even with limited data sampling errors, the convergence of Q-function learned by OPRAC and the controllability of bias upper bound between the learned Q-function and its true Q-value can be theoretically guaranteed. In addition, the sub-optimality of learned optimal policy merely stems from sampling errors. Experiments on the well-known D4RL Gym-MuJoCo benchmark demonstrate that OPRAC can rapidly learn robust and effective task-solving policies owing to the stable estimate of Q-value, outperforming state-of-the-art offline RLs by at least 15%.

     

  • loading
  • [1]
    R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction,” Cambridge, MA, USA: MIT Press, 2018.
    [2]
    Y. Cheng, L. Huang, C. L. P. Chen, and X. Wang, “Robust Actor-Critic with relative entropy regulating actor,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 11, pp. 9054–9063, Nov. 2023. doi: 10.1109/TNNLS.2022.3155483
    [3]
    D. Wang, N. Gao, D. Liu, J. Li, and F. L. Lewis, “Recent progress in reinforcement learning and adaptive dynamic programming for advanced control applications,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 1, pp. 18–36, Jan. 2024.
    [4]
    J. Wang, Y. Hong, J. Wang, J. Xu, Y. Tang, Q. L. Han, and J. Kurths, “Cooperative and competitive multi-agent systems: From optimization to games,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 5, pp. 763–783, May 2022. doi: 10.1109/JAS.2022.105506
    [5]
    Y. Yang, H. Modares, K. G. Vamvoudakis, and F. L. Lewis, “Cooperative finitely excited learning for dynamical games,” IEEE Trans. Cybern., vol. 54, no. 2, pp. 797–810, Feb. 2024.
    [6]
    Y. Cheng, L. Huang, and X. Wang, “Authentic boundary proximal policy optimization,” IEEE Trans. Cybern., vol. 52, no. 9, pp. 9428–9438, Sep. 2022. doi: 10.1109/TCYB.2021.3051456
    [7]
    Y. Yang, Z. Ding, R. Wang, H. Modares, and D. C. Wunsch, “Data-driven human-robot interaction without velocity measurement using off-policy reinforcement learning,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 1, pp. 47–63, Jan. 2022. doi: 10.1109/JAS.2021.1004258
    [8]
    X. Wang, L. Wang, C. Dong, H. Ren, and K. Xing, “An online deep reinforcement learning-based order recommendation framework for rider-centered food delivery system,” IEEE Trans. Intell. Transp. Syst., vol. 24, no. 5, pp. 5640–5654, May 2023. doi: 10.1109/TITS.2023.3237580
    [9]
    Y. Yang, B. Kiumarsi, H. Modares, and C. Xu, “Model-free λ-policy iteration for discrete-time linear quadratic regulation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 2, pp. 635–649, Feb. 2023. doi: 10.1109/TNNLS.2021.3098985
    [10]
    C. Mu, Y. Zhang, G. Cai, R. Liu, and C. Sun, “A data-based feedback relearning algorithm for uncertain nonlinear systems,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 5, pp. 1288–1303, May 2023. doi: 10.1109/JAS.2023.123186
    [11]
    Z. Peng, C. Han, Y. Liu, and Z. Zhou, “Weighted policy constraints for offline reinforcement learning,” in Proc. AAAI Conf. Artif. Intell., 2023, pp. 9435–9443.
    [12]
    S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,” arXiv preprint arXiv: 2005.01643, 2020.
    [13]
    S. Lange, T. Gabel, and M. Riedmiller, “Batch reinforcement learning,” Reinforcement Learning: State-of-the-Art, Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 45–73.
    [14]
    R. F. Prudencio, M. R. Maximo, and E. L. Colombini, “A survey on offline reinforcement learning: Taxonomy, review, and open problems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 8, pp. 10237–10257, Aug. 2024.
    [15]
    A. A. Abdellatif, N. Mhaisen, A. Mohamed, A. Erbad, and M. Guizani, “Reinforcement learning for intelligent healthcare systems: A review of challenges, applications, and open research issues,” IEEE Internet Things J., vol. 10, no. 24, pp. 21982–22007, Dec. 2023. doi: 10.1109/JIOT.2023.3288050
    [16]
    Q. Zhang, Y. Gao, Y. Zhang, Y. Guo, D. Ding, Y. Wang, P. Sun, and D. Zhao, “TrajGen: Generating realistic and diverse trajectories with reactive and feasible agent behaviors for autonomous driving,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 12, pp. 24474–24487, Dec. 2022. doi: 10.1109/TITS.2022.3202185
    [17]
    J. Wang, Q. Zhang, and D. Zhao, “Highway lane change decision-making via attention-based deep reinforcement learning,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 3, pp. 567–569, Mar. 2022. doi: 10.1109/JAS.2021.1004395
    [18]
    H. Wang, Z. Liu, Z. Han, Y. Wu, and D. Liu, “Rapid adaptation for active pantograph control in high-speed railway via deep meta reinforcement learning,” IEEE Trans. Cybern., vol. 54, no. 5, pp. 2811–2823, May 2024.
    [19]
    K. Zhang, R. Su, and H. Zhang, “A novel resilient control scheme for a class of Markovian jump systems with partially unknown information,” IEEE Trans. Cybern., vol. 52, no. 8, pp. 8191–8200, Aug. 2022. doi: 10.1109/TCYB.2021.3050619
    [20]
    C. Mu, Z. Liu, J. Yan, H. Jia, and X. Zhang, “Graph multi-agent reinforcement learning for inverter-based active voltage control,” IEEE Trans. Smart Grid, vol. 15, no. 2, pp. 1399–1409, Mar. 2024.
    [21]
    H. Ren, H. Ma, H. Li, and Z. Wang, “Adaptive fixed-time control of nonlinear MASs with actuator faults,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 5, pp. 1252–1262, May 2023. doi: 10.1109/JAS.2023.123558
    [22]
    H. Ma, H. Ren, Q. Zhou, H. Li, and Z. Wang, “Observer-based neural control of N-link flexible-joint robots,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 4, pp. 5295–5305, Apr. 2024.
    [23]
    S. Tosatto, J. Carvalho, and J. Peters, “Batch reinforcement learning with a nonparametric off-policy policy gradient,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 5996–6010, Oct. 2022. doi: 10.1109/TPAMI.2021.3088063
    [24]
    X. Hu, Y. Ma, C. Xiao, Y. Zheng, and J. Hao, “In-sample policy iteration for offline reinforcement learning,” arXiv preprint arXiv: 2306.05726, 2023.
    [25]
    S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 2052–2062.
    [26]
    S. Fujimoto and S. Gu, “A minimalist approach to offline reinforcement learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 20132–20145.
    [27]
    A. Kumar, A. Zhou, G.Tucker, and S. Levine, “Conservative Q-learning for offline reinforcement learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, pp. 1179–1191.
    [28]
    L. Huang, B. Dong, J. Lu, and W. Zhang, “Mild policy evaluation for offline Actor-Critic,” IEEE Trans. Neural Netw. Learn. Syst., early access, Sept. 2023, DOI: 10.1109/TNNLS.2023.3309906.
    [29]
    I. Kostrikov, A. Nair, and S. Levine, “Offline reinforcement learning with implicit Q-learning,” arXiv preprint arXiv: 2110.06169, 2021.
    [30]
    A. Kumar, J. Fu, M Soh, G. Tucker, and S. Levine, “Stabilizing off-policy Q-learning via bootstrapping error reduction,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 11784–11794.
    [31]
    Y. Wu, T. George, and O. Nachum, “Behavior regularized offline reinforcement learning,” arXiv preprint arXiv: 1911.11361, 2019.
    [32]
    F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from observation,” in Proc. Joint Conf. Artif. Intell., 2018, pp. 4950–4957.
    [33]
    A. Nair, Dalal M, Gupta A, and S. Levine, “Accelerating online reinforcement learning with offline datasets,” arXiv preprint arXiv: 2006.09359.
    [34]
    P. Daoudi, M. Barlier, L. D. Santos, and V. Virmaux, “Density estimation for conservative Q-learning,” in Proc. ICLR Workshop Generalizable Policy Learning in Physical World, 2022.
    [35]
    J. Lyu, X. Ma, X. Li, and Z. Lu, “Mildly conservative Q-learning for offline reinforcement learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, pp. 1711–1724.
    [36]
    Y. Wu, S. Zhai, N. Srivastava, J. Susskind, J. Zhang, R. Salakhutdinov, and H. Goh, “Uncertainty weighted Actor-Critic for offline reinforcement learning,” arXiv preprint arXiv: 2105.08140, 2021.
    [37]
    C. Bai, L. Wang, Z. Yang, Z. Deng, A. Garg, P. Liu, and Z. Wang, “Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning,” arXiv preprint arXiv: 2202.11566, 2022.
    [38]
    H. Xu, L. Jiang, J. Li, Z. Yang, Z. Wang, V. W. K. Chan, and X. Zhan, “Offline RL with no OOD actions: In-sample learning via implicit value regularization,” arXiv preprint arXiv: 2303.15810, 2023.
    [39]
    C. Xiao, H. Wang, Y. Pan, A. White, and M. White, “The in-sample softmax for offline reinforcement learning,” arXiv preprint arXiv: 2302.14372, 2023.
    [40]
    D. Brandfonbrener, W Whitney, R. Ranganath, and J. Bruna, “Offline rl without off-policy evaluation,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 4933–4946.
    [41]
    Z. Zhuang, K. Lei, J. Liu, D. Wang, and Y. Guo, “Behavior proximal policy optimization,” arXiv preprint arXiv: 2302.11312, 2023.
    [42]
    L. Shi, R. Dadashi, Y. Chi, P. S. Castro, and M. Geist, “Offline reinforcement learning with on-policy Q-function regularization,” arXiv preprint arXiv: 2307.13824, 2023.
    [43]
    J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine, “D4RL: Datasets for deep data-driven reinforcement learning,” arXiv preprint arXiv: 2004.07219, 2020.
    [44]
    E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” in Proc. Int. Conf. Intell. Rob. Syst., 2012, pp. 5026–5033.
    [45]
    A. Farahmand, C. Szepesvári, and R. Munos, “Error propagation for approximate policy and value iteration,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2010, pp. 568–576.
    [46]
    L. Huang, B. Dong, and W. Zhang, “Efficient offline reinforcement learning with relaxed conservatism,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 8, pp. 5260–5272, Aug. 2024.
    [47]
    C. Dann, Y. Mansour, M. Mohri, A. Sekhari, and K. Sridharan, “Guarantees for epsilon-greedy reinforcement learning with function approximation,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 4666–4689.
    [48]
    T. Xu, Z. Li, and Y. Yu, “Error bounds of imitating policies and environments for reinforcement learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 6968–6980, Oct. 2022. doi: 10.1109/TPAMI.2021.3096966
    [49]
    S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 1587–1596.
    [50]
    D. Silver and G. Tesauro, “Monte-Carlo simulation balancing,” in Proc. Int. Conf. Mach. Learn., 2009, pp. 945–952.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(9)  / Tables(4)

    Article Metrics

    Article views (92) PDF downloads(31) Cited by()

    Highlights

    • Develop an on-policy regularized policy evaluation for offline RL
    • Integrate the intrinsic conservatism of on-policy SARSA into off-policy Q-learning
    • Promote the stable estimated Q-function without accessing the pre-trained behavior policy

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return