Robust Offline Actor-Critic With On-policy Regularized Policy Evaluation

Shuo Cao; Xuesong Wang; Yuhu Cheng

doi:10.1109/JAS.2024.124494

Volume 11 Issue 12

Dec. 2024

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 15.3, Top 1 (SCI Q1)

CiteScore: 23.5, Top 2% (Q1)
Google Scholar h5-index: 77， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2024 > 11(12): 2497-2511

S. Cao, X. Wang, and Y. Cheng, “Robust offline actor-critic with on-policy regularized policy evaluation,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 12, pp. 2497–2511, Dec. 2024. doi: 10.1109/JAS.2024.124494

Citation:

S. Cao, X. Wang, and Y. Cheng, “Robust offline actor-critic with on-policy regularized policy evaluation,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 12, pp. 2497–2511, Dec. 2024. doi: 10.1109/JAS.2024.124494

Citation:

PDF( 2837 KB)

Robust Offline Actor-Critic With On-policy Regularized Policy Evaluation

doi: 10.1109/JAS.2024.124494

Funds: This work was supported in part by the National Natural Science Foundation of China (62176259, 62373364) and the Key Research and Development Program of Jiangsu Province (BE2022095)

More Information

Abstract

Abstract

To alleviate the extrapolation error and instability inherent in Q-function directly learned by off-policy Q-learning (QL-style) on static datasets, this article utilizes the on-policy state-action-reward-state-action (SARSA-style) to develop an offline reinforcement learning (RL) method termed robust offline Actor-Critic with on-policy regularized policy evaluation (OPRAC). With the help of SARSA-style bootstrap actions, a conservative on-policy Q-function and a penalty term for matching the on-policy and off-policy actions are jointly constructed to regularize the optimal Q-function of off-policy QL-style. This naturally equips the off-policy QL-style policy evaluation with the intrinsic pessimistic conservatism of on-policy SARSA-style, thus facilitating the acquisition of stable estimated Q-function. Even with limited data sampling errors, the convergence of Q-function learned by OPRAC and the controllability of bias upper bound between the learned Q-function and its true Q-value can be theoretically guaranteed. In addition, the sub-optimality of learned optimal policy merely stems from sampling errors. Experiments on the well-known D4RL Gym-MuJoCo benchmark demonstrate that OPRAC can rapidly learn robust and effective task-solving policies owing to the stable estimate of Q-value, outperforming state-of-the-art offline RLs by at least 15%.
- Offline reinforcement learning,
- off-policy QL-style,
- on-policy SARSA-style,
- policy evaluation (PE),
- Q-value estimation

FullText(HTML)

References(50)

References

[1]	R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction,” Cambridge, MA, USA: MIT Press, 2018.
[2]	Y. Cheng, L. Huang, C. L. P. Chen, and X. Wang, “Robust Actor-Critic with relative entropy regulating actor,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 11, pp. 9054–9063, Nov. 2023. doi: 10.1109/TNNLS.2022.3155483
[3]	D. Wang, N. Gao, D. Liu, J. Li, and F. L. Lewis, “Recent progress in reinforcement learning and adaptive dynamic programming for advanced control applications,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 1, pp. 18–36, Jan. 2024.
[4]	J. Wang, Y. Hong, J. Wang, J. Xu, Y. Tang, Q. L. Han, and J. Kurths, “Cooperative and competitive multi-agent systems: From optimization to games,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 5, pp. 763–783, May 2022. doi: 10.1109/JAS.2022.105506
[5]	Y. Yang, H. Modares, K. G. Vamvoudakis, and F. L. Lewis, “Cooperative finitely excited learning for dynamical games,” IEEE Trans. Cybern., vol. 54, no. 2, pp. 797–810, Feb. 2024.
[6]	Y. Cheng, L. Huang, and X. Wang, “Authentic boundary proximal policy optimization,” IEEE Trans. Cybern., vol. 52, no. 9, pp. 9428–9438, Sep. 2022. doi: 10.1109/TCYB.2021.3051456
[7]	Y. Yang, Z. Ding, R. Wang, H. Modares, and D. C. Wunsch, “Data-driven human-robot interaction without velocity measurement using off-policy reinforcement learning,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 1, pp. 47–63, Jan. 2022. doi: 10.1109/JAS.2021.1004258
[8]	X. Wang, L. Wang, C. Dong, H. Ren, and K. Xing, “An online deep reinforcement learning-based order recommendation framework for rider-centered food delivery system,” IEEE Trans. Intell. Transp. Syst., vol. 24, no. 5, pp. 5640–5654, May 2023. doi: 10.1109/TITS.2023.3237580
[9]	Y. Yang, B. Kiumarsi, H. Modares, and C. Xu, “Model-free λ-policy iteration for discrete-time linear quadratic regulation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 2, pp. 635–649, Feb. 2023. doi: 10.1109/TNNLS.2021.3098985
[10]	C. Mu, Y. Zhang, G. Cai, R. Liu, and C. Sun, “A data-based feedback relearning algorithm for uncertain nonlinear systems,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 5, pp. 1288–1303, May 2023. doi: 10.1109/JAS.2023.123186
[11]	Z. Peng, C. Han, Y. Liu, and Z. Zhou, “Weighted policy constraints for offline reinforcement learning,” in Proc. AAAI Conf. Artif. Intell., 2023, pp. 9435–9443.
[12]	S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,” arXiv preprint arXiv: 2005.01643, 2020.
[13]	S. Lange, T. Gabel, and M. Riedmiller, “Batch reinforcement learning,” Reinforcement Learning: State-of-the-Art, Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 45–73.
[14]	R. F. Prudencio, M. R. Maximo, and E. L. Colombini, “A survey on offline reinforcement learning: Taxonomy, review, and open problems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 8, pp. 10237–10257, Aug. 2024.
[15]	A. A. Abdellatif, N. Mhaisen, A. Mohamed, A. Erbad, and M. Guizani, “Reinforcement learning for intelligent healthcare systems: A review of challenges, applications, and open research issues,” IEEE Internet Things J., vol. 10, no. 24, pp. 21982–22007, Dec. 2023. doi: 10.1109/JIOT.2023.3288050
[16]	Q. Zhang, Y. Gao, Y. Zhang, Y. Guo, D. Ding, Y. Wang, P. Sun, and D. Zhao, “TrajGen: Generating realistic and diverse trajectories with reactive and feasible agent behaviors for autonomous driving,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 12, pp. 24474–24487, Dec. 2022. doi: 10.1109/TITS.2022.3202185
[17]	J. Wang, Q. Zhang, and D. Zhao, “Highway lane change decision-making via attention-based deep reinforcement learning,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 3, pp. 567–569, Mar. 2022. doi: 10.1109/JAS.2021.1004395
[18]	H. Wang, Z. Liu, Z. Han, Y. Wu, and D. Liu, “Rapid adaptation for active pantograph control in high-speed railway via deep meta reinforcement learning,” IEEE Trans. Cybern., vol. 54, no. 5, pp. 2811–2823, May 2024.
[19]	K. Zhang, R. Su, and H. Zhang, “A novel resilient control scheme for a class of Markovian jump systems with partially unknown information,” IEEE Trans. Cybern., vol. 52, no. 8, pp. 8191–8200, Aug. 2022. doi: 10.1109/TCYB.2021.3050619
[20]	C. Mu, Z. Liu, J. Yan, H. Jia, and X. Zhang, “Graph multi-agent reinforcement learning for inverter-based active voltage control,” IEEE Trans. Smart Grid, vol. 15, no. 2, pp. 1399–1409, Mar. 2024.
[21]	H. Ren, H. Ma, H. Li, and Z. Wang, “Adaptive fixed-time control of nonlinear MASs with actuator faults,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 5, pp. 1252–1262, May 2023. doi: 10.1109/JAS.2023.123558
[22]	H. Ma, H. Ren, Q. Zhou, H. Li, and Z. Wang, “Observer-based neural control of N-link flexible-joint robots,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 4, pp. 5295–5305, Apr. 2024.
[23]	S. Tosatto, J. Carvalho, and J. Peters, “Batch reinforcement learning with a nonparametric off-policy policy gradient,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 5996–6010, Oct. 2022. doi: 10.1109/TPAMI.2021.3088063
[24]	X. Hu, Y. Ma, C. Xiao, Y. Zheng, and J. Hao, “In-sample policy iteration for offline reinforcement learning,” arXiv preprint arXiv: 2306.05726, 2023.
[25]	S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 2052–2062.
[26]	S. Fujimoto and S. Gu, “A minimalist approach to offline reinforcement learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 20132–20145.
[27]	A. Kumar, A. Zhou, G.Tucker, and S. Levine, “Conservative Q-learning for offline reinforcement learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, pp. 1179–1191.
[28]	L. Huang, B. Dong, J. Lu, and W. Zhang, “Mild policy evaluation for offline Actor-Critic,” IEEE Trans. Neural Netw. Learn. Syst., early access, Sept. 2023, DOI: 10.1109/TNNLS.2023.3309906.
[29]	I. Kostrikov, A. Nair, and S. Levine, “Offline reinforcement learning with implicit Q-learning,” arXiv preprint arXiv: 2110.06169, 2021.
[30]	A. Kumar, J. Fu, M Soh, G. Tucker, and S. Levine, “Stabilizing off-policy Q-learning via bootstrapping error reduction,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 11784–11794.
[31]	Y. Wu, T. George, and O. Nachum, “Behavior regularized offline reinforcement learning,” arXiv preprint arXiv: 1911.11361, 2019.
[32]	F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from observation,” in Proc. Joint Conf. Artif. Intell., 2018, pp. 4950–4957.
[33]	A. Nair, Dalal M, Gupta A, and S. Levine, “Accelerating online reinforcement learning with offline datasets,” arXiv preprint arXiv: 2006.09359.
[34]	P. Daoudi, M. Barlier, L. D. Santos, and V. Virmaux, “Density estimation for conservative Q-learning,” in Proc. ICLR Workshop Generalizable Policy Learning in Physical World, 2022.
[35]	J. Lyu, X. Ma, X. Li, and Z. Lu, “Mildly conservative Q-learning for offline reinforcement learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, pp. 1711–1724.
[36]	Y. Wu, S. Zhai, N. Srivastava, J. Susskind, J. Zhang, R. Salakhutdinov, and H. Goh, “Uncertainty weighted Actor-Critic for offline reinforcement learning,” arXiv preprint arXiv: 2105.08140, 2021.
[37]	C. Bai, L. Wang, Z. Yang, Z. Deng, A. Garg, P. Liu, and Z. Wang, “Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning,” arXiv preprint arXiv: 2202.11566, 2022.
[38]	H. Xu, L. Jiang, J. Li, Z. Yang, Z. Wang, V. W. K. Chan, and X. Zhan, “Offline RL with no OOD actions: In-sample learning via implicit value regularization,” arXiv preprint arXiv: 2303.15810, 2023.
[39]	C. Xiao, H. Wang, Y. Pan, A. White, and M. White, “The in-sample softmax for offline reinforcement learning,” arXiv preprint arXiv: 2302.14372, 2023.
[40]	D. Brandfonbrener, W Whitney, R. Ranganath, and J. Bruna, “Offline rl without off-policy evaluation,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 4933–4946.
[41]	Z. Zhuang, K. Lei, J. Liu, D. Wang, and Y. Guo, “Behavior proximal policy optimization,” arXiv preprint arXiv: 2302.11312, 2023.
[42]	L. Shi, R. Dadashi, Y. Chi, P. S. Castro, and M. Geist, “Offline reinforcement learning with on-policy Q-function regularization,” arXiv preprint arXiv: 2307.13824, 2023.
[43]	J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine, “D4RL: Datasets for deep data-driven reinforcement learning,” arXiv preprint arXiv: 2004.07219, 2020.
[44]	E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” in Proc. Int. Conf. Intell. Rob. Syst., 2012, pp. 5026–5033.
[45]	A. Farahmand, C. Szepesvári, and R. Munos, “Error propagation for approximate policy and value iteration,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2010, pp. 568–576.
[46]	L. Huang, B. Dong, and W. Zhang, “Efficient offline reinforcement learning with relaxed conservatism,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 8, pp. 5260–5272, Aug. 2024.
[47]	C. Dann, Y. Mansour, M. Mohri, A. Sekhari, and K. Sridharan, “Guarantees for epsilon-greedy reinforcement learning with function approximation,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 4666–4689.
[48]	T. Xu, Z. Li, and Y. Yu, “Error bounds of imitating policies and environments for reinforcement learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 6968–6980, Oct. 2022. doi: 10.1109/TPAMI.2021.3096966
[49]	S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 1587–1596.
[50]	D. Silver and G. Tesauro, “Monte-Carlo simulation balancing,” in Proc. Int. Conf. Mach. Learn., 2009, pp. 945–952.

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(9) / Tables(4)

Get Citation

PDF

XML

Article Metrics

Article views (92) PDF downloads(31)

Highlights

Develop an on-policy regularized policy evaluation for offline RL
Integrate the intrinsic conservatism of on-policy SARSA into off-policy Q-learning
Promote the stable estimated Q-function without accessing the pre-trained behavior policy

Robust Offline Actor-Critic With On-policy Regularized Policy Evaluation

doi: 10.1109/JAS.2024.124494

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Highlights

Export File

Citation

Format

Content