Citation: | S. Cao, X. Wang, and Y. Cheng, “Robust offline actor-critic with on-policy regularized policy evaluation,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 12, pp. 2497–2511, Dec. 2024. doi: 10.1109/JAS.2024.124494 |
[1] |
R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction,” Cambridge, MA, USA: MIT Press, 2018.
|
[2] |
Y. Cheng, L. Huang, C. L. P. Chen, and X. Wang, “Robust Actor-Critic with relative entropy regulating actor,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 11, pp. 9054–9063, Nov. 2023. doi: 10.1109/TNNLS.2022.3155483
|
[3] |
D. Wang, N. Gao, D. Liu, J. Li, and F. L. Lewis, “Recent progress in reinforcement learning and adaptive dynamic programming for advanced control applications,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 1, pp. 18–36, Jan. 2024.
|
[4] |
J. Wang, Y. Hong, J. Wang, J. Xu, Y. Tang, Q. L. Han, and J. Kurths, “Cooperative and competitive multi-agent systems: From optimization to games,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 5, pp. 763–783, May 2022. doi: 10.1109/JAS.2022.105506
|
[5] |
Y. Yang, H. Modares, K. G. Vamvoudakis, and F. L. Lewis, “Cooperative finitely excited learning for dynamical games,” IEEE Trans. Cybern., vol. 54, no. 2, pp. 797–810, Feb. 2024.
|
[6] |
Y. Cheng, L. Huang, and X. Wang, “Authentic boundary proximal policy optimization,” IEEE Trans. Cybern., vol. 52, no. 9, pp. 9428–9438, Sep. 2022. doi: 10.1109/TCYB.2021.3051456
|
[7] |
Y. Yang, Z. Ding, R. Wang, H. Modares, and D. C. Wunsch, “Data-driven human-robot interaction without velocity measurement using off-policy reinforcement learning,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 1, pp. 47–63, Jan. 2022. doi: 10.1109/JAS.2021.1004258
|
[8] |
X. Wang, L. Wang, C. Dong, H. Ren, and K. Xing, “An online deep reinforcement learning-based order recommendation framework for rider-centered food delivery system,” IEEE Trans. Intell. Transp. Syst., vol. 24, no. 5, pp. 5640–5654, May 2023. doi: 10.1109/TITS.2023.3237580
|
[9] |
Y. Yang, B. Kiumarsi, H. Modares, and C. Xu, “Model-free λ-policy iteration for discrete-time linear quadratic regulation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 2, pp. 635–649, Feb. 2023. doi: 10.1109/TNNLS.2021.3098985
|
[10] |
C. Mu, Y. Zhang, G. Cai, R. Liu, and C. Sun, “A data-based feedback relearning algorithm for uncertain nonlinear systems,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 5, pp. 1288–1303, May 2023. doi: 10.1109/JAS.2023.123186
|
[11] |
Z. Peng, C. Han, Y. Liu, and Z. Zhou, “Weighted policy constraints for offline reinforcement learning,” in Proc. AAAI Conf. Artif. Intell., 2023, pp. 9435–9443.
|
[12] |
S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,” arXiv preprint arXiv: 2005.01643, 2020.
|
[13] |
S. Lange, T. Gabel, and M. Riedmiller, “Batch reinforcement learning,” Reinforcement Learning: State-of-the-Art, Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 45–73.
|
[14] |
R. F. Prudencio, M. R. Maximo, and E. L. Colombini, “A survey on offline reinforcement learning: Taxonomy, review, and open problems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 8, pp. 10237–10257, Aug. 2024.
|
[15] |
A. A. Abdellatif, N. Mhaisen, A. Mohamed, A. Erbad, and M. Guizani, “Reinforcement learning for intelligent healthcare systems: A review of challenges, applications, and open research issues,” IEEE Internet Things J., vol. 10, no. 24, pp. 21982–22007, Dec. 2023. doi: 10.1109/JIOT.2023.3288050
|
[16] |
Q. Zhang, Y. Gao, Y. Zhang, Y. Guo, D. Ding, Y. Wang, P. Sun, and D. Zhao, “TrajGen: Generating realistic and diverse trajectories with reactive and feasible agent behaviors for autonomous driving,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 12, pp. 24474–24487, Dec. 2022. doi: 10.1109/TITS.2022.3202185
|
[17] |
J. Wang, Q. Zhang, and D. Zhao, “Highway lane change decision-making via attention-based deep reinforcement learning,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 3, pp. 567–569, Mar. 2022. doi: 10.1109/JAS.2021.1004395
|
[18] |
H. Wang, Z. Liu, Z. Han, Y. Wu, and D. Liu, “Rapid adaptation for active pantograph control in high-speed railway via deep meta reinforcement learning,” IEEE Trans. Cybern., vol. 54, no. 5, pp. 2811–2823, May 2024.
|
[19] |
K. Zhang, R. Su, and H. Zhang, “A novel resilient control scheme for a class of Markovian jump systems with partially unknown information,” IEEE Trans. Cybern., vol. 52, no. 8, pp. 8191–8200, Aug. 2022. doi: 10.1109/TCYB.2021.3050619
|
[20] |
C. Mu, Z. Liu, J. Yan, H. Jia, and X. Zhang, “Graph multi-agent reinforcement learning for inverter-based active voltage control,” IEEE Trans. Smart Grid, vol. 15, no. 2, pp. 1399–1409, Mar. 2024.
|
[21] |
H. Ren, H. Ma, H. Li, and Z. Wang, “Adaptive fixed-time control of nonlinear MASs with actuator faults,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 5, pp. 1252–1262, May 2023. doi: 10.1109/JAS.2023.123558
|
[22] |
H. Ma, H. Ren, Q. Zhou, H. Li, and Z. Wang, “Observer-based neural control of N-link flexible-joint robots,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 4, pp. 5295–5305, Apr. 2024.
|
[23] |
S. Tosatto, J. Carvalho, and J. Peters, “Batch reinforcement learning with a nonparametric off-policy policy gradient,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 5996–6010, Oct. 2022. doi: 10.1109/TPAMI.2021.3088063
|
[24] |
X. Hu, Y. Ma, C. Xiao, Y. Zheng, and J. Hao, “In-sample policy iteration for offline reinforcement learning,” arXiv preprint arXiv: 2306.05726, 2023.
|
[25] |
S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 2052–2062.
|
[26] |
S. Fujimoto and S. Gu, “A minimalist approach to offline reinforcement learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 20132–20145.
|
[27] |
A. Kumar, A. Zhou, G.Tucker, and S. Levine, “Conservative Q-learning for offline reinforcement learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, pp. 1179–1191.
|
[28] |
L. Huang, B. Dong, J. Lu, and W. Zhang, “Mild policy evaluation for offline Actor-Critic,” IEEE Trans. Neural Netw. Learn. Syst., early access, Sept. 2023, DOI: 10.1109/TNNLS.2023.3309906.
|
[29] |
I. Kostrikov, A. Nair, and S. Levine, “Offline reinforcement learning with implicit Q-learning,” arXiv preprint arXiv: 2110.06169, 2021.
|
[30] |
A. Kumar, J. Fu, M Soh, G. Tucker, and S. Levine, “Stabilizing off-policy Q-learning via bootstrapping error reduction,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 11784–11794.
|
[31] |
Y. Wu, T. George, and O. Nachum, “Behavior regularized offline reinforcement learning,” arXiv preprint arXiv: 1911.11361, 2019.
|
[32] |
F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from observation,” in Proc. Joint Conf. Artif. Intell., 2018, pp. 4950–4957.
|
[33] |
A. Nair, Dalal M, Gupta A, and S. Levine, “Accelerating online reinforcement learning with offline datasets,” arXiv preprint arXiv: 2006.09359.
|
[34] |
P. Daoudi, M. Barlier, L. D. Santos, and V. Virmaux, “Density estimation for conservative Q-learning,” in Proc. ICLR Workshop Generalizable Policy Learning in Physical World, 2022.
|
[35] |
J. Lyu, X. Ma, X. Li, and Z. Lu, “Mildly conservative Q-learning for offline reinforcement learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, pp. 1711–1724.
|
[36] |
Y. Wu, S. Zhai, N. Srivastava, J. Susskind, J. Zhang, R. Salakhutdinov, and H. Goh, “Uncertainty weighted Actor-Critic for offline reinforcement learning,” arXiv preprint arXiv: 2105.08140, 2021.
|
[37] |
C. Bai, L. Wang, Z. Yang, Z. Deng, A. Garg, P. Liu, and Z. Wang, “Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning,” arXiv preprint arXiv: 2202.11566, 2022.
|
[38] |
H. Xu, L. Jiang, J. Li, Z. Yang, Z. Wang, V. W. K. Chan, and X. Zhan, “Offline RL with no OOD actions: In-sample learning via implicit value regularization,” arXiv preprint arXiv: 2303.15810, 2023.
|
[39] |
C. Xiao, H. Wang, Y. Pan, A. White, and M. White, “The in-sample softmax for offline reinforcement learning,” arXiv preprint arXiv: 2302.14372, 2023.
|
[40] |
D. Brandfonbrener, W Whitney, R. Ranganath, and J. Bruna, “Offline rl without off-policy evaluation,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 4933–4946.
|
[41] |
Z. Zhuang, K. Lei, J. Liu, D. Wang, and Y. Guo, “Behavior proximal policy optimization,” arXiv preprint arXiv: 2302.11312, 2023.
|
[42] |
L. Shi, R. Dadashi, Y. Chi, P. S. Castro, and M. Geist, “Offline reinforcement learning with on-policy Q-function regularization,” arXiv preprint arXiv: 2307.13824, 2023.
|
[43] |
J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine, “D4RL: Datasets for deep data-driven reinforcement learning,” arXiv preprint arXiv: 2004.07219, 2020.
|
[44] |
E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” in Proc. Int. Conf. Intell. Rob. Syst., 2012, pp. 5026–5033.
|
[45] |
A. Farahmand, C. Szepesvári, and R. Munos, “Error propagation for approximate policy and value iteration,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2010, pp. 568–576.
|
[46] |
L. Huang, B. Dong, and W. Zhang, “Efficient offline reinforcement learning with relaxed conservatism,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 8, pp. 5260–5272, Aug. 2024.
|
[47] |
C. Dann, Y. Mansour, M. Mohri, A. Sekhari, and K. Sridharan, “Guarantees for epsilon-greedy reinforcement learning with function approximation,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 4666–4689.
|
[48] |
T. Xu, Z. Li, and Y. Yu, “Error bounds of imitating policies and environments for reinforcement learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 6968–6980, Oct. 2022. doi: 10.1109/TPAMI.2021.3096966
|
[49] |
S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 1587–1596.
|
[50] |
D. Silver and G. Tesauro, “Monte-Carlo simulation balancing,” in Proc. Int. Conf. Mach. Learn., 2009, pp. 945–952.
|