Safe Q-Learning for Data-Driven Nonlinear Optimal Control With Asymmetric State Constraints

Mingming Zhao; Ding Wang; Shijie Song; Junfei Qiao

doi:10.1109/JAS.2024.124509

Volume 11 Issue 12

Dec. 2024

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 15.3, Top 1 (SCI Q1)

CiteScore: 23.5, Top 2% (Q1)
Google Scholar h5-index: 77， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2024 > 11(12): 2408-2422

M. Zhao, D. Wang, S. Song, and J. Qiao, “Safe Q-learning for data-driven nonlinear optimal control with asymmetric state constraints,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 12, pp. 2408–2422, Dec. 2024. doi: 10.1109/JAS.2024.124509

Citation:

M. Zhao, D. Wang, S. Song, and J. Qiao, “Safe Q-learning for data-driven nonlinear optimal control with asymmetric state constraints,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 12, pp. 2408–2422, Dec. 2024. doi: 10.1109/JAS.2024.124509

Citation:

PDF( 2993 KB)

Safe Q-Learning for Data-Driven Nonlinear Optimal Control With Asymmetric State Constraints

doi: 10.1109/JAS.2024.124509

Funds: This work was supported in part by the National Science and Technology Major Project (2021ZD0112302) and the National Natural Science Foundation of China (62222301, 61890930-5, 62021003)

More Information

Author Bio:
Mingming Zhao received the B.E. degree in automation from Henan Polytechnic University in 2019, and the M.E. degree in control engineering in 2022 from the Beijing University of Technology where he is currently working toward the Ph.D. degree in control science and engineering. His research interests include adaptive dynamic programming, reinforcement learning with industrial applications, and intelligent systems

Ding Wang (Senior Member, IEEE) received the Ph.D. degree in control theory and control engineering from Institute of Automation, Chinese Academy of Science in 2012. He is currently a Full Professor with the School of Information Science and Technology, Beijing University of Technology. He has authored or co-authored over 150 journal and conference papers and five monographs. His research interests include adaptive critic control with industrial applications, reinforcement learning, and intelligent systems. Dr. Wang currently serves as an Associate Editor of IEEE Transactions on Systems, Man, and Cybernetics: Systems, Neural Networks, Engineering Applications of Artificial Intelligence, International Journal of Robust and Nonlinear Control, and Acta Automatica Sinica

Shijie Song received the B.S. degree in mechanical engineering from Xi’an Shiyou University in 2016, and the M.S. degree in mechatronics engineering from the University of Electronic Science and Technology of China in 2020, where he is currently pursuing the Ph.D. degree. His research interests include adaptive dynamic programming and neural network control

Junfei Qiao (Senior Member, IEEE) received the B.E. and M.E. degrees in control engineering from Liaoning Technical University in 1992 and 1995, respectively, and the Ph.D. degree in control theory and control engineering from Northwestern University in 1998. He is currently a Professor with the School of Information Science and Technology, Beijing University of Technology where he is also the Director of Beijing Laboratory of Smart Environmental Protection. His research interests include neural networks, intelligent systems, self-adaptive systems, and process control
Corresponding author: Junfei Qiao, e-mail: adqiao@bjut.edu.cn
Received Date: 2024-04-01
Accepted Date: 2024-04-26

Available Online: 2024-08-21

Abstract

Abstract

This article develops a novel data-driven safe Q-learning method to design the safe optimal controller which can guarantee constrained states of nonlinear systems always stay in the safe region while providing an optimal performance. First, we design an augmented utility function consisting of an adjustable positive definite control obstacle function and a quadratic form of the next state to ensure the safety and optimality. Second, by exploiting a pre-designed admissible policy for initialization, an off-policy stabilizing value iteration Q-learning (SVIQL) algorithm is presented to seek the safe optimal policy by using offline data within the safe region rather than the mathematical model. Third, the monotonicity, safety, and optimality of the SVIQL algorithm are theoretically proven. To obtain the initial admissible policy for SVIQL, an offline VIQL algorithm with zero initialization is constructed and a new admissibility criterion is established for immature iterative policies. Moreover, the critic and action networks with precise approximation ability are established to promote the operation of VIQL and SVIQL algorithms. Finally, three simulation experiments are conducted to demonstrate the virtue and superiority of the developed safe Q-learning method.

FullText(HTML)

References(41)

References

[1]	F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,” IEEE Control Syst. Mag., vol. 32, no. 6, pp. 76–105, Dec. 2012. doi: 10.1109/MCS.2012.2214134
[2]	D. Wang, J. Wang, M. Zhao, P. Xin, and J. Qiao, “Adaptive multi-step evaluation design with stability guarantee for discrete-time optimal learning control,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 9, pp. 1797–1809, Sept. 2023. doi: 10.1109/JAS.2023.123684
[3]	Y. Sokolov, R. Kozma, L. D. Werbos, and P. J. Werbos, “Complete stability analysis of a heuristic approximate dynamic programming control design,” Automatica, vol. 59, pp. 9–18, Sept. 2015. doi: 10.1016/j.automatica.2015.06.001
[4]	D. Wang, H.-L. Zhao, and X. Li, “Adaptive critic control for wastewater treatment systems based on multiobjective particle swarm optimization,” Chin. J. Eng., vol. 46, no. 5, pp. 908–917, May 2024.
[5]	Q. Yang, W. Cao, W. Meng, and J. Si, “Reinforcement-learning-based tracking control of waste water treatment process under realistic system conditions and control performance requirements,” IEEE Trans. Syst. Man Cybern. Syst., vol. 52, no. 8, pp. 5284–5294, Aug. 2022. doi: 10.1109/TSMC.2021.3122802
[6]	Y. Jiang, W. Gao, J. Wu, T. Chai, and F. L. Lewis, “Reinforcement learning and cooperative H_∞ output regulation of linear continuous-time multi-agent systems,” Automatica, vol. 148, p. 110768, Feb. 2023. doi: 10.1016/j.automatica.2022.110768
[7]	M. Ha, D. Wang, and D. Liu, “Discounted iterative adaptive critic designs with novel stability analysis for tracking control,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 7, pp. 1262–1272, Jul. 2022. doi: 10.1109/JAS.2022.105692
[8]	M. Zhao, D. Wang, J. Qiao, M. Ha, and J. Ren, “Advanced value iteration for discrete-time intelligent critic control: A survey,” Artif. Intell. Rev., vol. 56, no. 10, pp. 12315–12346, Oct. 2023. doi: 10.1007/s10462-023-10497-1
[9]	D. Wang, “Event-based iterative neural control for a type of discrete dynamic plant,” Chin. J. Eng., vol. 44, no. 3, pp. 411–419, Mar. 2022.
[10]	Q.-Y. Fan and G.-H. Yang, “Adaptive nearly optimal control for a class of continuous-time nonaffine nonlinear systems with inequality constraints,” ISA Trans., vol. 66, pp. 122–133, Jan. 2017. doi: 10.1016/j.isatra.2016.10.019
[11]	D. Wang, N. Gao, D. Liu, J. Li, and F. L. Lewis, “Recent progress in reinforcement learning and adaptive dynamic programming for advanced control applications,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 1, pp. 18–36, Jan. 2024. doi: 10.1109/JAS.2023.123843
[12]	A. Heydari, “Stability analysis of optimal adaptive control under value iteration using a stabilizing initial policy,” IEEE Trans. Neural Networks Learn. Syst., vol. 29, no. 9, pp. 4522–4527, Sept. 2018. doi: 10.1109/TNNLS.2017.2755501
[13]	J. Qiao, M. Zhao, D. Wang, and M. Li, “Action-dependent heuristic dynamic programming with experience replay for wastewater treatment processes,” IEEE Trans. Ind. Inf., vol. 20, no. 4, pp. 6257–6265, Apr. 2024. doi: 10.1109/TII.2023.3344130
[14]	L. Zhang, Y. Feng, X. Liang, S. Liu, G. Cheng, and J. Huang, “Sample strategy based on TD-error for offline reinforcement learning,” Chin. J. Eng., vol. 45, no. 12, pp. 2118–2128, Dec. 2023.
[15]	B. Luo, D. Liu, T. Huang, and D. Wang, “Model-free optimal tracking control via critic-only Q-learning,” IEEE Trans. Neural Networks Learn. Syst., vol. 27, no. 10, pp. 2134–2144, Oct. 2016. doi: 10.1109/TNNLS.2016.2585520
[16]	Q. Wei, F. L. Lewis, Q. Sun, P. Yan, and R. Song, “Discrete-time deterministic Q-learning: A novel convergence analysis,” IEEE Trans. Cybern., vol. 47, no. 5, pp. 1224–1237, May 2017. doi: 10.1109/TCYB.2016.2542923
[17]	B. Kiumarsi, F. L. Lewis, H. Modares, A. Karimpour, and M.-B. Naghibi-Sistani, “Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics,” Automatica, vol. 50, no. 4, pp. 1167–1175, Apr. 2014. doi: 10.1016/j.automatica.2014.02.015
[18]	M. Lin and B. Zhao, “Policy optimization adaptive dynamic programming for optimal control of input-affine discrete-time nonlinear systems,” IEEE Trans. Syst. Man Cybern. Syst., vol. 53, no. 7, pp. 4339–4350, Jul. 2023. doi: 10.1109/TSMC.2023.3247466
[19]	B. Luo, D. Liu, and H.-N. Wu, “Adaptive constrained optimal control design for data-based nonlinear discrete-time systems with critic-only structure,” IEEE Trans. Neural Networks Learn. Syst., vol. 29, no. 6, pp. 2099–2111, Jun. 2018. doi: 10.1109/TNNLS.2017.2751018
[20]	J. Li, T. Chai, F. L. Lewis, Z. Ding, and Y. Jiang, “Off-policy interleaved Q-learning: Optimal control for affine nonlinear discrete-time systems,” IEEE Trans. Neural Networks Learn. Syst., vol. 30, no. 5, pp. 1308–1320, May 2019. doi: 10.1109/TNNLS.2018.2861945
[21]	Q. Wei and D. Liu, “A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems,” Sci. China Inf. Sci., vol. 58, no. 12, pp. 1–15, Dec. 2015.
[22]	S. Song, M. Zhu, X. Dai, and D. Gong, “Model-free optimal tracking control of nonlinear input-affine discrete-time systems via an iterative deterministic Q-learning algorithm,” IEEE Trans. Neural Networks Learn. Syst., vol. 35, no. 1, pp. 999–1012, Jan. 2024. doi: 10.1109/TNNLS.2022.3178746
[23]	S. Song, D. Gong, M. Zhu, Y. Zhao, and C. Huang, “Data-driven optimal tracking control for discrete-time nonlinear systems with unknown dynamics using deterministic ADP,” IEEE Trans. Neural Networks Learn. Syst., Jun. 2024, DOI: 10.1109/TNNLS.2023.3323142.
[24]	Z. Marvi and B. Kiumarsi, “Reinforcement learning with safety and stability guarantees during exploration for linear systems,” IEEE Open J. Control Syst., vol. 1, pp. 322–334, Nov. 2022. doi: 10.1109/OJCSYS.2022.3209945
[25]	K. P. Wabersich, L. Hewing, A. Carron, and M. N. Zeilinger, “Probabilistic model predictive safety certification for learning-based control,” IEEE Trans. Autom. Control, vol. 67, no. 1, pp. 176–188, Jan. 2022. doi: 10.1109/TAC.2021.3049335
[26]	K. P. Wabersich and M. N. Zeilinger, “A predictive safety filter for learning-based control of constrained nonlinear dynamical systems,” Automatica, vol. 129, p. 109597, Jul. 2021. doi: 10.1016/j.automatica.2021.109597
[27]	M. Zanon and S. Gros, “Safe reinforcement learning using robust MPC,” IEEE Trans. Autom. Control, vol. 66, no. 8, pp. 3638–3652, Aug. 2021. doi: 10.1109/TAC.2020.3024161
[28]	Y. Yang, K. G. Vamvoudakis, H. Modares, Y. Yin, and D. C. Wunsch, “Safe intermittent reinforcement learning with static and dynamic event generators,” IEEE Trans. Neural Networks Learn. Syst., vol. 31, no. 12, pp. 5441–5455, Dec. 2020. doi: 10.1109/TNNLS.2020.2967871
[29]	Y. Yang, K. G. Vamvoudakis, and H. Modares, “Safe reinforcement learning for dynamical games,” Int. J. Robust Nonlinear Control, vol. 30, no. 9, pp. 3706–3726, Jun. 2020. doi: 10.1002/rnc.4962
[30]	N. M. Yazdani, R. K. Moghaddam, B. Kiumarsi, and H. Modares, “A safety-certified policy iteration algorithm for control of constrained nonlinear systems,” IEEE Control Syst. Lett., vol. 4, no. 3, pp. 686–691, Jul. 2020. doi: 10.1109/LCSYS.2020.2990632
[31]	Z. Marvi and B. Kiumarsi, “Safe reinforcement learning: A control barrier function optimization approach,” Int. J. Robust Nonlinear Control, vol. 31, no. 6, pp. 1923–1940, Apr. 2021. doi: 10.1002/rnc.5132
[32]	J. Xu, J. Wang, J. Rao, Y. Zhong, and H. Wang, “Adaptive dynamic programming for optimal control of discrete-time nonlinear system with state constraints based on control barrier function,” Int. J. Robust Nonlinear Control, vol. 32, no. 6, pp. 3408–3424, Apr. 2022. doi: 10.1002/rnc.5955
[33]	S. Liu, L. Liu, and Z. Yu, “Safe reinforcement learning for affine nonlinear systems with state constraints and input saturation using control barrier functions,” Neurocomputing, vol. 518, pp. 562–576, Jan. 2023. doi: 10.1016/j.neucom.2022.11.006
[34]	L. Zhang, L. Xie, Y. Jiang, Z. Li, X. Liu, and H. Su, “Optimal control for constrained discrete-time nonlinear systems based on safe reinforcement learning,” IEEE Trans. Neural Networks Learn. Syst., Oct. 2023, DOI: 10.1109/TNNLS.2023.3326397.
[35]	S. Prajna, “Barrier certificates for nonlinear model validation,” Automatica, vol. 42, no. 1, pp. 117–126, Jan. 2006. doi: 10.1016/j.automatica.2005.08.007
[36]	A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs for safety critical systems,” IEEE Trans. Autom. Control, vol. 62, no. 8, pp. 3861–3876, Aug. 2017. doi: 10.1109/TAC.2016.2638961
[37]	D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems,” IEEE Trans. Neural Networks Learn. Syst., vol. 25, no. 3, pp. 621–634, Mar. 2014. doi: 10.1109/TNNLS.2013.2281663
[38]	L. Zheng, Z. Liu, Y. Wang, C. L. P. Chen, Y. Zhang, and Z. Wu, “Reinforcement learning-based adaptive optimal control for nonlinear systems with asymmetric hysteresis,” IEEE Trans. Neural Networks Learn. Syst., Jul. 2023, DOI: 10.1109/TNNLS.2023.3289978.
[39]	D. Wang and D. Liu, “Learning and guaranteed cost control with event-based adaptive critic implementation,” IEEE Trans. Neural Networks Learn. Syst., vol. 29, no. 12, pp. 6004–6014, Dec. 2018. doi: 10.1109/TNNLS.2018.2817256
[40]	M. Ha, D. Wang, and D. Liu, “A novel value iteration scheme with adjustable convergence rate,” IEEE Trans. Neural Networks Learn. Syst., vol. 34, no. 10, pp. 7430–7442, Oct. 2023. doi: 10.1109/TNNLS.2022.3143527
[41]	J. Qiao, M. Zhao, D. Wang, and M. Ha, “Adjustable iterative Q-learning schemes for model-free optimal tracking control,” IEEE Trans. Syst. Man Cybern. Syst., vol. 54, no. 2, pp. 1202–1213, Feb. 2024. doi: 10.1109/TSMC.2023.3324215

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(12)

Get Citation

PDF

XML

Article Metrics

Article views (215) PDF downloads(89)

Highlights

A novel formulation of augmented utility function is established by incorporating an adjustable positive definite CBF and a quadratic form of the next state. First, the proposed augmented utility function can better ensure the safety of the next state if the current state is within the safe region. Second, the adjustable CBF is a positive definite function and only equals zero at the equilibrium point. Finally, comparison results under six utility functions are provided to demonstrate the superiority of the proposed augmented utility function
The SVIQL algorithm is first proposed to pursue the safe optimal control policy by utilizing filtered offline data without the system model. The SVIQL algorithm draws on the advantages of PIQL with stable iterative policies and VIQL with simple policy evaluation. Hence, each iterative policy resulting from SVIQL is valid and the computational complexity is lower than that of PIQL
The monotonicity, safety, and optimality of the SVIQL algorithm are analyzed. By means of a pre-designed initial admissible policy, we theoretically elucidate that the iterative Q-function sequence is bounded and monotonically nonincreasing, which guarantees system states are always maintained in the safe region
In previous literature, there is no description on how to acquire the initial safe and stable policy for unknown nonlinear systems when only offline data are available. Hence, we provide a new stability criterion for the offline VIQL algorithm with zero initial function to attain the initial stable and safe policy, which is necessary for SVIQL

Safe Q-Learning for Data-Driven Nonlinear Optimal Control With Asymmetric State Constraints

doi: 10.1109/JAS.2024.124509

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Highlights

Export File

Citation

Format

Content