A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 11 Issue 12
Dec.  2024

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 15.3, Top 1 (SCI Q1)
    CiteScore: 23.5, Top 2% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
M. Zhao, D. Wang, S. Song, and  J. Qiao,  “Safe Q-learning for data-driven nonlinear optimal control with asymmetric state constraints,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 12, pp. 2408–2422, Dec. 2024. doi: 10.1109/JAS.2024.124509
Citation: M. Zhao, D. Wang, S. Song, and  J. Qiao,  “Safe Q-learning for data-driven nonlinear optimal control with asymmetric state constraints,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 12, pp. 2408–2422, Dec. 2024. doi: 10.1109/JAS.2024.124509

Safe Q-Learning for Data-Driven Nonlinear Optimal Control With Asymmetric State Constraints

doi: 10.1109/JAS.2024.124509
Funds:  This work was supported in part by the National Science and Technology Major Project (2021ZD0112302) and the National Natural Science Foundation of China (62222301, 61890930-5, 62021003)
More Information
  • This article develops a novel data-driven safe Q-learning method to design the safe optimal controller which can guarantee constrained states of nonlinear systems always stay in the safe region while providing an optimal performance. First, we design an augmented utility function consisting of an adjustable positive definite control obstacle function and a quadratic form of the next state to ensure the safety and optimality. Second, by exploiting a pre-designed admissible policy for initialization, an off-policy stabilizing value iteration Q-learning (SVIQL) algorithm is presented to seek the safe optimal policy by using offline data within the safe region rather than the mathematical model. Third, the monotonicity, safety, and optimality of the SVIQL algorithm are theoretically proven. To obtain the initial admissible policy for SVIQL, an offline VIQL algorithm with zero initialization is constructed and a new admissibility criterion is established for immature iterative policies. Moreover, the critic and action networks with precise approximation ability are established to promote the operation of VIQL and SVIQL algorithms. Finally, three simulation experiments are conducted to demonstrate the virtue and superiority of the developed safe Q-learning method.

     

  • loading
  • [1]
    F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,” IEEE Control Syst. Mag., vol. 32, no. 6, pp. 76–105, Dec. 2012. doi: 10.1109/MCS.2012.2214134
    [2]
    D. Wang, J. Wang, M. Zhao, P. Xin, and J. Qiao, “Adaptive multi-step evaluation design with stability guarantee for discrete-time optimal learning control,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 9, pp. 1797–1809, Sept. 2023. doi: 10.1109/JAS.2023.123684
    [3]
    Y. Sokolov, R. Kozma, L. D. Werbos, and P. J. Werbos, “Complete stability analysis of a heuristic approximate dynamic programming control design,” Automatica, vol. 59, pp. 9–18, Sept. 2015. doi: 10.1016/j.automatica.2015.06.001
    [4]
    D. Wang, H.-L. Zhao, and X. Li, “Adaptive critic control for wastewater treatment systems based on multiobjective particle swarm optimization,” Chin. J. Eng., vol. 46, no. 5, pp. 908–917, May 2024.
    [5]
    Q. Yang, W. Cao, W. Meng, and J. Si, “Reinforcement-learning-based tracking control of waste water treatment process under realistic system conditions and control performance requirements,” IEEE Trans. Syst. Man Cybern. Syst., vol. 52, no. 8, pp. 5284–5294, Aug. 2022. doi: 10.1109/TSMC.2021.3122802
    [6]
    Y. Jiang, W. Gao, J. Wu, T. Chai, and F. L. Lewis, “Reinforcement learning and cooperative H output regulation of linear continuous-time multi-agent systems,” Automatica, vol. 148, p. 110768, Feb. 2023. doi: 10.1016/j.automatica.2022.110768
    [7]
    M. Ha, D. Wang, and D. Liu, “Discounted iterative adaptive critic designs with novel stability analysis for tracking control,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 7, pp. 1262–1272, Jul. 2022. doi: 10.1109/JAS.2022.105692
    [8]
    M. Zhao, D. Wang, J. Qiao, M. Ha, and J. Ren, “Advanced value iteration for discrete-time intelligent critic control: A survey,” Artif. Intell. Rev., vol. 56, no. 10, pp. 12315–12346, Oct. 2023. doi: 10.1007/s10462-023-10497-1
    [9]
    D. Wang, “Event-based iterative neural control for a type of discrete dynamic plant,” Chin. J. Eng., vol. 44, no. 3, pp. 411–419, Mar. 2022.
    [10]
    Q.-Y. Fan and G.-H. Yang, “Adaptive nearly optimal control for a class of continuous-time nonaffine nonlinear systems with inequality constraints,” ISA Trans., vol. 66, pp. 122–133, Jan. 2017. doi: 10.1016/j.isatra.2016.10.019
    [11]
    D. Wang, N. Gao, D. Liu, J. Li, and F. L. Lewis, “Recent progress in reinforcement learning and adaptive dynamic programming for advanced control applications,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 1, pp. 18–36, Jan. 2024. doi: 10.1109/JAS.2023.123843
    [12]
    A. Heydari, “Stability analysis of optimal adaptive control under value iteration using a stabilizing initial policy,” IEEE Trans. Neural Networks Learn. Syst., vol. 29, no. 9, pp. 4522–4527, Sept. 2018. doi: 10.1109/TNNLS.2017.2755501
    [13]
    J. Qiao, M. Zhao, D. Wang, and M. Li, “Action-dependent heuristic dynamic programming with experience replay for wastewater treatment processes,” IEEE Trans. Ind. Inf., vol. 20, no. 4, pp. 6257–6265, Apr. 2024. doi: 10.1109/TII.2023.3344130
    [14]
    L. Zhang, Y. Feng, X. Liang, S. Liu, G. Cheng, and J. Huang, “Sample strategy based on TD-error for offline reinforcement learning,” Chin. J. Eng., vol. 45, no. 12, pp. 2118–2128, Dec. 2023.
    [15]
    B. Luo, D. Liu, T. Huang, and D. Wang, “Model-free optimal tracking control via critic-only Q-learning,” IEEE Trans. Neural Networks Learn. Syst., vol. 27, no. 10, pp. 2134–2144, Oct. 2016. doi: 10.1109/TNNLS.2016.2585520
    [16]
    Q. Wei, F. L. Lewis, Q. Sun, P. Yan, and R. Song, “Discrete-time deterministic Q-learning: A novel convergence analysis,” IEEE Trans. Cybern., vol. 47, no. 5, pp. 1224–1237, May 2017. doi: 10.1109/TCYB.2016.2542923
    [17]
    B. Kiumarsi, F. L. Lewis, H. Modares, A. Karimpour, and M.-B. Naghibi-Sistani, “Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics,” Automatica, vol. 50, no. 4, pp. 1167–1175, Apr. 2014. doi: 10.1016/j.automatica.2014.02.015
    [18]
    M. Lin and B. Zhao, “Policy optimization adaptive dynamic programming for optimal control of input-affine discrete-time nonlinear systems,” IEEE Trans. Syst. Man Cybern. Syst., vol. 53, no. 7, pp. 4339–4350, Jul. 2023. doi: 10.1109/TSMC.2023.3247466
    [19]
    B. Luo, D. Liu, and H.-N. Wu, “Adaptive constrained optimal control design for data-based nonlinear discrete-time systems with critic-only structure,” IEEE Trans. Neural Networks Learn. Syst., vol. 29, no. 6, pp. 2099–2111, Jun. 2018. doi: 10.1109/TNNLS.2017.2751018
    [20]
    J. Li, T. Chai, F. L. Lewis, Z. Ding, and Y. Jiang, “Off-policy interleaved Q-learning: Optimal control for affine nonlinear discrete-time systems,” IEEE Trans. Neural Networks Learn. Syst., vol. 30, no. 5, pp. 1308–1320, May 2019. doi: 10.1109/TNNLS.2018.2861945
    [21]
    Q. Wei and D. Liu, “A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems,” Sci. China Inf. Sci., vol. 58, no. 12, pp. 1–15, Dec. 2015.
    [22]
    S. Song, M. Zhu, X. Dai, and D. Gong, “Model-free optimal tracking control of nonlinear input-affine discrete-time systems via an iterative deterministic Q-learning algorithm,” IEEE Trans. Neural Networks Learn. Syst., vol. 35, no. 1, pp. 999–1012, Jan. 2024. doi: 10.1109/TNNLS.2022.3178746
    [23]
    S. Song, D. Gong, M. Zhu, Y. Zhao, and C. Huang, “Data-driven optimal tracking control for discrete-time nonlinear systems with unknown dynamics using deterministic ADP,” IEEE Trans. Neural Networks Learn. Syst., Jun. 2024, DOI: 10.1109/TNNLS.2023.3323142.
    [24]
    Z. Marvi and B. Kiumarsi, “Reinforcement learning with safety and stability guarantees during exploration for linear systems,” IEEE Open J. Control Syst., vol. 1, pp. 322–334, Nov. 2022. doi: 10.1109/OJCSYS.2022.3209945
    [25]
    K. P. Wabersich, L. Hewing, A. Carron, and M. N. Zeilinger, “Probabilistic model predictive safety certification for learning-based control,” IEEE Trans. Autom. Control, vol. 67, no. 1, pp. 176–188, Jan. 2022. doi: 10.1109/TAC.2021.3049335
    [26]
    K. P. Wabersich and M. N. Zeilinger, “A predictive safety filter for learning-based control of constrained nonlinear dynamical systems,” Automatica, vol. 129, p. 109597, Jul. 2021. doi: 10.1016/j.automatica.2021.109597
    [27]
    M. Zanon and S. Gros, “Safe reinforcement learning using robust MPC,” IEEE Trans. Autom. Control, vol. 66, no. 8, pp. 3638–3652, Aug. 2021. doi: 10.1109/TAC.2020.3024161
    [28]
    Y. Yang, K. G. Vamvoudakis, H. Modares, Y. Yin, and D. C. Wunsch, “Safe intermittent reinforcement learning with static and dynamic event generators,” IEEE Trans. Neural Networks Learn. Syst., vol. 31, no. 12, pp. 5441–5455, Dec. 2020. doi: 10.1109/TNNLS.2020.2967871
    [29]
    Y. Yang, K. G. Vamvoudakis, and H. Modares, “Safe reinforcement learning for dynamical games,” Int. J. Robust Nonlinear Control, vol. 30, no. 9, pp. 3706–3726, Jun. 2020. doi: 10.1002/rnc.4962
    [30]
    N. M. Yazdani, R. K. Moghaddam, B. Kiumarsi, and H. Modares, “A safety-certified policy iteration algorithm for control of constrained nonlinear systems,” IEEE Control Syst. Lett., vol. 4, no. 3, pp. 686–691, Jul. 2020. doi: 10.1109/LCSYS.2020.2990632
    [31]
    Z. Marvi and B. Kiumarsi, “Safe reinforcement learning: A control barrier function optimization approach,” Int. J. Robust Nonlinear Control, vol. 31, no. 6, pp. 1923–1940, Apr. 2021. doi: 10.1002/rnc.5132
    [32]
    J. Xu, J. Wang, J. Rao, Y. Zhong, and H. Wang, “Adaptive dynamic programming for optimal control of discrete-time nonlinear system with state constraints based on control barrier function,” Int. J. Robust Nonlinear Control, vol. 32, no. 6, pp. 3408–3424, Apr. 2022. doi: 10.1002/rnc.5955
    [33]
    S. Liu, L. Liu, and Z. Yu, “Safe reinforcement learning for affine nonlinear systems with state constraints and input saturation using control barrier functions,” Neurocomputing, vol. 518, pp. 562–576, Jan. 2023. doi: 10.1016/j.neucom.2022.11.006
    [34]
    L. Zhang, L. Xie, Y. Jiang, Z. Li, X. Liu, and H. Su, “Optimal control for constrained discrete-time nonlinear systems based on safe reinforcement learning,” IEEE Trans. Neural Networks Learn. Syst., Oct. 2023, DOI: 10.1109/TNNLS.2023.3326397.
    [35]
    S. Prajna, “Barrier certificates for nonlinear model validation,” Automatica, vol. 42, no. 1, pp. 117–126, Jan. 2006. doi: 10.1016/j.automatica.2005.08.007
    [36]
    A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs for safety critical systems,” IEEE Trans. Autom. Control, vol. 62, no. 8, pp. 3861–3876, Aug. 2017. doi: 10.1109/TAC.2016.2638961
    [37]
    D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems,” IEEE Trans. Neural Networks Learn. Syst., vol. 25, no. 3, pp. 621–634, Mar. 2014. doi: 10.1109/TNNLS.2013.2281663
    [38]
    L. Zheng, Z. Liu, Y. Wang, C. L. P. Chen, Y. Zhang, and Z. Wu, “Reinforcement learning-based adaptive optimal control for nonlinear systems with asymmetric hysteresis,” IEEE Trans. Neural Networks Learn. Syst., Jul. 2023, DOI: 10.1109/TNNLS.2023.3289978.
    [39]
    D. Wang and D. Liu, “Learning and guaranteed cost control with event-based adaptive critic implementation,” IEEE Trans. Neural Networks Learn. Syst., vol. 29, no. 12, pp. 6004–6014, Dec. 2018. doi: 10.1109/TNNLS.2018.2817256
    [40]
    M. Ha, D. Wang, and D. Liu, “A novel value iteration scheme with adjustable convergence rate,” IEEE Trans. Neural Networks Learn. Syst., vol. 34, no. 10, pp. 7430–7442, Oct. 2023. doi: 10.1109/TNNLS.2022.3143527
    [41]
    J. Qiao, M. Zhao, D. Wang, and M. Ha, “Adjustable iterative Q-learning schemes for model-free optimal tracking control,” IEEE Trans. Syst. Man Cybern. Syst., vol. 54, no. 2, pp. 1202–1213, Feb. 2024. doi: 10.1109/TSMC.2023.3324215

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(12)

    Article Metrics

    Article views (215) PDF downloads(89) Cited by()

    Highlights

    • A novel formulation of augmented utility function is established by incorporating an adjustable positive definite CBF and a quadratic form of the next state. First, the proposed augmented utility function can better ensure the safety of the next state if the current state is within the safe region. Second, the adjustable CBF is a positive definite function and only equals zero at the equilibrium point. Finally, comparison results under six utility functions are provided to demonstrate the superiority of the proposed augmented utility function
    • The SVIQL algorithm is first proposed to pursue the safe optimal control policy by utilizing filtered offline data without the system model. The SVIQL algorithm draws on the advantages of PIQL with stable iterative policies and VIQL with simple policy evaluation. Hence, each iterative policy resulting from SVIQL is valid and the computational complexity is lower than that of PIQL
    • The monotonicity, safety, and optimality of the SVIQL algorithm are analyzed. By means of a pre-designed initial admissible policy, we theoretically elucidate that the iterative Q-function sequence is bounded and monotonically nonincreasing, which guarantees system states are always maintained in the safe region
    • In previous literature, there is no description on how to acquire the initial safe and stable policy for unknown nonlinear systems when only offline data are available. Hence, we provide a new stability criterion for the offline VIQL algorithm with zero initial function to attain the initial stable and safe policy, which is necessary for SVIQL

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return