Loading [MathJax]/jax/element/mml/optable/GeneralPunctuation.js
A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Xueli Wang, Derui Ding, Hongli Dong, and Xian-Ming Zhang, "Neural-Network-Based Control for Discrete-Time Nonlinear Systems with Input Saturation Under Stochastic Communication Protocol," IEEE/CAA J. Autom. Sinica, vol. 8, no. 4, pp. 766-778, Apr. 2021. doi: 10.1109/JAS.2021.1003922
Citation: Xueli Wang, Derui Ding, Hongli Dong, and Xian-Ming Zhang, "Neural-Network-Based Control for Discrete-Time Nonlinear Systems with Input Saturation Under Stochastic Communication Protocol," IEEE/CAA J. Autom. Sinica, vol. 8, no. 4, pp. 766-778, Apr. 2021. doi: 10.1109/JAS.2021.1003922

Neural-Network-Based Control for Discrete-Time Nonlinear Systems with Input Saturation Under Stochastic Communication Protocol

doi: 10.1109/JAS.2021.1003922
Funds:  This work was supported in part by the Australian Research Council Discovery Early Career Researcher Award (DE200101128), and Australian Research Council (DP190101557)
More Information
  • In this paper, an adaptive dynamic programming (ADP) strategy is investigated for discrete-time nonlinear systems with unknown nonlinear dynamics subject to input saturation. To save the communication resources between the controller and the actuators, stochastic communication protocols (SCPs) are adopted to schedule the control signal, and therefore the closed-loop system is essentially a protocol-induced switching system. A neural network (NN)-based identifier with a robust term is exploited for approximating the unknown nonlinear system, and a set of switch-based updating rules with an additional tunable parameter of NN weights are developed with the help of the gradient descent. By virtue of a novel Lyapunov function, a sufficient condition is proposed to achieve the stability of both system identification errors and the update dynamics of NN weights. Then, a value iterative ADP algorithm in an offline way is proposed to solve the optimal control of protocol-induced switching systems with saturation constraints, and the convergence is profoundly discussed in light of mathematical induction. Furthermore, an actor-critic NN scheme is developed to approximate the control law and the proposed performance index function in the framework of ADP, and the stability of the closed-loop system is analyzed in view of the Lyapunov theory. Finally, the numerical simulation results are presented to demonstrate the effectiveness of the proposed control scheme.

     

  • OPTIMAL control has been one of the main focuses of control fields due to its wide applications in various emerging industrial systems, such as electrical power systems, industrial control systems, and spacecraft attitude control systems [1]-[7]. It is usually equivalent to solve the well-known Hamilton-Jacobi-Bellman (HJB) equation, which is a critical challenge for nonlinear systems [8]. Fortunately, the adaptive dynamic programming (ADP) algorithm, as the most efficient tool, has been developed to perform various suboptimal control issues with known or unknown system dynamics [9]-[11] by virtue of both its ability of effectively approximating correlation functions and the characteristics of iterative forward transfer. The main idea of ADP algorithms is to utilize two function sequences to iteratively approximate the cost and value functions corresponding to the solution of the HJB equation in a forward-in-time manner [12]. It should be pointed out that the value iteration technology developed in [13], [14] is one of the most important iterative ADP algorithms, and its convergence has also been thoroughly discussed in [15]-[17]. Furthermore, some representative algorithms including heuristic dynamic programming (HDP), dual heuristic dynamic programming (DHP), as well as globalized DHP have been proposed and implemented in various control issues benefiting from the famous actor-critic structure, see [18]-[20]. It is noteworthy that the obtained controller is usually a suboptimal one because of the existence of approximation errors of such a structure, and therefore the corresponding control is also regarded as near-optimal control.

    In engineering practice, the actuator saturation is very pervasive due mainly to the facility protection or physical limits of the actuators. If the saturation of the actuator is not considered adequately, the performance of the closed-loop system is often severely damaged [21]. As a result, it is of tremendous significance to survey the influence of the input saturation phenomenon. Under the framework of optimal control, a bounded and invertible one-to-one function in a nonquadratic performance functional is usually exploited to evaluate the cost of saturated inputs and the analytical solution of the optimal controller can be obtained although it is still dependent on the cost functional [8], [22], [23]. Inspired by these work, the near-optimal control for various networked control systems has been investigated and some interesting results have been preliminarily reported in the literature, see [24]-[27], for instance. Near-optimal regulation under the actor-critic framework has been investigated in [26] for discrete-time nonlinear systems subject to quantized effects, where the quantization errors can be eliminated via a dynamic quantizer with adaptive step size. Furthermore, an online policy iteration algorithm has been presented in [28] to learn the optimal solution for a class of unknown constrained-input systems. Obviously, compared with the case without control constraints, the near-optimal control issues subject to constrained-inputs and various network-induced phenomena remain at an infant stage and thus require further research efforts.

    On another frontier of research, in the past few years, we have witnessed the persistent development of network technologies, which has been attracting recurring attention on networked control systems [29]. In order to effectively utilize the limited resource or reduce the switching frequency for prolonging the service life of the equipment, only one (or a limited number of) sensor/control node, governed by various protocols, is permitted to get access to the communication network. These protocols include, but not limited to, the round-robin protocol [30], the try-once-discard protocol [31] and the stochastic communication protocol [32], and the event-triggered protocol [33], [34]. There is no doubt that the utilization of these protocols tremendously results in the complexity and the difficulty of both the stability analysis and the design of weight updating rules, which is the main reason why there are sparse results on this topic. Very recently, consensus control with the help of reconstructed dynamics of the local tracking errors has been investigated in [35] for multi-agent systems with event-triggered mechanism and input constraints, where the effect on the local cost from the adopted triggering scheme has been investigated. The critic and actor networks combined with an identifier network have been simultaneously designed in [27] to deal with a constrained-input control issue with unknown drift dynamics and event-triggered communication protocols. Unfortunately, so far, near-optimal control for the discrete-time nonlinear systems subjected to input saturations has not yet been adequately investigated, not to mention the stochastic communication protocol (SCP) is also a concern, which constitutes the motivation of this paper.

    The addressed system with unknown nonlinear dynamics is essentially a protocol-induced switching system when SCP is employed to govern the data transmission or update between the controller and the actuator. Usually, SCP can be modeled by a Markov chain and the relative networked control issues can be effectively handled via the switching system theory combined with Lyapunov approaches. It is worth noting that this is a nontrivial topic for optimal control issues due mainly to the challenge of the cost function from such a switch. Recently, two typical approaches have been, respectively, developed in [36] via a combined cost function related to transition probabilities and in [37] via the dynamic programming principle [38]. However, when an identifier is designed to approximate the unknown nonlinear dynamics, there exists a great challenge to disclose the influence on the updating rules of the identifier’s weights and the identification errors. Furthermore, the convergence of the designed ADP algorithm and the practical execution with critic and actor networks should be further inspected. As such, motivated by the above discussions, the focus of this paper is to handle the neural networks (NN)-based near-optimal control problem for a discrete-time nonlinear system subject to constrained-inputs and SCPs. This appears to be nontrivial due to the following essential difficulties: 1) how to design an NN-based identifier under SCPs to estimate system dynamics, 2) how to perform the convergence analysis of the ADP algorithm, and 3) how to disclose the performance of the closed-loop system in the framework of critic and actor networks.

    In response to the above discussions, this paper is concerned with the near-optimal control problem for a class of discrete-time nonlinear systems with constrained inputs and SCPs, and hence its main contributions are highlighted as follows: 1) an NN-based identifier with a robust term is presented to approximate the unknown nonlinear system, where novel weight updating rules are constructed by virtue of an additional tunable parameter; 2) a set of conditions are derived to check the stability of both identification error dynamics and updated error dynamics of NN weights; 3) the convergence of proposed value iterative ADP algorithm, which solves the optimal control issue of protocol-induced switching systems with saturation constraints in an off-line way, is profoundly discussed in light of mathematical induction; and 4) an actor-critic NN scheme is employed to perform the addressed near-optimal control issue.

    The rest of this paper is formulated as follows: the problem formulation and preliminaries are presented in Section II. For the addressed control issue, four subsections are involved in Section III: an NN-based identification with a robust modification term is designed in Section III-A to identify discrete-time systems with unknown nonlinear dynamics; the value iterative ADP algorithm with convergence analysis is developed in Section III-B; the implementation of ADP algorithm with actor-critic networks in Section III-C, and the performance of closed-loop systems is discussed in Section III-D. Furthermore, a numerical example is given in Section IV to demonstrate the effectiveness of the proposed algorithms. Finally, the conclusion is given in Section V.

    Notation: The notation used in this paper is standard. N denotes the set of nonnegative integers. RN denotes the set of all N-dimensional real matrices. For the matrix Q, QT and tr{Q} denote the transpose and the trace of Q, respectively. diag{Q1,Q2,,Qn} stands for a block-diagonal matrix where the square matrices Qi are in the corresponding main diagonal blocks. For a vector x, x denotes the Euclidean norm.

    In this paper, the investigated networked control system consists of a nonlinear plant, sensors, identifier, controller, as well as actuator. We assume that the system states xk can be measured directly by sensors and then sent to the controller via shared networks. By using NNs, identifier along with the controller is utilized to realize the approximation of the nonlinear systems based on the received signals. To reduce the communication burden, SCPs are employed to schedule the information transmission between the controller and the actuator.

    Consider the unknown discrete-time nonlinear system with the following form:

    xk+1=f(xk)+g(xk)ˉuk (1)

    where xkRM is the system state directly measured by sensors, ˉuk=[ˉu1,k,ˉu2,k,,ˉuN,k]TRN is the actuator input scheduled by SCPs, and f(xk) and g(xk) are, respectively, the unknown nonlinear functions with f(0)=0 and g(0)=0. Assume that f+gu is Lipschitz continuous on a set Ωx containing the origin. Furthermore, the actuator input ˉuk belongs to Ωu={ˉuk||ˉui,k|ˉu} where ˉu is a positive scalar representing the saturation-level of the actuator.

    In light of unknown nonlinearities, an NN-based system identifier via xk, which will be designed in the next section, needs to be adopted to obtain the ideal control signal uk=[u1,k,u2,k,,uN,k]T. For the convenience of analysis, the saturation constraint can be removed from ˉuk into the control signal uk, that is, ukΩu. In what follows, SCP scheduling is performed to reduce the switching frequency and improve the communication burden between the controller and the actuator. To model this process, let us introduce the scheduling signal ξk{1,2,,N} to describe the selected element obtaining access to the network at time instant k. Under SCPs, ξk, a random variable, is modeled by a Markov chain with the known transition probability

    pij=Prob{ξk+1=j|ξk=i} (2)

    where pij0 for i,j={1,2,,N} and Nj=1pij=1. By means of the above variable, the signal ˉuk received by the actuator is expressed as

    ˉui,k={ui,k,ifi=ξkˉui,k1,otherwise (3)

    where zero-order-holders are utilized in the viewpoint of practical engineering.

    The actuator is further denoted as

    ˉuk=Φ(ξk)uk (4)

    with Φ(i)=diag{σ1i,σ2i,,σNi} where σniσ(in){0,1}(n=1,2,,N) is the Kronecker delta function, i.e., σ(in) is a binary function that equals 1 if i=n and equals 0 otherwise.

    Thus, the closed-loop system is as follows:

    xk+1=f(xk)+g(xk)Φ(ξk)uk:=fξk(xk)+˜gξk(xk)uk. (5)

    Remark 1: The main idea of SCP is to assign the access privilege for each node in a random manner. The “random switch” behavior of the node scheduling can be usually characterized by a Markov chain, see the corresponding research in [39]. Obviously, the addressed system (5) is essentially a protocol-induced switching system.

    To quantify the control performance, the associate utility of each scheduling is employed as follows:

    Ji(xk)=j=ki(xj,uj)=j=k(Qi(xj)+Si(uj)) (6)

    where i(xj,uj) is the cost function, in which Qi(xj) is positive and usually a quadratic function, and Si(uj) is generally a positive nonquadratic function to evaluate the constrained control input. In this paper, Si(uk) is selected as

    Si(uk)=uk02ˉu(tanh1(νˉu))TRidν=Ni=1ui,k02ˉu(tanh1(νiˉu))TRidνi (7)

    where tanh(ν/ˉu) stands for the hyperbolic tangent function; ν=[ν1,ν2,,νN]T is an integral vector; and Ridiag{ri,1,ri,2,,ri,N} is a known positive definite diagonal matrix with appropriate dimension. The operator tanh1(ν/ˉu) means

    tanh1(ν/ˉu)=[tanh1(ν1/ˉu),,tanh1(νN/ˉu)]T.

    Via the same with the approach in [27], Si(uk) can also be expressed as

    Si(uk)=2ˉuuTkRitanh1(ukˉu)+ˉu2ˉRiln(1u2kˉu2)

    where ˉRi[ri,1,ri,2,,ri,N].

    Remark 2: In the framework of optimal control, the term Si(uk) in the associate utility should satisfy the following three conditions: 1) a continuous and positive function for the performance evaluation, 2) a monotonic function for each component, and 3) a derivable function whose derived function should be invertible for the analytic solution of the optimal control law. Obviously, the adopted Si(uk) (7) is definitely the best choice for the control uk subject to input saturation.

    In order to disclose the effect from statistical characteristic of SCPs, similarly to the scheme in [36], reconstruct the performance index function (6) by embedding the transition probability matrix as follows:

    {JI(xk)=p11J1(xk)+p12J2(xk)++p1NJN(xk)JII(xk)=p21J1(xk)+p22J2(xk)++p2NJN(xk)JN(xk)=pN1J1(xk)+pN2J2(xk)++pNNJN(xk).

    By virtue of the weighted sum technique, a combined performance index is constructed as

    J(xk)=λ1JI(xk)+λ2JII(xk)++λNJN(xk) (8)

    where λi>0 is the weight vector satisfying Ni=1λi=1.

    Define

    Γ=[Γ1,Γ2,,ΓN]TL(xk,uk)=[l1(xk,uk),,lN(xk,uk)]

    where Γi=ns=1λspsi>0. It follows from (6) and (8) that

    J(xk)=Ni=1ΓiJi(xk)=Ni=1Γili(xk,uk)+Ni=1ΓiJ(xk+1)=ΓTL(xk,uk)+J(xk+1). (9)

    Before proceeding further, let us introduce the following definition.

    Definition 1: A law uk is an admissible control, if uk is continuous on the compact set ΩuRN and can stabilize the closed-loop system (5) for all x0Ωx and J(xk) is finite xkΩx.

    The purpose of this paper is to find a suboptimal control law uk to optimize the combined performance index (9), which consists of the following three aspects:

    1) Designing an NN-based identifer to identify the unknown nonlinear dynamics;

    2) Developing a value iterative ADP algorithm to solve the optimal control of protocol-induced switching systems with saturation constraints in an off-line way;

    3) In light of the obtained value iterative ADP algorithm, proposing an actor-critic NN scheme to perform the addressed near optimal control.

    The following assumption is needed in order to reveal the boundedness of developed approximate scheme in sequel.

    Assumption 1: The cost function i(x,u) (i=1,2,,ny) satisfies the following conditions:

    1) i(x,u) is continuously differentiable on u and its derivative is denoted as ψ,i(u):=i(x,u)/u;

    2) The derivative function ψ,i(u) is invertible with its inverse function denoted as ui(x)=ψ1,i(i(x,u)/u);

    3) The inverse function satisfies with a known positive constant \gamma .

    Note, that the function \ell_i(x,u) is quite general with examples including 1) \kappa\sum\nolimits_{s = 1}^p\int_0^{u}\tanh^{-1}(\nu/\kappa)R_id\nu for nonlinear systems with input constraints; and 2) x^T{P}_ix+u^TR_iu for linear systems.

    Four subsections are embodied in this section, including the design of the NN-based identifer, the iterative ADP algorithm, and the actor-critic NN scheme, as well as the performance analysis of the identification errors, the iterative ADP algorithm and the closed-loop system.

    In this paper, an NN-based approximator is utilized to identify discrete-time nonlinear systems without the knowledge of system dynamics to solve the optimal control issue. Specifically, to learn the unknown nonlinear functions, a stable adaptive weight updating law is proposed for tuning the nonlinear identifier, and a robust modification term, a function of estimated error and an additional tunable parameter, are also introduced to guarantee asymptotic stability of the proposed nonlinear identification scheme.

    To start the development of NN-based identifier, the system dynamic (5) is rewritten as

    x_{k+1} = F_{\xi_k}(x_k, u_{k}) (10)

    where F_{i}(x_k, u_{k})\triangleq f(x_k)+\tilde{g}_{i}(x_k){u}_k .

    According to the universal approximation property of NNs, there exists an NN representation of the function F_{\xi_k}(x_k, u_{k}) on a compact set \Omega_x . In this paper, a three-layer NN is considered as the function approximation structure, under which the number of neurons in the hidden layer is r, the weight matrix (a predetermined constant matrix) between the input and hidden layers is denoted by W_{1} , and the weight matrix between the hidden layer and output layer is denoted as W_{2,\xi_k} , which needs to be estimated during the training process. In this case, the closed-loop system (10) is further described as

    x_{k+1} = W_{2,\xi_k}^{T}\phi_x(\omega_{k})+\varepsilon_{k} (11)

    where \omega_{k} = W_{1}^{T}[x_{k}^{T}\; u_{k}^{T}]^{T} is the hidden layer input, \phi_x(\omega_{k}) is the bounded activation function satisfying \|\phi_x(\omega_{k})\|\leq\phi_{x,m} , and \varepsilon_{k} is the approximation error salifying a general assumption to be provided as follows.

    For the NN represented closed-loop system (11), an identifier is designed to estimate the system state, which is described by

    \hat{x}_{k+1} = (\hat{W}^{k}_{2,\xi_k})^{T}\phi_x(\omega_{k})-q_k (12)

    where \hat{W}^{k}_{2,\xi_k} denotes the estimation of the ideal weight matrix W^{k}_{2,\xi_k} , and q_k is a robust term to reduce the approximation error.

    Define the identification error and the estimated error of weight matrix as follows:

    \tilde{x}_{k} = \hat{x}_{k}-x_{k},\; \tilde{W}^{k}_{2,\xi_k} = \hat{W}^{k}_{2,\xi_k}-W_{2,\xi_k}. (13)

    Then, subtracting (11) from (12) obtains the following identification error dynamics:

    \tilde{x}_{k+1} = \hat{x}_{k+1}-x_{k+1} = (\tilde{W}^{k}_{2,\xi_k})^{T} \phi_x(\omega_k)-\varepsilon_k-q_k. (14)

    Considering this error dynamics, the robust term inspired by the work of [40] is constructed as

    q_k = \frac{\nu_k \tilde{x}_k }{ \tilde{x}_k^{T}\tilde{x}_k+c_2}

    where c_2>1 is a given constant, \nu_k is an additional tunable parameter to be designed subsequently. Therefore, the system dynamics (14) can be further rewritten as

    \begin{split} \tilde{x}_{k+1} = \;& (\tilde{W}^{k}_{2,\xi_k})^{T}\phi_x(\omega_k)-\frac{{\nu}_k \tilde{x}_k}{\tilde{x}_k^{T}\tilde{x}_k+c_2}-\varepsilon_k \\ = \;& \Phi^{k}_{1,\xi_k}-\Phi^{k}_{2,\xi_k}-\varepsilon_k \end{split} (15)

    where \Phi^{k}_{1,i} and \Phi^{k}_{2,i} are introduced for brevity in writing.

    For adopted communication protocols, \xi_{k} , modeled by a Markov chain with the known transition probability, is usually known via the communication coding. To minimize the square residual error E_{k+1} = ({1}/{2})\tilde{x}_{k+1}^{T} \tilde{x}_{k+1}, the tuning law of \hat{W}^{k}_{2,i} is given as follows:

    \hat{W}^{k+1}_{2,i} = \left\{ {\begin{array}{*{20}{l}} \hat{W}^{k}_{2,i}-\gamma_w\phi_x(\omega_k)\tilde{x}_{k+1}^{T}, & {\rm{if}}\;\; \xi_{k} = i,\; & \; \xi_{k-1} = i \\ \hat{W}^{k}_{2,i},& {\rm{otherwise}} \end{array}} \right. (16)

    and the tuning law of additional tunable parameter \nu_k introduced as

    \begin{split} \nu_{k+1} = \;& \alpha_\nu\nu_{k}+\frac{\gamma_\nu }{\tilde{x}_k^{T}\tilde{x}_k+c_2}\tilde{x}_{k+1}^{T} \tilde{x}_{k} \\ = \;& \alpha_\nu\nu_{k}+\gamma_\nu\Phi^{k}_{3}\tilde{x}_{k+1}^{T} \tilde{x}_{k} \end{split} (17)

    where \gamma_w>0 is the NN learning rate, and \alpha_\nu>0 and \gamma_\nu>0 are the designed parameters.

    Remark 3: The proposed updating rule (16) is novel and nontrivial. First, a zero-order holder is adopted to keep the weights of unactivated subsystems. Specifically, it can be found from the second case that the weights are unchanged along with the time k . Second, the update of weights is performed only when two successive schedulings are satisfied in order to avoid the fluctuation of weight updates.

    The following assumption and lemma are used to prove the convergence of the error dynamics.

    Assumption 2: The NN approximation error \varepsilon_{k} is upper bounded by a function of identification error \tilde{x}_k , that is

    \varepsilon_{k}^{T}\varepsilon_{k}\leq\bar{\vartheta}\tilde{x}_{k}^{T}\tilde{x}_{k} (18)

    where \bar\vartheta is a known constant.

    Lemma 1: For any positive definite matrix \Pi\in {\mathbb{R}}^{n\times n} , vectors x,y\in {\mathbb{R}}^{n} and scalar a>0 , the following inequality is true

    2x^{T}\Pi y\leq ax^{T}\Pi x+a^{-1}y^{T}\Pi y. (19)

    Theorem 1: Let the identifier (12) be used to identify the nonlinear system (10), where the parameter updating laws given in (16) and (17) are used tuning the NN weights and the robust modification term, respectively. The estimation error \tilde{x}_k in (14) is asymptotically stable while the weights \hat{W}^{k}_{2,i} and the additional tunable parameter \nu_k are convergent if the learning rate \gamma_w satisfies 6\gamma_w\phi_{x,m}^{2}\leq\theta_1 , and parameters \gamma_\nu and \alpha_\nu satisfy \gamma_\nu = \gamma_w\phi_{x,m}^{2} and

    \left\{ {\begin{array}{*{20}{l}} \; 0<\theta_1< {\dfrac{1}{2}}\\ \; 0<\varepsilon < {\dfrac{1}{4}}\\ \; 0<\bar\vartheta <1\\ \; \alpha_\nu<\sqrt {\dfrac{7}{8}}. \end{array}} \right. (20)

    Proof: Consider the following Lyapunov function candidate

    \begin{split} L^{k} = \;& L^{k}_{1}+\sum\limits_{s = 1}^NL^{k}_{2,s}+L^{k}_{3} \\ = \;& \tilde{x}_{k}^{T}\tilde{x}_{k} +\frac{1}{\gamma_w}\sum\limits_{s = 1}^N {\rm tr}\left\{(\tilde{W}^{k}_{2,s})^{T}\tilde{W}^{k}_{2,s}\right\} +\frac{1}{\gamma_\nu}{\nu}_k^{2}. \end{split} (21)

    Taking the first-order difference of L^{k}_{1} along with the dynamics (15) yields

    \begin{split} &{{E}}\{\Delta L^{k}_{1}|\xi_k = i,x_k\} \\ &\triangleq {{E}}\{\tilde{x}_{k+1}^{T}\tilde{x}_{k+1}|\xi_k = i,x_k\}-\tilde{x}_{k}^{T}\tilde{x}_{k} \\ &= \sum\nolimits^{N}\limits_{j = 1}p_{i,j}\tilde{x}_{k+1}^{T}\tilde{x}_{k+1}-\tilde{x}_{k}^{T}\tilde{x}_{k} \\ &= (\Phi^{k}_{1,i})^{T}\Phi^{k}_{1,i}+(\Phi^{k}_{2,i})^{T}\Phi^{k}_{2,i}+\varepsilon_k^{T}\varepsilon_k-\tilde{x}_{k}^{T}\tilde{x}_{k} \\ &\quad-2(\Phi^{k}_{1,i})^{T}\Phi^{k}_{2,i}-2(\Phi^{k}_{1,i})^{T}\varepsilon_k+2(\Phi^{k}_{2,i})^{T}\varepsilon_k. \end{split} (22)

    Similarly, taking the first-order difference of L^{k}_{2} along with the dynamics (16) results into

    \begin{split} &\sum\limits_{s = 1}^N{{E}}\left\{\Delta L^{k}_{2,s}|\xi_k = i,x_k\right\} \\ &\triangleq \sum\limits^{N}\limits_{j = 1}{{E}}\Big\{\frac{ p_{i,j}}{\gamma_w}{\rm tr}\left((\tilde{W}^{k+1}_{2,i})^{T}\tilde{W}^{k+1}_{2,i}\right)\Big|\xi_k = i,x_k\Big\}\\ &\quad+\sum\limits_{s = 1,s\not = i}^N \sum\limits^{N}\limits_{j = 1}\frac{ p_{i,j}}{\gamma_w}{{E}}\Big\{{\rm tr}\left((\tilde{W}^{k+1}_{2,s})^{T}\tilde{W}^{k+1}_{2,s}\right)\Big|\xi_k = i,x_k\Big\} \\ &\quad-\frac{1}{\gamma_w}\sum\limits_{s = 1}^N {\rm tr}\left((\tilde{W}^{k}_{2,s})^{T}\tilde{W}^{k}_{2,s}\right) \\ &= \frac{1}{\gamma_w}{{E}}\Big\{{\rm tr}\Big((\tilde{W}^{k}_{2,i}-\gamma_w\phi_x(\omega_k)\tilde{x}_{k+1}^{T})^{T}\\ &\quad\times(\tilde{W}^{k}_{2,i} -\gamma_w\phi_x(\omega_k)\tilde{x}_{k+1}^{T})\Big)|\xi_k = i,x_k\Big\}\\ &\quad-\frac{1}{\gamma_w}{\rm tr}\left((\tilde{W}^{k}_{2,i})^{T}\tilde{W}^{k}_{2,i}\right). \end{split} (23)

    Noting \phi_x(\omega_k)\leq\phi_{x,m} , one has

    \begin{split} & \sum\limits_{s = 1}^N{{E}}\{\Delta L^{k}_{2,s}|\xi_k = i,x_k\} \\ &\quad\leq -2\Phi^{k}_{1,i}\tilde{x}_{k+1}+\gamma_w\phi_{x,m}^{2}\tilde{x}_{k+1}^{T}\tilde{x}_{k+1} \\ &\quad= -2(\Phi^{k}_{1,i})^{T}\Phi^{k}_{1,i}+2(\Phi^{k}_{1,i})^{T}\Phi^{k}_{2,i}+2(\Phi^{k}_{1,i})^{T}\varepsilon_k \\ &\quad\quad+3\gamma_w\phi_{x,m}^{2}\left((\Phi^{k}_{1,i})^{T}\Phi^{k}_{1,i}+(\Phi^{k}_{2,i})^{T}\Phi^{k}_{2,i}+\varepsilon_k^{T}\varepsilon_k\right). \end{split} (24)

    Furthermore, it is not difficult to calculate that

    \begin{split} & {{E}}\left\{\Delta L^{k}_{3}|\xi_k = i,x_k\right\} \\ &\quad= \frac{1}{\gamma_\nu}{{E}}\left\{{\nu}_{k+1}^{2}|\xi_k = i,x_k\right\}-\frac{1}{\gamma_\nu} {\nu}_{k}^{2} \\ &\quad= \frac{1}{\gamma_\nu}\left((\alpha_\nu{\nu}_{k}+\gamma_\nu\ \Phi^{k}_{3}\tilde{x}_{k+1}^{T} \tilde{x}_{k})^2-{\nu}_{k}^{2}\right) \\ &\quad= 2(\Phi^{k}_{2,i})^{T}\tilde{x}_{k+1}+\gamma_\nu(\Phi^{k}_{3}\tilde{x}_{k+1}^{T}\tilde{x}_{k})^{2} \\ &\quad\quad-\gamma_\nu^{-1}(1-\alpha_\nu^2) {\nu}_{k}^{2} \\ &\quad\leq -2(\Phi^{k}_{2,i})^{T}\Phi^{k}_{2,i}+ 2(\Phi^{k}_{1,i})^{T}\Phi^{k}_{2,i}-2(\Phi^{k}_{2,i})^{T}\varepsilon_k \\ &\quad\quad +3\gamma_\nu(\Phi^{k}_{3})^{2}\tilde{x}_{k}^{T}\tilde{x}_{k}\Big((\Phi^{k}_{1,i})^{T}\Phi^{k}_{1,i}+(\Phi^{k}_{2,i})^{T}\Phi^{k}_{2,i}\\ &\quad \quad+\varepsilon_k^{T}\varepsilon_k\Big) -\gamma_\nu^{-1}(1-\alpha_\nu^2) {\nu}_{k}^{2}. \end{split} (25)

    Denote the first-order difference of \Delta L^{k} as

    \Delta L^{k} = \Delta L^{k}_{1}+\sum\limits_{s = 1}^N\Delta L^{k}_{2,s}+\Delta L^{k}_{3}. (26)

    Considering (22), (24) and (25), the equation (26) can be handled as

    \begin{split} \; &{{E}}\left\{\Delta L^{k}|\xi_k = i,x_k\right\}\\ &\quad\leq -(\Phi^{k}_{1,i})^{T}\Phi^{k}_{1,i}-(\Phi^{k}_{2,i})^{T}\Phi^{k}_{2,i}-\tilde{x}_{k}^{T}\tilde{x}_{k}\\ &\quad \quad+\varepsilon_k^{T}\varepsilon_k+2\Phi^{k}_{2,i}(\Phi^{k}_{1,i})^{T} -\gamma_\nu^{-1}(1-\alpha_\nu^2) {\nu}_{k}^{2}\\ &\quad\quad +3\left(\gamma_w\phi_{x,m}^{2}+\gamma_\nu(\Phi^{k}_{3})^{2}\tilde{x}_{k}^{T}\tilde{x}_{k}\right)\\ &\quad\quad\times\left((\Phi^{k}_{1,i})^{T}\Phi^{k}_{1,i}+(\Phi^{k}_{2,i})^{T}\Phi^{k}_{2,i}+\varepsilon_k^{T}\varepsilon_k\right). \end{split}

    Then, considering Assumption 2 and (\Phi^{k}_{3})^{2}\tilde{x}_{k}^{T}\tilde{x}_{k}\leq\tilde{x}_{k}^{T}\tilde{x}_{k} , one has

    \begin{split} \; &{{E}}\left\{\Delta L^{k}|\xi_k = i,x_k\right\} \\ &\quad\leq -(1-3\gamma_w\phi_{x,m}^{2}-3\gamma_\nu)(\parallel\Phi^{k}_{1,i}\parallel^{2}+\parallel\Phi^{k}_{2,i}\parallel^{2}) \\ &\quad \quad-(1-\bar\vartheta-3\bar\vartheta(\gamma_w\phi_{x,m}^{2}+\gamma_\nu)\parallel\tilde{x}_{k}\parallel^{2} \\ &\quad\quad+2\parallel\Phi^{k}_{1,i}\parallel\; \parallel\Phi^{k}_{2,i}\parallel-\gamma_\nu^{-1}(1-\alpha_\nu^2) {\nu}_{k}^{2}. \end{split} (27)

    Furthermore, noting

    \|\Phi^{k}_{2,i}\|^2 = \Big\|\frac{{\nu}_k \tilde{x}_k}{\tilde{x}_k^{T}\tilde{x}_k+c_2}\Big\|^2\leq {\nu}_{k}^{2} (28)

    one has

    \begin{split} 2\parallel\Phi^{k}_{1,i}\parallel\; \parallel\Phi^{k}_{2,i}\parallel \; \leq \;& \theta_1\parallel\Phi^{k}_{1,i}\parallel ^2 +\;\theta_1^{-1}\parallel\Phi^{k}_{2,i}\parallel ^2 \nonumber\\ \leq \;& \theta_1\parallel\Phi^{k}_{1,i}\parallel ^2 +\;\varepsilon\theta_1^{-1}\parallel\Phi^{k}_{2,i}\parallel ^2 \nonumber\\ &+\theta_1^{-1}(1-\varepsilon){\nu}_{k}^{2} \end{split}

    where the scalar \varepsilon belongs to (0,\;1) .

    Furthermore, select the parameters as \gamma_\nu = \gamma_w\phi_{x,m}^{2} , and 6\gamma_w\phi_{x,m}^{2}\leq\theta_1 . Applying Lemma 1, it follows from (27) that

    \begin{split} &{{E}}\{\Delta L^{k}|\xi_k = i,x_k\} \\ &\quad\leq -(1-3\gamma_w\phi_{x,m}^{2}-3\gamma_\nu-\theta_1)\parallel\Phi^{k}_{i}\parallel^{2} \\ & \quad\quad-(1-3\gamma_w\phi_{x,m}^{2}-3\gamma_\nu-\varepsilon\theta_1^{-1})\parallel q_k\parallel^{2} \\ & \quad \quad-\Big(1-\bar\vartheta-3\bar\vartheta(\gamma_w\phi_{x,m}^{2}+\gamma_\nu)\parallel\tilde{x}_{k}\parallel^{2} \\ & \quad\quad-(\gamma_\nu^{-1}(1-\alpha_\nu^2)-\theta_1^{-1}(1-\varepsilon)\Big) {\nu}_{k}^{2} \\ &\quad= -(1-6\gamma_\nu-\theta_1)\parallel\Phi^{k}_{i}\parallel^{2}-(1-6\gamma_\nu-\varepsilon\theta_1^{-1})\parallel q_k\parallel^{2} \\ &\quad \quad -(1-\bar\vartheta-6\bar\vartheta\gamma_\nu)\parallel\tilde{x}_{k}\parallel^{2}-(\gamma_\nu^{-1}(1-\alpha_\nu^2) \\ &\quad\quad -\theta_1^{-1}(1-\varepsilon)) {\nu}_{k}^{2} \\ &\quad\leq -(1-2\theta_1)\parallel\Phi^{k}_{i}\parallel^{2}-(1-\theta_1-\varepsilon\theta_1^{-1})\parallel q_k\parallel^{2} \\ & \quad\quad-(1-\bar\vartheta(1+\theta_1))\parallel\tilde{x}_{k}\parallel^{2}-\Big((\frac{\theta_1}{6})^{-1}(1-\alpha_\nu^2) \\ &\quad \quad-\theta_1^{-1}(1-\varepsilon)\Big) {\nu}_{k}^{2}. \end{split} (29)

    Therefore, one has {{E}}\{\Delta L^{k}|\xi_k = i,x_k\} < 0 if the following inequalities hold

    \left\{ \begin{aligned} & 1-2\theta_1>0\\ & 1-\theta_1-\varepsilon\theta_1^{-1}>0\\ & 1-\bar\vartheta(1+\theta_1)>0\\ & \Big(\frac{\theta_1}{6}\Big)^{-1}(1-\alpha_\nu^2)-\theta_1^{-1}(1-\varepsilon) \geq 0 \end{aligned} \right. (30)

    which yields

    \left\{ \begin{aligned} & 0<\theta_1<\frac{1}{2} \\ & \theta_1^{2}-\theta_1<\varepsilon<\theta_1-\theta_1^{2}\\& \bar\vartheta<\frac{1}{1+\theta_1}\\ & 6(1-\alpha_\nu^2)>(1-\varepsilon). \end{aligned} \right. (31)

    that is, the inequalities (20).

    Remark 4: It should be pointed out that the approximate error \varepsilon_k of NNs should be dependent on system states and will trend to zero as system states are close to the original point. As such, the feature should be adequately taken into consideration. In this paper, a robust term q_{k} with an additional tunable parameter \nu_k , inspired by the work of [40], is employed to improve the system perform while guaranteeing the asymptotic stability. Furthermore, the update of \hat{W}^{k}_{2,i} in (16) is affected by the Markov jump, and therefore a novel Lyapunov function candidate is constructed by adding the term ({1}/{\gamma_w})\sum\nolimits_{s = 1}^N {\rm tr}\left\{(\tilde{W}^{k}_{2,s})^{T}\tilde{W}^{k}_{2,s}\right\} to discover the desired condition of identifer dynamics.

    According to the Bellman’s optimality principle, the optimal performance index function J^{\ast}(x_k) satisfies the discrete-time HJB equation

    J^{\ast}(x_k) = \min_{u_k}\{\Gamma^{^{T}}L(x_k,u_k)+J^{\ast}(x_{k+1})\} (32)

    and the corresponding optimal control strategy is given by

    u_k^{\ast} = \arg\min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J^{\ast}(x_{k+1})\}. (33)

    Assume that the minimum on the right-hand side of (32) exists and is unique. Taking the first-derivative of the right-hand part, the ideal optimal control u_k^{\ast} is given by

    \begin{split} u_k^{\ast} = \;& -\bar u\tanh\Big(\frac{1}{2\bar u}R^{-1}g_{i}^{T}(x_k)\nabla J^{\ast}(x_{k+1})\Big) \\ = \;& -\bar u\tanh\Big(\frac{1}{2\bar u}R^{-1}g_{i}^{T}(x_k)\nabla J^{\ast}(\Gamma_{i}(x_k)+g_{i}(x_k){u}_k\Big). \end{split} (34)

    Since the direct solution of the HJB equation for nonlinear systems is computationally intensive, the value iteration algorithm, usually named as an ADP algorithm, needs to be developed in light of the Bellman’s principle of optimality. Initializing the value function J_0(x_k) = \Sigma(x_k) , one construct the following iterative algorithm:

    \begin{split} u_s(x_k) = \; &\arg\min_{u_k}\{{\Gamma^{T}L(x_k,u_k)+J_s(x_{k+1})}\} \\ = \;& \arg\min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J_s(F_{i}(x_k, {u}_k)\} \end{split} (35)

    and

    \begin{split} J_{s+1}(x_{k}) = \;& \min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J_s(x_{k+1})\} \\ = \;& \Gamma^{T}L(x_k,u_s(x_k))+J_s(F_{i}(x_k, {u}_k) \end{split} (36)

    where s is the iterative step, and J_s(x_k) and u_s(x_k) are used to approximate J^{\ast}(x_k) and u_k^{\ast} , respectively, as s\rightarrow\infty .

    Inspired by [17], [41], we further demonstrate the convergence of the developed scheme with the help of a “functional bound” method.

    Lemma 2: Consider the sequences J_{s}(x_{k}) and u_s(x_k) introduced by (36) and (35), respectively. Given the initial value function J_{0}(x) for x\in {\mathbb{R}}^{M+N} , the value function J_{s}(x) is a monotonically sequence as s increases. Specifically, if J_{0}(x)\leq J_{1}(x) , the value function J_{s}(x) is a monotonically nondecreasing sequence as s increases, i.e., J_{s}(x)\leq J_{s+1}(x) , otherwise the value function J_{s}(x) is a monotonically decreasing sequence as s increases, i.e., J_{s}(x)> J_{s+1}(x) .

    Proof: Let us first prove the case of J_{0}(x)\leq J_{1}(x) by mathematical induction. Consider (36) and J_{0}(x)\leq J_{1}(x) , one has

    \begin{split} J_{2}(x_{k}) = \;& \min_{u_k}\{{\Gamma^{T}L(x_k,u_k)+J_1(x_{k+1})}\}\\ \leq \;& \min_{u_k}\{{\Gamma^{T}L(x_k,u_k)+J_0(x_{k+1})}\}\\ = \; &J_{1}(x_{k}). \end{split}

    Assume that J_{q-1}(x)\leq J_{q}(x) holds when s = q-1 . Then, for s = q , one can conclude that

    \begin{split} J_{q+1}(x_{k}) = \; &\min_{u_k}\{{\Gamma^{T}L(x_k,u_k)+J_q(x_{k+1})}\}\\ \leq \;& \min_{u_k}\{{\Gamma^{T}L(x_k,u_k)+J_{q-1}(x_{k+1})}\}\\ = \;& J_{q}(x_{k}). \end{split}

    Therefore, this case is true. Similarly, one can conclude that J_{s}(x)\geq J_{s+1}(x) when J_{0}(x)\geq J_{1}(x) .

    Theorem 2: Consider the sequences J_s(x_k) and u_s(x_k) introduced in (36) and (35), respectively. If there exist four constants \underline{\rho} , \bar{\rho} , \underline{\varrho} and \bar{\varrho} satisfying 0<\underline{\rho}\leq\bar{\rho} and 0\leq\underline{\varrho}\leq\bar{\varrho} such that

    \underline{\rho}\{\Gamma^{T}L(x_k,u_k)\}\leq J^{\ast}(x_k)\leq \bar{\rho}\{\Gamma^{T}L(x_k,u_k)\} (37)

    and

    \underline{\varrho}J^{\ast}(x_k)\leq J_0(x_k)\leq \bar{\varrho}J^{\ast}(x_k) (38)

    hold uniformly, then the iterative value function J_s(x_k) converges to the optimal value J^{\ast}(x_k) as s\rightarrow\infty , i.e.,

    \lim_{s\rightarrow\infty}J_s(x_k) = J^{\ast}(x_k). (39)

    Proof: To verify this result, we will first prove the following assertion by using the mathematical induction method.

    Assertion: Case I: For parameters 0\leq\underline{\varrho}\leq\bar{\varrho}<1 , the iterative value function J_s(x_k) satisfies

    \begin{split} \Big(1+\frac{\underline{\varrho}-1}{(1+\bar{\rho}^{-1})^{{s}}}\Big)\;&J^{\ast}(x_k)\leq J_{s}(x_k)\; \\ & \leq\Big(1+\frac{\bar{\varrho}-1}{(1+\underline{\rho}^{-1})^{{s}}}\Big)J^{\ast}(x_k). \end{split} (40)

    Case II: For parameters 0\leq\underline{\varrho}\leq1\leq\bar{\varrho}\leq\infty , the iterative value function J_s(x_k) satisfies

    \begin{split} \Big(1+\frac{\underline{\varrho}-1}{(1+\bar{\rho}^{-1})^{{s}}}\Big)\;& J^{\ast}(x_k)\leq J_{s}(x_k)\; \\ & \leq\Big(1+\frac{\bar{\varrho}-1}{(1+\bar{\rho}^{-1})^{{s}}}\Big)J^{\ast}(x_k). \end{split} (41)

    Case III: For parameters 1\leq\underline{\varrho}\leq\bar{\varrho}\leq\infty , the iterative value function J_s(x_k) satisfies (40).

    Considering the limited space, we only prove the left-hand side of the inequality (40) in Case I and the right-hand side of the inequality (41) in Case II. Furthermore, the proof of Case III is similar to those the first two cases and hence its proof is omitted.

    Obviously, the left-hand side of the inequality (40) in Case I holds for s = 0 . Then, combing with the condition (37), one can derive that

    \begin{split} J_1(x_k) = \;& \min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J_0(x_{k+1})\}\\ \geq \;& \min_{u_k}\{\Gamma^{T}L(x_k,u_k)+\underline{\varrho}J^{\ast}(x_{k+1})\}\\ \geq \;& \min_{u_k}\Big\{\Gamma^{T}L(x_k,u_k)+\underline{\varrho}J^{\ast}(x_{k+1}) \end{split}
    \begin{split} & +\frac{(\underline{\varrho}-1)}{1+\bar{\rho}} (\bar{\rho}\Gamma^{T}L(x_k,u_k)-J^{\ast}(x_{k+1})) \Big\}\\ \geq \;& \min_{u_k}\Big\{\Big(1+\frac{\bar{\rho}(\underline{\varrho}-1)}{1+\bar{\rho}}\Big)\Gamma^{T}L(x_k,u_k)\\ &+\Big(\underline{\varrho}-\frac{\underline{\varrho}-1}{1+\bar{\rho}}\Big)J^{\ast}(x_{k+1})\Big\}\\ = \;& \Big(1+\frac{\underline{\varrho}-1}{1+\bar{\rho}^{-1}}\Big)\min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J^{\ast}(x_{k+1})\}\\ = \; &\Big(1+\frac{\underline{\varrho}-1}{1+\bar{\rho}^{-1}}\Big)J^{\ast}(x_k). \end{split}

    Furthermore, assume that the conclusion holds for s = q-1 , that is

    \Big(1+\frac{\underline{\varrho}-1}{(1+\bar{\rho}^{-1})^{{q-1}}}\Big)J^{\ast}(x_k)\leq J_{q-1}(x_k).

    When s =q, combining with the condition (37) again, one has

    \begin{split} J_{q}(x_k) = \;& \min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J_{q-1}(x_{k+1})\}\\ \geq \;& \min_{u_k}\Big\{\Gamma^{T}L(x_k,u_k)+\Big(1+\frac{\underline{\varrho}-1}{(1+\bar{\rho}^{-1})^{q-1}}\Big)J^{\ast}(x_{k+1})\Big\}\\ \geq \;& \min_{u_k}\Big\{\Gamma^{T}L(x_k,u_k)+\Big(1+\frac{\underline{\varrho}-1}{(1+\bar{\rho}^{-1})^{q-1}}\Big)J^{\ast}(x_{k+1})\\ &+\frac{(\underline{\varrho}-1) (\bar{\rho}\Gamma^{T}L(x_k,u_k)-J^{\ast}(x_{k+1})) }{(1+\bar{\rho})(1+\bar{\rho}^{-1})^{q-1}}\Big\}\\ = \; &\Big(1+\frac{\underline{\varrho}-1}{(1+\bar{\rho}^{-1})^{q}}\Big)\min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J^{\ast}(x_{k+1})\}\\ = \;& \Big(1+\frac{\underline{\varrho}-1}{(1+\bar{\rho}^{-1})^{q}}\Big)J^{\ast}(x_k). \end{split}

    According to the mathematical induction method, the left-hand side of the inequality (40) holds.

    In what follows, let us prove the right-hand side of the inequality (41) in Case II. Obviously, it is not difficult to find that

    \begin{split} J_1(x_k) = \;& \min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J_0(x_{k+1})\}\\ \leq \;& \min_{u_k}\{\Gamma^{T}L(x_k,u_k)+\bar{\varrho}J^{\ast}(x_{k+1})\}\\ \leq \;& \min_{u_k}\Big\{\Gamma^{T}L(x_k,u_k)+\bar{\varrho}J^{\ast}(x_{k+1})\\ &+\frac{\bar{\varrho}-1}{1+\bar{\rho}}\big(\bar{\rho}\Gamma^{T}L(x_k,u_k)-J^{\ast}(x_{k+1})\big)\Big\}\\ = \;& \Big(1+\frac{\bar{\varrho}-1}{1+\bar{\rho}^{-1}}\Big)J^{\ast}(x_k) \end{split}

    where the term \,\bar{\rho}\Gamma^{T}L(x_k,u_k)-J^{\ast}(x_{k+1}) induced by the condition (37) is added.

    Furthermore, assume that the conclusion holds for s = q-1 , that is

    J_{q-1}(x_k)\leq \; \Big(1+\frac{\bar{\varrho}-1}{(1+\bar{\rho}^{-1})^{{q-1}}}\Big)J^{\ast}(x_k).

    When s = q , combining with the condition (37), one has

    \begin{split} J_{q}(x_k) = \; &\min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J_{q-1}(x_{k+1})\}\\ \leq \; &\min_{u_k}\Big\{\Gamma^{T}L(x_k,u_k)+\Big(1+\frac{\bar{\varrho}-1}{(1+\bar{\rho}^{-1})^{{q-1}}}\Big)J^{\ast}(x_{k+1})\Big\}\\ \leq \;& \min_{u_k}\Big\{\Gamma^{T}L(x_k,u_k)+\Big(1+\frac{\bar{\varrho}-1}{(1+\bar{\rho}^{-1})^{{q-1}}}\Big)J^{\ast}(x_{k+1})\\ &+\frac{(\bar{\varrho}-1)(\bar{\rho}\Gamma^{T}L(x_k,u_k)-J^{\ast}(x_{k+1})}{(1+\bar{\rho})(1+\bar{\rho}^{-1})^{q-1}}\Big\}\\ = \;& \Big(1+\frac{\bar{\varrho}-1}{(1+\bar{\rho}^{-1})^{q}}\Big)J^{\ast}(x_k). \end{split}

    In light of the mathematical induction method, the right-hand side of the inequality (41) holds.

    Combining the above conclusions, we can obtain that this assertion is true. Finally, letting s\to\infty , the convergence (39) is easily derived.

    Remark 5: The above theorem discloses the convergence of the developed ADP scheme with the help of a “functional bound” method, which comes from [17], [41]. An assertion has been allocated for the convenience of processing. For the practical application, a terminal condition (or a fixed size number s^\ast ) is adopted, the related algorithm (i.e., ADP Algorithm 1) is provided as follows.

    Algorithm 1 ADP algorithm

    Initialization Value J_0(x_k)=\Sigma(x_k) and error threshold \varpi > 0. Set s=0 .

    1: while |J_{s}(x_k)-J_{s-1}(x_k)|>\varpi do

    2: Solve u_s(x_k) according to (35);

    Update the value J_{s+1}(x_k) according to (36);

    Set s=s+1 ;

    3: end while

    4: Output Control strategy u_{s-1}(x_k).

    Due to the unknown J_{s}(x_{k+1}) , a approximation structure via NNs is employed to approximate both J^{\ast}(x_k) and u^{\ast}(x_k) . Such a structure consists of a critic network and an actor network, which are all chosen as three-layer feed forward NNs and their implementation process is shown in Fig. 1. In light of the above conception, the optimal value function (32) and the control input (34) can be described by the following critic NN and actor NN:

    Figure  1.  Neural network structure of the proposed ADP approach.
    J^{\ast}(x_{k}) = W_{2c}^{T}\phi_c(W_{1c}^{T}{\textit{z}}_{k})+\theta_c({\textit{z}}_{k}) (42)

    and

    u^{\ast}(x_k) = \phi_{2a}({W}^{T}_{2a}\phi_{1a}(W_{1a}^{T}x_{k}))+\theta_a(x_{k}) (43)

    with z_{k} = [x^{T}_k\; u^{T}_k]^{T} where W_{2c} and W_{2a} are the ideal wights of designed NNs and bounded, respectively, by two positive scalars \bar{W}_{2cM} and \bar{W}_{2aM} , i.e., \|W_{2c}\| \leq W_{2cM} and \|W_{2a}\|\leq W_{2aM}; \theta_c({\textit{z}}_{k}) and \theta_a(x_{k}) are the bounded approximation errors, i.e., \|\theta_c(z_{k})\|\leq\theta_{cM} and \|\theta_a(x_{k})\|\leq\theta_{aM} ; W_{1c} and W_{1a} are the known weight matrices of between the input layer and hidden layer; and \phi_{c}(\cdot) , \phi_{1a}(\cdot) and \phi_{2a}(\cdot) are the activation functions satisfying \|\phi_{c}(\cdot)\|\leq\phi_{c,m} , \|\phi_{1a}(\cdot)\|\leq\phi_{1a,m} and \|\phi_{2a}(\cdot)\|\leq\phi_{2a,m} .

    In order to identify the ideal weight W_{2c} , the following approximation is developed by virtue of the ADP algorithm

    \hat{J}_{s}(x_{k}) = \hat{W}^{T}_{2c,s}\phi_c(W_{1c}^{T}{\textit{z}}_{s,k}) (44)

    where {\textit{z}}_{s,k} = [x^{T}_k\; u^{T}_{s,k}]^{T}.

    Taking the above equation into (36), there is usually

    \hat{J}_{s}(x_{k})\not = \Gamma^TL(x_k,u_{s-1}(x_k))+\hat{J}_{s-1}(x_{k+1})

    that is

    \hat{W}^{T}_{2c,s}\phi(W_{1c}^{T}{\textit{z}}_{s,k})\not = \Gamma^TL(x_k,u_{s-1}(x_k))+\hat{J}_{s-1}(x_{k+1}) . (45)

    Introduce the gap

    \begin{split} \Delta{J}_{s}(x_{k}) = \;& \hat{W}^{T}_{2c,s}\phi_c(W_{1c}^{T}{\textit{z}}_{s,k})-\hat{J}_{s-1}(x_{k+1}) \\ &-\Gamma^TL(x_k,u_{s-1}(x_k)) \end{split} (46)

    and then define the cost function

    e_{c,s} = \frac{1}{2}\Delta{J}^{2}_{s}(x_{k}).

    Minimizing such a function results in the updating rule of the weights of the critic network

    \begin{split} \hat{W}_{2c,s+1} = \; &\hat{W}_{2c,s}-\varepsilon_c\frac{\partial e_{c,s}}{\partial \hat{W}_{2c,s}} \\ = \; &\hat{W}_{2c,s}-\varepsilon_c\frac{\partial e_{c,s}(k)}{\partial \tilde{J}_{s}(x_{k})}\frac{\partial (\Delta{J}_{s}(x_{k}))}{\partial \hat{W}_{2c,s}} \\ = \; &\hat{W}_{2c,s}-\varepsilon_c\phi_c(W_{1c}^{T}z_{s,k})\Delta{J}^{T}_{s}(x_{k}) \end{split} (47)

    where \varepsilon_c>0 is the learning rate of the critic network. The weights of the model network are kept unchanged after finished the training process.

    In the actor network, x_k is used as the input, while the control input is approximated by

    \hat{u}_s(x_k) = \phi_{2a}\left(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})\right). (48)

    On the other hand, it follows from (34) that the approximated value is also obtained by

    u_s(x_k) = \; -\bar u\tanh\left(\frac{1}{2\bar u}R^{-1}g_{i}^{T}(x_k)\nabla \hat{J}^{T}_{s}(x_{k+1})\right). (49)

    Denote \Upsilon_i = ({1}/{2\bar u})R^{-1}\tilde{g}_{i}^{T}(x_k) and then introduce the gap

    \begin{split} \Delta{u}_{s}(x_{k}) = \;& \hat{u}_s(x_k)-u_s(x_k) \\ = \;& \phi_{2a}\left(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})\right) \\ &+\bar u\tanh\big(\Upsilon_i\nabla\phi_c^{T}(W_{1c}^{T}{\textit{z}}_{s,k+1})\hat{W}^{T}_{2c,s}\big). \end{split} (50)

    In what follows, define the cost of this gap

    e_{a,s} = \frac{1}{2}\Delta{u}^{T}_{s}(x_{k})\Delta{u}_{s}(x_{k}).

    By employing the gradient descent approach again to minimize e_{a,s} , one has the updating rule of the weights of the actor network

    \begin{split} \hat{W}_{2a,s+1} = \;& \hat{W}_{2a,s}-\varepsilon_a\frac{\partial e_{a,s}(k)}{\partial \hat{W}_{2a,s}} \\ = \; &\hat{W}_{2a,s}-\varepsilon_a\frac{\partial e_{a,s}(k)}{\partial \Delta{u}_{s}(x_{k})} \frac{\partial \Delta{u}_{s}(x_{k})}{\partial \hat{u}_s(x_k)} \frac{\partial \hat{u}_s(x_k)}{\partial \hat{W}_{2a,s}} \\ = \;& \hat{W}_{2a,s}-\frac{1}{2}\varepsilon_a\phi_{1a}(W_{1a}^{T}x_k) \\ &\times\Big(1-\phi_{2a}^{T}(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})) \\ &\times\phi_{2a}(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k}))\Big)\Delta{u}^{T}_{s}(x_{k}) \end{split} (51)

    where \varepsilon_a> 0 is the learning rate of the action network.

    Defining the estimation errors of weight matrices

    \tilde{W}_{2c,s} = \hat{W}_{2c,s}-W_{2c},\; \tilde{W}_{2a,s} = \hat{W}_{2a,s}-W_{2a}

    one has

    \begin{split} \tilde{W}_{2c,s+1} = \;& \tilde{W}_{2c,s}-\varepsilon_c\phi_c(W_{1c}^{T}{\textit{z}}_{s,k})\Delta{J}^{T}_{s}(x_{k}) \\ = \; &\tilde{W}_{2c,s}-\varepsilon_c\phi_c(W_{1c}^{T}{\textit{z}}_{s,k})\Big( \hat{W}^{T}_{2c,s}\phi_c(W_{1c}^{T}{\textit{z}}_{s,k}) \\ &-\hat{J}_{s-1}(x_{k+1})-\Gamma^TL(x_k,u_{s-1}(x_k))\Big)^T \\ = \;& \tilde{W}_{2c,s}-\varepsilon_c\phi_c(W_{1c}^{T}{\textit{z}}_{s,k})\Big( \tilde{W}^{T}_{2c,s}\phi_c(W_{1c}^{T}{\textit{z}}_{s,k}) \\ &+{W}^{T}_{2c}\phi_c(W_{1c}^{T}{\textit{z}}_{s,k})-\hat{W}^{T}_{2c,s-1}\phi_c(W_{1c}^{T}{\textit{z}}_{s,k+1}) \\ & -\Gamma^TL(x_k,u_{s-1}(x_k))\Big)^T \end{split} (52)

    and

    \begin{split} \tilde{W}_{2a,s+1} = \;& \tilde{W}_{2a,s}-\frac{1}{2}\varepsilon_a\phi_{1a}(W_{1a}^{T}x_k) \\ &\times\Big(1-\phi_{2a}^{T}(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})) \\ &\times\phi_{2a}\big(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})\big)\Big)\Delta{u}^{T}_{s}(x_{k}) \\ = \;& \tilde{W}_{2a,s}-\frac{1}{2}\varepsilon_a\phi_{1a}(W_{1a}^{T}x_k) \\ &\times\Big(1-\phi_{2a}^{T}(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})) \\ &\times\phi_{2a}\big(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})\big)\Big) \\ &\times\Big(\phi_{2a}(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})) \\ &+\bar u\tanh\big(\Upsilon\nabla\phi_c^{T}(W_{1c}^{T}{\textit{z}}_{s,k+1})\hat{W}^{T}_{2c,s}\big)\Big). \end{split} (53)

    It is easily seen that the estimation errors of weights in actor and critic networks will inevitably affect the performance of the above ADP algorithm. Thus, it is necessary to prove the boundedness of the critic and actor NN weights.

    Theorem 3: Consider the discrete-time Markov jump system (MJS) (5), the critic NN (44) and the actor NN (48). Then, for the fixed time k , the weight estimation error \tilde{W}_{c,s} in (52) of the critic NN and the weight estimation error \tilde{W}_{a,s} in (50) of the actor NN are all UUB, if the following conditions for the learning rates are satisfied

    \; 0<\varepsilon_c\leq \phi_{c,m}^{-2},\;\;0<\varepsilon_a\leq \phi_{1a,m}^{-2}. (54)

    Proof: In order to show the boundedness, we introduce a Lyapunov function candidate

    \begin{split} L_{\tilde{W}_s} = \; &L_{\tilde{W}_{2c},s}+L_{\tilde{W}_{2a},s}\\ = \;& \frac{1}{\alpha_c}{\rm tr}\left\{\tilde{W}^{T}_{2c,s}\tilde{W}_{2c,s}\right\}+\frac{1}{\alpha_a}{\rm tr}\left\{\tilde{W}^{T}_{2a,s}\tilde{W}_{2a,s}\right\}. \end{split}

    In what follows, the proof is similar to the one in literature [42], and therefore its details are omitted, and the corresponding learning rates need to satisfy

    \begin{split} &0<\varepsilon_c\leq \frac{1}{\parallel \phi_c(W_{1c}^{T}{\textit{z}}_{k})\parallel^{2}}\\ &0<\varepsilon_a\leq \frac{\parallel\phi_{1a}(W_{1a}^{T}x_{k})\parallel^{-2}}{1-\parallel \phi_{2a}(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k}))\parallel^{2}}. \end{split}

    Since the excitation functions \phi_c(\cdot) , \phi_{1a}(\cdot) and \phi_{2a}(\cdot) are bounded, the ideal learning rates can be obtained.

    Assumption 3: The function \|g(x_k)\| in (1) is bounded, and therefore the function \|g_{i}(x_k)\| is also bounded.

    Theorem 4: Let the initial control input be admissible and the initial actor-NN and critic-NN weights be selected from a compact set which includes the ideal weights. The NN weight updating laws (47) and (51) are adopted in an off-line way for the critic network (44) and the actor network (48), and the updating law (50) with (17) is employed in an online way for the identifier (12). Then, the closed-loop system (5) (or (10)) with control law (48) selecting \hat{u}(x_k) = \phi_{2a}(\hat{W}^{T}_{2a,\infty}\phi_{1a}(W_{1a}^{T}x_{k})) is ultimately bounded in mean-square sense if all conditions in Theorems 1 and 3 hold.

    Proof: In the framework of identifier-based control, taking the control policy (48) into account, the actual closed-loop system as follows:

    \begin{split} x_{k+1} = \;& F_{\xi_k}(x_k, \hat{u}_{\infty}(x_k))\\ = \; &F_{\xi_k}(x_k, u^{\ast}(x_k))+ g_{\xi_k}(x_k)(\hat{u}_{\infty}(x_k)- u^{\ast}(x_k))\\ = \;& F_{\xi_k}(x_k, u^{\ast}(x_k)) +g_{\xi_k}(x_k)(\phi_{2a}(\tilde{W}^{T}_{2a}\psi_{k})-\theta_a(x_{k})) \end{split}

    where

    \begin{split} &\psi_{k} = \; \phi_{1a}(W_{1a}^{T}x_{k})\\ &\phi_{2a}(\tilde{W}^{T}_{2a}\psi_{k}) = \; \phi_{2a}(\hat{W}^{T}_{2a,\infty} \psi_{k})-\phi_{2a}({W}^{T}_{2a} \psi_{k}). \end{split}

    Obviously, considering the property of activation functions of NNs, one has that \|\phi_{2a}(\tilde{W}^{T}_{2a}\psi_{k})-\theta_a(x_{k})\| is bounded. Furthermore, benefiting from Assumption 3, the additional term g_{\xi_k}(x_k)(\phi_{2a}(\tilde{W}^{T}_{2a}\psi_{k})-\theta_a(x_{k})) is also bounded.

    On the other hand, according to the optimal control theory, the policy (43) stabilizes the system (11) (i.e., (10)) on the compact set. With the same approach in [37], it is clear that there exists a constant H^{\ast} such that

    {E}\parallel\sum\limits_{j = 1}^{N} p_{ij}F_{i}(x_k, u^{\ast}(x_k))\parallel^{2}\leq H^{\ast}{E} \parallel x_k \parallel^{2}. (55)

    By virtue of the input-to-state stability or the similar line in [37], one can conclude that the actual closed-loop system is ultimately bounded in mean-square sense.

    Remark 6: In the above subsections, a set of critic and actor networks are designed to approximate the performance index function sequence J_{s}(x_{k}) and the control law sequence u_s(x_k) for the fixed x_k , where the updating rules of NN weights are derived via the gradient descent. By means of the well-known Lyapunov stability theory, we obtain the conditions on learning rates of neural networks, under which both the weight error dynamics and the closed-loop systems are bounded stable.

    Remark 7: In almost all ADP-based suboptimal control issues for the nonlinear systems, NNs are widely utilized to approximate the unknown nonlinear dynamics as well as the actor and critic functions. Such a structure, named the actor-critic structure, provides the capability of forwarding calculation while avoiding the dimensional disaster. Inspired by the idea in [43], a tuning parameter q_k has been employed in the identification of unknown nonlinear systems to adjust the approximate error \varepsilon_{k} . Furthermore, the three-layer feed forward NNs have been adopted to approximate actor and critic functions where the approximation capability is enhanced due to the utilization of a hidden layer.

    Remark 8: Up to date, two typical iteration strategies of ADP algorithms are utilized to obtain the desired controller parameter and the associate utility, and they are policy iteration (PI) and value iteration (VI), respectively. One major difference between PI and VI strategies is that PI requires an initial admissible control policy that stabilizes the system states [44]. From a mathematical point of view, the initial admissible control can be regarded as a suboptimal control which requires to solve the nonlinear partial differential equations (PDEs) analytically. To overcome the shortage, a VI-based strategy has been developed in this paper to definitely deal with the control issue with input saturation and communication scheduling.

    In this section, we use a simulation example to show that the proposed suboptimal control is effective for discrete-time nonlinear systems with input saturation under SCPs.

    Consider the following nonlinear system:

    x_{k+1} = \left[ {\begin{array}{*{20}{c}} -0.5x_{1,k}+0.1x_{2,k} \\ 0.1\sin(x_{1,k})\exp(|x_{2,k}|)+1.2x_{2,k} \\ \end{array}} \right]+\left[ {\begin{array}{*{20}{c}} \bar{u}_{1,k} \\ \bar{u}_{2,k} \end{array}} \right]

    where x_{i,k} ( i = 1,2 ) stands for the i -th element of vector x_k with the initial value x_0 = [-0.5,\;-0.2]^{T} , and \bar{u}_{i,k} ( i = 1,2 ) is the actuator input scheduled by SCPs, where the scheduling probabilities are p_{11} = 0.65 and p_{22} = 0.6 . Its initial value is u_0 = [-0.1,0.5]^{T} and the saturation level \bar u is 5.

    In this example, choose three-layer feedforward NNs in the identifier, the critic network and the action network with structures 4-4-2 , 4-2-1 , and 2-2-2 , respectively. Furthermore, select the activation functions as follows

    \begin{split} &\phi_x(\ast) = \; \frac{2(e^{\ast}-e^{-\ast })} {e^{\ast}+e^{-\ast}}\;\;\\ &\phi_{c}(\ast) = \; \phi_{1a}(\ast) = \frac{e^{\ast}-e^{-\ast }} {e^{\ast}+e^{-\ast}}\;\;\\ &\phi_{2a}(\ast) = \; \frac{\bar {u}(e^{\ast}-e^{-\ast })} {e^{\ast}+e^{-\ast}}. \end{split}

    Thus, the bounds of \phi_x , \phi_{c} , \phi_{1a} and \phi_{2a} are \phi_{x,m} = 2 , \phi_{c,m} = \phi_{1a,m} = 1 and \phi_{2a} = \bar{u} , respectively.

    In virtue of Theorem 1, we can employ the learning rate \gamma_\nu = \gamma_w = 0.1 , and parameters \alpha_\nu = 0.9 , c_2 = 1 , and \nu_{0} = 0.1 in the tuning law \hat{W}^{k}_{2,i} and the additional tunable parameter \nu_k . Furthermore, the weight matrix W_{1} between the input and hidden layers is adopted 1.2I and the initial weight matrices \hat{W}^{0}_{2,i} ( i = 1, 2 ) in (12) between the hidden layer and output layer are selected as

    \hat{W}^{0}_{2,i} = \left[ {\begin{array}{*{20}{c}} -0.50, &0.1, &1.0, &0 \\ 0.02, &1.2, &0, &1.0 \end{array}} \right]^{T}, \;\;\; i = 1,2.

    In what follows, we consider the matrices Q_1 = Q_2 = 0.5I and R_1 = R_2 = 0.2I in the cost function (6) and the corresponding scalars \lambda_1 = 0.4 and \lambda_2 = 0.6 in the combined performance index (8). Furthermore, the parameters in the updating rules (47) and (51) are chosen as \varepsilon_a = \varepsilon_c = 0.4 and for the adopted critic-actor network with the help of Theorem 3, the initial weight matrices of this critic-actor network are selected as

    \hat{W}_{2c,0} = \; \left[ {\begin{array}{*{20}{c}} 1.00, & 1.05 \end{array}} \right]^{T},\;\;\\ \hat{W}_{2a,0} = \; \left[ {\begin{array}{*{20}{c}} -0.04, & -1.16 \\ -0.01, & -0.134 \end{array}} \right]^{T}.

    In addition, the weight matrices W_{1a} and W_{1c} between the input and hidden layers are

    {W}_{1a} = 0.2I,\;\; {W}_{1c} = \left[ {\begin{array}{*{20}{c}} 2, & 0, & 0.01, & 0 \\ 0, & 2.5, & 0, & 0.01 \end{array}} \right]^{T}.

    Training of weight matrices for critic-actor networks is performed in instant k = 4 with 200 steps. After being trained, the weights are kept unchanged. The training process is shown in Fig. 2 and their trajectories are convergent, which verifies the effectiveness of developed ADP scheme.

    Figure  2.  The iterative trajectories of the weight matrices \hat{W}_{2a,s} and \hat{W}_{2c,s} .

    The simulation results are presented in Figs. 3-7. The state trajectories x_k of the open-loop system are depicted in Fig. 3 to reveal that the open-loop system is divergent. With the help of the designed controller, the state trajectories x_k of the closed-loop system and the corresponding trajectories \hat{x}_k of the identifer are respectively shown in Figs. 4 and 5. For this control issue, the secluded node is shown in Fig. 6, which clearly discloses that the system randomly jumps due to the utilization of different actuator units, and the weight matrices of the identifier \hat{W}^k_{2i} are plotted in Fig. 7, all of which are eventually convergent. It is not difficult to see that the closed-loop system is stable, and therefore the developed control strategy is effective.

    Figure  3.  State trajectories x_k of the open-loop system.
    Figure  7.  The iterative trajectories of the weight matrices \hat{W}^{k}_{2,i} .
    Figure  4.  State trajectories x_k of the closed-loop system.
    Figure  5.  State trajectories \hat{x}_k of the identifer.
    Figure  6.  The scheduling of actuator units.

    In this paper, we have developed a suboptimal control strategy in the framework of ADP for a class of unknown nonlinear discrete-time systems subject to input constraints. An identification with robust term based on a three-layer neural network in which the weight update relies on protocol-induced jumps, has been established to approximate nonlinear systems and the corresponding stability has been provided. Then, the value iterative ADP algorithm has been developed to solve the suboptimal control problem with boundedness analysis, and the convergence of iterative algorithm, as well as the boundedness of the estimation errors for critic and actor NN weights, has been analyzed. Furthermore, an actor-critic NN scheme has been developed to approximate the control law and the proposed performance index function and the stability of the closed-loop systems have been discussed. Finally, the numerical simulation result has been utilized to demonstrate the effectiveness of the proposed control scheme.

  • [1]
    M. Mazouchi, M. B. N. Sistani, and S. K. H. Sani, “A novel distributed optimal adaptive control algorithm for nonlinear multi-agent differential graphical games,” IEEE/CAA J. Autom. Sinica, vol. 5, no. 1, pp. 331–341, Jan. 2018.
    [2]
    Y. J. Liu, L. Tang, S. Tong, C. L. Chen, and D. J. Li, “Reinforcement learning design-based adaptive tracking control with less learning parameters for nonlinear discrete-time MIMO systems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 1, pp. 165–176, Jan. 2015.
    [3]
    R. Song and L. Zhu, “Optimal fixed-point tracking control for discrete-time nonlinear systems via ADP,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 3, pp. 657–666, May 2019.
    [4]
    L. Sun, and Z. Zheng, “Disturbance-observer-based robust backstepping attitude stabilization of spacecraft under input saturation and measurement uncertainty,” IEEE Trans. Ind. Electron., vol. 64, no. 10, pp. 7994–8002, 2017.
    [5]
    D. Wang, H. He, X. Zhong, and D. Liu, “Event-driven nonlinear discounted optimal regulation involving a power system application,” IEEE Trans. Ind. Electron., vol. 64, no. 10, pp. 8177–8186, 2017.
    [6]
    H. Li, Y. Wu and M. Chen, “Adaptive fault-tolerant tracking control for discrete-time multi-agent systems via reinforcement learning algorithm,” IEEE Trans. Cybern., to be published. DOI: 10.1109/TCYB.2020.2982168.
    [7]
    T. Wang, H. Gao, and J. Qiu, “A combined adaptive neural network and nonlinear model predictive control for multirate networked industrial process control,” IEEE Trans. Ind. Electron., vol. 27, no. 2, pp. 416–425, 2016.
    [8]
    H. Zhang, Y. Luo, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1490–1503, 2009.
    [9]
    Z. Shi and Z. Wang, “Optimal control for a class of complex singular system based on adaptive dynamic programming,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 1, pp. 188–197, Jan. 2019.
    [10]
    R. Song, Q. Wei, H. Zhang, and F. L. Lewis, “Discrete-time non-zero-sum games with completely unknown dynamics,” IEEE Trans. Cybern., vol. 99, pp. 1–15, 2019. doi: 10.1109/TCYB.2019.2957406
    [11]
    Q. Wei, and D. Liu, “Data-driven neuro-optimal temperature control of waterCgas shift reaction using stable iterative adaptive dynamic programming,” IEEE Trans. Ind. Electron., vol. 61, no. 11, pp. 6399–6408, 2014.
    [12]
    P. J. Werbos, “Foreword-ADP: the key direction for future research in intelligent control and understanding brain intelligence,” IEEE Trans. Syst. Man,Cybern. Part B, vol. 38, pp. 898–900, 2008.
    [13]
    D. P. Bertsekas and J. N. Tsitsiklis, “Neuro-Dynamic Programming,” Athena Scientific, USA, Belmont, MA, 1996.
    [14]
    R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction”, Cambridge, MA, USA: MIT Press, 1998.
    [15]
    A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof,” IEEE Trans. Syst. Man,Cybern. Part B, vol. 38, no. 4, pp. 943–949, 2008.
    [16]
    A. Heydari, “Stability analysis of optimal adaptive control under value iteration using a stabilizing initial policy,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 9, pp. 4522–4527, Sept. 2018.
    [17]
    Q. Wei, D. Liu, and H. Lin, “Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems,” IEEE Trans. Cybern., vol. 46, pp. 840–853, 2016.
    [18]
    W. B. Powell, “Approximate Dynamic Programming,” IHoboken, NJ, USA: Wiley, 2007.
    [19]
    D. V. Prokhorov and D. C. Wunsch, “Adaptive critic designs,” IEEE Trans. Neural Netw., vol. 8, no. 5, pp. 997–1007, 1997.
    [20]
    X. Zhong, N. Zhen, and H. He, “A theoretical foundation of goal representation heuristic dynamic programming,” IEEE Trans. Neural Netw. Learn. Syst, vol. 27, no. 12, pp. 2513–2525, 2017.
    [21]
    Y. Yuan, Z. Wang, P. Zhang, and H. Liu, “Near-optimal resilient control strategy design for state-saturated networked systems under stochastic communication protocol,” IEEE Trans. Cybern., vol. 49, no. 8, pp. 1–13, 2018.
    [22]
    M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, no. 5, pp. 779–791, 2005.
    [23]
    X. Yang and B. Zhao, “Optimal neuro-control strategy for nonlinear systems with asymmetric input constraints,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 2, pp. 575–583, Mar. 2020.
    [24]
    D. Liu, X. Yang, D. Wang, and Q. Wei, “Reinforcement-learning-based robust controller design for continuous-time uncertain nonlinear systems subject to input constraints,” IEEE Trans. Cybern., vol. 45, no. 7, pp. 1372–1385, Jul. 2015.
    [25]
    Y. J. Liu, S. Li, S. Tong, and C. L. P. Chen, “Neural approximation-based adaptive control for a class of nonlinear nonstrict feedback discrete-time systems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 7, pp. 1531–1541, Jul. 2017.
    [26]
    H. Xu, Q. Zhao, and S. Jagannathan, “Finite-horizon near-optimal output feedback neural network control of quantized nonlinear discrete-time systems with input constraint,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 8, pp. 1776–1788, Aug. 2015.
    [27]
    Y. Zhu, D. Zhao, H. He, and J. Ji, “Event-triggered optimal control for partially unknown constrained-input systems via adaptive dynamic programming,” IEEE Trans. Ind. Electron., vol. 64, no. 5, pp. 4101–4109, 2017.
    [28]
    H. Modares, F. L. Lewis, and M. Naghibi-Sistani, “Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 10, pp. 1513–1525, 2013.
    [29]
    D. Ding, Q. L. Han, X. Ge, and J. Wang, “Secure state estimation and control of cyber-physical systems: A survey,” IEEE Trans. Syst. Man, Cybern.: Syst., to be published. DOI: 10.1109/TSMC.2020.3041121.
    [30]
    V. Ugrinovskii and E. Fridman, “A round-robin type protocol for distributed estimation with \substack{ H_{\infty} } consensus,” Syst. Control Lett., vol. 69, pp. 103–110, 2014.
    [31]
    G. Walsh, H. Ye, and L. Bushnell, “Stability analysis of networked control systems,” IEEE Trans. Control Syst. Tech., vol. 10, no. 3, pp. 438–446, 2002.
    [32]
    L. Zou, Z. Wang, and H. Gao, “Observer-based \substack{H_{\infty} } Control of networked systems with stochastic communication protocol: The finite-horizon case,” Automatica, vol. 63, pp. 366–373, 2016.
    [33]
    H. Ma, H. Li, R. Lu, and T. Huang, “Adaptive event-triggered control for a class of nonlinear systems with periodic disturbances,” Sci China Inf. Sci., vol. 63, no. 5, pp. 157–171, 2020.
    [34]
    Z. Wang, Q. Wei, and D. Liu, “Event-triggered adaptive dynamic programming for discrete-time multi-player games,” Inf. Sci., vol. 506, pp. 457–470, Jan. 2020.
    [35]
    D. Ding, Z. Wang, and Q. L. Han, “Neural-network-based consensus control for multi-agent systems with input constraints: The event-triggered case,” IEEE Trans. Cybern., vol. 50, no. 8, pp. 1–12, 2019.
    [36]
    X. Zhong, H. He, H. Zhang, and Z. Wang, “Optimal control for unknown discrete-time nonlinear Markov jump systems using adaptive dynamic programming,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 12, pp. 2141–2155, 2014.
    [37]
    D. Ding, Z. Wang, and Q. L. Han, “Neural-network-based output-feedback control with stochastic communication protocols,” Automatica, vol. 106, pp. 221–229, Aug. 2019.
    [38]
    N. Azevedo, D. Pinheiro, and G.-W. Weber, “Dynamic programming for a Markov-switching jump-diffusion,” J Comput. Appl. Math., vol. 267, no. 6, pp. 1–19, Sep. 2014.
    [39]
    M. C. F. Donkers, W. P. M. H. Heemels, and D. Bernardini, A. Bemporad, and V. Shneer, “Stability analysis of stochastic networked control systems,” Automatica, vol. 48, no. 4, pp. 917–925, 2012.
    [40]
    T. Dierks, B. T. Thumati, and S. Jagannathan, “Optimal control of unknown affine nonlinear discrete-time systems using offline-trained neural networks with proof of convergence,” Neural Networks, vol. 22, no. 5–6, pp. 851–860, 2009.
    [41]
    B. Lincoln and A. Rantzer, “Relaxing dynamic programming,” IEEE Trans. Autom. Control, vol. 51, no. 8, pp. 1249–1260, Aug. 2006.
    [42]
    J. Song, Y. Niu, and Y. Zou, “Convergence analysis for an identifier-based adaptive dynamic programming algorithm,” In Proc. the 34th Chinese Control Conf., 2015.
    [43]
    D. Liu, D. Wang, and X. Yang, “An iterative adaptive dynamic programming algorithm for optimal control of unknown discrete-time nonlinear systems with constrained inputs,” Inf. Sci., vol. 220, no. 1, pp. 331–342, 2013.
    [44]
    D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear Systems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 3, pp. 621–634, 2014.
  • Related Articles

    [1]Yun Zhang, Yuqi Wang, Yunze Cai. Value Iteration-Based Distributed Adaptive Dynamic Programming for Multi-Player Differential Game With Incomplete Information[J]. IEEE/CAA Journal of Automatica Sinica, 2025, 12(2): 436-447. doi: 10.1109/JAS.2024.124950
    [2]Ding Wang, Lingzhi Hu, Xiaoli Li, Junfei Qiao. Online Fault-Tolerant Tracking Control With Adaptive Critic for Nonaffine Nonlinear Systems[J]. IEEE/CAA Journal of Automatica Sinica, 2025, 12(1): 215-227. doi: 10.1109/JAS.2024.124989
    [3]Kang Xiong, Qinglai Wei, Hongyang Li. Residential Energy Scheduling With Solar Energy Based on Dyna Adaptive Dynamic Programming[J]. IEEE/CAA Journal of Automatica Sinica, 2025, 12(2): 403-413. doi: 10.1109/JAS.2024.124809
    [4]Chi Ma, Dianbiao Dong. Finite-time Prescribed Performance Time-Varying Formation Control for Second-Order Multi-Agent Systems With Non-Strict Feedback Based on a Neural Network Observer[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(4): 1039-1050. doi: 10.1109/JAS.2023.123615
    [5]Zhongyang Wang, Youqing Wang, Zdzisław Kowalczuk. Adaptive Optimal Discrete-Time Output-Feedback Using an Internal Model Principle and Adaptive Dynamic Programming[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(1): 131-140. doi: 10.1109/JAS.2023.123759
    [6]Meilin Li, Yue Long, Tieshan Li, Hongjing Liang, C. L. Philip Chen. Dynamic Event-Triggered Consensus Control for Input Constrained Multi-Agent Systems With a Designable Minimum Inter-Event Time[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(3): 649-660. doi: 10.1109/JAS.2023.123582
    [7]Weiwei Sun, Xinci Gao, Lusong Ding, Xiangyu Chen. Distributed Fault Estimation for Nonlinear Systems With Sensor Saturation and Deception Attacks Using Stochastic Communication Protocols[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(8): 1865-1876. doi: 10.1109/JAS.2023.124161
    [8]Ding Wang, Ning Gao, Derong Liu, Jinna Li, Frank L. Lewis. Recent Progress in Reinforcement Learning and Adaptive Dynamic Programming for Advanced Control Applications[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(1): 18-36. doi: 10.1109/JAS.2023.123843
    [9]Dingxin He, HaoPing Wang, Yang Tian, Yida Guo. A Fractional-Order Ultra-Local Model-Based Adaptive Neural Network Sliding Mode Control of n-DOF Upper-Limb Exoskeleton With Input Deadzone[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(3): 760-781. doi: 10.1109/JAS.2023.123882
    [10]Qinchen Yang, Fukai Zhang, Cong Wang. Deterministic Learning-Based Neural PID Control for Nonlinear Robotic Systems[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(5): 1227-1238. doi: 10.1109/JAS.2024.124224
    [11]Zhijia Zhao, Jian Zhang, Shouyan Chen, Wei He, Keum-Shik Hong. Neural-Network-Based Adaptive Finite-Time Control for a Two-Degree-of-Freedom Helicopter System With an Event-Triggering Mechanism[J]. IEEE/CAA Journal of Automatica Sinica, 2023, 10(8): 1754-1765. doi: 10.1109/JAS.2023.123453
    [12]Haowei Lin, Bo Zhao, Derong Liu, Cesare Alippi. Data-based Fault Tolerant Control for Affine Nonlinear Systems Through Particle Swarm Optimized Neural Networks[J]. IEEE/CAA Journal of Automatica Sinica, 2020, 7(4): 954-964. doi: 10.1109/JAS.2020.1003225
    [13]Jingwei Lu, Qinglai Wei, Fei-Yue Wang. Parallel Control for Optimal Tracking via Adaptive Dynamic Programming[J]. IEEE/CAA Journal of Automatica Sinica, 2020, 7(6): 1662-1674. doi: 10.1109/JAS.2020.1003426
    [14]Ruizhuo Song, Liao Zhu. Optimal Fixed-Point Tracking Control for Discrete-Time Nonlinear Systems via ADP[J]. IEEE/CAA Journal of Automatica Sinica, 2019, 6(3): 657-666. doi: 10.1109/JAS.2019.1911453
    [15]Zhan Shi, Zhanshan Wang. Optimal Control for a Class of Complex Singular System Based on Adaptive Dynamic Programming[J]. IEEE/CAA Journal of Automatica Sinica, 2019, 6(1): 188-197. doi: 10.1109/JAS.2019.1911342
    [16]Hongjun Yang, Jinkun Liu. An Adaptive RBF Neural Network Control Method for a Class of Nonlinear Systems[J]. IEEE/CAA Journal of Automatica Sinica, 2018, 5(2): 457-462. doi: 10.1109/JAS.2017.7510820
    [17]Fei-Yue Wang, Jie Zhang, Qinglai Wei, Xinhu Zheng, Li Li. PDP: Parallel Dynamic Programming[J]. IEEE/CAA Journal of Automatica Sinica, 2017, 4(1): 1-5.
    [18]Qinglai Wei, Derong Liu, Yu Liu, Ruizhuo Song. Optimal Constrained Self-learning Battery Sequential Management in Microgrid Via Adaptive Dynamic Programming[J]. IEEE/CAA Journal of Automatica Sinica, 2017, 4(2): 168-176. doi: 10.1109/JAS.2016.7510262
    [19]Xiaoming Sun, Shuzhi Sam Ge. Adaptive Neural Region Tracking Control of Multi-fully Actuated Ocean Surface Vessels[J]. IEEE/CAA Journal of Automatica Sinica, 2014, 1(1): 77-83.
    [20]Qiming Zhao, Hao Xu, Sarangapani Jagannathan. Near Optimal Output Feedback Control of Nonlinear Discrete-time Systems Based on Reinforcement Neural Network Learning[J]. IEEE/CAA Journal of Automatica Sinica, 2014, 1(4): 372-384.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(7)

    Article Metrics

    Article views (1700) PDF downloads(100) Cited by()

    Highlights

    • An NN-based identifier with a robust term is presented to approximate the unknown nonlinear system, where weight update rules are constructed by an additional tunable parameter;
    • A value iterative ADP algorithm is proposed to solve the suboptimal control issue of protocol-induced switching systems with saturation constraints in an off-line way;
    • The convergence of the ADP algorithm is discussed and further performed via an actor-critic NN scheme;
    • A set of conditions are derived to check the stability of both identification error dynamics and updated error dynamics of NN weights.

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return