
IEEE/CAA Journal of Automatica Sinica
Citation: | Xueli Wang, Derui Ding, Hongli Dong, and Xian-Ming Zhang, "Neural-Network-Based Control for Discrete-Time Nonlinear Systems with Input Saturation Under Stochastic Communication Protocol," IEEE/CAA J. Autom. Sinica, vol. 8, no. 4, pp. 766-778, Apr. 2021. doi: 10.1109/JAS.2021.1003922 |
OPTIMAL control has been one of the main focuses of control fields due to its wide applications in various emerging industrial systems, such as electrical power systems, industrial control systems, and spacecraft attitude control systems [1]-[7]. It is usually equivalent to solve the well-known Hamilton-Jacobi-Bellman (HJB) equation, which is a critical challenge for nonlinear systems [8]. Fortunately, the adaptive dynamic programming (ADP) algorithm, as the most efficient tool, has been developed to perform various suboptimal control issues with known or unknown system dynamics [9]-[11] by virtue of both its ability of effectively approximating correlation functions and the characteristics of iterative forward transfer. The main idea of ADP algorithms is to utilize two function sequences to iteratively approximate the cost and value functions corresponding to the solution of the HJB equation in a forward-in-time manner [12]. It should be pointed out that the value iteration technology developed in [13], [14] is one of the most important iterative ADP algorithms, and its convergence has also been thoroughly discussed in [15]-[17]. Furthermore, some representative algorithms including heuristic dynamic programming (HDP), dual heuristic dynamic programming (DHP), as well as globalized DHP have been proposed and implemented in various control issues benefiting from the famous actor-critic structure, see [18]-[20]. It is noteworthy that the obtained controller is usually a suboptimal one because of the existence of approximation errors of such a structure, and therefore the corresponding control is also regarded as near-optimal control.
In engineering practice, the actuator saturation is very pervasive due mainly to the facility protection or physical limits of the actuators. If the saturation of the actuator is not considered adequately, the performance of the closed-loop system is often severely damaged [21]. As a result, it is of tremendous significance to survey the influence of the input saturation phenomenon. Under the framework of optimal control, a bounded and invertible one-to-one function in a nonquadratic performance functional is usually exploited to evaluate the cost of saturated inputs and the analytical solution of the optimal controller can be obtained although it is still dependent on the cost functional [8], [22], [23]. Inspired by these work, the near-optimal control for various networked control systems has been investigated and some interesting results have been preliminarily reported in the literature, see [24]-[27], for instance. Near-optimal regulation under the actor-critic framework has been investigated in [26] for discrete-time nonlinear systems subject to quantized effects, where the quantization errors can be eliminated via a dynamic quantizer with adaptive step size. Furthermore, an online policy iteration algorithm has been presented in [28] to learn the optimal solution for a class of unknown constrained-input systems. Obviously, compared with the case without control constraints, the near-optimal control issues subject to constrained-inputs and various network-induced phenomena remain at an infant stage and thus require further research efforts.
On another frontier of research, in the past few years, we have witnessed the persistent development of network technologies, which has been attracting recurring attention on networked control systems [29]. In order to effectively utilize the limited resource or reduce the switching frequency for prolonging the service life of the equipment, only one (or a limited number of) sensor/control node, governed by various protocols, is permitted to get access to the communication network. These protocols include, but not limited to, the round-robin protocol [30], the try-once-discard protocol [31] and the stochastic communication protocol [32], and the event-triggered protocol [33], [34]. There is no doubt that the utilization of these protocols tremendously results in the complexity and the difficulty of both the stability analysis and the design of weight updating rules, which is the main reason why there are sparse results on this topic. Very recently, consensus control with the help of reconstructed dynamics of the local tracking errors has been investigated in [35] for multi-agent systems with event-triggered mechanism and input constraints, where the effect on the local cost from the adopted triggering scheme has been investigated. The critic and actor networks combined with an identifier network have been simultaneously designed in [27] to deal with a constrained-input control issue with unknown drift dynamics and event-triggered communication protocols. Unfortunately, so far, near-optimal control for the discrete-time nonlinear systems subjected to input saturations has not yet been adequately investigated, not to mention the stochastic communication protocol (SCP) is also a concern, which constitutes the motivation of this paper.
The addressed system with unknown nonlinear dynamics is essentially a protocol-induced switching system when SCP is employed to govern the data transmission or update between the controller and the actuator. Usually, SCP can be modeled by a Markov chain and the relative networked control issues can be effectively handled via the switching system theory combined with Lyapunov approaches. It is worth noting that this is a nontrivial topic for optimal control issues due mainly to the challenge of the cost function from such a switch. Recently, two typical approaches have been, respectively, developed in [36] via a combined cost function related to transition probabilities and in [37] via the dynamic programming principle [38]. However, when an identifier is designed to approximate the unknown nonlinear dynamics, there exists a great challenge to disclose the influence on the updating rules of the identifier’s weights and the identification errors. Furthermore, the convergence of the designed ADP algorithm and the practical execution with critic and actor networks should be further inspected. As such, motivated by the above discussions, the focus of this paper is to handle the neural networks (NN)-based near-optimal control problem for a discrete-time nonlinear system subject to constrained-inputs and SCPs. This appears to be nontrivial due to the following essential difficulties: 1) how to design an NN-based identifier under SCPs to estimate system dynamics, 2) how to perform the convergence analysis of the ADP algorithm, and 3) how to disclose the performance of the closed-loop system in the framework of critic and actor networks.
In response to the above discussions, this paper is concerned with the near-optimal control problem for a class of discrete-time nonlinear systems with constrained inputs and SCPs, and hence its main contributions are highlighted as follows: 1) an NN-based identifier with a robust term is presented to approximate the unknown nonlinear system, where novel weight updating rules are constructed by virtue of an additional tunable parameter; 2) a set of conditions are derived to check the stability of both identification error dynamics and updated error dynamics of NN weights; 3) the convergence of proposed value iterative ADP algorithm, which solves the optimal control issue of protocol-induced switching systems with saturation constraints in an off-line way, is profoundly discussed in light of mathematical induction; and 4) an actor-critic NN scheme is employed to perform the addressed near-optimal control issue.
The rest of this paper is formulated as follows: the problem formulation and preliminaries are presented in Section II. For the addressed control issue, four subsections are involved in Section III: an NN-based identification with a robust modification term is designed in Section III-A to identify discrete-time systems with unknown nonlinear dynamics; the value iterative ADP algorithm with convergence analysis is developed in Section III-B; the implementation of ADP algorithm with actor-critic networks in Section III-C, and the performance of closed-loop systems is discussed in Section III-D. Furthermore, a numerical example is given in Section IV to demonstrate the effectiveness of the proposed algorithms. Finally, the conclusion is given in Section V.
Notation: The notation used in this paper is standard.
In this paper, the investigated networked control system consists of a nonlinear plant, sensors, identifier, controller, as well as actuator. We assume that the system states
Consider the unknown discrete-time nonlinear system with the following form:
xk+1=f(xk)+g(xk)ˉuk | (1) |
where
In light of unknown nonlinearities, an NN-based system identifier via
pij=Prob{ξk+1=j|ξk=i} | (2) |
where
ˉui,k={ui,k,ifi=ξkˉui,k−1,otherwise | (3) |
where zero-order-holders are utilized in the viewpoint of practical engineering.
The actuator is further denoted as
ˉuk=Φ(ξk)uk | (4) |
with
Thus, the closed-loop system is as follows:
xk+1=f(xk)+g(xk)Φ(ξk)uk:=fξk(xk)+˜gξk(xk)uk. | (5) |
Remark 1: The main idea of SCP is to assign the access privilege for each node in a random manner. The “random switch” behavior of the node scheduling can be usually characterized by a Markov chain, see the corresponding research in [39]. Obviously, the addressed system (5) is essentially a protocol-induced switching system.
To quantify the control performance, the associate utility of each scheduling is employed as follows:
Ji(xk)=∞∑j=kℓi(xj,uj)=∞∑j=k(Qi(xj)+Si(uj)) | (6) |
where
Si(uk)=∫uk02ˉu(tanh−1(νˉu))TRidν=N∑i=1∫ui,k02ˉu(tanh−1(νiˉu))TRidνi | (7) |
where
tanh−1(ν/ˉu)=[tanh−1(ν1/ˉu),…,tanh−1(νN/ˉu)]T. |
Via the same with the approach in [27],
Si(uk)=2ˉuuTkRitanh−1(ukˉu)+ˉu2ˉRiln(1−u2kˉu2) |
where
Remark 2: In the framework of optimal control, the term
In order to disclose the effect from statistical characteristic of SCPs, similarly to the scheme in [36], reconstruct the performance index function (6) by embedding the transition probability matrix as follows:
{JI(xk)=p11J1(xk)+p12J2(xk)+⋯+p1NJN(xk)JII(xk)=p21J1(xk)+p22J2(xk)+⋯+p2NJN(xk)⋮JN(xk)=pN1J1(xk)+pN2J2(xk)+⋯+pNNJN(xk). |
By virtue of the weighted sum technique, a combined performance index is constructed as
J(xk)=λ1JI(xk)+λ2JII(xk)+⋯+λNJN(xk) | (8) |
where
Define
Γ=[Γ1,Γ2,…,ΓN]TL(xk,uk)=[l1(xk,uk),…,lN(xk,uk)] |
where
J(xk)=N∑i=1ΓiJi(xk)=N∑i=1Γili(xk,uk)+N∑i=1ΓiJ(xk+1)=ΓTL(xk,uk)+J(xk+1). | (9) |
Before proceeding further, let us introduce the following definition.
Definition 1: A law
The purpose of this paper is to find a suboptimal control law
1) Designing an NN-based identifer to identify the unknown nonlinear dynamics;
2) Developing a value iterative ADP algorithm to solve the optimal control of protocol-induced switching systems with saturation constraints in an off-line way;
3) In light of the obtained value iterative ADP algorithm, proposing an actor-critic NN scheme to perform the addressed near optimal control.
The following assumption is needed in order to reveal the boundedness of developed approximate scheme in sequel.
Assumption 1: The cost function
1)
2) The derivative function
3) The inverse function satisfies
Note, that the function
Four subsections are embodied in this section, including the design of the NN-based identifer, the iterative ADP algorithm, and the actor-critic NN scheme, as well as the performance analysis of the identification errors, the iterative ADP algorithm and the closed-loop system.
In this paper, an NN-based approximator is utilized to identify discrete-time nonlinear systems without the knowledge of system dynamics to solve the optimal control issue. Specifically, to learn the unknown nonlinear functions, a stable adaptive weight updating law is proposed for tuning the nonlinear identifier, and a robust modification term, a function of estimated error and an additional tunable parameter, are also introduced to guarantee asymptotic stability of the proposed nonlinear identification scheme.
To start the development of NN-based identifier, the system dynamic (5) is rewritten as
x_{k+1} = F_{\xi_k}(x_k, u_{k}) | (10) |
where
According to the universal approximation property of NNs, there exists an NN representation of the function
x_{k+1} = W_{2,\xi_k}^{T}\phi_x(\omega_{k})+\varepsilon_{k} | (11) |
where
For the NN represented closed-loop system (11), an identifier is designed to estimate the system state, which is described by
\hat{x}_{k+1} = (\hat{W}^{k}_{2,\xi_k})^{T}\phi_x(\omega_{k})-q_k | (12) |
where
Define the identification error and the estimated error of weight matrix as follows:
\tilde{x}_{k} = \hat{x}_{k}-x_{k},\; \tilde{W}^{k}_{2,\xi_k} = \hat{W}^{k}_{2,\xi_k}-W_{2,\xi_k}. | (13) |
Then, subtracting (11) from (12) obtains the following identification error dynamics:
\tilde{x}_{k+1} = \hat{x}_{k+1}-x_{k+1} = (\tilde{W}^{k}_{2,\xi_k})^{T} \phi_x(\omega_k)-\varepsilon_k-q_k. | (14) |
Considering this error dynamics, the robust term inspired by the work of [40] is constructed as
q_k = \frac{\nu_k \tilde{x}_k }{ \tilde{x}_k^{T}\tilde{x}_k+c_2} |
where
\begin{split} \tilde{x}_{k+1} = \;& (\tilde{W}^{k}_{2,\xi_k})^{T}\phi_x(\omega_k)-\frac{{\nu}_k \tilde{x}_k}{\tilde{x}_k^{T}\tilde{x}_k+c_2}-\varepsilon_k \\ = \;& \Phi^{k}_{1,\xi_k}-\Phi^{k}_{2,\xi_k}-\varepsilon_k \end{split} | (15) |
where
For adopted communication protocols,
\hat{W}^{k+1}_{2,i} = \left\{ {\begin{array}{*{20}{l}} \hat{W}^{k}_{2,i}-\gamma_w\phi_x(\omega_k)\tilde{x}_{k+1}^{T}, & {\rm{if}}\;\; \xi_{k} = i,\; & \; \xi_{k-1} = i \\ \hat{W}^{k}_{2,i},& {\rm{otherwise}} \end{array}} \right. | (16) |
and the tuning law of additional tunable parameter
\begin{split} \nu_{k+1} = \;& \alpha_\nu\nu_{k}+\frac{\gamma_\nu }{\tilde{x}_k^{T}\tilde{x}_k+c_2}\tilde{x}_{k+1}^{T} \tilde{x}_{k} \\ = \;& \alpha_\nu\nu_{k}+\gamma_\nu\Phi^{k}_{3}\tilde{x}_{k+1}^{T} \tilde{x}_{k} \end{split} | (17) |
where
Remark 3: The proposed updating rule (16) is novel and nontrivial. First, a zero-order holder is adopted to keep the weights of unactivated subsystems. Specifically, it can be found from the second case that the weights are unchanged along with the time
The following assumption and lemma are used to prove the convergence of the error dynamics.
Assumption 2: The NN approximation error
\varepsilon_{k}^{T}\varepsilon_{k}\leq\bar{\vartheta}\tilde{x}_{k}^{T}\tilde{x}_{k} | (18) |
where
Lemma 1: For any positive definite matrix
2x^{T}\Pi y\leq ax^{T}\Pi x+a^{-1}y^{T}\Pi y. | (19) |
Theorem 1: Let the identifier (12) be used to identify the nonlinear system (10), where the parameter updating laws given in (16) and (17) are used tuning the NN weights and the robust modification term, respectively. The estimation error
\left\{ {\begin{array}{*{20}{l}} \; 0<\theta_1< {\dfrac{1}{2}}\\ \; 0<\varepsilon < {\dfrac{1}{4}}\\ \; 0<\bar\vartheta <1\\ \; \alpha_\nu<\sqrt {\dfrac{7}{8}}. \end{array}} \right. | (20) |
Proof: Consider the following Lyapunov function candidate
\begin{split} L^{k} = \;& L^{k}_{1}+\sum\limits_{s = 1}^NL^{k}_{2,s}+L^{k}_{3} \\ = \;& \tilde{x}_{k}^{T}\tilde{x}_{k} +\frac{1}{\gamma_w}\sum\limits_{s = 1}^N {\rm tr}\left\{(\tilde{W}^{k}_{2,s})^{T}\tilde{W}^{k}_{2,s}\right\} +\frac{1}{\gamma_\nu}{\nu}_k^{2}. \end{split} | (21) |
Taking the first-order difference of
\begin{split} &{{E}}\{\Delta L^{k}_{1}|\xi_k = i,x_k\} \\ &\triangleq {{E}}\{\tilde{x}_{k+1}^{T}\tilde{x}_{k+1}|\xi_k = i,x_k\}-\tilde{x}_{k}^{T}\tilde{x}_{k} \\ &= \sum\nolimits^{N}\limits_{j = 1}p_{i,j}\tilde{x}_{k+1}^{T}\tilde{x}_{k+1}-\tilde{x}_{k}^{T}\tilde{x}_{k} \\ &= (\Phi^{k}_{1,i})^{T}\Phi^{k}_{1,i}+(\Phi^{k}_{2,i})^{T}\Phi^{k}_{2,i}+\varepsilon_k^{T}\varepsilon_k-\tilde{x}_{k}^{T}\tilde{x}_{k} \\ &\quad-2(\Phi^{k}_{1,i})^{T}\Phi^{k}_{2,i}-2(\Phi^{k}_{1,i})^{T}\varepsilon_k+2(\Phi^{k}_{2,i})^{T}\varepsilon_k. \end{split} | (22) |
Similarly, taking the first-order difference of
\begin{split} &\sum\limits_{s = 1}^N{{E}}\left\{\Delta L^{k}_{2,s}|\xi_k = i,x_k\right\} \\ &\triangleq \sum\limits^{N}\limits_{j = 1}{{E}}\Big\{\frac{ p_{i,j}}{\gamma_w}{\rm tr}\left((\tilde{W}^{k+1}_{2,i})^{T}\tilde{W}^{k+1}_{2,i}\right)\Big|\xi_k = i,x_k\Big\}\\ &\quad+\sum\limits_{s = 1,s\not = i}^N \sum\limits^{N}\limits_{j = 1}\frac{ p_{i,j}}{\gamma_w}{{E}}\Big\{{\rm tr}\left((\tilde{W}^{k+1}_{2,s})^{T}\tilde{W}^{k+1}_{2,s}\right)\Big|\xi_k = i,x_k\Big\} \\ &\quad-\frac{1}{\gamma_w}\sum\limits_{s = 1}^N {\rm tr}\left((\tilde{W}^{k}_{2,s})^{T}\tilde{W}^{k}_{2,s}\right) \\ &= \frac{1}{\gamma_w}{{E}}\Big\{{\rm tr}\Big((\tilde{W}^{k}_{2,i}-\gamma_w\phi_x(\omega_k)\tilde{x}_{k+1}^{T})^{T}\\ &\quad\times(\tilde{W}^{k}_{2,i} -\gamma_w\phi_x(\omega_k)\tilde{x}_{k+1}^{T})\Big)|\xi_k = i,x_k\Big\}\\ &\quad-\frac{1}{\gamma_w}{\rm tr}\left((\tilde{W}^{k}_{2,i})^{T}\tilde{W}^{k}_{2,i}\right). \end{split} | (23) |
Noting
\begin{split} & \sum\limits_{s = 1}^N{{E}}\{\Delta L^{k}_{2,s}|\xi_k = i,x_k\} \\ &\quad\leq -2\Phi^{k}_{1,i}\tilde{x}_{k+1}+\gamma_w\phi_{x,m}^{2}\tilde{x}_{k+1}^{T}\tilde{x}_{k+1} \\ &\quad= -2(\Phi^{k}_{1,i})^{T}\Phi^{k}_{1,i}+2(\Phi^{k}_{1,i})^{T}\Phi^{k}_{2,i}+2(\Phi^{k}_{1,i})^{T}\varepsilon_k \\ &\quad\quad+3\gamma_w\phi_{x,m}^{2}\left((\Phi^{k}_{1,i})^{T}\Phi^{k}_{1,i}+(\Phi^{k}_{2,i})^{T}\Phi^{k}_{2,i}+\varepsilon_k^{T}\varepsilon_k\right). \end{split} | (24) |
Furthermore, it is not difficult to calculate that
\begin{split} & {{E}}\left\{\Delta L^{k}_{3}|\xi_k = i,x_k\right\} \\ &\quad= \frac{1}{\gamma_\nu}{{E}}\left\{{\nu}_{k+1}^{2}|\xi_k = i,x_k\right\}-\frac{1}{\gamma_\nu} {\nu}_{k}^{2} \\ &\quad= \frac{1}{\gamma_\nu}\left((\alpha_\nu{\nu}_{k}+\gamma_\nu\ \Phi^{k}_{3}\tilde{x}_{k+1}^{T} \tilde{x}_{k})^2-{\nu}_{k}^{2}\right) \\ &\quad= 2(\Phi^{k}_{2,i})^{T}\tilde{x}_{k+1}+\gamma_\nu(\Phi^{k}_{3}\tilde{x}_{k+1}^{T}\tilde{x}_{k})^{2} \\ &\quad\quad-\gamma_\nu^{-1}(1-\alpha_\nu^2) {\nu}_{k}^{2} \\ &\quad\leq -2(\Phi^{k}_{2,i})^{T}\Phi^{k}_{2,i}+ 2(\Phi^{k}_{1,i})^{T}\Phi^{k}_{2,i}-2(\Phi^{k}_{2,i})^{T}\varepsilon_k \\ &\quad\quad +3\gamma_\nu(\Phi^{k}_{3})^{2}\tilde{x}_{k}^{T}\tilde{x}_{k}\Big((\Phi^{k}_{1,i})^{T}\Phi^{k}_{1,i}+(\Phi^{k}_{2,i})^{T}\Phi^{k}_{2,i}\\ &\quad \quad+\varepsilon_k^{T}\varepsilon_k\Big) -\gamma_\nu^{-1}(1-\alpha_\nu^2) {\nu}_{k}^{2}. \end{split} | (25) |
Denote the first-order difference of
\Delta L^{k} = \Delta L^{k}_{1}+\sum\limits_{s = 1}^N\Delta L^{k}_{2,s}+\Delta L^{k}_{3}. | (26) |
Considering (22), (24) and (25), the equation (26) can be handled as
\begin{split} \; &{{E}}\left\{\Delta L^{k}|\xi_k = i,x_k\right\}\\ &\quad\leq -(\Phi^{k}_{1,i})^{T}\Phi^{k}_{1,i}-(\Phi^{k}_{2,i})^{T}\Phi^{k}_{2,i}-\tilde{x}_{k}^{T}\tilde{x}_{k}\\ &\quad \quad+\varepsilon_k^{T}\varepsilon_k+2\Phi^{k}_{2,i}(\Phi^{k}_{1,i})^{T} -\gamma_\nu^{-1}(1-\alpha_\nu^2) {\nu}_{k}^{2}\\ &\quad\quad +3\left(\gamma_w\phi_{x,m}^{2}+\gamma_\nu(\Phi^{k}_{3})^{2}\tilde{x}_{k}^{T}\tilde{x}_{k}\right)\\ &\quad\quad\times\left((\Phi^{k}_{1,i})^{T}\Phi^{k}_{1,i}+(\Phi^{k}_{2,i})^{T}\Phi^{k}_{2,i}+\varepsilon_k^{T}\varepsilon_k\right). \end{split} |
Then, considering Assumption 2 and
\begin{split} \; &{{E}}\left\{\Delta L^{k}|\xi_k = i,x_k\right\} \\ &\quad\leq -(1-3\gamma_w\phi_{x,m}^{2}-3\gamma_\nu)(\parallel\Phi^{k}_{1,i}\parallel^{2}+\parallel\Phi^{k}_{2,i}\parallel^{2}) \\ &\quad \quad-(1-\bar\vartheta-3\bar\vartheta(\gamma_w\phi_{x,m}^{2}+\gamma_\nu)\parallel\tilde{x}_{k}\parallel^{2} \\ &\quad\quad+2\parallel\Phi^{k}_{1,i}\parallel\; \parallel\Phi^{k}_{2,i}\parallel-\gamma_\nu^{-1}(1-\alpha_\nu^2) {\nu}_{k}^{2}. \end{split} | (27) |
Furthermore, noting
\|\Phi^{k}_{2,i}\|^2 = \Big\|\frac{{\nu}_k \tilde{x}_k}{\tilde{x}_k^{T}\tilde{x}_k+c_2}\Big\|^2\leq {\nu}_{k}^{2} | (28) |
one has
\begin{split} 2\parallel\Phi^{k}_{1,i}\parallel\; \parallel\Phi^{k}_{2,i}\parallel \; \leq \;& \theta_1\parallel\Phi^{k}_{1,i}\parallel ^2 +\;\theta_1^{-1}\parallel\Phi^{k}_{2,i}\parallel ^2 \nonumber\\ \leq \;& \theta_1\parallel\Phi^{k}_{1,i}\parallel ^2 +\;\varepsilon\theta_1^{-1}\parallel\Phi^{k}_{2,i}\parallel ^2 \nonumber\\ &+\theta_1^{-1}(1-\varepsilon){\nu}_{k}^{2} \end{split} |
where the scalar
Furthermore, select the parameters as
\begin{split} &{{E}}\{\Delta L^{k}|\xi_k = i,x_k\} \\ &\quad\leq -(1-3\gamma_w\phi_{x,m}^{2}-3\gamma_\nu-\theta_1)\parallel\Phi^{k}_{i}\parallel^{2} \\ & \quad\quad-(1-3\gamma_w\phi_{x,m}^{2}-3\gamma_\nu-\varepsilon\theta_1^{-1})\parallel q_k\parallel^{2} \\ & \quad \quad-\Big(1-\bar\vartheta-3\bar\vartheta(\gamma_w\phi_{x,m}^{2}+\gamma_\nu)\parallel\tilde{x}_{k}\parallel^{2} \\ & \quad\quad-(\gamma_\nu^{-1}(1-\alpha_\nu^2)-\theta_1^{-1}(1-\varepsilon)\Big) {\nu}_{k}^{2} \\ &\quad= -(1-6\gamma_\nu-\theta_1)\parallel\Phi^{k}_{i}\parallel^{2}-(1-6\gamma_\nu-\varepsilon\theta_1^{-1})\parallel q_k\parallel^{2} \\ &\quad \quad -(1-\bar\vartheta-6\bar\vartheta\gamma_\nu)\parallel\tilde{x}_{k}\parallel^{2}-(\gamma_\nu^{-1}(1-\alpha_\nu^2) \\ &\quad\quad -\theta_1^{-1}(1-\varepsilon)) {\nu}_{k}^{2} \\ &\quad\leq -(1-2\theta_1)\parallel\Phi^{k}_{i}\parallel^{2}-(1-\theta_1-\varepsilon\theta_1^{-1})\parallel q_k\parallel^{2} \\ & \quad\quad-(1-\bar\vartheta(1+\theta_1))\parallel\tilde{x}_{k}\parallel^{2}-\Big((\frac{\theta_1}{6})^{-1}(1-\alpha_\nu^2) \\ &\quad \quad-\theta_1^{-1}(1-\varepsilon)\Big) {\nu}_{k}^{2}. \end{split} | (29) |
Therefore, one has
\left\{ \begin{aligned} & 1-2\theta_1>0\\ & 1-\theta_1-\varepsilon\theta_1^{-1}>0\\ & 1-\bar\vartheta(1+\theta_1)>0\\ & \Big(\frac{\theta_1}{6}\Big)^{-1}(1-\alpha_\nu^2)-\theta_1^{-1}(1-\varepsilon) \geq 0 \end{aligned} \right. | (30) |
which yields
\left\{ \begin{aligned} & 0<\theta_1<\frac{1}{2} \\ & \theta_1^{2}-\theta_1<\varepsilon<\theta_1-\theta_1^{2}\\& \bar\vartheta<\frac{1}{1+\theta_1}\\ & 6(1-\alpha_\nu^2)>(1-\varepsilon). \end{aligned} \right. | (31) |
that is, the inequalities (20).
Remark 4: It should be pointed out that the approximate error
According to the Bellman’s optimality principle, the optimal performance index function
J^{\ast}(x_k) = \min_{u_k}\{\Gamma^{^{T}}L(x_k,u_k)+J^{\ast}(x_{k+1})\} | (32) |
and the corresponding optimal control strategy is given by
u_k^{\ast} = \arg\min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J^{\ast}(x_{k+1})\}. | (33) |
Assume that the minimum on the right-hand side of (32) exists and is unique. Taking the first-derivative of the right-hand part, the ideal optimal control
\begin{split} u_k^{\ast} = \;& -\bar u\tanh\Big(\frac{1}{2\bar u}R^{-1}g_{i}^{T}(x_k)\nabla J^{\ast}(x_{k+1})\Big) \\ = \;& -\bar u\tanh\Big(\frac{1}{2\bar u}R^{-1}g_{i}^{T}(x_k)\nabla J^{\ast}(\Gamma_{i}(x_k)+g_{i}(x_k){u}_k\Big). \end{split} | (34) |
Since the direct solution of the HJB equation for nonlinear systems is computationally intensive, the value iteration algorithm, usually named as an ADP algorithm, needs to be developed in light of the Bellman’s principle of optimality. Initializing the value function
\begin{split} u_s(x_k) = \; &\arg\min_{u_k}\{{\Gamma^{T}L(x_k,u_k)+J_s(x_{k+1})}\} \\ = \;& \arg\min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J_s(F_{i}(x_k, {u}_k)\} \end{split} | (35) |
and
\begin{split} J_{s+1}(x_{k}) = \;& \min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J_s(x_{k+1})\} \\ = \;& \Gamma^{T}L(x_k,u_s(x_k))+J_s(F_{i}(x_k, {u}_k) \end{split} | (36) |
where
Inspired by [17], [41], we further demonstrate the convergence of the developed scheme with the help of a “functional bound” method.
Lemma 2: Consider the sequences
Proof: Let us first prove the case of
\begin{split} J_{2}(x_{k}) = \;& \min_{u_k}\{{\Gamma^{T}L(x_k,u_k)+J_1(x_{k+1})}\}\\ \leq \;& \min_{u_k}\{{\Gamma^{T}L(x_k,u_k)+J_0(x_{k+1})}\}\\ = \; &J_{1}(x_{k}). \end{split} |
Assume that
\begin{split} J_{q+1}(x_{k}) = \; &\min_{u_k}\{{\Gamma^{T}L(x_k,u_k)+J_q(x_{k+1})}\}\\ \leq \;& \min_{u_k}\{{\Gamma^{T}L(x_k,u_k)+J_{q-1}(x_{k+1})}\}\\ = \;& J_{q}(x_{k}). \end{split} |
Therefore, this case is true. Similarly, one can conclude that
Theorem 2: Consider the sequences
\underline{\rho}\{\Gamma^{T}L(x_k,u_k)\}\leq J^{\ast}(x_k)\leq \bar{\rho}\{\Gamma^{T}L(x_k,u_k)\} | (37) |
and
\underline{\varrho}J^{\ast}(x_k)\leq J_0(x_k)\leq \bar{\varrho}J^{\ast}(x_k) | (38) |
hold uniformly, then the iterative value function
\lim_{s\rightarrow\infty}J_s(x_k) = J^{\ast}(x_k). | (39) |
Proof: To verify this result, we will first prove the following assertion by using the mathematical induction method.
Assertion: Case I: For parameters
\begin{split} \Big(1+\frac{\underline{\varrho}-1}{(1+\bar{\rho}^{-1})^{{s}}}\Big)\;&J^{\ast}(x_k)\leq J_{s}(x_k)\; \\ & \leq\Big(1+\frac{\bar{\varrho}-1}{(1+\underline{\rho}^{-1})^{{s}}}\Big)J^{\ast}(x_k). \end{split} | (40) |
Case II: For parameters
\begin{split} \Big(1+\frac{\underline{\varrho}-1}{(1+\bar{\rho}^{-1})^{{s}}}\Big)\;& J^{\ast}(x_k)\leq J_{s}(x_k)\; \\ & \leq\Big(1+\frac{\bar{\varrho}-1}{(1+\bar{\rho}^{-1})^{{s}}}\Big)J^{\ast}(x_k). \end{split} | (41) |
Case III: For parameters
Considering the limited space, we only prove the left-hand side of the inequality (40) in Case I and the right-hand side of the inequality (41) in Case II. Furthermore, the proof of Case III is similar to those the first two cases and hence its proof is omitted.
Obviously, the left-hand side of the inequality (40) in Case I holds for
\begin{split} J_1(x_k) = \;& \min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J_0(x_{k+1})\}\\ \geq \;& \min_{u_k}\{\Gamma^{T}L(x_k,u_k)+\underline{\varrho}J^{\ast}(x_{k+1})\}\\ \geq \;& \min_{u_k}\Big\{\Gamma^{T}L(x_k,u_k)+\underline{\varrho}J^{\ast}(x_{k+1}) \end{split} |
\begin{split} & +\frac{(\underline{\varrho}-1)}{1+\bar{\rho}} (\bar{\rho}\Gamma^{T}L(x_k,u_k)-J^{\ast}(x_{k+1})) \Big\}\\ \geq \;& \min_{u_k}\Big\{\Big(1+\frac{\bar{\rho}(\underline{\varrho}-1)}{1+\bar{\rho}}\Big)\Gamma^{T}L(x_k,u_k)\\ &+\Big(\underline{\varrho}-\frac{\underline{\varrho}-1}{1+\bar{\rho}}\Big)J^{\ast}(x_{k+1})\Big\}\\ = \;& \Big(1+\frac{\underline{\varrho}-1}{1+\bar{\rho}^{-1}}\Big)\min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J^{\ast}(x_{k+1})\}\\ = \; &\Big(1+\frac{\underline{\varrho}-1}{1+\bar{\rho}^{-1}}\Big)J^{\ast}(x_k). \end{split} |
Furthermore, assume that the conclusion holds for
\Big(1+\frac{\underline{\varrho}-1}{(1+\bar{\rho}^{-1})^{{q-1}}}\Big)J^{\ast}(x_k)\leq J_{q-1}(x_k). |
When
\begin{split} J_{q}(x_k) = \;& \min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J_{q-1}(x_{k+1})\}\\ \geq \;& \min_{u_k}\Big\{\Gamma^{T}L(x_k,u_k)+\Big(1+\frac{\underline{\varrho}-1}{(1+\bar{\rho}^{-1})^{q-1}}\Big)J^{\ast}(x_{k+1})\Big\}\\ \geq \;& \min_{u_k}\Big\{\Gamma^{T}L(x_k,u_k)+\Big(1+\frac{\underline{\varrho}-1}{(1+\bar{\rho}^{-1})^{q-1}}\Big)J^{\ast}(x_{k+1})\\ &+\frac{(\underline{\varrho}-1) (\bar{\rho}\Gamma^{T}L(x_k,u_k)-J^{\ast}(x_{k+1})) }{(1+\bar{\rho})(1+\bar{\rho}^{-1})^{q-1}}\Big\}\\ = \; &\Big(1+\frac{\underline{\varrho}-1}{(1+\bar{\rho}^{-1})^{q}}\Big)\min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J^{\ast}(x_{k+1})\}\\ = \;& \Big(1+\frac{\underline{\varrho}-1}{(1+\bar{\rho}^{-1})^{q}}\Big)J^{\ast}(x_k). \end{split} |
According to the mathematical induction method, the left-hand side of the inequality (40) holds.
In what follows, let us prove the right-hand side of the inequality (41) in Case II. Obviously, it is not difficult to find that
\begin{split} J_1(x_k) = \;& \min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J_0(x_{k+1})\}\\ \leq \;& \min_{u_k}\{\Gamma^{T}L(x_k,u_k)+\bar{\varrho}J^{\ast}(x_{k+1})\}\\ \leq \;& \min_{u_k}\Big\{\Gamma^{T}L(x_k,u_k)+\bar{\varrho}J^{\ast}(x_{k+1})\\ &+\frac{\bar{\varrho}-1}{1+\bar{\rho}}\big(\bar{\rho}\Gamma^{T}L(x_k,u_k)-J^{\ast}(x_{k+1})\big)\Big\}\\ = \;& \Big(1+\frac{\bar{\varrho}-1}{1+\bar{\rho}^{-1}}\Big)J^{\ast}(x_k) \end{split} |
where the term
Furthermore, assume that the conclusion holds for
J_{q-1}(x_k)\leq \; \Big(1+\frac{\bar{\varrho}-1}{(1+\bar{\rho}^{-1})^{{q-1}}}\Big)J^{\ast}(x_k). |
When
\begin{split} J_{q}(x_k) = \; &\min_{u_k}\{\Gamma^{T}L(x_k,u_k)+J_{q-1}(x_{k+1})\}\\ \leq \; &\min_{u_k}\Big\{\Gamma^{T}L(x_k,u_k)+\Big(1+\frac{\bar{\varrho}-1}{(1+\bar{\rho}^{-1})^{{q-1}}}\Big)J^{\ast}(x_{k+1})\Big\}\\ \leq \;& \min_{u_k}\Big\{\Gamma^{T}L(x_k,u_k)+\Big(1+\frac{\bar{\varrho}-1}{(1+\bar{\rho}^{-1})^{{q-1}}}\Big)J^{\ast}(x_{k+1})\\ &+\frac{(\bar{\varrho}-1)(\bar{\rho}\Gamma^{T}L(x_k,u_k)-J^{\ast}(x_{k+1})}{(1+\bar{\rho})(1+\bar{\rho}^{-1})^{q-1}}\Big\}\\ = \;& \Big(1+\frac{\bar{\varrho}-1}{(1+\bar{\rho}^{-1})^{q}}\Big)J^{\ast}(x_k). \end{split} |
In light of the mathematical induction method, the right-hand side of the inequality (41) holds.
Combining the above conclusions, we can obtain that this assertion is true. Finally, letting
Remark 5: The above theorem discloses the convergence of the developed ADP scheme with the help of a “functional bound” method, which comes from [17], [41]. An assertion has been allocated for the convenience of processing. For the practical application, a terminal condition (or a fixed size number
Algorithm 1 ADP algorithm
Initialization Value
1: while
2: Solve
Update the value
Set
3: end while
4: Output Control strategy
Due to the unknown
J^{\ast}(x_{k}) = W_{2c}^{T}\phi_c(W_{1c}^{T}{\textit{z}}_{k})+\theta_c({\textit{z}}_{k}) | (42) |
and
u^{\ast}(x_k) = \phi_{2a}({W}^{T}_{2a}\phi_{1a}(W_{1a}^{T}x_{k}))+\theta_a(x_{k}) | (43) |
with
In order to identify the ideal weight
\hat{J}_{s}(x_{k}) = \hat{W}^{T}_{2c,s}\phi_c(W_{1c}^{T}{\textit{z}}_{s,k}) | (44) |
where
Taking the above equation into (36), there is usually
\hat{J}_{s}(x_{k})\not = \Gamma^TL(x_k,u_{s-1}(x_k))+\hat{J}_{s-1}(x_{k+1}) |
that is
\hat{W}^{T}_{2c,s}\phi(W_{1c}^{T}{\textit{z}}_{s,k})\not = \Gamma^TL(x_k,u_{s-1}(x_k))+\hat{J}_{s-1}(x_{k+1}) . | (45) |
Introduce the gap
\begin{split} \Delta{J}_{s}(x_{k}) = \;& \hat{W}^{T}_{2c,s}\phi_c(W_{1c}^{T}{\textit{z}}_{s,k})-\hat{J}_{s-1}(x_{k+1}) \\ &-\Gamma^TL(x_k,u_{s-1}(x_k)) \end{split} | (46) |
and then define the cost function
e_{c,s} = \frac{1}{2}\Delta{J}^{2}_{s}(x_{k}). |
Minimizing such a function results in the updating rule of the weights of the critic network
\begin{split} \hat{W}_{2c,s+1} = \; &\hat{W}_{2c,s}-\varepsilon_c\frac{\partial e_{c,s}}{\partial \hat{W}_{2c,s}} \\ = \; &\hat{W}_{2c,s}-\varepsilon_c\frac{\partial e_{c,s}(k)}{\partial \tilde{J}_{s}(x_{k})}\frac{\partial (\Delta{J}_{s}(x_{k}))}{\partial \hat{W}_{2c,s}} \\ = \; &\hat{W}_{2c,s}-\varepsilon_c\phi_c(W_{1c}^{T}z_{s,k})\Delta{J}^{T}_{s}(x_{k}) \end{split} | (47) |
where
In the actor network,
\hat{u}_s(x_k) = \phi_{2a}\left(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})\right). | (48) |
On the other hand, it follows from (34) that the approximated value is also obtained by
u_s(x_k) = \; -\bar u\tanh\left(\frac{1}{2\bar u}R^{-1}g_{i}^{T}(x_k)\nabla \hat{J}^{T}_{s}(x_{k+1})\right). | (49) |
Denote
\begin{split} \Delta{u}_{s}(x_{k}) = \;& \hat{u}_s(x_k)-u_s(x_k) \\ = \;& \phi_{2a}\left(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})\right) \\ &+\bar u\tanh\big(\Upsilon_i\nabla\phi_c^{T}(W_{1c}^{T}{\textit{z}}_{s,k+1})\hat{W}^{T}_{2c,s}\big). \end{split} | (50) |
In what follows, define the cost of this gap
e_{a,s} = \frac{1}{2}\Delta{u}^{T}_{s}(x_{k})\Delta{u}_{s}(x_{k}). |
By employing the gradient descent approach again to minimize
\begin{split} \hat{W}_{2a,s+1} = \;& \hat{W}_{2a,s}-\varepsilon_a\frac{\partial e_{a,s}(k)}{\partial \hat{W}_{2a,s}} \\ = \; &\hat{W}_{2a,s}-\varepsilon_a\frac{\partial e_{a,s}(k)}{\partial \Delta{u}_{s}(x_{k})} \frac{\partial \Delta{u}_{s}(x_{k})}{\partial \hat{u}_s(x_k)} \frac{\partial \hat{u}_s(x_k)}{\partial \hat{W}_{2a,s}} \\ = \;& \hat{W}_{2a,s}-\frac{1}{2}\varepsilon_a\phi_{1a}(W_{1a}^{T}x_k) \\ &\times\Big(1-\phi_{2a}^{T}(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})) \\ &\times\phi_{2a}(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k}))\Big)\Delta{u}^{T}_{s}(x_{k}) \end{split} | (51) |
where
Defining the estimation errors of weight matrices
\tilde{W}_{2c,s} = \hat{W}_{2c,s}-W_{2c},\; \tilde{W}_{2a,s} = \hat{W}_{2a,s}-W_{2a} |
one has
\begin{split} \tilde{W}_{2c,s+1} = \;& \tilde{W}_{2c,s}-\varepsilon_c\phi_c(W_{1c}^{T}{\textit{z}}_{s,k})\Delta{J}^{T}_{s}(x_{k}) \\ = \; &\tilde{W}_{2c,s}-\varepsilon_c\phi_c(W_{1c}^{T}{\textit{z}}_{s,k})\Big( \hat{W}^{T}_{2c,s}\phi_c(W_{1c}^{T}{\textit{z}}_{s,k}) \\ &-\hat{J}_{s-1}(x_{k+1})-\Gamma^TL(x_k,u_{s-1}(x_k))\Big)^T \\ = \;& \tilde{W}_{2c,s}-\varepsilon_c\phi_c(W_{1c}^{T}{\textit{z}}_{s,k})\Big( \tilde{W}^{T}_{2c,s}\phi_c(W_{1c}^{T}{\textit{z}}_{s,k}) \\ &+{W}^{T}_{2c}\phi_c(W_{1c}^{T}{\textit{z}}_{s,k})-\hat{W}^{T}_{2c,s-1}\phi_c(W_{1c}^{T}{\textit{z}}_{s,k+1}) \\ & -\Gamma^TL(x_k,u_{s-1}(x_k))\Big)^T \end{split} | (52) |
and
\begin{split} \tilde{W}_{2a,s+1} = \;& \tilde{W}_{2a,s}-\frac{1}{2}\varepsilon_a\phi_{1a}(W_{1a}^{T}x_k) \\ &\times\Big(1-\phi_{2a}^{T}(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})) \\ &\times\phi_{2a}\big(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})\big)\Big)\Delta{u}^{T}_{s}(x_{k}) \\ = \;& \tilde{W}_{2a,s}-\frac{1}{2}\varepsilon_a\phi_{1a}(W_{1a}^{T}x_k) \\ &\times\Big(1-\phi_{2a}^{T}(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})) \\ &\times\phi_{2a}\big(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})\big)\Big) \\ &\times\Big(\phi_{2a}(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k})) \\ &+\bar u\tanh\big(\Upsilon\nabla\phi_c^{T}(W_{1c}^{T}{\textit{z}}_{s,k+1})\hat{W}^{T}_{2c,s}\big)\Big). \end{split} | (53) |
It is easily seen that the estimation errors of weights in actor and critic networks will inevitably affect the performance of the above ADP algorithm. Thus, it is necessary to prove the boundedness of the critic and actor NN weights.
Theorem 3: Consider the discrete-time Markov jump system (MJS) (5), the critic NN (44) and the actor NN (48). Then, for the fixed time
\; 0<\varepsilon_c\leq \phi_{c,m}^{-2},\;\;0<\varepsilon_a\leq \phi_{1a,m}^{-2}. | (54) |
Proof: In order to show the boundedness, we introduce a Lyapunov function candidate
\begin{split} L_{\tilde{W}_s} = \; &L_{\tilde{W}_{2c},s}+L_{\tilde{W}_{2a},s}\\ = \;& \frac{1}{\alpha_c}{\rm tr}\left\{\tilde{W}^{T}_{2c,s}\tilde{W}_{2c,s}\right\}+\frac{1}{\alpha_a}{\rm tr}\left\{\tilde{W}^{T}_{2a,s}\tilde{W}_{2a,s}\right\}. \end{split} |
In what follows, the proof is similar to the one in literature [42], and therefore its details are omitted, and the corresponding learning rates need to satisfy
\begin{split} &0<\varepsilon_c\leq \frac{1}{\parallel \phi_c(W_{1c}^{T}{\textit{z}}_{k})\parallel^{2}}\\ &0<\varepsilon_a\leq \frac{\parallel\phi_{1a}(W_{1a}^{T}x_{k})\parallel^{-2}}{1-\parallel \phi_{2a}(\hat{W}^{T}_{2a,s}\phi_{1a}(W_{1a}^{T}x_{k}))\parallel^{2}}. \end{split} |
Since the excitation functions
Assumption 3: The function
Theorem 4: Let the initial control input be admissible and the initial actor-NN and critic-NN weights be selected from a compact set which includes the ideal weights. The NN weight updating laws (47) and (51) are adopted in an off-line way for the critic network (44) and the actor network (48), and the updating law (50) with (17) is employed in an online way for the identifier (12). Then, the closed-loop system (5) (or (10)) with control law (48) selecting
Proof: In the framework of identifier-based control, taking the control policy (48) into account, the actual closed-loop system as follows:
\begin{split} x_{k+1} = \;& F_{\xi_k}(x_k, \hat{u}_{\infty}(x_k))\\ = \; &F_{\xi_k}(x_k, u^{\ast}(x_k))+ g_{\xi_k}(x_k)(\hat{u}_{\infty}(x_k)- u^{\ast}(x_k))\\ = \;& F_{\xi_k}(x_k, u^{\ast}(x_k)) +g_{\xi_k}(x_k)(\phi_{2a}(\tilde{W}^{T}_{2a}\psi_{k})-\theta_a(x_{k})) \end{split} |
where
\begin{split} &\psi_{k} = \; \phi_{1a}(W_{1a}^{T}x_{k})\\ &\phi_{2a}(\tilde{W}^{T}_{2a}\psi_{k}) = \; \phi_{2a}(\hat{W}^{T}_{2a,\infty} \psi_{k})-\phi_{2a}({W}^{T}_{2a} \psi_{k}). \end{split} |
Obviously, considering the property of activation functions of NNs, one has that
On the other hand, according to the optimal control theory, the policy (43) stabilizes the system (11) (i.e., (10)) on the compact set. With the same approach in [37], it is clear that there exists a constant
{E}\parallel\sum\limits_{j = 1}^{N} p_{ij}F_{i}(x_k, u^{\ast}(x_k))\parallel^{2}\leq H^{\ast}{E} \parallel x_k \parallel^{2}. | (55) |
By virtue of the input-to-state stability or the similar line in [37], one can conclude that the actual closed-loop system is ultimately bounded in mean-square sense.
Remark 6: In the above subsections, a set of critic and actor networks are designed to approximate the performance index function sequence
Remark 7: In almost all ADP-based suboptimal control issues for the nonlinear systems, NNs are widely utilized to approximate the unknown nonlinear dynamics as well as the actor and critic functions. Such a structure, named the actor-critic structure, provides the capability of forwarding calculation while avoiding the dimensional disaster. Inspired by the idea in [43], a tuning parameter
Remark 8: Up to date, two typical iteration strategies of ADP algorithms are utilized to obtain the desired controller parameter and the associate utility, and they are policy iteration (PI) and value iteration (VI), respectively. One major difference between PI and VI strategies is that PI requires an initial admissible control policy that stabilizes the system states [44]. From a mathematical point of view, the initial admissible control can be regarded as a suboptimal control which requires to solve the nonlinear partial differential equations (PDEs) analytically. To overcome the shortage, a VI-based strategy has been developed in this paper to definitely deal with the control issue with input saturation and communication scheduling.
In this section, we use a simulation example to show that the proposed suboptimal control is effective for discrete-time nonlinear systems with input saturation under SCPs.
Consider the following nonlinear system:
x_{k+1} = \left[ {\begin{array}{*{20}{c}} -0.5x_{1,k}+0.1x_{2,k} \\ 0.1\sin(x_{1,k})\exp(|x_{2,k}|)+1.2x_{2,k} \\ \end{array}} \right]+\left[ {\begin{array}{*{20}{c}} \bar{u}_{1,k} \\ \bar{u}_{2,k} \end{array}} \right] |
where
In this example, choose three-layer feedforward NNs in the identifier, the critic network and the action network with structures
\begin{split} &\phi_x(\ast) = \; \frac{2(e^{\ast}-e^{-\ast })} {e^{\ast}+e^{-\ast}}\;\;\\ &\phi_{c}(\ast) = \; \phi_{1a}(\ast) = \frac{e^{\ast}-e^{-\ast }} {e^{\ast}+e^{-\ast}}\;\;\\ &\phi_{2a}(\ast) = \; \frac{\bar {u}(e^{\ast}-e^{-\ast })} {e^{\ast}+e^{-\ast}}. \end{split} |
Thus, the bounds of
In virtue of Theorem 1, we can employ the learning rate
\hat{W}^{0}_{2,i} = \left[ {\begin{array}{*{20}{c}} -0.50, &0.1, &1.0, &0 \\ 0.02, &1.2, &0, &1.0 \end{array}} \right]^{T}, \;\;\; i = 1,2. |
In what follows, we consider the matrices
\hat{W}_{2c,0} = \; \left[ {\begin{array}{*{20}{c}} 1.00, & 1.05 \end{array}} \right]^{T},\;\;\\ \hat{W}_{2a,0} = \; \left[ {\begin{array}{*{20}{c}} -0.04, & -1.16 \\ -0.01, & -0.134 \end{array}} \right]^{T}. |
In addition, the weight matrices
{W}_{1a} = 0.2I,\;\; {W}_{1c} = \left[ {\begin{array}{*{20}{c}} 2, & 0, & 0.01, & 0 \\ 0, & 2.5, & 0, & 0.01 \end{array}} \right]^{T}. |
Training of weight matrices for critic-actor networks is performed in instant
The simulation results are presented in Figs. 3-7. The state trajectories
In this paper, we have developed a suboptimal control strategy in the framework of ADP for a class of unknown nonlinear discrete-time systems subject to input constraints. An identification with robust term based on a three-layer neural network in which the weight update relies on protocol-induced jumps, has been established to approximate nonlinear systems and the corresponding stability has been provided. Then, the value iterative ADP algorithm has been developed to solve the suboptimal control problem with boundedness analysis, and the convergence of iterative algorithm, as well as the boundedness of the estimation errors for critic and actor NN weights, has been analyzed. Furthermore, an actor-critic NN scheme has been developed to approximate the control law and the proposed performance index function and the stability of the closed-loop systems have been discussed. Finally, the numerical simulation result has been utilized to demonstrate the effectiveness of the proposed control scheme.
[1] |
M. Mazouchi, M. B. N. Sistani, and S. K. H. Sani, “A novel distributed optimal adaptive control algorithm for nonlinear multi-agent differential graphical games,” IEEE/CAA J. Autom. Sinica, vol. 5, no. 1, pp. 331–341, Jan. 2018.
|
[2] |
Y. J. Liu, L. Tang, S. Tong, C. L. Chen, and D. J. Li, “Reinforcement learning design-based adaptive tracking control with less learning parameters for nonlinear discrete-time MIMO systems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 1, pp. 165–176, Jan. 2015.
|
[3] |
R. Song and L. Zhu, “Optimal fixed-point tracking control for discrete-time nonlinear systems via ADP,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 3, pp. 657–666, May 2019.
|
[4] |
L. Sun, and Z. Zheng, “Disturbance-observer-based robust backstepping attitude stabilization of spacecraft under input saturation and measurement uncertainty,” IEEE Trans. Ind. Electron., vol. 64, no. 10, pp. 7994–8002, 2017.
|
[5] |
D. Wang, H. He, X. Zhong, and D. Liu, “Event-driven nonlinear discounted optimal regulation involving a power system application,” IEEE Trans. Ind. Electron., vol. 64, no. 10, pp. 8177–8186, 2017.
|
[6] |
H. Li, Y. Wu and M. Chen, “Adaptive fault-tolerant tracking control for discrete-time multi-agent systems via reinforcement learning algorithm,” IEEE Trans. Cybern., to be published. DOI: 10.1109/TCYB.2020.2982168.
|
[7] |
T. Wang, H. Gao, and J. Qiu, “A combined adaptive neural network and nonlinear model predictive control for multirate networked industrial process control,” IEEE Trans. Ind. Electron., vol. 27, no. 2, pp. 416–425, 2016.
|
[8] |
H. Zhang, Y. Luo, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1490–1503, 2009.
|
[9] |
Z. Shi and Z. Wang, “Optimal control for a class of complex singular system based on adaptive dynamic programming,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 1, pp. 188–197, Jan. 2019.
|
[10] |
R. Song, Q. Wei, H. Zhang, and F. L. Lewis, “Discrete-time non-zero-sum games with completely unknown dynamics,” IEEE Trans. Cybern., vol. 99, pp. 1–15, 2019. doi: 10.1109/TCYB.2019.2957406
|
[11] |
Q. Wei, and D. Liu, “Data-driven neuro-optimal temperature control of waterCgas shift reaction using stable iterative adaptive dynamic programming,” IEEE Trans. Ind. Electron., vol. 61, no. 11, pp. 6399–6408, 2014.
|
[12] |
P. J. Werbos, “Foreword-ADP: the key direction for future research in intelligent control and understanding brain intelligence,” IEEE Trans. Syst. Man,Cybern. Part B, vol. 38, pp. 898–900, 2008.
|
[13] |
D. P. Bertsekas and J. N. Tsitsiklis, “Neuro-Dynamic Programming,” Athena Scientific, USA, Belmont, MA, 1996.
|
[14] |
R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction”, Cambridge, MA, USA: MIT Press, 1998.
|
[15] |
A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof,” IEEE Trans. Syst. Man,Cybern. Part B, vol. 38, no. 4, pp. 943–949, 2008.
|
[16] |
A. Heydari, “Stability analysis of optimal adaptive control under value iteration using a stabilizing initial policy,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 9, pp. 4522–4527, Sept. 2018.
|
[17] |
Q. Wei, D. Liu, and H. Lin, “Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems,” IEEE Trans. Cybern., vol. 46, pp. 840–853, 2016.
|
[18] |
W. B. Powell, “Approximate Dynamic Programming,” IHoboken, NJ, USA: Wiley, 2007.
|
[19] |
D. V. Prokhorov and D. C. Wunsch, “Adaptive critic designs,” IEEE Trans. Neural Netw., vol. 8, no. 5, pp. 997–1007, 1997.
|
[20] |
X. Zhong, N. Zhen, and H. He, “A theoretical foundation of goal representation heuristic dynamic programming,” IEEE Trans. Neural Netw. Learn. Syst, vol. 27, no. 12, pp. 2513–2525, 2017.
|
[21] |
Y. Yuan, Z. Wang, P. Zhang, and H. Liu, “Near-optimal resilient control strategy design for state-saturated networked systems under stochastic communication protocol,” IEEE Trans. Cybern., vol. 49, no. 8, pp. 1–13, 2018.
|
[22] |
M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, no. 5, pp. 779–791, 2005.
|
[23] |
X. Yang and B. Zhao, “Optimal neuro-control strategy for nonlinear systems with asymmetric input constraints,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 2, pp. 575–583, Mar. 2020.
|
[24] |
D. Liu, X. Yang, D. Wang, and Q. Wei, “Reinforcement-learning-based robust controller design for continuous-time uncertain nonlinear systems subject to input constraints,” IEEE Trans. Cybern., vol. 45, no. 7, pp. 1372–1385, Jul. 2015.
|
[25] |
Y. J. Liu, S. Li, S. Tong, and C. L. P. Chen, “Neural approximation-based adaptive control for a class of nonlinear nonstrict feedback discrete-time systems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 7, pp. 1531–1541, Jul. 2017.
|
[26] |
H. Xu, Q. Zhao, and S. Jagannathan, “Finite-horizon near-optimal output feedback neural network control of quantized nonlinear discrete-time systems with input constraint,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 8, pp. 1776–1788, Aug. 2015.
|
[27] |
Y. Zhu, D. Zhao, H. He, and J. Ji, “Event-triggered optimal control for partially unknown constrained-input systems via adaptive dynamic programming,” IEEE Trans. Ind. Electron., vol. 64, no. 5, pp. 4101–4109, 2017.
|
[28] |
H. Modares, F. L. Lewis, and M. Naghibi-Sistani, “Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 10, pp. 1513–1525, 2013.
|
[29] |
D. Ding, Q. L. Han, X. Ge, and J. Wang, “Secure state estimation and control of cyber-physical systems: A survey,” IEEE Trans. Syst. Man, Cybern.: Syst., to be published. DOI: 10.1109/TSMC.2020.3041121.
|
[30] |
V. Ugrinovskii and E. Fridman, “A round-robin type protocol for distributed estimation with
|
[31] |
G. Walsh, H. Ye, and L. Bushnell, “Stability analysis of networked control systems,” IEEE Trans. Control Syst. Tech., vol. 10, no. 3, pp. 438–446, 2002.
|
[32] |
L. Zou, Z. Wang, and H. Gao, “Observer-based
|
[33] |
H. Ma, H. Li, R. Lu, and T. Huang, “Adaptive event-triggered control for a class of nonlinear systems with periodic disturbances,” Sci China Inf. Sci., vol. 63, no. 5, pp. 157–171, 2020.
|
[34] |
Z. Wang, Q. Wei, and D. Liu, “Event-triggered adaptive dynamic programming for discrete-time multi-player games,” Inf. Sci., vol. 506, pp. 457–470, Jan. 2020.
|
[35] |
D. Ding, Z. Wang, and Q. L. Han, “Neural-network-based consensus control for multi-agent systems with input constraints: The event-triggered case,” IEEE Trans. Cybern., vol. 50, no. 8, pp. 1–12, 2019.
|
[36] |
X. Zhong, H. He, H. Zhang, and Z. Wang, “Optimal control for unknown discrete-time nonlinear Markov jump systems using adaptive dynamic programming,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 12, pp. 2141–2155, 2014.
|
[37] |
D. Ding, Z. Wang, and Q. L. Han, “Neural-network-based output-feedback control with stochastic communication protocols,” Automatica, vol. 106, pp. 221–229, Aug. 2019.
|
[38] |
N. Azevedo, D. Pinheiro, and G.-W. Weber, “Dynamic programming for a Markov-switching jump-diffusion,” J Comput. Appl. Math., vol. 267, no. 6, pp. 1–19, Sep. 2014.
|
[39] |
M. C. F. Donkers, W. P. M. H. Heemels, and D. Bernardini, A. Bemporad, and V. Shneer, “Stability analysis of stochastic networked control systems,” Automatica, vol. 48, no. 4, pp. 917–925, 2012.
|
[40] |
T. Dierks, B. T. Thumati, and S. Jagannathan, “Optimal control of unknown affine nonlinear discrete-time systems using offline-trained neural networks with proof of convergence,” Neural Networks, vol. 22, no. 5–6, pp. 851–860, 2009.
|
[41] |
B. Lincoln and A. Rantzer, “Relaxing dynamic programming,” IEEE Trans. Autom. Control, vol. 51, no. 8, pp. 1249–1260, Aug. 2006.
|
[42] |
J. Song, Y. Niu, and Y. Zou, “Convergence analysis for an identifier-based adaptive dynamic programming algorithm,” In Proc. the 34th Chinese Control Conf., 2015.
|
[43] |
D. Liu, D. Wang, and X. Yang, “An iterative adaptive dynamic programming algorithm for optimal control of unknown discrete-time nonlinear systems with constrained inputs,” Inf. Sci., vol. 220, no. 1, pp. 331–342, 2013.
|
[44] |
D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear Systems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 3, pp. 621–634, 2014.
|