全部分类

搜索资料

首页

外文文献及翻译--自适应动态规划综述

外文文献及翻译--自适应动态规划综述

举报

开通vip

外文文献及翻译--自适应动态规划综述PAGE/NUMPAGES外文文献：AdaptiveDynamicProgramming:AnIntroductionAbstract:Inthisarticle,weintroducesomerecentresearchtrendswithinthefieldofadaptive/approximatedynamicprogramming(ADP),includingthevariationsonthestructureofADPschemes,thedevelopmentofADPalgorithm...

外文文献及翻译--自适应动态规划综述

PAGE/NUMPAGES外文文献：AdaptiveDynamicProgramming:AnIntroductionAbstract:Inthisarticle,weintroducesomerecentresearchtrendswithinthefieldofadaptive/approximatedynamicprogramming(ADP),includingthevariationsonthestructureofADPschemes,thedevelopmentofADPalgorithmsandapplicationsofADPschemes.ForADPalgorithms,thepointoffocusisthatiterativealgorithmsofADPcanbesortedintotwoclasses:oneclassistheiterativealgorithmwithinitialstablepolicy;theotheristheonewithouttherequirementofinitialstablepolicy.Itisgenerallybelievedthatthelatteronehaslesscomputationatthecostofmissingtheguaranteeofsystemstabilityduringiterationprocess.Inaddition,manyrecentpapershaveprovidedconvergenceanalysisassociatedwiththealgorithmsdeveloped.Furthermore,wepointoutsometopicsforfuturestudies.IntroductionAsiswellknown,therearemanymethodsfordesigningstablecontrolfornonlinearsystems.However,stabilityisonlyabareminimumrequirementinasystemdesign.Ensuringoptimalityguaranteesthestabilityofthenonlinearsystem.Dynamicprogrammingisaveryusefultoolinsolvingoptimizationandoptimalcontrolproblemsbyemployingtheprincipleofoptimality.In[16],theprincipleofoptimalityisexpressedas:“Anoptimalpolicyhasthepropertythatwhatevertheinitialstateandinitialdecisionare,theremainingdecisionsmustconstituteanoptimalpolicywithregardtothestateresultingfromthefirstdecision.”Thereareseveralspectrumsaboutthedynamicprogramming.Onecanconsiderdiscrete-timesystemsorcontinuous-timesystems,linearsystemsornonlinearsystems,time-invariantsystemsortime-varyingsystems,deterministicsystemsorstochasticsystems,etc.Wefirsttakealookatnonlineardiscrete-time(timevarying)dynamical(deterministic)systems.Time-varyingnonlinearsystemscovermostoftheapplicationareasanddiscrete-timeisthebasicconsiderationfordigitalcomputation.Supposethatoneisgivenadiscrete-timenonlinear(timevarying)dynamicalsystemwhererepresentsthestatevectorofthesystemanddenotesthecontrolactionandFisthesystemfunction.Supposethatoneassociateswiththissystemtheperformanceindex(orcost)whereUiscalledtheutilityfunctionandgisthediscountfactorwith0,g#1.NotethatthefunctionJisdependentontheinitialtimeiandtheinitialstatex(i),anditisreferredtoasthecost-to-goofstatex(i).Theobjectiveofdynamicprogrammingproblemistochooseacontrolsequenceu(k),k5i,i11,c,sothatthefunctionJ(i.e.,thecost)in(2)isminimized.AccordingtoBellman,theoptimalcostfromtimekisequaltoTheoptimalcontrolu*1k2attimekistheu1k2whichachievesthisminimum,i.e.,Equation(3)istheprincipleofoptimalityfordiscrete-timesystems.Itsimportanceliesinthefactthatitallowsonetooptimizeoveronlyonecontrolvectoratatimebyworkingbackwardintime.Innonlinearcontinuous-timecase,thesystemcanbedescribedbyThecostinthiscaseisdefinedasForcontinuous-timesystems,Bellman’sprincipleofoptimalitycanbeapplied,too.TheoptimalcostJ*(x0)5minJ(x0,u(t))willsatisfytheHamilton-Jacobi-BellmanEquationEquations(3)and(7)arecalledtheoptimalityequationsofdynamicprogrammingwhicharethebasisforimplementationofdynamicprogramming.Intheabove,ifthefunctionFin(1)or(5)andthecostfunctionJin(2)or(6)areknown,thesolutionofu(k)becomesasimpleoptimizationproblem.Ifthesystemismodeledbylineardynamicsandthecostfunctiontobeminimizedisquadraticinthestateandcontrol,thentheoptimalcontrolisalinearfeedbackofthestates,wherethegainsareobtainedbysolvingastandardRiccatiequation[47].Ontheotherhand,ifthesystemismodeledbynonlineardynamicsorthecostfunctionisnonquadratic,theoptimalstatefeedbackcontrolwilldependuponsolutionstotheHamilton-Jacobi-Bellman(HJB)equation[48]whichisgenerallyanonlinearpartialdifferentialequationordifferenceequation.However,itisoftencomputationallyuntenabletoruntruedynamicprogrammingduetothebackwardnumericalprocessrequiredforitssolutions,i.e.,asaresultofthewell-known“curseofdimensionality”[16],[28].In[69],threecursesaredisplayedinresourcemanagementandcontrolproblemstoshowthecostfunctionJ,whichisthetheoreticalsolutionoftheHamilton-Jacobi-Bellmanequation,isverydifficulttoobtain,exceptforsystemssatisfyingsomeverygoodconditions.Overtheyears,progresshasbeenmadetocircumventthe“curseofdimensionality”bybuildingasystem,called“critic”,toapproximatethecostfunctionindynamicprogramming(cf.[10],[60],[61],[63],[70],[78],[92],[94],[95]).Theideaistoapproximatedynamicprogrammingsolutionsbyusingafunctionapproximationstructuresuchasneuralnetworkstoapproximatethecostfunction.TheBasicStructuresofADPInrecentyears,adaptive/approximatedynamicprogramming(ADP)hasgainedmuchattentionfrommanyresearchersinordertoobtainapproximatesolutionsoftheHJBequation,cf.[2],[3],[5],[8],[11]–[13],[21],[22],[25],[30],[31],[34],[35],[40],[46],[49],[52],[54],[55],[63],[70],[76],[80],[83],[95],[96],[99],[100].In1977,Werbos[91]introducedanapproachforADPthatwaslatercalledadaptivecriticdesigns(ACDs).ACDswereproposedin[91],[94],[97]asawayforsolvingdynamicprogrammingproblemsforward-in-time.Intheliterature,thereareseveralsynonymsusedfor“AdaptiveCriticDesigns”[10],[24],[39],[43],[54],[70],[71],[87],including“ApproximateDynamicProgramming”[69],[82],[95],“AsymptoticDynamicProgramming”[75],“AdaptiveDynamicProgramming”[63],[64],“HeuristicDynamicProgramming”[46],[93],“Neuro-DynamicProgramming”[17],“NeuralDynamicProgramming”[82],[101],and“ReinforcementLearning”[84].BertsekasandTsitsiklisgaveanoverviewoftheneurodynamicprogrammingintheirbook[17].Theyprovidedthebackground,gaveadetailedintroductiontodynamicprogramming,discussedtheneuralnetworkarchitecturesandmethodsfortrainingthem,anddevelopedgeneralconvergencetheoremsforstochasticapproximationmethodsasthefoundationforanalysisofvariousneuro-dynamicprogrammingalgorithms.Theyprovidedthecoreneuro-dynamicprogrammingmethodology,includingmanymathematicalresultsandmethodologicalinsights.Theysuggestedmanyusefulmethodologiesforapplicationstoneurodynamicprogramming,likeMonteCarlosimulation,on-lineandoff-linetemporaldifferencemethods,Q-learningalgorithm,optimisticpolicyiterationmethods,Bellmanerrormethods,approximatelinearprogramming,approximatedynamicprogrammingwithcost-to-gofunction,etc.Aparticularlyimpressivesuccessthatgreatlymotivatedsubsequentresearch,wasthedevelopmentofabackgammonplayingprogrambyTesauro[85].Hereaneuralnetworkwastrainedtoapproximatetheoptimalcost-to-gofunctionofthegameofbackgammonbyusingsimulation,thatis,bylettingtheprogramplayagainstitself.Unlikechessprograms,thisprogramdidnotuselookaheadofmanysteps,soitssuccesscanbeattributedprimarilytotheuseofaproperlytrainedapproximationoftheoptimalcost-to-gofunction.ToimplementtheADPalgorithm,Werbos[95]proposedameanstogetaroundthisnumericalcomplexitybyusing“approximatedynamicprogramming”formulations.Hismethodsapproximatetheoriginalproblemwithadiscreteformulation.SolutiontotheADPformulationisobtainedthroughneuralnetworkbasedadaptivecriticapproach.ThemainideaofADPisshowninFig.1.Heproposedtwobasicversionswhichareheuristicdynamicprogramming(HDP)anddualheuristicprogramming(DHP).HDPisthemostbasicandwidelyappliedstructureofADP[13],[38],[72],[79],[90],[93],[104],[106].ThestructureofHDPisshowninFig.2.HDPisamethodforestimatingthecostfunction.EstimatingthecostfunctionforagivenpolicyonlyrequiressamplesfromtheinstantaneousutilityfunctionU,whilemodelsoftheenvironmentandtheinstantaneousrewardareneededtofindthecostfunctioncorrespondingtotheoptimalpolicy.InHDP,theoutputofthecriticnetworkisJ^,whichistheestimateofJinequation(2).ThisisdonebyminimizingthefollowingerrormeasureovertimewhereJ^(k)5J^3x(k),u(k),k,WC4andWCrepresentstheparametersofthecriticnetwork.WhenEh50forallk,(8)impliesthatDualheuristicprogrammingisamethodforestimatingthegradientofthecostfunction,ratherthanJitself.Todothis,afunctionisneededtodescribethegradientoftheinstantaneouscostfunctionwithrespecttothestateofthesystem.IntheDHPstructure,theactionnetworkremainsthesameastheoneforHDP,butforthesecondnetwork,whichiscalledthecriticnetwork,withthecostateasitsoutputandthestatevariablesasitsinputs.Thecriticnetwork’strainingismorecomplicatedthanthatinHDPsinceweneedtotakeintoaccountallrelevantpathwaysofbackpropagation.Thisisdonebyminimizingthefollowingerrormeasureovertimewhere'J^1k2/'x1k25'J^3x1k2,u1k2,k,WC4/'x1k2andWCrepresentstheparametersofthecriticnetwork.WhenEh50forallk,(10)impliesthat2.TheoreticalDevelopmentsIn[82],Sietalsummarizesthecross-disciplinarytheoreticaldevelopmentsofADPandoverviewsDPandADP;anddiscussestheirrelationstoartificialintelligence,approximationtheory,controltheory,operationsresearch,andstatistics.In[69],PowellshowshowADP,whencoupledwithmathematicalprogramming,cansolve(approximately)deterministicorstochasticoptimizationproblemsthatarefarlargerthananythingthatcouldbesolvedusingexistingtechniquesandshowstheimprovementdirectionsofADP.In[95],Werbosfurthergavetwootherversionscalled“actiondependentcritics,”namely,ADHDP(alsoknownasQ-learning[89])andADDHP.InthetwoADPstructures,thecontrolisalsotheinputofthecriticnetworks.In1997,ProkhorovandWunsch[70]presentedmorealgorithmsaccordingtoACDs.TheydiscussedthedesignfamiliesofHDP,DHP,andglobalizeddualheuristicprogramming(GDHP).TheysuggestedsomenewimprovementstotheoriginalGDHPdesign.Theypromisedtobeusefulformanyengineeringapplicationsintheareasofoptimizationandoptimalcontrol.Basedononeofthesemodifications,theypresentaunifiedapproachtoallACDs.ThisleadstoageneralizedtrainingprocedureforACDs.In[26],arealizationofADHDPwassuggested:aleastsquaressupportvectormachine(SVM)regressorhasbeenusedforgeneratingthecontrolactions,whileanSVM-basedtree-typeneuralnetwork(NN)isusedasthecritic.TheGDHPorADGDHPstructureminimizestheerrorwithrespecttoboththecostanditsderivatives.Whileitismorecomplextodothissimultaneously,theresultingbehaviorisexpectedtobesuperior.Soin[102],GDHPservesasareconfigurablecontrollertodealwithbothabruptandincipientchangesintheplantdynamicsduetofaults.Anovelfaulttolerantcontrol(FTC)supervisoriscombinedwithGDHPforthepurposeofimprovingtheperformanceofGDHPforfaulttolerantcontrol.Whentheplantisaffectedbyaknownabruptfault,thenewinitialconditionsofGDHPareloadedfromdynamicmodelbank(DMB).Ontheotherhand,ifthefaultisincipient,thereconfigurablecontrollermaintainsperformancebycontinuouslymodifyingitselfwithoutsupervisorintervention.ItisnotedthatthetrainingofthreenetworksusedtoimplementtheGDHPisinanonlinefashionbyutilizingtwodistinctnetworkstoimplementthecritic.Thefirstcriticnetworkistrainedateveryiterationswhilethesecondoneisupdatedwithacopyofthefirstoneatagivenperiodofiterations.AlltheADPstructurescanrealizethesamefunctionthatistoobtaintheoptimalcontrolpolicywhilethecomputationprecisionandrunningtimearedifferentfromeachother.Generallyspeaking,thecomputationburdenofHDPislowbutthecomputationprecisionisalsolow;whileGDHPhasbetterprecisionbutthecomputationprocesswilltakelongertimeandthedetailedcomparisoncanbeseenin[70].In[30],[33]and[83],theschematicofdirectheuristicdynamicprogrammingisdeveloped.Usingtheapproachof[83],themodelnetworkinFig.1isnotneededanymore.Reference[101]makessignificantcontributionstomodel-freeadaptivecriticdesigns.Severalpracticalexamplesareincludedin[101]fordemonstrationwhichincludesingleinvertedpendulumandtripleinvertedpendulum.Areinforcementlearning-basedcontrollerdesignfornonlineardiscrete-timesystemswithinputconstraintsispresentedby[36],wherethenonlineartrackingcontrolisimplementedwithfilteredtrackingerrorusingdirectHDPdesigns.Similarworksalsosee[37].Reference[54]isalsoaboutmodel-freeadaptivecriticdesigns.Twoapproachesforthetrainingofcriticnetworkareprovidedin[54]:Aforward-in-timeapproachandabackward-in-timeapproach.Fig.4showsthediagramofforward-intimeapproach.Inthisapproach,weviewJ^(k)in(8)astheoutputofthecriticnetworktobetrainedandchooseU(k)1gJ^(k11)asthetrainingtarget.NotethatJ^(k)andJ^(k11)areobtainedusingstatevariablesatdifferenttimeinstances.Fig.5showsthediagramofbackward-in-timeapproach.Inthisapproach,weviewJ^(k11)in(8)astheoutputofthecriticnetworktobetrainedandchoose(J^(k)2U(k))/gasthetrainingtarget.Thetrainingapproachof[101]canbeconsideredasabackward-in-timeapproach.InFig.4andFig.5,x(k11)istheoutputofthemodelnetwork.Animprovementandmodificationtothetwonetworkarchitecture,whichiscalledthe“singlenetworkadaptivecritic(SNAC)”waspresentedin[65],[66].Thisapproacheliminatestheactionnetwork.Asaconsequence,theSNACarchitectureoffersthreepotentialadvantages:asimplerarchitecture,lessercomputationalload(abouthalfofthedualnetworkalgorithms),andnoapproximateerrorduetothefactthattheactionnetworkiseliminated.TheSNACapproachisapplicabletoawideclassofnonlinearsystemswheretheoptimalcontrol(stationary)equationcanbeexplicitlyexpressedintermsofthestateandthecostatevariables.Mostoftheproblemsinaerospace,automobile,robotics,andotherengineeringdisciplinescanbecharacterizedbythenonlinearcontrol-affineequationsthatyieldsucharelation.SNAC-basedcontrollersyieldexcellenttrackingperformancesinapplicationstomicroelectronicmechanicalsystems,chemicalreactor,andhigh-speedreentryproblems.Padhietal.[65]haveprovedthatforlinearsystems(wherethemappingbetweenthecostateatstagek11andthestateatstagekislinear),thesolutionobtainedbythealgorithmbasedontheSNACstructureconvergestothesolutionofdiscreteRiccatiequation.译文:自适应动态规划综述摘要：自适应动态规划(Adaptivedynamicprogramming,ADP)是最优控制领域新兴起的一种近似最优方法,是当前国际最优化领域的研究热点.ADP方法利用函数近似结构来近似哈密顿{雅可比{贝尔曼(Hamilton-Jacobi-Bellman,HJB)方程的解,采用离线迭代或者在线更新的方法,来获得系统的近似最优控制策略,从而能够有效地解决非线性系统的优化控制问题.本文按照ADP的结构变化、算法的发展和应用三个方面介绍ADP方法.对目前ADP方法的研究成果加以总结,并对这一研究领域仍需解决的问题和未来的发展方向作了进一步的展望。关键词：自适应动态规划,神经网络,非线性系统,稳定性引言动态系统在自然界中是普遍存在的,对于动态系统的稳定性分析长期以来一直是研究热点,且已经提出了一系列方法.然而控制科技工作者往往在保证控制系统稳定性的基础上还要求其最优性.本世纪50»60年代,在空间技术发展和数字计算机实用化的推动下,动态系统的优化理论得到了迅速的发展,形成了一个重要的学科分支:最优控制.它在空间技术、系统工程、经济管理与决策、人口控制、多级工艺设备的优化等许多领域都有越来越广泛的应用.1957年Bellman提出了一种求解最优控制问题的有效工具:动态规划(Dynamicprograming,DP)方法[1].该方法的核心是贝尔曼最优性原理,即:多级决策过程的最优策略具有这种性质,不论初始状态和初始决策如何,其余的决策对于由初始决策所形成的状态来说,必定也是一个最优策略.这个原理可以归结为一个基本的递推公式,求解多级决策问题时,要从末端开始,到始端为止,逆向递推.该原理适用的范围十分广泛,例如离散系统、连续系统、线性系统、非线性系统、确定系统以及随机系统等。下面分别就离散和连续两种情况对DP方法的基本原理进行说明.首先考虑离散非线性系统。假设一个系统的动态方程为其中,为系统的状态向量,为控制输入向量。系统相应的代价函数(或性能指标函数)的形式为其中,初始状态x(k)=xk给定,l(x(k),u(k),k)是效用函数,r为折扣因子且满足0 标准的黎卡提方程得到.如果系统是非线性系统或者代价函数不是状态和控制输入的二次型形式,那么就需要通过求解HJB方程进而获得最优控制策略.然而,HJB方程这种偏微分方程的求解是一件非常困难的事情.此外,DP方法还有一个明显的弱点:随着x和u维数的增加,计算量和存储量有着惊人的增长,也就是我们平常所说的\维数灾"问题[1-2].为了克服这些弱点,Werbos首先提出了自适应动态规划(Adaptivedynamicprogramming,ADP)方法的框架[3],其主要思想是利用一个函数近似结构(例如神经网络、模糊模型、多项式等)来估计代价函数,用于按时间正向求解DP问题。近些年来,ADP方法获得了广泛的关注,也产生了一系列的同义词,例如:自适应评价设计[4-7]、启发式动态规划[8-9]、神经元动态规划[10-11]、自适应动态规划[12]和增强学习[13]等.2006年美国科学基金会组织的\2006NSFWorkshopandOut-reachTutorialsonApproximateDynamicPro-gramming"研讨会上,建议将该方法统称为\Adap-tive/Approximatedynamicprogramming".Bert-sekas等在文献[10¡11]中对神经元动态规划进行了总结,详细地介绍了动态规划、神经网络的结构和训练算法,提出了许多应用神经元动态规划的有效方法.Si等总结了ADP方法在交叉学科的发展,讨论了DP和ADP方法与人工智能、近似理论、控制理论、运筹学和统计学的联系[14].在文献[15]中,Powell展示了如何利用ADP方法求解确定或者随机最优化问题,并指出了ADP方法的发展方向.Balakrishnan等在文献[16]中从有模型和无模型两种情况出发,对之前利用ADP方法设计动态系统反馈控制器的方法进行了总结.文献[17]从要求初始稳定和不要求初始稳定的角度对ADP方法做了介绍.本文将基于我们的研究成果,在之前研究的基础上,概述ADP方法的最新进展。ADP的结构发展为了执行ADP方法,Werbos提出了两种基本结构:启发式动态规划(Heuristicdynamicpro-gramming,HDP)和二次启发式规划(Dualheuris-ticprogramming,DHP),其结构如图1和图2所示[4].HDP是ADP方法最基础并且应用最广泛的结构,其目的是估计系统的代价函数,一般采用三个网络:评价网、控制网和模型网.评价网的输出用来估计代价函数J(x(k));控制网用来映射状态变量和控制输入之间的关系;模型网用来估计下一时刻的系统状态.而DHP方法则是估计系统代价函数的梯度.DHP的控制网和模型网的定义与HDP相同,而其评价网的输出是代价函数的梯度。HDP是ADP最基本和广泛应用结构[13],[38],[72],[79],[90],[93],[104],[106]。HDP的结构被显示在图2。HDP是一个成本函数估算方法。估算成本函数为一个给定的政策只需要样品从瞬时效用函数U,而模型的环境和瞬时奖励是需要找到相应的成本函数的最优策略。在HDP中，评估网络的输出是J，即方程（2）的估计值J，这是通过最小化跟踪时间误差而得出的。和Wc代表的是评价网的参数，当Eh=0是，对于所有的k，有和双启发式规划是一个估算方法梯度的成本函数,而不是J本身。要做到这一点,需要一个函数来描述梯度的瞬时成本函数对系统的状态。在设计DHP结构、行动网络仍是一样的,但是对HDP第二网络,称为评论家网络,与主脉的作为其输出和状态变量作为它的输入。HDP中的评价网络的训练是比这更复杂的，因为我们需要考虑所有相关的通路反向传播的。这是通过最小化跟踪时间误差而得出的。和Wc代表的是评价网的参数，当Eh=0是，对于所有的k，有2.理论发展在[82],Sietal总结了跨学科的理论发展,并概述DP和ADPADP;并讨论了　　他们的关系对人工智能,近似理论、控制理论、运筹学、和统计。在[69],鲍威尔显示加上ADP数学规划,可以解决(大约)确定性或随机优化问题,远比任何东西可以解决用现有技术和显示改进方向的ADP。在[95],Werbos进一步给另外两个版本称为“actiondependent批评家”,即ADHDP(也称为q学习[89])和ADDHP。在这两个ADP结构,控制同时输入的评论家网络。1997年,普罗霍罗夫和温斯迟[70]提出更多的算法根据。他们讨论了设计家庭的HDP,DHP和全球化双启发式编程(GDHP)。他们建议　　一些新的改进原来的GDHP设计。他们承诺是有用的对于许多工程应用领域的优化和最优控制。基于人的这些修改,他们提出一个统一的方法。这导致一个广义培训过程。在[26],提出实现ADHDP:最小二乘法支持向量机(SVM)回归量已经被用于生成控制行为,而一个基于svm树类型神经网络(NN),是用来作为评论家。这个GDHP或ADGDHP结构最小化误差与两方面成本及其衍生品。虽然它是更复杂的这样做同时,生成的行为将是优越的。所以在[102],GDHP作为可重构。一个新颖的容错控制(FTC)主管GDHP结合为目的的性能的提高GDHP为容错控制。当植物受到一个已知的突变故障,新初始条件的GDHP加载从动态模型　　银行(DMB)。另一方面,如果故障初期,可重构控制器保持性能通过不断修改本身没有主管的干预。这指出,该培训的三个网络用来实现是一个在线的GDHP时尚利用两个不同的网络实现了评价。第一个评论家网络在每一个迭代训练而第二个是更新一份第一次在一个给定的时期的迭代。所有的ADP结构可以实现相同的功能获得最优控制政策而计算精度和运行时间也各不相同。一般来说,计算负担低但HDP计算精度也低;GDHP具有更好的精度，但计算过程将需要一定的时间，详细地比较中可以看[70]。在[30]、[33]和[83],图示的直接启发式动态规划是发达。使用的方法[83],模型网络在图1是不需要了。参考[101]做了重大的贡献对自适应补偿自适应评价网络设计。几个实际的例子是包括在[101]用于演示其中包括单倒立摆和三重倒立摆。一个钢筋创建控制器非线性离散时间系统的设计提出了具有输入约束[36],　　在非线性跟踪控制是实现过滤的跟踪误差使用直接DHP的设计。类似的作品　　也看[37]。参考[54]也对自适应补偿自适应评价网络的设计。两个方法训练的评价网络提供在[54]。一个前锋的时间的方法和一个后退的时间的方法。图4显示了forward-intimeapproach图。在这种方法中,我们认为J^(k)(8)作为输出的评论家网络训练有素,选择U(k)1gj^(k11)作为训练目标。注意,J^(k)和J(k11)是获得使用状态变量在不同时间的实例。图5显示了图的后退的时间的方法。在这种方法中,我们认为J^(k11)在(8)作为输出的评论家网络训练有素,选择(J^(k)2u(k))/g作为训练目标。[101]可以看作一种落后——及时。在图4和图5,x(k11)的输出模型的网络。一个改进和修改两个网络架构,这就是所谓的(SNAC)在[65],[66]中提出的“单一网络自适应的批评”。这种方法消除了这个行动网络。因此,SNAC架构　　提供了三个潜在的优势:一个简单的架构,小雅计算负载(大约一半的双网络算法),　　和没有近似误差原因,行动网络被消除。这个方法是适用于SNAC宽范的非线性系统，此时的最优控制(固定式)方程可以用状态和主方程的变量来明确表达。大多数在航空航天、汽车、机器人和其他工程学科的问题由非线性控制仿射方程产生这些关系。基于snac很好的跟踪控制器产生表演在应用微电子机械系统,化学反应器,高速再入问题。Padhietal。[65]证明线性系统(映射在阶段之间的主脉的k11和国家在舞台k是线性的),解决方案获得的算法基于SNAC结构的解收敛于离散黎卡提微分方程。

                    本文档为【外文文献及翻译--自适应动态规划综述】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

你可能还喜欢

最新资料

资料动态

专题动态

精品课件

暂无简介~

格式：doc

大小：587KB

软件：Word

页数：19

分类：企业经营

上传时间：2021-12-04

浏览量：8

热点搜索

古代学校的别称 2022年《中华人民共和国反食品浪费法》全文《小小羊儿要回家》教案及教学反思德国自我催眠方法介绍机械制造企业宣传片解说词《口语交际：父母之爱》教学设计（部编版小学语文五年级上册第六单元）美睫课程（美容班）正式版 GB 2772-1999 林木种子检验规程 2019年小学生三年级优秀作文300字-精彩的一天 “最美乡音”方言童谣诵读比赛无锡方言童谣汇编 2016年智慧树【人文与医学】考试参考答案 [指南]食疗本草txt全本收费下载《精神科问诊及精神检查方法》PPT课件模板行测2 古代学校的别称 2022年《中华人民共和国反食品浪费法》全文《小小羊儿要回家》教案及教学反思德国自我催眠方法介绍机械制造企业宣传片解说词《口语交际：父母之爱》教学设计（部编版小学语文五年级上册第六单元）美睫课程（美容班）正式版 GB 2772-1999 林木种子检验规程 2019年小学生三年级优秀作文300字-精彩的一天 “最美乡音”方言童谣诵读比赛无锡方言童谣汇编 2016年智慧树【人文与医学】考试参考答案 [指南]食疗本草txt全本收费下载《精神科问诊及精神检查方法》PPT课件模板行测2