首页 An Introduction to Reinforcement Learning - Part 4

An Introduction to Reinforcement Learning - Part 4

An Introduction to Reinforcement Learning - Part 4 References Agrc. P. E. (1988). The Dynamic Structure c~f Evn:wlay U(e. Ph.D. thesis, Massachusetts Institute of Technology. AI-TR 1085, MIT Artificial Intelligence Laboratory. Agrc. P. E .. and Chapman. D. ( 1990). What are plans for? Robotics and Awonom...

References Agrc. P. E. (1988). The Dynamic Structure c~f Evn:wlay U(e. Ph.D. thesis, Massachusetts Institute of Technology. AI-TR 1085, MIT Artificial Intelligence Laboratory. Agrc. P. E .. and Chapman. D. ( 1990). What are plans for? Robotics and Awonomous Systems, 6:17-34. Albus, J. S. ( 1971). A theory of cerehellar function. Mathematical Biosciences. 10:25-61. Albus. J. S. ( 1981 ). Brain. Relun·im; w1d Robotics. Byte Books, Peterborough, NH. Anderson. C. W. (1986). i.Raming and Problem Soll·ing witlr Multilayer Connectionist Sys- tems. Ph.D. thesis. University of Massachuseus. AmhersL Anderson. C. W. ( 1987). Strategy learning with multilayer connectionist representations. Pro- ceeding~ of the Fourth lmenuttiona/ Worhhop on Machine Leaming, pp. 103-114. Morgan Kaufmann. San Mateo. CA. Anderson. J. A.. Silverstein. J. W .. Rill. S. A .. and Jones. R. S. (1977). Distinctive features. categorical perception. and probahility learning: Some applications of a nemal modeL Psy- chological Rniew. 84:413-451. Andreae. J. H. (1963). STELLA: A scheme for a learning machine. In ProceedingH!fthe 2nd IFAC Congrn·s. Bas/e. pp. 497-502. Buuerworths, London. Andreae. J. H. ( 1969a). A learning machine with monologue. Imemational Joumal of Man- MaclrineStlldies. 1:1-20. Andreae. J. H. (I %9b). Leaming machines-a unified view. In A. R. Meetham and R. A. Hud- son (eds.), Encyclopedia of Informmion. Linguistics. and Comml. pp. 261-270. Pergamon, Oxford. Andreae, J. H. ( 1977). Thinking with the Teachable Machine. Academic Press. London. Baird. L. C. ( 1995). Residual algorithms: Reinforcemelll learning with function approxima- tion. In Proceeding.~ of the Twe(fih Intenwtional COJ!ference on Machine Leaming, pp. 30-37. Morgan Kaufmann, San Francisco. 292 References Bao, G., Cassandras. C. G., Djaferis. T. E .. Gandhi. A. D .. and Lm>7C. D.P. ( 1994). Elevator dispa!Chcrs for down peak traffic. Technical report. ECE Department, University of Massa- chusclls, Amhetst. Barnard, E. (1993). Temporal-difference methods and Markov models. IEEE Tnm.mctions 011 Systems. Man. and Cvbemetics. 23:357-365. Barto. A. G. ( 1985). Learning by statistical cooper,uion of self-interested neuron-like comput- ing elements. Human Neurobiology, 4:229-256. Barto. A. G. ( 1986). Game-theoretic cooperativity in networks of self-interested units. In J. S. Denker (ed.), Neural Networks for Compwing. pp. 41-46. American Institute of Physics, New York. Barto, A. G. ( 1990). ConnectioRistleaming for control: An overview. InT. Miller, R. S. Sutton. and P. J. Werbos (eds.). Neural Networks for Colltml. pp. 5- 58. MIT Press, Cambridge, MA. Barto, A. G. (1991). Some learning tasks from a control perspective. In L. Nadel and D. L. Stein (eds.). 1990 Lectures ill Compla Systems, pp. 195-223. Addison-Wesley. Red- wood City. CA. Bano, A. G. ( 1992). Reinforcement learning and adaptive critic methods. In D. A. White and D. A. Sofge (eds.), Ha11clbook oflmellif?ellt Comml: Neural. Fu::.zy. and Adaptive Approaches, pp. 469-491. Van Nostrand Reinhold. New York. Barto, A. G. (1995a). Adaptive critic" and the baStll ganglia. In J. C. Houk, J. L. Davis, dlld D. G. Beiser (cds.). Models l?f lnfmmation Processing i11 tile Basal Ganglia. pp. 215- 232. MIT Press, Cambridge. MA. Barto, A. G. ( 1995b). Reinforcement learning. In M. A. Arbib (ed.), Ha11dhook o.f Brain Theory and Neural Networks. pp. 804-809. MIT Press. Cambridge. MA. Barto. A. G .• and A nandan. P. ( 1985). Pattern recognizing stochastic learning automata. IEEE Transactio/Is on Systems. Man. and Cybanetics. 15:360-375. Barto, A. G., and Anderson. C. W. ( 1985). Structur,tl learning in connectionist systems. In Pmgram of the Sevemh A11nual CoJ!ferellce lifthe Cof?nitil'e Science Society. pp. 43-54. Barto. A. G .. Anderson. C. W., and Sulton. R. S. ( 1982). Synthesis of nonlinear control surfaces by a layered associative search network. Biological Cy/Jemetics, 43: 175-185. B..trto, A. G., Bradtke. S. J .• and Singh. S. P. (1991 ). Real-time learning and control using asynchronous dynamic programming. Technical Report 91-57. Department of Computer and Infommtion Science, University of Massachusetts, Amherst. Barto, A. G., Br.tdtke. S. J .. and Singh. S. P. ( 1995). Learning to act using real-time dynamic programming. A1 tijiciallntel/igellce. 72:81-138. Barto, A. G., and Duff, M. ( 1994). Monte Carlo matrix inversion and reinforcement learning. In J. D. Cohen, G. Tesauro, and J. Alspector (eds.), Advances in Neural bifmmation Pm- 293 Referencel cessing Systems: Proceedings of the 1993 Cm!ference. pp. 687-694. Morgan Kaufmann. San Francisco. Barto. A. G .. and Jordan. M. I. ( 1987). Gradient following without back-pmpagation in layered networks. In M. Caudill and C. BUller (eds.). Proceedings of the iEEE First Annual Conference on Neural Networks. pp. 11629-11636. SOS Printing. San Diego. CA. Bano, A. G., and Sutton, R. S. ( 1981 a). Goal seeking componen1s for adaptive imelligence: An initial assessment. Technical Report AFWAL-TR-81-1070. Air Force Wright Aeronautical L..1boratories/Avionics Laboratory, Wright-Patterson AFB, OH. Barto, A. G., and Sutton. R. S. (1981 b). Landmark learning: An illustration of a~sociative search. Biological Cybemetics, 42: 1-8. Barto, A. G .• and Sutton, R. S. (1982). Simulation of dnticipatory responses in classical conditioning by a neuron-like adaptive element. Behavioural Brain Research, 4:221-235. Barto, A. G .. Sutton. R. S .. and Anderson, C. W. (1983). Ncuronlike elements that can solve difficult learning control problems. IEEE Transactions on Svstems, Mcm, and Cybemetics, 13:835-846. Reprinted in J. A. Anderson and E. Rosenfeld (~ds.), Neurocomputing: Fowula- tions of Research. pp. 535-549. MIT Press. Cambridge. MA, 1988. Barto, A. G., Sutton, R. S., and Brouwer. P. S. ( 1981 ). Associative search network: A rein- forcement learning associative memory. Biological Cybemetics, 40:201-211. Bellman. R. E. ( 1956). A problem in the sequential design of experiments. Sankhya. 16:221- 229. Bellman, R. E. (1957a). Dynamic Programming. Princeton University Press. Princeton. Bellman, R. E. (1957b). A Markov decision process. Journal of Mathematical Mechanics, 6:679-684. Bellman, R. E., and Drcytus. S. E. ( 1959). Functional approximations and dynamic program- ming. Mathematical Tables and Other Aids to Computation, 13:247-251 . Bellman, R. E., Kalaba. R., and Kotkin, B. ( 1973). Polynomial approximation-A new com- plllational technique in dynamic programming: Allocation processes. Mathemmical Compu- tation, 17:155- 161. Berry, D. A .. and Fristedt, B. (1985). Bandit ProMems. Chapman and Hall. London. Bensckds, D. P. (1982). Distributed dynamic programming. IEEE Trtmsactions on Amomatic Comml, 27:610-616. Bertsekas, D. P. ( 1983 ). Distrihuted asynchronous computation of fixed points. Mathematical Programming, 27:107-120. Bcrtsekas, D. P. ( 1987). Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs, NJ . 294 References Bertsckas, D. P. (1995). Dynamic Programming and Optimal Comro/. Athena Scientific. Belmont. MA. Bertsekas. D.P., and Tsitsiklis, J. N. (1989). Parallel and Distribmed Computation: Numerical Methods. Prentice-Hall. Englewood Cliffs. NJ. Bertsekas, D. P .. and Tsitsiklis. J. N. ( 1996). Neuro-Dynamic Programming. Athena Scielllilic. Belmom, MA. Biermann. A. W .. Fairfield, J. R. C.. and Beres, T. R. (1982). Signature table systems and learning. IEEE Transactions on Systems. Man. and Cybe111etics. 12:635-648. Bishop. C. M. ( 1995). Neural Netwmks for Patte1n Recognition. Clarendon. Oxford. Booker. L. B. (1982). Intelligent Behavior as wz Adapwtion to the Task Em•imnmem. Ph.D. thesis, University of Michigan. Ann Arbor. Boone. G. ( 1997). Minimum-time control of the acrobot. In 1997 lmemational Conference on Robotics and Automation. pp. 3281-3287. IEEE Robotics and Automation Society . ....... Boutilier, C., Dearden, R., and Goldszmidt, M. ( 1995). Exploiting structure in policy construc- tion. In Proceedings o_fthe Fourteenth International Joim Conference on Artijiciallmelligence, pp. 1104-1111. Morgan Kaufmann. Boy,m. J. A .. and Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function . In G. Tcsauro. D. S. Touret~:ky. and T. Leen (eds .),Advances in Neural information Processing Sy.l"fems: Proceedings of the 1994 Conference. pp. 369-376. MIT Press. Cambridge. MA. Boyan. J. A .. Moore. A. W., and Sutton. R. S. (eds.). (1995). Proceedings of the Workshop on Value Function Approximation. Machine Leaming Cm!ference 1995. Technical Report CMU- CS-95-206. School of Computer Science. Carnegie Mellon University. Pittsburgh. PA. Bradtke, S. J. (1993). Rcinforcemem learning applied to linear quadratic regulation. In S. J. Hanson. J.D. Cowan, and C. L. Giles (eds.). Advances in Neural h!{Oimmion Processing Systems: Proceedings of the 1992 Conference, pp. 295-302. Morgan Kaufmann. San Mateo, CA. Bradtke, S. J. ( 1994). Incremental Dynamic Programming for On-Une Adaptive Optimal Comrol. Ph.D. thesis, University ofMassachuseus. Amherst. Appeared as CMPSCT Technical Report 94-62. Bradtke. S. J .. and Barto, A. G. ( 1996). Linear least-squares algorithms for temporal difference learning. Machine Leaming. 22:33- 57. S. J. Bradtke. B. E. Ydstie, and A. G. Barto (1994). Adaptive linear quadratic control using pol- icy iteration. In Proceedings of the American Control Cot!ference, pp. 3475-3479. American Autom.ttic Control Council, Evanston. IL. Bradtke, S. J. , and Duff, M. 0. (1995). Reinforcemcllllearning methods for continuous-Lime Markov decision problems. In G. Tesauro, D. Tourctzky, and T. Lccn (eds.), Adwmces in 295 References Neural information Process in[? Systems: Proceedings of the 1994 Conference. pp. 393-400. MIT Press, Cambridge, MA. Bridle. J. S. ( 1990). Training stochastic model recognition algorithms as networks can lead to maximum mutual infomultion estimates of parameters. In D. S. Touretzky (eel.), Advances in Neural b!formation Processing Systems: Proceedings of the 1989 Cm!ference, pp. 211-217. Morgan Kaufmann. San Mateo. CA. Broom head, D. S., and Lowe, D. ( 1988). Muhivariable functional interpolation and adaptive networks. Complex Systems, 2:321-355. Bryson, A. E .. Jr. (1996). Optimal control-1950 to 1985. IEEE Control Systems, 13(3):26- 33. Bush, R. R .. and Mosteller. F. (1955). Stochmtic Models for Leaming. Wiley, New York. Byrne, J. H., Gingrich. K. J.. and Baxter. D. A. (1990). Computational capabilities of single neurons: RcJ,uionship to simple fom1s of associative and nonassociative learning in aplysia. [n R. D. Hawkins and G. H. Bower (eds.). Computational Models of Learning, pp. 31-63. Academic Press. New York. Campb<;:ll. D. T. (1960). Blind variation and l>elective survival as a general strategy in knowlel(ge-processes. [n M. C. Yovits and S. Cameron (eds.), Sel.f-Orgwri::Jng Systems. pp. 205-231. Pergamon. New York. Carlstrtim. J., and Nordstrom. E. (1997). Control of self-similar ATM call tr.tflic by rein- forcement learning. In Proceedings of the buemational Workshop on Applications of Neural Networks to Telecommunications 3, pp. 54-62. Erlbaum. Hillsdale, NJ. Chapman. D., and Kaelbling, L. P. ( 1991 ). Input generalization in delayed reinforcement learn- ing: An algorithm and performance comparisons. [n Proceedings of the Twe(fih lntemational Conference on Artificiallntelligence, pp. 726-731. Morgan Kaufmann. San Mateo. CA. Chow, C.-S., and Tsitsiklis. J. N. (1991 ). An optim<~l one-way multigrid algorithm for discrete- time stochastic control./£££ Transactions on Amomatic Colltml. 36:898-914. Chrisman, L. ( 1992). Reinforcement lemning with perceptual aliasing: The perceptual dis- tinctions approach. [n Proceedings of the Temlr National Conference on Artificial intelligence. pp. 183- 188. AAAf/MlT Prc.~s. Menlo Park, CA. Christensen, J., and Korf. R. E. (1986). A unified theory of heuristic evaluation functions and its application to learning. In Proceeding.r of the F(ftli National Cm!ference on Artificial llltel/igmce. pp. 148-152. Morgan Kaufmann. San Mateo, CA. Cichosz. P. ( 1995). Truncating temporal differences: On the efficient implementation ofTD()..) for reinforcement learning. lou mal of Art({icial/melligence Research. 2:287-318. Clark, W. A.. and Farley, B. G. ( 1955). Generali7ation of pan em recognition in a self- organi.~:ing system. [n Proceedings of the 1955 Westem Joim Computer Col({erence. pp. 86-91. 296 References Clouse. J. (1996). Onilllegrating Appremice Learning and Reil!forcemellt Learning TJTL£2. Ph.D. thesis, University of Massachusetts, Amherst. Appeared as CMPSCI Technical Report 96-026. Clouse. J.. and Utgoff, P. (1992). A teaching method for reinforcement learning systems. In Proceedings of the Ninth lntenzational Mad1ine Learning Conference, pp. 92-101. Morgan Kaufmann, San Mateo. CA. Colombeui, M., and Dorigo, M. ( 1994). Training agent to perform sequential behavior. Adap- tive Behavior, 2(3):247-275. Connell. J. (1989). A colony architecture for an artificial creature. Technical Report AI-TR- 1151. MIT Artificial Intelligence Laboratory, Cambridge, MA. Connell, J., and Mahadevan, S. (1993). Robot Leaming. Kluwer Academic, Boston. Craik, K. J. W. ( 1943). 111e Nature of Explanation. Cambridge University Press, Cambridge. Crites, R. H. (1996). Large-Scale Dynamic Optimiwtion Using Teams of Reinforcement Leaming Agems. Ph.D. thesis, University of Massachusetts. Amherst. Crites. R. H .. and Barto, A. G. (1996). Improving elevator performance using reinforcement learning. In D. S. Tourctzky, M. C. Mozer. and M. E. Hasselmo (eds.), Advances in Neural b!fomwtion Processing Systdns: Proceedings of the 1995 COIJ{erence, pp. 1017-1023. MIT Press, Cambridge. MA. Curtiss. J. H. (1954). A theoretical comparison of the efficiencies of two classical methods and a Monte Carlo method for computing one component of the solution of a set of linear algebraic equations. In H. A. Meyer (ed.). Symposium m1 Mollte Carlo Methods. pp. 191-233. Wiley. New York. Cziko. G. (1995). Without Miracles: Univer.ml Selection Tlle01:Y and the Second Dan>inian Rel'olmion. MIT Press, Cambridge, MA. Daniel. J. W. (1976). Splines and efficiency in dynamic programming. Journal of Mathemati- cal Analysis ami Applications, 54:402-407. Dayan, P. ( 1991 ). Reinforcement comparison. In D. S. Tourctzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton (eds.), Connectionist Models: Proceedings of the 1990 Summer School, pp. 45-51. Morgan Kaufmann, San Mateo, CA. Dayan. P. (1992). The convergence ofTD(>-.) for general A. Machine Learning, 8:341-362. Dayan. P .. and Hinton, G. E. ( 1993). Feudal reinforcement learning. InS. J. Hanson, J.D. Co- hen, and C. L Giles (eds.), Admnces in Neural b!formation Processing Systems: Proceedings ofthe 1992 Conference, pp. 271-278. Morgan Kaufmann. San Mateo. CA. Dayan. P., and Sejnowski, T. ( 1994). TD(>-.) converges with probability I. Machine Learning, 14:295-301. Dean, T .. and Lin. S. H. ( 1995). Decomposition techniques for planning in stochastic domains. In Proceedings of the Fourteemh lntemational Joint Conference on Artificial lmelligence. 297 References pp. 1121-1127. Morgan Kaufm,mn. See ttlso Technical Report CS-95-10, Brown University, Department of Computer Science, 1995. DeJong. G .. and Spong, M. W. (1994). Swinging up the acrobat: An example of imelligent control. [n Pmceedings oft he American Control Conference. pp. 2I58-2162. American Auto- matic Control Council. Evanston. [L Denardo, E. V. ( 1967). Contraction mappings in the theory underlying dynamic programming. SIAM Review, 9:165-177. Dennen. D. C. ( 1978). Brainstorms. pp. 71-89. Bradford/MIT Press, Cambridge, MA. Dietterich, T. G., and Flann. N. S. (1995). Explanation-based learning and reinforcement learning: A unified view. [n A. Prieditis and S. Russell (eds.), Pmceedings o.fthe Rve(fih Inter- national Cot!ference on Machine Leaming. pp. 176-184. Morgan Kaufmann, San Francisco. Doya. K. (1996). Temporal difference learning in continuous time and space. In D. S. Touret- zky. M. C. Mozer, and M. E. Hasselmo (eds.), Advances in Neural biformation Processing Systems: Proceedings o.fthe 1995 Cm!ference, pp. 1073-1079. MIT Press, Cambridge, MA. Doyle, P. G., and Snell, J. L. (1984). Random ffiilks and Electric Networks. The Mathematical Association of Amrica. Carus Mathematical Monograph 22. Dreyfus. S. E .. and Law. A. M. ( 1977). The A11 and Them)' of Dynamic Programming. Academic Press, New York. Duda, R. 0., and Hart, P. E. ( 1973). Pattem Classification and Scene Analysis. Wiley, New York. Duff. M. 0. (1995). Q-learning for bandit problems. In A. Prieditis and S. Russell (eds.), Proceedings of the Rvelfth lmemational Conference on Machine Leaminf!, pp. 209- 217. Morgan Kaufmann, San Francisco. Estes, W. K. ( 1950). Toward a statistical theory of learning. P~ychololgical Review, 57:94-107. Farley. B. G .. and Clark, W. A. ( 1954). Simulation of self-organizing systems by digital computer. IRE Transactions 011 b!fomwtion Theory, 4:76-84. Feldbaum, A. A. (1965). Optimal Comml Systems. Academic Press. New York. Friston. K. J., Tononi. G .. Reeke, G. N .• Sporns. 0 .. and Edelman, G. M. (1994). Value- dependent selection in the brain: Simulation in a synthetic neural model. Neuroscience. 59:229-243. Fu, K. S. (1970). Learning control systems- Review and outlook. IEEE Tm11sactions on Autommic Control. 15:210-221. Galanter. E .. and Gerstcnhaber. M. (1956\. On thought: The extrinsic theory. Psychological Review, 63:218-227. Gallant. S. I. (1993). Neural Network Learning and Expett Systems. MIT Press, Cambridge, MA. 298 References Glillmo. 0. and Asplund. L. ( 1995). Reinforcement learning by construction of hypothetical targets. In J. Alspector. R. Goodman, and T. X. Brown (eds.). Proceedings of the lmema- tional Workshop on Applications cifNeural Network.~ to Telecommunications 2. pp. 300-307. Erlbaum, Hillsdale, NJ. Gardner. M. (1973). Mathematical games. Sciemific American. 228(1): 108-115. Gclperin, A., Hop field. J. Land Tank. D. W. ( 1985). The logic of limax learning. In A. Sclver- ston (ed.), Model Neural Networks ami Bc•hm1ior, pp. 247-261. Plenum Press, New York. Gittins, J. C., and Jones. D. M. (1974). A dyn11mic allocation index for the sequential design of experiments. In J. Gani. K. Sarkadi. and I. Vinczc (eds.), Progress in Stati.l"tics. pp. 241-266. North- Holland, Amsterdam-London. Goldberg. D. E. ( 1989). Genetic Algorithms in Search. Optimization. a11d Machine Lea111ing. Addison-Wesley. Reading. MA. Goldstein, H. (1957). Classical Mechanics. Addison-Wesley. Reading, MA. Goodwin. G. C., and Sin. K. S. ( 1984). Adaptive Filtering Prediction and Cmltlvl. Prentice- Hall, Englewood Cliffs, NJ. \ Gordon. G. J. ( 1995). Stable function approximation in dynamic progr,unming. In A. Prieditis and S. Russell (cds.). Proceedings of the 1ive{fth lntemational COI({erence on Machine Leam- ing. pp. 261-268. Morgan Kaufmann. San Francisco. An expanded version was published dS Technical Report CMU-CS-95-103. Carnegie Mellon University. Pittsburgh. PA, 1995. Gordon, G. J. (1996). Stable fitted reinforcement learning. In D. S. Touretzky. M. C. Mozer, M. E. Hasselmo (eds.). Ad1•ances in Neuralltifomwtion Processing Systems: Proceedings of the 1995 Conference. pp. 1052-1058. MIT Press, Cambridge, MA. Griffith, A. K. (1966). A new machine learning technique applied to the game of checkers. Technical Report Project MAC, Artificial Intelligence Memo 94. Massachusetts lnstitllle of Technology. Camhridge. MA. Griffith. A. K. ( 1974 ). A comp

                    本文档为【An Introduction to Reinforcement Learning - Part 4】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

An Introduction to Reinforcement Learning - Part 4

你可能还喜欢