References
Agrc. P. E. (1988). The Dynamic Structure c~f Evn:wlay U(e. Ph.D. thesis, Massachusetts
Institute of Technology. AI-TR 1085, MIT Artificial Intelligence Laboratory.
Agrc. P. E .. and Chapman. D. ( 1990). What are plans for? Robotics and Awonomous Systems,
6:17-34.
Albus, J. S. ( 1971). A theory of cerehellar function. Mathematical Biosciences. 10:25-61.
Albus. J. S. ( 1981 ). Brain. Relun·im; w1d Robotics. Byte Books, Peterborough, NH.
Anderson. C. W. (1986). i.Raming and Problem Soll·ing witlr Multilayer Connectionist Sys-
tems. Ph.D. thesis. University of Massachuseus. AmhersL
Anderson. C. W. ( 1987). Strategy learning with multilayer connectionist representations. Pro-
ceeding~ of the Fourth lmenuttiona/ Worhhop on Machine Leaming, pp. 103-114. Morgan
Kaufmann. San Mateo. CA.
Anderson. J. A.. Silverstein. J. W .. Rill. S. A .. and Jones. R. S. (1977). Distinctive features.
categorical perception. and probahility learning: Some applications of a nemal modeL Psy-
chological Rniew. 84:413-451.
Andreae. J. H. (1963). STELLA: A scheme for a learning machine. In ProceedingH!fthe 2nd
IFAC Congrn·s. Bas/e. pp. 497-502. Buuerworths, London.
Andreae. J. H. ( 1969a). A learning machine with monologue. Imemational Joumal of Man-
MaclrineStlldies. 1:1-20.
Andreae. J. H. (I %9b). Leaming machines-a unified view. In A. R. Meetham and R. A. Hud-
son (eds.), Encyclopedia of Informmion. Linguistics. and Comml. pp. 261-270. Pergamon,
Oxford.
Andreae, J. H. ( 1977). Thinking with the Teachable Machine. Academic Press. London.
Baird. L. C. ( 1995). Residual algorithms: Reinforcemelll learning with function approxima-
tion. In Proceeding.~ of the Twe(fih Intenwtional COJ!ference on Machine Leaming, pp. 30-37.
Morgan Kaufmann, San Francisco.
292 References
Bao, G., Cassandras. C. G., Djaferis. T. E .. Gandhi. A. D .. and Lm>7C. D.P. ( 1994). Elevator
dispa!Chcrs for down peak traffic. Technical report. ECE Department, University of Massa-
chusclls, Amhetst.
Barnard, E. (1993). Temporal-difference methods and Markov models. IEEE Tnm.mctions 011
Systems. Man. and Cvbemetics. 23:357-365.
Barto. A. G. ( 1985). Learning by statistical cooper,uion of self-interested neuron-like comput-
ing elements. Human Neurobiology, 4:229-256.
Barto. A. G. ( 1986). Game-theoretic cooperativity in networks of self-interested units. In
J. S. Denker (ed.), Neural Networks for Compwing. pp. 41-46. American Institute of Physics,
New York.
Barto, A. G. ( 1990). ConnectioRistleaming for control: An overview. InT. Miller, R. S. Sutton.
and P. J. Werbos (eds.). Neural Networks for Colltml. pp. 5- 58. MIT Press, Cambridge, MA.
Barto, A. G. (1991). Some learning tasks from a control perspective. In L. Nadel and
D. L. Stein (eds.). 1990 Lectures ill Compla Systems, pp. 195-223. Addison-Wesley. Red-
wood City. CA.
Bano, A. G. ( 1992). Reinforcement learning and adaptive critic methods. In D. A. White and
D. A. Sofge (eds.), Ha11clbook oflmellif?ellt Comml: Neural. Fu::.zy. and Adaptive Approaches,
pp. 469-491. Van Nostrand Reinhold. New York.
Barto, A. G. (1995a). Adaptive critic" and the baStll ganglia. In J. C. Houk, J. L. Davis, dlld
D. G. Beiser (cds.). Models l?f lnfmmation Processing i11 tile Basal Ganglia. pp. 215- 232.
MIT Press, Cambridge. MA.
Barto, A. G. ( 1995b). Reinforcement learning. In M. A. Arbib (ed.), Ha11dhook o.f Brain
Theory and Neural Networks. pp. 804-809. MIT Press. Cambridge. MA.
Barto. A. G .• and A nandan. P. ( 1985). Pattern recognizing stochastic learning automata. IEEE
Transactio/Is on Systems. Man. and Cybanetics. 15:360-375.
Barto, A. G., and Anderson. C. W. ( 1985). Structur,tl learning in connectionist systems. In
Pmgram of the Sevemh A11nual CoJ!ferellce lifthe Cof?nitil'e Science Society. pp. 43-54.
Barto. A. G .. Anderson. C. W., and Sulton. R. S. ( 1982). Synthesis of nonlinear control
surfaces by a layered associative search network. Biological Cy/Jemetics, 43: 175-185.
B..trto, A. G., Bradtke. S. J .• and Singh. S. P. (1991 ). Real-time learning and control using
asynchronous dynamic programming. Technical Report 91-57. Department of Computer and
Infommtion Science, University of Massachusetts, Amherst.
Barto, A. G., Br.tdtke. S. J .. and Singh. S. P. ( 1995). Learning to act using real-time dynamic
programming. A1 tijiciallntel/igellce. 72:81-138.
Barto, A. G., and Duff, M. ( 1994). Monte Carlo matrix inversion and reinforcement learning.
In J. D. Cohen, G. Tesauro, and J. Alspector (eds.), Advances in Neural bifmmation Pm-
293 Referencel
cessing Systems: Proceedings of the 1993 Cm!ference. pp. 687-694. Morgan Kaufmann. San
Francisco.
Barto. A. G .. and Jordan. M. I. ( 1987). Gradient following without back-pmpagation in layered
networks. In M. Caudill and C. BUller (eds.). Proceedings of the iEEE First Annual Conference
on Neural Networks. pp. 11629-11636. SOS Printing. San Diego. CA.
Bano, A. G., and Sutton, R. S. ( 1981 a). Goal seeking componen1s for adaptive imelligence:
An initial assessment. Technical Report AFWAL-TR-81-1070. Air Force Wright Aeronautical
L..1boratories/Avionics Laboratory, Wright-Patterson AFB, OH.
Barto, A. G., and Sutton. R. S. (1981 b). Landmark learning: An illustration of a~sociative
search. Biological Cybemetics, 42: 1-8.
Barto, A. G .• and Sutton, R. S. (1982). Simulation of dnticipatory responses in classical
conditioning by a neuron-like adaptive element. Behavioural Brain Research, 4:221-235.
Barto, A. G .. Sutton. R. S .. and Anderson, C. W. (1983). Ncuronlike elements that can solve
difficult learning control problems. IEEE Transactions on Svstems, Mcm, and Cybemetics,
13:835-846. Reprinted in J. A. Anderson and E. Rosenfeld (~ds.), Neurocomputing: Fowula-
tions of Research. pp. 535-549. MIT Press. Cambridge. MA, 1988.
Barto, A. G., Sutton, R. S., and Brouwer. P. S. ( 1981 ). Associative search network: A rein-
forcement learning associative memory. Biological Cybemetics, 40:201-211.
Bellman. R. E. ( 1956). A problem in the sequential design of experiments. Sankhya. 16:221-
229.
Bellman, R. E. (1957a). Dynamic Programming. Princeton University Press. Princeton.
Bellman, R. E. (1957b). A Markov decision process. Journal of Mathematical Mechanics,
6:679-684.
Bellman, R. E., and Drcytus. S. E. ( 1959). Functional approximations and dynamic program-
ming. Mathematical Tables and Other Aids to Computation, 13:247-251 .
Bellman, R. E., Kalaba. R., and Kotkin, B. ( 1973). Polynomial approximation-A new com-
plllational technique in dynamic programming: Allocation processes. Mathemmical Compu-
tation, 17:155- 161.
Berry, D. A .. and Fristedt, B. (1985). Bandit ProMems. Chapman and Hall. London.
Bensckds, D. P. (1982). Distributed dynamic programming. IEEE Trtmsactions on Amomatic
Comml, 27:610-616.
Bertsekas, D. P. ( 1983 ). Distrihuted asynchronous computation of fixed points. Mathematical
Programming, 27:107-120.
Bcrtsekas, D. P. ( 1987). Dynamic Programming: Deterministic and Stochastic Models.
Prentice-Hall, Englewood Cliffs, NJ .
294 References
Bertsckas, D. P. (1995). Dynamic Programming and Optimal Comro/. Athena Scientific.
Belmont. MA.
Bertsekas. D.P., and Tsitsiklis, J. N. (1989). Parallel and Distribmed Computation: Numerical
Methods. Prentice-Hall. Englewood Cliffs. NJ.
Bertsekas, D. P .. and Tsitsiklis. J. N. ( 1996). Neuro-Dynamic Programming. Athena Scielllilic.
Belmom, MA.
Biermann. A. W .. Fairfield, J. R. C.. and Beres, T. R. (1982). Signature table systems and
learning. IEEE Transactions on Systems. Man. and Cybe111etics. 12:635-648.
Bishop. C. M. ( 1995). Neural Netwmks for Patte1n Recognition. Clarendon. Oxford.
Booker. L. B. (1982). Intelligent Behavior as wz Adapwtion to the Task Em•imnmem. Ph.D.
thesis, University of Michigan. Ann Arbor.
Boone. G. ( 1997). Minimum-time control of the acrobot. In 1997 lmemational Conference on
Robotics and Automation. pp. 3281-3287. IEEE Robotics and Automation Society .
.......
Boutilier, C., Dearden, R., and Goldszmidt, M. ( 1995). Exploiting structure in policy construc-
tion. In Proceedings o_fthe Fourteenth International Joim Conference on Artijiciallmelligence,
pp. 1104-1111. Morgan Kaufmann.
Boy,m. J. A .. and Moore, A. W. (1995). Generalization in reinforcement learning: Safely
approximating the value function . In G. Tcsauro. D. S. Touret~:ky. and T. Leen (eds .),Advances
in Neural information Processing Sy.l"fems: Proceedings of the 1994 Conference. pp. 369-376.
MIT Press. Cambridge. MA.
Boyan. J. A .. Moore. A. W., and Sutton. R. S. (eds.). (1995). Proceedings of the Workshop on
Value Function Approximation. Machine Leaming Cm!ference 1995. Technical Report CMU-
CS-95-206. School of Computer Science. Carnegie Mellon University. Pittsburgh. PA.
Bradtke, S. J. (1993). Rcinforcemem learning applied to linear quadratic regulation. In
S. J. Hanson. J.D. Cowan, and C. L. Giles (eds.). Advances in Neural h!{Oimmion Processing
Systems: Proceedings of the 1992 Conference, pp. 295-302. Morgan Kaufmann. San Mateo,
CA.
Bradtke, S. J. ( 1994). Incremental Dynamic Programming for On-Une Adaptive Optimal
Comrol. Ph.D. thesis, University ofMassachuseus. Amherst. Appeared as CMPSCT Technical
Report 94-62.
Bradtke. S. J .. and Barto, A. G. ( 1996). Linear least-squares algorithms for temporal difference
learning. Machine Leaming. 22:33- 57.
S. J. Bradtke. B. E. Ydstie, and A. G. Barto (1994). Adaptive linear quadratic control using pol-
icy iteration. In Proceedings of the American Control Cot!ference, pp. 3475-3479. American
Autom.ttic Control Council, Evanston. IL.
Bradtke, S. J. , and Duff, M. 0. (1995). Reinforcemcllllearning methods for continuous-Lime
Markov decision problems. In G. Tesauro, D. Tourctzky, and T. Lccn (eds.), Adwmces in
295 References
Neural information Process in[? Systems: Proceedings of the 1994 Conference. pp. 393-400.
MIT Press, Cambridge, MA.
Bridle. J. S. ( 1990). Training stochastic model recognition algorithms as networks can lead to
maximum mutual infomultion estimates of parameters. In D. S. Touretzky (eel.), Advances in
Neural b!formation Processing Systems: Proceedings of the 1989 Cm!ference, pp. 211-217.
Morgan Kaufmann. San Mateo. CA.
Broom head, D. S., and Lowe, D. ( 1988). Muhivariable functional interpolation and adaptive
networks. Complex Systems, 2:321-355.
Bryson, A. E .. Jr. (1996). Optimal control-1950 to 1985. IEEE Control Systems, 13(3):26-
33.
Bush, R. R .. and Mosteller. F. (1955). Stochmtic Models for Leaming. Wiley, New York.
Byrne, J. H., Gingrich. K. J.. and Baxter. D. A. (1990). Computational capabilities of single
neurons: RcJ,uionship to simple fom1s of associative and nonassociative learning in aplysia.
[n R. D. Hawkins and G. H. Bower (eds.). Computational Models of Learning, pp. 31-63.
Academic Press. New York.
Campb<;:ll. D. T. (1960). Blind variation and l>elective survival as a general strategy in
knowlel(ge-processes. [n M. C. Yovits and S. Cameron (eds.), Sel.f-Orgwri::Jng Systems.
pp. 205-231. Pergamon. New York.
Carlstrtim. J., and Nordstrom. E. (1997). Control of self-similar ATM call tr.tflic by rein-
forcement learning. In Proceedings of the buemational Workshop on Applications of Neural
Networks to Telecommunications 3, pp. 54-62. Erlbaum. Hillsdale, NJ.
Chapman. D., and Kaelbling, L. P. ( 1991 ). Input generalization in delayed reinforcement learn-
ing: An algorithm and performance comparisons. [n Proceedings of the Twe(fih lntemational
Conference on Artificiallntelligence, pp. 726-731. Morgan Kaufmann. San Mateo. CA.
Chow, C.-S., and Tsitsiklis. J. N. (1991 ). An optim<~l one-way multigrid algorithm for discrete-
time stochastic control./£££ Transactions on Amomatic Colltml. 36:898-914.
Chrisman, L. ( 1992). Reinforcement lemning with perceptual aliasing: The perceptual dis-
tinctions approach. [n Proceedings of the Temlr National Conference on Artificial intelligence.
pp. 183- 188. AAAf/MlT Prc.~s. Menlo Park, CA.
Christensen, J., and Korf. R. E. (1986). A unified theory of heuristic evaluation functions
and its application to learning. In Proceeding.r of the F(ftli National Cm!ference on Artificial
llltel/igmce. pp. 148-152. Morgan Kaufmann. San Mateo, CA.
Cichosz. P. ( 1995). Truncating temporal differences: On the efficient implementation ofTD()..)
for reinforcement learning. lou mal of Art({icial/melligence Research. 2:287-318.
Clark, W. A.. and Farley, B. G. ( 1955). Generali7ation of pan em recognition in a self-
organi.~:ing system. [n Proceedings of the 1955 Westem Joim Computer Col({erence. pp. 86-91.
296 References
Clouse. J. (1996). Onilllegrating Appremice Learning and Reil!forcemellt Learning TJTL£2.
Ph.D. thesis, University of Massachusetts, Amherst. Appeared as CMPSCI Technical Report
96-026.
Clouse. J.. and Utgoff, P. (1992). A teaching method for reinforcement learning systems. In
Proceedings of the Ninth lntenzational Mad1ine Learning Conference, pp. 92-101. Morgan
Kaufmann, San Mateo. CA.
Colombeui, M., and Dorigo, M. ( 1994). Training agent to perform sequential behavior. Adap-
tive Behavior, 2(3):247-275.
Connell. J. (1989). A colony architecture for an artificial creature. Technical Report AI-TR-
1151. MIT Artificial Intelligence Laboratory, Cambridge, MA.
Connell, J., and Mahadevan, S. (1993). Robot Leaming. Kluwer Academic, Boston.
Craik, K. J. W. ( 1943). 111e Nature of Explanation. Cambridge University Press, Cambridge.
Crites, R. H. (1996). Large-Scale Dynamic Optimiwtion Using Teams of Reinforcement
Leaming Agems. Ph.D. thesis, University of Massachusetts. Amherst.
Crites. R. H .. and Barto, A. G. (1996). Improving elevator performance using reinforcement
learning. In D. S. Tourctzky, M. C. Mozer. and M. E. Hasselmo (eds.), Advances in Neural
b!fomwtion Processing Systdns: Proceedings of the 1995 COIJ{erence, pp. 1017-1023. MIT
Press, Cambridge. MA.
Curtiss. J. H. (1954). A theoretical comparison of the efficiencies of two classical methods and
a Monte Carlo method for computing one component of the solution of a set of linear algebraic
equations. In H. A. Meyer (ed.). Symposium m1 Mollte Carlo Methods. pp. 191-233. Wiley.
New York.
Cziko. G. (1995). Without Miracles: Univer.ml Selection Tlle01:Y and the Second Dan>inian
Rel'olmion. MIT Press, Cambridge, MA.
Daniel. J. W. (1976). Splines and efficiency in dynamic programming. Journal of Mathemati-
cal Analysis ami Applications, 54:402-407.
Dayan, P. ( 1991 ). Reinforcement comparison. In D. S. Tourctzky, J. L. Elman, T. J. Sejnowski,
and G. E. Hinton (eds.), Connectionist Models: Proceedings of the 1990 Summer School,
pp. 45-51. Morgan Kaufmann, San Mateo, CA.
Dayan. P. (1992). The convergence ofTD(>-.) for general A. Machine Learning, 8:341-362.
Dayan. P .. and Hinton, G. E. ( 1993). Feudal reinforcement learning. InS. J. Hanson, J.D. Co-
hen, and C. L Giles (eds.), Admnces in Neural b!formation Processing Systems: Proceedings
ofthe 1992 Conference, pp. 271-278. Morgan Kaufmann. San Mateo. CA.
Dayan. P., and Sejnowski, T. ( 1994). TD(>-.) converges with probability I. Machine Learning,
14:295-301.
Dean, T .. and Lin. S. H. ( 1995). Decomposition techniques for planning in stochastic domains.
In Proceedings of the Fourteemh lntemational Joint Conference on Artificial lmelligence.
297 References
pp. 1121-1127. Morgan Kaufm,mn. See ttlso Technical Report CS-95-10, Brown University,
Department of Computer Science, 1995.
DeJong. G .. and Spong, M. W. (1994). Swinging up the acrobat: An example of imelligent
control. [n Pmceedings oft he American Control Conference. pp. 2I58-2162. American Auto-
matic Control Council. Evanston. [L
Denardo, E. V. ( 1967). Contraction mappings in the theory underlying dynamic programming.
SIAM Review, 9:165-177.
Dennen. D. C. ( 1978). Brainstorms. pp. 71-89. Bradford/MIT Press, Cambridge, MA.
Dietterich, T. G., and Flann. N. S. (1995). Explanation-based learning and reinforcement
learning: A unified view. [n A. Prieditis and S. Russell (eds.), Pmceedings o.fthe Rve(fih Inter-
national Cot!ference on Machine Leaming. pp. 176-184. Morgan Kaufmann, San Francisco.
Doya. K. (1996). Temporal difference learning in continuous time and space. In D. S. Touret-
zky. M. C. Mozer, and M. E. Hasselmo (eds.), Advances in Neural biformation Processing
Systems: Proceedings o.fthe 1995 Cm!ference, pp. 1073-1079. MIT Press, Cambridge, MA.
Doyle, P. G., and Snell, J. L. (1984). Random ffiilks and Electric Networks. The Mathematical
Association of Amrica. Carus Mathematical Monograph 22.
Dreyfus. S. E .. and Law. A. M. ( 1977). The A11 and Them)' of Dynamic Programming.
Academic Press, New York.
Duda, R. 0., and Hart, P. E. ( 1973). Pattem Classification and Scene Analysis. Wiley, New
York.
Duff. M. 0. (1995). Q-learning for bandit problems. In A. Prieditis and S. Russell (eds.),
Proceedings of the Rvelfth lmemational Conference on Machine Leaminf!, pp. 209- 217.
Morgan Kaufmann, San Francisco.
Estes, W. K. ( 1950). Toward a statistical theory of learning. P~ychololgical Review, 57:94-107.
Farley. B. G .. and Clark, W. A. ( 1954). Simulation of self-organizing systems by digital
computer. IRE Transactions 011 b!fomwtion Theory, 4:76-84.
Feldbaum, A. A. (1965). Optimal Comml Systems. Academic Press. New York.
Friston. K. J., Tononi. G .. Reeke, G. N .• Sporns. 0 .. and Edelman, G. M. (1994). Value-
dependent selection in the brain: Simulation in a synthetic neural model. Neuroscience.
59:229-243.
Fu, K. S. (1970). Learning control systems- Review and outlook. IEEE Tm11sactions on
Autommic Control. 15:210-221.
Galanter. E .. and Gerstcnhaber. M. (1956\. On thought: The extrinsic theory. Psychological
Review, 63:218-227.
Gallant. S. I. (1993). Neural Network Learning and Expett Systems. MIT Press, Cambridge,
MA.
298 References
Glillmo. 0. and Asplund. L. ( 1995). Reinforcement learning by construction of hypothetical
targets. In J. Alspector. R. Goodman, and T. X. Brown (eds.). Proceedings of the lmema-
tional Workshop on Applications cifNeural Network.~ to Telecommunications 2. pp. 300-307.
Erlbaum, Hillsdale, NJ.
Gardner. M. (1973). Mathematical games. Sciemific American. 228(1): 108-115.
Gclperin, A., Hop field. J. Land Tank. D. W. ( 1985). The logic of limax learning. In A. Sclver-
ston (ed.), Model Neural Networks ami Bc•hm1ior, pp. 247-261. Plenum Press, New York.
Gittins, J. C., and Jones. D. M. (1974). A dyn11mic allocation index for the sequential design of
experiments. In J. Gani. K. Sarkadi. and I. Vinczc (eds.), Progress in Stati.l"tics. pp. 241-266.
North- Holland, Amsterdam-London.
Goldberg. D. E. ( 1989). Genetic Algorithms in Search. Optimization. a11d Machine Lea111ing.
Addison-Wesley. Reading. MA.
Goldstein, H. (1957). Classical Mechanics. Addison-Wesley. Reading, MA.
Goodwin. G. C., and Sin. K. S. ( 1984). Adaptive Filtering Prediction and Cmltlvl. Prentice-
Hall, Englewood Cliffs, NJ. \
Gordon. G. J. ( 1995). Stable function approximation in dynamic progr,unming. In A. Prieditis
and S. Russell (cds.). Proceedings of the 1ive{fth lntemational COI({erence on Machine Leam-
ing. pp. 261-268. Morgan Kaufmann. San Francisco. An expanded version was published dS
Technical Report CMU-CS-95-103. Carnegie Mellon University. Pittsburgh. PA, 1995.
Gordon, G. J. (1996). Stable fitted reinforcement learning. In D. S. Touretzky. M. C. Mozer,
M. E. Hasselmo (eds.). Ad1•ances in Neuralltifomwtion Processing Systems: Proceedings of
the 1995 Conference. pp. 1052-1058. MIT Press, Cambridge, MA.
Griffith, A. K. (1966). A new machine learning technique applied to the game of checkers.
Technical Report Project MAC, Artificial Intelligence Memo 94. Massachusetts lnstitllle of
Technology. Camhridge. MA.
Griffith. A. K. ( 1974 ). A comp
本文档为【An Introduction to Reinforcement Learning - Part 4】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。