NIPS Tutorial December 2, 1996 REINFORCEMENT LEARNING by Richard S. Sutton Senior Research Scientist Department of Computer Science University of Massachusetts Amherst, MA 01003 rich@cs.umass.edu http://www.cs.umass.edu/~rich ABSTRACT Reinforcement learning is learning about, from, and while interacting with a environment in order to achieve a goal. In other words, it is a relatively direct model of the learning that people and animals do in their normal lives. In the last two decades, this age-old problem has come to be much better understood by integrating ideas from psychology, optimal control, artificial neural networks, and artificial intelligence. New methods and combinations of methods have enabled much better solutions to large-scale applications than had been possible by all other means. This tutorial will provide a top-down introduction to the field, covering Markov decision processes and approximate value functions as the formulation of the problem, and dynamic programming, temporal-difference learning, and Monte Carlo methods as the principal solution methods. The role of neural networks, evolutionary methods, and planning will also be covered. The emphasis will be on understanding the capabilities and appropriate role of each of class of methods within in an integrated system for learning and decision making. Suggested further readings General Reinforcement Learning Sutton, R.S., Barto, A.G., An Introduction to Reinforcement Learning. A nearly-completed textbook treatment. Barto, A.G., Sutton, R.S., Watkins, C.J.C.H. (1990) ÒLearning and Sequential Decision MakingÓ. In Learning and Computational Neuroscience: Foundations of Adaptive Networks, M. Gabriel and J.W. Moore, Eds., pp. 539-602, MIT Press. Kaelbling, L.P. (1996) Special triple issue on Reinforcement Learning of the journal Machine Learning, Vol 22, Nos. 1/2/3. Sutton, R.S. (1992) Special double issue on Reinforcement Learning of the journal Machine Learning, Vol 8, Nos. 3/4. Animal Learning Theory and Reinforcement Learning Sutton, R.S., Barto, A.G. (1990) ÒTime-derivative models of Pavlovian reinforcement,Ó in Learning and Computational Neuroscience: Foundations of Adaptive Networks, M. Gabriel and J. Moore, Eds., pp. 497--537. MIT Press. Neuroscience and Reinforcement Learning Houk, JC, Adams, JL & Barto, AG (1995). ÒA model of how the basal ganglia generate and use neural signals that predict reinforcement.Ó In JC Houk, JL Davis & DG Beiser, editors, Models of Information Processing in the Basal Ganglia, pp. 249-270. Cambridge, MA: MIT Press. Montague, PR, Dayan, P, Person, C & Sejnowski, TJ (1995). ÒBee foraging in uncertain environments using predictive Hebbian learning.Ó Nature 377, 725-728. Montague, PR, Dayan, P & Sejnowski, TK (1996). ÒA framework for mesencephalic dopamine systems based on predictive Hebbian learning.Ó Journal of Neuroscience 16, 1936-1947. See also my web page, starting from http://www.cs.umass.edu/~rich.