First, a quick guide to the highlights, roughly in order of the talk's potential current interest:

- The
Future of Artificial Intelligence Belongs to Search and
Learning

- RL is the First Computational Theory of Mind (short perspective)
- Are you Ready to Fully Embrace
Approximation?

- Experience
and Intelligence: Toward a Scalable AI-agent Architecture
(shorter,
earlier version at NeurIPS)

- Advice and
Perspectives on RL Research Frontiers (long)

- Video of Interview on AI and society
- The
Future of Artificial Intelligence: UBC,
Puerto Rico, AI&Humanity

- The importance of Temporal-Difference
Learning

- Reinforcement Learning Tutorial at
NIPS-2015

- Reinforcement Learning and Psychology
- Emphatic TD learning

- Learning About Sensorimotor Data

- Toward Learning Human-level Predictive Knowledge
- Deconstructing Reinforcement Learning

- New gradient-TD algorithms

- Critterbot project

- Micro-stimulus model of Dopamine at CalTech
- Tracking talk at ICML'07

- Experience-Oriented Artificial Intelligence
- overview of predictive representations
- Step-size adaptation: IDBD

- Provocative remarks at the DARPA Cognitive Systems Conference, 5/20/05
- RL Past, Present, and Future (ICML/COLT/UAI 7/25/98, SEAL98)
- RL's
Computational Theory of Mind 2/14/03 Rutgers/Indiana

- Artificial Intelligence Should Be About Predictions 12/7/01 AT&T, CEC2000
- From MDPs to
AI 5/14/03 U Alberta

- Overcoming the Curse of Dimensionality with RL (MIT ORC 4/19/01)
- From Reflex to Reason (Cornell 12/8/00)
- Temporal Difference Networks, presented at NIPS-04
- Constructive Induction Needs a Methodology based on Continuing Learning (ICML94)e
- RL Tutorial 7/23/98

The Future of Artificial Intelligence (University of British Columbia, Feb 25, 2016)

Video.
Pdf slides.

When mankind finally comes to understand the principles of intelligence and how they can be embodied in machines, it will be the most important discovery of our age, perhaps of any age. In recent years, with progress in deep learning and other areas, this great scientific prize has begun to appear almost within reach. Artificial superintelligences are not imminent, but they may well occur within our lifetimes. The consequences, benefits, and dangers for humanity have become popular topics in the press, at public policy meetings (e.g., Davos) and at scientific meetings; luminaries such as Stephen Hawking and Elon Musk have weighed in. Is it all hyperbole and fear mongering, or are there genuine scientific advances underlying the current excitement? In this talk, I try to provide some perspective on these issues, informed and undoubtedly biased by my 38 years of research in AI. I seek to contribute to the conversation in two ways: 1) by seeing current developments as part of the longest trend in AI---towards cheaper computation and thus a greater role for search, learning, and all things meta, and 2) by sketching one possible path to AI (the one I am currently treading) and what it might look like for it to succeed.

When mankind finally comes to understand the principles of intelligence and how they can be embodied in machines, it will be the most important discovery of our age, perhaps of any age. In recent years, with progress in deep learning and other areas, this great scientific prize has begun to appear almost within reach. Artificial superintelligences are not imminent, but they may well occur within our lifetimes. The consequences, benefits, and dangers for humanity have become popular topics in the press, at public policy meetings (e.g., Davos) and at scientific meetings; luminaries such as Stephen Hawking and Elon Musk have weighed in. Is it all hyperbole and fear mongering, or are there genuine scientific advances underlying the current excitement? In this talk, I try to provide some perspective on these issues, informed and undoubtedly biased by my 38 years of research in AI. I seek to contribute to the conversation in two ways: 1) by seeing current developments as part of the longest trend in AI---towards cheaper computation and thus a greater role for search, learning, and all things meta, and 2) by sketching one possible path to AI (the one I am currently treading) and what it might look like for it to succeed.

Are you Ready to Fully Embrace Approximation? (RLAI Tea-time Talk, June 8, 2020)

Video.
Pdf
slides.

Approximation that scales with computational resources is what drives modern machine learning. The steady drumbeat of Moore’s law enables successes, such as those of deep learning and AlphaGo, that depend on scalable approximation, and will continue to do so for the foreseeable future. Are we ready to be part of this future? Fully embracing approximation imposes a challenging discipline under which we must do without so much of what reinforcement learning takes for granted, including:

• optimal policies

• the discounted control objective

• Markov state, and therefore:

• all probabilities and expectations

• all true value functions

• the mean square Bellman error

• the mean square value error

• convergence to anything

• off-line learning

• mapping from environment state to feature vectors.

If you are not ready to give on up all these things, then you are not ready to fully embrace approximation, and you are not ready to take what are likely to be the most important next steps in machine intelligence.

Approximation that scales with computational resources is what drives modern machine learning. The steady drumbeat of Moore’s law enables successes, such as those of deep learning and AlphaGo, that depend on scalable approximation, and will continue to do so for the foreseeable future. Are we ready to be part of this future? Fully embracing approximation imposes a challenging discipline under which we must do without so much of what reinforcement learning takes for granted, including:

• optimal policies

• the discounted control objective

• Markov state, and therefore:

• all probabilities and expectations

• all true value functions

• the mean square Bellman error

• the mean square value error

• convergence to anything

• off-line learning

• mapping from environment state to feature vectors.

If you are not ready to give on up all these things, then you are not ready to fully embrace approximation, and you are not ready to take what are likely to be the most important next steps in machine intelligence.

The Future of Artificial Intelligence Belongs to Search and Learning (University of Toronto, Oct 27, 2016)

Pdf
slides. Video.

When mankind finally comes to understand the principles of intelligence and how they can be embodied in machines, it will be the most important discovery of our age, perhaps of any age. In recent years, with progress in deep learning and other areas, this great scientific prize has begun to appear almost within reach. The consequences, benefits, and dangers for humanity have become popular topics in the press, at public policy meetings, and at scientific meetings. Is it all hyperbole and fear mongering, or are there genuine scientific advances underlying the current excitement? In this talk, I try to provide some perspective, informed and undoubtedly biased by my 38 years of research in AI. I seek to contribute to the conversation in two ways: 1) by seeing current developments as part of the longest trend in AI---towards cheaper computation and thus a greater role for search, learning, and all things scalable, and 2) by sketching one possible path to AI, based on prediction and reinforcement learning.

When mankind finally comes to understand the principles of intelligence and how they can be embodied in machines, it will be the most important discovery of our age, perhaps of any age. In recent years, with progress in deep learning and other areas, this great scientific prize has begun to appear almost within reach. The consequences, benefits, and dangers for humanity have become popular topics in the press, at public policy meetings, and at scientific meetings. Is it all hyperbole and fear mongering, or are there genuine scientific advances underlying the current excitement? In this talk, I try to provide some perspective, informed and undoubtedly biased by my 38 years of research in AI. I seek to contribute to the conversation in two ways: 1) by seeing current developments as part of the longest trend in AI---towards cheaper computation and thus a greater role for search, learning, and all things scalable, and 2) by sketching one possible path to AI, based on prediction and reinforcement learning.

Introduction to Reinforcement Learning with Function Approximation (NIPS Dec 7, 2015)

Pdf
slides. Video.

Reinforcement learning is a body of theory and techniques for optimal sequential decision making developed in the last thirty years primarily within the machine learning and operations research communities, and which has separately become important in psychology and neuroscience. This tutorial will develop an intuitive understanding of the underlying formal problem (Markov decision processes) and its core solution methods, including dynamic programming, Monte Carlo methods, and temporal-difference learning. It will focus on how these methods have been combined with parametric function approximation, including deep learning, to find good approximate solutions to problems that are otherwise too large to be addressed at all. Finally, it will briefly survey some recent developments in function approximation, eligibility traces, and off-policy learning.

Reinforcement learning is a body of theory and techniques for optimal sequential decision making developed in the last thirty years primarily within the machine learning and operations research communities, and which has separately become important in psychology and neuroscience. This tutorial will develop an intuitive understanding of the underlying formal problem (Markov decision processes) and its core solution methods, including dynamic programming, Monte Carlo methods, and temporal-difference learning. It will focus on how these methods have been combined with parametric function approximation, including deep learning, to find good approximate solutions to problems that are otherwise too large to be addressed at all. Finally, it will briefly survey some recent developments in function approximation, eligibility traces, and off-policy learning.

The Future of Artificial Intelligence (LABMP class Sep 10, 2015)

More on the Future of AI (LABMP class Jan 14, 2016)

Creating Human-level AI: How and When? (Future of AI Conference, Puerto Rico, Jan 3, 2015)

Deep Questions (Neural Computation and Adaptive Perception Workshop, Dec 5, 2015)

Emphatic Temporal-difference Learning (Google Deepmind, July 16 2015)

This talk will present a new algorithmic
idea—emphasis—that may have wide-ranging implications for all
kinds of temporal-difference learning. With function
approximation, it is not possible to get the value estimates
of all states exactly correct. If some are more accurate, then
others must be less so. How are the function approximation
resources to be allocated? As I formulate it, this question
comes down to asking about the intensity or emphasis of the
learning update at each time step. The emphasis is a positive
number M_{t} multiplied by the
learning rate (step size) of each learning update. As the
emphasis varies from step to step, it changes the effective
distribution of the learning updates. In recent work, Mahmood,
White, Yu, and I have shown that emphasis alone can cause the
TD(lambda) algorithm to become stable and convergent under
general off-policy learning. This is possibly a landmark
result, but it is technical and already available on arxiv. In
this talk I will focus on a further interesting possibility
that is more accessible but has not yet been properly worked
out: that emphasis changes and may improve the asymptotic
solution of temporal-difference learning algorithms. Emphasis
is based on how the values of different states are
interrelated by bootstrapping; if this can be understood, it
seems likely that a bound on asymptotic mean-square error can
be found that improves on the classic bound by Tsitsiklis and
Van Roy (1997). This talk is an invitation to join with me in
thinking about this issue and discovering the better bound.

Multi-step Prediction (Neural Computation and Adaptive Perception Workshop, Dec 6, 2014)

Reinforcement Learning and Psychology (or Neuroscience): A Personal Story (Annual Meeting of the Society for Mathematical Psychology July 20 2014, and other places)

The modern field of
reinforcement learning (RL) has a long, intertwined
relationship with psychology. Almost all the powerful ideas of
RL came originally from psychology, and today they are
recognized as having significantly increased our ability to
solve difficult engineering problems such as playing
backgammon, flying helicopters, and optimal placement of
internet advertisements. Psychology should celebrate this and
take credit for it! RL has also begun to give something back
to the study of natural minds, as RL algorithms are providing
insights into classical conditioning, the neuroscience of
brain reward systems, and the role of mental replay in
thought. I have been working in the field of RL for much of
this journey, back and forth between nature and engineering,
and have played a role in some of the key steps. In this talk
I tell the story as it seemed to happen from my point of view,
summarizing it in four things that I think every psychologist
should know about RL: 1) that it is a formalization of
learning by trial and error, with engineering uses, 2) that it
is a formalization of the propagation of reward predictions
which closely matches behavioral and neuroscience data, 3)
that it is a formalization of thought as learning from
replayed experience that again matches data from natural
systems, and 4) that there is a beautiful confluence of
psychology, neuroscience, and computational theory on common
ideas and elegant algorithms.

The AI Singularity and Prospects for Digital Immortality. (Kim Solez's LABMP course Dec 3, 2013)

Gradient Temporal-Difference Learning Algorithms (Gatsby London 2011)

Learning About Sensorimotor Data (NIPS Dec 13, 2011 and JNNS Dec 15 2011)

Pdf
slides. Quicktime. Full keynote source with videos.
Associated paper on the
Horde architecture. Associated paper on
on-policy nexting. Maei's thesis on
gradient-TD algorithms. On
videolectures.net.

Temporal-difference (TD)
learning of reward predictions underlies both
reinforcement-learning algorithms and the standard dopamine
model of reward-based learning in the brain. This confluence
of computational and neuroscientific ideas is perhaps the most
successful since the Hebb synapse. Can it be extended beyond
reward? The brain certainly predicts many things other than
reward---such as in a forward model of the consequences of
various ways of behaving---and TD methods can be used to make
these predictions. The idea and advantages of using TD methods
to learn large numbers of predictions about many states and
stimuli, in parallel, have been apparent since the 1990s, but
technical issues have prevented this vision from being
practically implemented...until now. A key breakthrough was
the development of a new family of gradient-TD methods,
introduced at NIPS in 2008 (by Maei, Szepesvari, and myself).
Using these methods, and other ideas, we are now able to learn
thousands of non-reward predictions in real-time at 10Hz from
a single sensorimotor data stream from a physical robot. These
predictions are temporally extended (ranging up to tens of
seconds of anticipation), goal oriented, and policy
contingent. The new algorithms enable learning to be
off-policy and in parallel, resulting in dramatic increases in
the amount that can be learned in a given amount of time. Our
effective learning rate scales linearly with computational
resources. On a consumer laptop we can learn thousands of
predictions in real-time. On a larger computer, or on a
comparable laptop in a few years, the same methods could learn
millions of meaningful predictions about different alternate
ways of behaving. These predictions in aggregate constitute a
rich detailed model of the world that can support planning
methods such as approximate dynamic programming.

Toward Learning Human-level Predictive
Knowledge (AGI keynote March 5, 2010)Slides.
Video of the talk part1,
part2. This talk
worked out pretty well for content -- about handling knowledge
predictively with general value functions. The slides and
video are best viewed side by side. In part 2, you should skip
from 5:50 to 15:50, as this portion has bad sound and is
repeated again with better sound starting at 15:50. Perhaps
this was done to show the slides better during these 10
minutes, but this can be done better by downloading the
slides. The total talk lasts about one hour and 25 minutes (if
5:50-15:50 is skipped). The sound is very poor in the lengthy
question period starting at 27:20 of part 2.

Deconstructing Reinforcement Learning (MSRL June 18, 2009)

The premise of this symposium is
that the ideas of reinforcement learning have impacted many
fields, including artificial intelligence, neuroscience,
control theory, psychology, and economics. But what are these
ideas and which of them is key? Is it the idea of reward and
reward prediction as a way of structuring the problem facing
both natural and artificial systems? Is it temporal-difference
learning as a sample-based algorithm for approximating dynamic
programming? Or is it the idea of learning online, by trial
and error, searching to find a way of behaving that might not
be known by any human supervisor? Or is it all of these ideas
and others, all coming to renewed prominence and significance
as these fields focus on the common problem that faces
animals, machines, and societies---how to predict and control
a hugely complex world that can never be understood
incompletely, but only as a gross, ever-changing
approximation? In this talk I seek to start the process of
phrasing and answering these questions. In some cases, from my
own experience, I can identify which ideas have been the most
important, and guess which will be in the future. For others I
can only ask the other speakers and attendees to provide
informed perspectives from their own fields.

Fast gradient-descent methods for temporal-difference learning with linear function approximation (ICML-09 June 16, 2009)

Associated paper.

Sutton, Szepesvari and
Maei (2009) recently introduced the first
temporal-difference learning algorithm compatible with
both linear function approximation and off-policy
training, and whose complexity scales only linearly in
the size of the function approximator. Although their
gradient temporal difference (GTD) algorithm converges
reliably, it can be very slow compared to conventional
linear TD (on on-policy problems where TD is
convergent), calling into question its practical
utility. In this paper we introduce two new
related algorithms with better convergence rates.
The first algorithm, GTD2, is derived and proved
convergent just as GTD was, but uses a different
objective function and converges significantly faster
(but still not as fast as conventional TD). The second
new algorithm, linear TD with gradient correction, or
TDC, uses the same update rule as conventional TD except
for an additional term which is initially zero. In our
experiments on small test problems and in a Computer Go
application with a million features, the learning rate
of this algorithm was comparable to that of conventional
TD. This algorithm appears to extend linear TD to
off-policy learning with no penalty in performance while
only doubling computational requirements.

Core learning algorithms for intrinsically motivated robots (Venice IM-CLeVeR Workshop Nov 17, 2009)

This talk will present recent
progress in the development of core learning algorithms that
may be useful in creating systems with nontrivial
cognitive-developmental trajectories. The most distinctive
feature of such systems is that their learning is continual
and cumulative. They never stop learning, and new learning
builds upon the old. For such learning, the algorithms must be
incremental and operate in real-time, and in the past the most
suitable algorithms for such cases have been gradient-descent
algorithms. Our new algorithms are extensions of
temporal-difference learning so that it is a true
gradient-descent algorithm, which greatly extends its
robustness and generality. In particular, we have obtained for
the first time temporal-difference methods for off-policy
learning with function approximation, including nonlinear
function approximation, and for intra-option learning. This
talk will not present these algorithms in technical detail,
but instead stress the several natural roles that they could
play in systems that set their own goals and explore the world
in a structured, intrinsically motivated way.

This is joint work with Hamid Maei, Csaba Szepesvari, Doina Precup, Shalabh Bhatnagar, David Silver, Michael Delp, and Eric Weiwiora

Online Representation
Learning in the State Update Function (Tea-time 2009)This is joint work with Hamid Maei, Csaba Szepesvari, Doina Precup, Shalabh Bhatnagar, David Silver, Michael Delp, and Eric Weiwiora

New Temporal-Difference Methods Based on Gradient Descent (USC 2/18/09)

ABSTRACT: Temporal-difference
methods based on gradient descent and parameterized function
approximators form a core part of the modern field of
reinforcement learning and are essential to many of its
large-scale applications. However, the most popular
methods, including TD(lambda), Q-learning, and Sarsa, are not
true gradient-descent methods and, as a result, the conditions
under which they converge are narrower and less robust than
can usually be guaranteed for gradient-descent methods.
In this paper we introduce a new family of temporal-difference
algorithms whose expected updates are in the direction of the
gradient of a natural performance measure that we call the
"mean squared projected Bellman error". Because these
are true gradient-descent methods, we are able to apply
standard techniques to prove them convergent and stable under
general conditions including, for the first time, off-policy
training. The new methods are of the same order of complexity
as TD(lambda) and, when TD(lambda) converges, they converge at
a similar rate to the same fixpoints. The new methods
are similar to GTD(0) (Sutton, Szepesvari & Maei, 2009),
but based on a different objective function and much more
efficient, as we demonstrate in a series of computational
experiments.

Applications of Reinforcement Learning in the Power Systems Industry (Invited Talk at the First Annual Power and Energy Forum, Nov 6, 2008)

Critterbot Project Overview (Aug 6, 2008)

Mind and Time: A View of Constructivist Reinforcement Learning

This an invited talk I gave at
the European Workshop on Reinforcement Learning in summer
2008. The basic idea is that in order to learn fast it
is necessary to learn slow, that the key to fast reinforcement
learning is to prepare for it by a slow continual process of
constructing a model of the world's state and dynamics.
Although I don't know exactly how to do this, I have many
ideas and suggestions, and an outline of how to proceed.
I try to communicate these in this talk.

How simple can mind be? (International Workshop on Natural and Artificial Cognition, University of Oxford 6/26/07)

On the Role of Tracking in Stationary Environments (ICML'07 6/21/07) Associated paper.

ABSTRACT: It is often thought
that learning algorithms that track the best solution, as
opposed to converging to it, are important only on
nonstationary problems. We present three results suggesting
that this is not so. First we illustrate in a simple concrete
example, the Black and White problem, that tracking can
perform better than any converging algorithm on a stationary
problem. Second, we show the same point on a larger, more
realistic problem, an application of temporal-difference
learning to computer Go. Our third result suggests that
tracking in stationary problems could be important for
meta-learning research (e.g., learning to learn, feature
selection, transfer). We apply a meta-learning algorithm for
step-size adaptation, IDBD,e to the Black and White problem,
showing that meta-learning has a dramatic long-term effect on
performance whereas, on an analogous converging problem,
meta-learning has only a small second-order effect. This small
result suggests a way of eventually overcoming a major
obstacle to meta-learning research: the lack of an independent
methodology for task selection.

Stimulus Representation in Temporal-Difference Models of the Dopamine System (Cal Tech 6/4/07)

The neurotransmitter dopamine
plays an important role in the processing of reward-related
information in the brain. A prominent theory of this function
is that the phasic firing of dopamine neurons encodes a reward
prediction error as formalized by the temporal-difference (TD)
algorithm in reinforcement learning. Most of these TD models
of the dopamine system have assumed a "complete serial
compound" representation in which every moment within a trial
is represented distinctly with no similarity to neighboring
moments. In this paper we present a more realistic temporal
representation in which external stimuli spawn a series of
internal microstimuli which grow weaker and more diffuse over
time. We show that if these microstimuli are used as inputs to
the TD model, then its match to experimental data is improved
for hitherto problematic cases in which reward is omitted or
received early. We also note that the new model never
produces large negative errors, suggesting that a second
neurotransmitter for representing negative errors may not be
necessary. Generally, we conclude that choosing a stimulus
representation with a more realistic temporal profile can
significantly alter the predictions of the TD model of
dopamine function.

Experience-Oriented Artificial Intelligence (Machine Learning Seminar at the University of Toronto, 4/3/06)

If intelligence is a
computation, then the temporal stream of sensations is its
input, and the temporal stream of actions is its output. These
two intermingled time series make up experience. They
are the basis on which all intelligent decisions are made and
the basis on which those decisions are judged. A focus on
experience has implications for many aspects of AI; in this
talk we consider its implications for knowledge
representation. I propose that it is possible and desirable
for an AI agent's knowledge of the world to be expressed
entirely as predictions about its low-level experience. Even
abstract concepts, such as the concept of a chair, can be
expressed as predictions, e.g., about what will happen if we
try to sit. The predictive approach is appealing because it
connects knowledge directly to data, allowing knowledge to be
autonomously verified and tuned, perhaps even learned.
However, there is a tremendous gap between human-level
knowledge (e.g., about space, objects, people, or water) and
low-level experience. The purpose of this talk is to
present some recent work suggesting how this gap might someday
be bridged. I describe a series of small experiments in
which extensions of reinforcement learning methods are used to
learn predictive representations of abstract commonsense
knowledge in micro-worlds. These are first steps on a long
journey toward understanding how a mind might make sense of
the blooming, buzzing confusion of its sensori-motor
experience.

Predictive Representations of State and Knowledge (ICML'05 workshop on Rich Representations for Reinforcement Learning, 8/7/05)

What is knowledge? The
empiricist answer, dating back to the 19th century, is that
knowledge is the ability to predict. In a modern version of
this idea, reinforcement learning researchers have proposed
that artificial agents should represent their knowledge as
predictions of their low-level sensations and actions.
This predictive representations (PR) approach is appealing
because it connects knowledge directly to data, thereby
facilitating learning and clarifying semantics. Most PR
research has emphasized representing the world's _state.
In this talk I will survey the main results and mathematical
ideas of that work. A natural follow on, just beginning to be
explored, is to use PRs for all kinds of world knowledge, of
dynamics as well as of state, of abstractions as well as
specifics. I will survey this work as well and attempt
to make vivid the potential of PRs for artificial
intelligence.

Grounding knowledge in subjective experience (provocative remarks at the 2nd Cognitive Systems Conference, 5/20/05)

Experience-Oriented Artificial Intelligence (McGill 11/30/05)

I propose that experience - the
explicit sequence of actions and sensations over an agent's
life - should play a central role in all aspects of artificial
intelligence. In particular:

1. Knowledge representation
should be in terms of experience. Recent work has shown that a
surprisingly wide range of world knowledge can be expressed as
predictions of experience, enabling it to be automatically
verified and tuned, and grounding its meaning in data rather
than in human understanding.

2. Planning/reasoning should be in terms of experience. It is natural to think of planning as comparing alternative future experiences. General methods, such as dynamic programming, can be used to plan using knowledge expressed in the aforementioned predictive form.

3. State representation should be in terms of experience. Rather than talk about objects and their metric or even topological relationships, we represent states by the predictions that can be made from them. For example, the state "John is in the coffee room" corresponds to the prediction that going to the coffee room will produce the sight of John.

2. Planning/reasoning should be in terms of experience. It is natural to think of planning as comparing alternative future experiences. General methods, such as dynamic programming, can be used to plan using knowledge expressed in the aforementioned predictive form.

3. State representation should be in terms of experience. Rather than talk about objects and their metric or even topological relationships, we represent states by the predictions that can be made from them. For example, the state "John is in the coffee room" corresponds to the prediction that going to the coffee room will produce the sight of John.

Much here has yet to be worked
out. Each of the "should"s above can also be read as a
"could", or even a "perhaps could". I am optimistic and
enthusiastic because of the potential for developing a compact
and powerful theory of AI in the long run, and for many easy
experimental tests in the short run.

Grounding Commonsense Knowledge in Question Networks. (University of Michigan 9/28/04)

A long-standing challenge in
artificial intelligence has been to relate the kind of
commonsense knowledge that people have about the world (for
example, about space, objects, people, trees and water) to the
low-level stream of sensations and actions. In this
talk, we present new work that brings us a few steps closer to
realizing this goal. We introduce the idea of question
networks, a way of expressing arbitrary machine-readable
questions about future sensations and actions, and a
temporal-difference algorithm for learning answers to the
questions. In a series of small experiments, we
illustrate the learning efficency of these methods and their
ability to handle non-Markov problems. Finally, we
present their extension to temporally abstract knowledge in
terms of closed-loop macro-actions known as options.
Overall, we argue that these steps bring us qualitatively
closer to understanding the blooming, buzzing confusion of
sensori-motor experience.

Temporal Difference Networks. Presented at NIPS-04. Larger Version.

We introduce a generalization of
temporal-difference (TD) learning to networks of interrelated
predictions. Rather than relating a single prediction to
itself at a later time, as in conventional TD methods, a TD
network relates each prediction in a set of predictions to
other predictions in the set at a later time. TD networks can
represent and apply TD learning to a much wider class of
predictions than has previously been possible. Using a
random-walk example, we show that these networks can be used
to learn to predict by a fixed interval, which is not possible
with conventional TD methods. Secondly, we show that when
actions are introduced, and the inter-prediction relationships
made contingent on them, the usual learning-efficiency
advantage of TD methods over Monte Carlo (supervised learning)
methods becomes particularly pronounced. Thirdly, we
demonstrate that TD networks can learn predictive state
representations that enable exact solution of a non-Markov
problem. A very broad range of inter-predictive temporal
relationships can be expressed in these networks. Overall we
argue that TD networks represent a substantial extension of
the abilities of TD methods and bring us closer to the goal of
representing world knowledge in entirely predictive, grounded
terms.

Knowledge Representation in TD Networks (AAAI Symposium on MDPs and POMDPs: Advances and Challenges (7/26/04) Large (1024 x 768) version

We introduce a generalization of
temporal-difference (TD) learning to networks of interrelated
predictions. Rather than relating a single prediction to
itself at a later time, as in conventional TD methods, a TD
network relates each prediction in a set of predictions to
other predictions in the set at a later time. TD networks can
represent and apply TD learning to a much wider class of
predictions than has previously been possible. Using a
random-walk example, we show that these networks can be used
to learn to predict by a fixed interval, which is not possible
with conventional TD methods. Secondly, we show that when
actions are introduced, and the inter-prediction relationships
made contingent on them, the usual learning-efficiency
advantage of TD methods over Monte Carlo (supervised learning)
methods becomes particularly pronounced. Thirdly, we
demonstrate that TD networks can learn predictive state
representations that enable exact solution of a non-Markov
problem. A very broad range of inter-predictive temporal
relationships can be expressed in these networks. Overall we
argue that TD networks represent a substantial extension of
the abilities of TD methods and bring us closer to the goal of
representing world knowledge in entirely predictive, grounded
terms.

Toward a Computational Theory of Intelligence -- iCORE talk on Reinforcement Learning and Artificial Intelligence (University of Calgary 2/25/04). Video no longer available.

This talk was to a general university audience (videocast to U. Alberta and U. Lethbridge). To showcase the ideas and power of RL, i collected a bunch of videos from other peoples' work. It's not often you can do this appropriately, but I think it was ok this time, and certainly it was fun. The accompanying videos:

- Robot dogs learning to walk fast (Kohl & Stone, U. Texas, 2004)
- Finnegan Southey devilsticking
- Robot devilsticking (Stefan Schaal & Chris Atkeson Univ. of Southern California)
- Helicopter by Ng (Stanford), Kim, Jordan, & Sastry (UC Berkeley) 2004
- Adversarial Robot Learning (Bowling & Veloso, CMU)

All save the last are
quicktimable and will play directly in safari. The last seems
to require mplayer.

Adapting bias by gradient descent: An incremental version of delta-bar-delta (University of Alberta 2/2/04)

Appropriate bias is widely viewed as the key to efficient learning and generalization. I present a new algorithm, the Incremental Delta-Bar-Delta (IDBD) algorithm, for the learning of appropriate biases based on previous learning experience. The IDBD algorithm is developed for the case of a simple, linear learning system---the LMS or delta rule with a separate learning-rate parameter for each input. The IDBD algorithm adjusts the learning-rate parameters, which are an important form of bias for this system. Because bias in this approach is adapted based on previous learning experience, the appropriate testbeds are drifting or non-stationary learning tasks. For particular tasks of this type, I show that the IDBD algorithm performs better than ordinary LMS and in fact finds the optimal learning rates. The IDBD algorithm extends and improves over prior work by Jacobs and by me in that it is fully incremental and has only a single free parameter. This paper also extends previous work by presenting a derivation of the IDBD algorithm as gradient descent in the space of learning-rate parameters. Finally, I offer a novel interpretation of the IDBD algorithm as an incremental form of hold-one-out cross validation.

From Markov Decision Processes to Artificial Intelligence (University of Alberta 5/14/03)

The path to general, human-level intelligence may go through Markov decision processes (MDPs), a discrete-time, probabilistic formulation of sequential decision problems in terms of states, actions, and rewards. Developed in the 1950s, MDPs were extensively explored and applied in operations research and engineering before coming to the attention of artificial intelligence researchers about 15 years ago. Much of the new interest has come from the field of reinforcement learning, where novel twists on classical dynamic programming methods have enabled the solution of more and vastly larger problems, such as backgammon (Tesauro, 1995) and elevator control (Crites and Barto, 1996). Despite remaining technical issues, real progress seems to have been made toward general learning and planning methods relevant to artificial intelligence. We suggest that the MDP framework can be extended further, to the threshold of human-level intelligence, by abstracting and generalizing each of its three components - actions, states, and rewards. We briefly survey recent work on temporally abstract actions (Precup, 2000; Parr, 1998), predictive representations of state (Littman et al., 2002), and non-reward subgoals (Sutton, Precup & Singh, 1998) to make this suggestion.

The reinforcement learning
approach to understanding intelligence is now about 20 years
old, which should be time enough for a mature perspective on
what it is and what it has contributed. Reinforcement learning
methods, particularly temporal-difference learning, have been
widely used in control and robotics applications, in playing
games such as chess and backgammon, in operations research,
and as models of animal learning and neural reward systems.
Holding these diverse applications together, and posing as a
fundamental statement about cognition and decision-making, is
a computational theory (in the sense of Marr) of mind.
Reinforcement learning methods are centered around the
interaction and simultaneous evolution of two primary
functional objects, the policy, which says what to do in each
situation, and the value function, which says how desirable it
is to be in each situation. In this talk, I will survey
several examples of reinforcement learning in the attempt to
make this underlying theory vivid. Finally, I will mention
some of the theory's limitations and shortcomings, and ongoing
efforts to make it relevant to the extremely powerful and
flexible cognition that we see in humans.

Experience-Oriented Artificial Intelligence (Nov 2002)

I propose that experience - the
explicit sequence of actions and sensations over an agent's
life - should play a central role in all aspects of artificial
intelligence. In particular:

[some of this is joint work with Doina Precup, Michael Littman, Satinder Singh & Peter Stone]

1. Knowledge representation
should be in terms of experience. Recent work has shown that
a surprisingly wide range of world knowledge can be
expressed as predictions of experience, enabling it to be
automatically verified and tuned, and grounding its meaning
in data rather than in human understanding.

2. Planning/reasoning should be in terms of experience. It is natural to think of planning as comparing alternative future experiences. General methods, such as dynamic programming, can be used to plan using knowledge expressed in the aforementioned predictive form.

3. State representation should be in terms of experience. Rather than talk about objects and their metric or even topological relationships, we represent states by the predictions that can be made from them. For example, the state "John is in the coffee room" corresponds to the prediction that going to the coffee room will produce the sight of John.

Much here has yet to be worked out. Each of the "should"s
above can also be read as a "could", or even a "perhaps
could". I am optimistic and enthusiastic because of the
potential for developing a compact and powerful theory of AI
in the long run, and for many easy experimental tests in the
short run. 2. Planning/reasoning should be in terms of experience. It is natural to think of planning as comparing alternative future experiences. General methods, such as dynamic programming, can be used to plan using knowledge expressed in the aforementioned predictive form.

3. State representation should be in terms of experience. Rather than talk about objects and their metric or even topological relationships, we represent states by the predictions that can be made from them. For example, the state "John is in the coffee room" corresponds to the prediction that going to the coffee room will produce the sight of John.

[some of this is joint work with Doina Precup, Michael Littman, Satinder Singh & Peter Stone]

Artificial Intelligence Should Be About Predictions (AT&T 12/7/01)

What keeps the knowledge in an
AI system correct? Usually people do, but that is a dead end;
eventually the AI must do it itself. Building AIs that can
maintain their own knowledge is probably the greatest single
challenge facing AI today.

It would be relatively easy to self-maintain knowledge if it were expressed as predictions: you would predict something and then see what actually happened. In this talk I propose that much of our knowledge of the world can be expressed as predictions that can be verified in this way. Certainly much of our everyday decision-making is based on predictions about alternative alternative courses of action. Even abstract concepts, such as the concept of a chair, can be expressed as predictions, e.g., about what would happen if we try to sit. Emphasizing ideas rather than technical details, I will describe some of the challenges to this predictive view and partial solutions. The main challenge is to be able to express in predictive form the wide variety of knowledge we have of the world. This can be done in large part by allowing the predictions to be conditional on action and to terminate flexibly, as in the "options" framework. A second challenge is to be fully grounded, to relate the meaning of predictions directly to data. Finally, we consider the pragmatic challenges: how to make progress with these ideas? Building a self-maintaining AI based on predictive knowledge is not difficult, but requires new ways of thinking, determination to do it right, and a willingness to proceed slowly.

We Have Not Yet Begun to
Learn (19th Reinforcement Learning Workshop, AT&T
9/20/01)It would be relatively easy to self-maintain knowledge if it were expressed as predictions: you would predict something and then see what actually happened. In this talk I propose that much of our knowledge of the world can be expressed as predictions that can be verified in this way. Certainly much of our everyday decision-making is based on predictions about alternative alternative courses of action. Even abstract concepts, such as the concept of a chair, can be expressed as predictions, e.g., about what would happen if we try to sit. Emphasizing ideas rather than technical details, I will describe some of the challenges to this predictive view and partial solutions. The main challenge is to be able to express in predictive form the wide variety of knowledge we have of the world. This can be done in large part by allowing the predictions to be conditional on action and to terminate flexibly, as in the "options" framework. A second challenge is to be fully grounded, to relate the meaning of predictions directly to data. Finally, we consider the pragmatic challenges: how to make progress with these ideas? Building a self-maintaining AI based on predictive knowledge is not difficult, but requires new ways of thinking, determination to do it right, and a willingness to proceed slowly.

Mind is About Predictions (Northeastern 7/31/01)

In this talk I will describe
recent research in artificial intelligence which has given
greater credance to the old idea that much of our knowledge of
the world is in the form of predictions. From the blooming,
buzzing confusion we extract what is predictable, and in so
doing discover useful concepts and ways of behaving.
Certainly, much of our everyday reasoning and decision making
is based on predictions about alternative courses of action.
Even abstract concepts, such as the concept of a chair, can be
expressed as predictions, e.g., about what will happen if we
try to sit. In this talk I will briefly cover three ideas: 1)
an expanded notion of prediction capable of expressing a broad
range of knowledge, 2) a kind of planning, or reasoning, as
the combination of predictions to yield new predictions, and
3) a way of representing the state of the world (as well as
its dynamics) as predictions. All this suggests that working
with predictions is what the mind is all about---that
predictions are the coin of the mental realm.

(Some of the newer bits of this are joint work with Michael Littman, Doina Precup, and Satinder Singh; also many thanks to David McAllester for constructive criticism.)

(Some of the newer bits of this are joint work with Michael Littman, Doina Precup, and Satinder Singh; also many thanks to David McAllester for constructive criticism.)

Off-policy temporal-difference learning with function approximation (ICML 7/1/01)

We introduce the first
algorithm for off-policy temporal-difference learning that is
stable with linear function approximation. Off-policy learning
is of interest because it forms the basis for popular
reinforcement learning methods such as Q-learning, which has
been known to diverge with linear function approximation, and
because it is critical to the practical utility of
multi-scale, multi-goal, learning frameworks such as options,
HAMs, and MAXQ. Our new algorithm combines TD(lambda) over
state-action pairs with importance sampling ideas from our
previous work. We prove that, given training under any
epsilon-soft policy, the algorithm converges w.p.1 to a close
approximation (as in Tsitsiklis and Van Roy, 1997; Tadic,
2001) to the action-value function for an arbitrary target
policy. Variations of the algorithm designed to reduce
variance introduce additional bias but are also guaranteed
convergent. We also illustrate our method empirically on a
small policy evaluation problem, showing reduced variance
compared to the most obvious importance sampling algorithm for
this problem. Our current results are limited to episodic
tasks with episodes of bounded length.

Overcoming the Curse of Dimensionality with Reinforcement Learning (MIT ORC 4/19/01)

Technological advances in the
last few decades have made computation and memory vastly
cheaper and thus available in massive quantities. The field of
reinforcement learning attempts to take advantage of this
trend when solving large-scale stochastic optimal control
problems. Dynamic programming can solve small instances of
such problems, but suffers from Bellman's "curse of
dimensionality," the tendency of the state space and thus
computational complexity to scale exponentially with the
number of state variables (and thus to quickly exceed even the
"massive" computational resources now available).
Reinforcement learning brings in two new techniques: 1)
parametric approximation of the value function, and 2)
sampling of state trajectories (rather than sweeps through the
state space). These enable finding approximate solutions,
improving in quality with the available computational
resources, on problems too large to even be attempted with
conventional dynamic programming. However, these techniques
also complicate theory, and there remain substantial gaps
between the reinforcement learning methods proven effective
and those that appear most effective in practice. In this
talk, I present results extending the convergence result of
Tsitsiklis and Van Roy for on-policy evaluation with linear
function approximation to the off-policy case, reviving the
possibility of convergence results for value-based off-policy
control methods such as Q-learning. I also present an
application to RoboCup soccer illustrating the linear approach
to function approximation. (This is joint work with Doina
Precup, Satinder Singh, Peter Stone, and Sanjoy Dasgupta.)

The Right Way to do Reinforcement Learning with Function Approximation (NIPS'00 12/2/00)

From Reflex to Reason (Cornell 12/8/00)

How close are we to a
computational understanding of the mind? Perhaps closer than
is usually thought. In this talk I discuss a small set of
principles drawn from reinforcement learning and other parts
of artificial intelligence that cover a broad range of mental
phenomena, from reflexes through various kinds of learning,
planning, and reasoning. These principles include rewards,
value functions, state-space search, and, as I emphasize in
this talk, representing our knowledge of the world as
predictions of future observations. First, I show how
predictive representations provide a new theory of that
simplest of learning phenomena, Pavlovian conditioning or the
learning of replexes. Second, I briefly outline how
model-based reinforcement learning with mental simulation can
serve as a theory of reasoning. I argue that representing
knowledge as predictions, including the possibility of
action-contingent and temporally indefinite predictions,
solves critical problems in the semantics and grounding of
classical symbolic approaches to knowledge representation.

Toward Grounding Knowledge in Prediction (CEC2000 7/18/00)

Any attempt to build
intelligent machines must come to grips with the question of
knowledge, of what kind of information about the world the
machine stores and manipulates. Traditionally there have been
two approaches, the horns of a dilemma. One uses verbal
statements like "John loves Mary" or "Socrates is a man" whose
meaning is clear only to people, not to machines; such
knowledge is ungrounded. The other uses mathematical
statements like differential equations or transition matrices
which, although clear and grounded, have never seemed adequate
for expressing the commonsense knowledge we all have about the
world and use everyday. In this talk we suggest that this
dilemma can be broken by grounding knowledge in an enlarged
notion of conditional prediction. In particular, if we allow
predictions conditional on outcomes (as in Precup, 2000; Parr,
1999) then much more can be expressed as predictions without
losing grounding and mathematical clarity. In addition, this
approach suggests a radical theory of reasoning---combining
knowledge to yield new knowledge---as simple composition of
predictions.

A Least Common Denominator for Temporal Abstraction in Reinforcement Learning (NIPS workshop 12/5/98)

Improved Switching Among Temporally Abstract Actions (NIPS 12/2/98)

In robotics and other control
applications it is commonplace to have a pre-existing set of
controllers for solving subtasks, perhaps hand-crafted or
previously learned or planned, and still face a difficult
problem of how to choose and switch among the controllers to
solve an overall task as well as possible. In this paper we
present a framework based on Markov decision processes and
semi-Markov decision processes for phrasing this problem, a
basic theorem regarding the improvement in performance that
can be obtained by switching flexibly between given
controllers, and example applications of the theorem. In
particular, we show how an agent can plan with these
high-level controllers and then use the results of such
planning to find an even better plan, by modifying the
existing controllers, with negligible additional cost and no
re-planning. In one of our examples, the complexity of the
problem is reduced from 24 billion state-action pairs to less
than a million state-controller pairs.

Reinforcement Learning: How Far Can It Go? (Past, Present, and Future) (ICML/COLT/UAI 7/25/98, Extended abstract)

Between MDPs and Semi-MDPs (Stanford 3/5/98)

A key challenge for AI is how to
learn, plan, and represent knowledge at multiple levels of
temporal abstraction. In this talk I develop an approach based
on the mathematical framework of reinforcement learning and
Markov decision processes (MDPs). The usual framework is
extended to include closed-loop multi-step *options*---whole
courses of behavior that may be temporally extended,
stochastic, and contingent on events. Examples of options
include picking up an object, going to lunch, and traveling to
a distant city, as well as primitive actions such as muscle
twitches and joint torques. Options can be used
interchangeably with primitive actions in reinforcement
learning and planning methods, and can be analyzed in terms of
a generalized kind of MDP known as a semi-Markov decision
process (SMDP) (e.g., Puterman, 1994; Bradtke and Duff, 1995;
Parr, 1998; Precup and Sutton, 1997). In this talk I focus on
the interplay between the MDP and SMDP levels of analysis. I
show how a set of options can be improved by changing their
termination conditions to improve over SMDP planning methods
with no additional cost. I also present novel *intra-option
*temporal-difference methods that substantially improve
over SMDP methods. Finally, I discuss how options themselves
can be learned, introducing a new notion of subgoal and
subtask into reinforcement learning. Overall, I argue that
options and models of options provide hitherto missing aspects
of a powerful, clear, and expressive framework for
representing and organizing knowledge. (Joint work with Doina
Precup and Satinder Singh.)

Reinforcement Learning: A Tutorial (GP-98 7/23/98)

Reinforcement learning is
learning about, from, and while interacting with a environment
in order to achieve a goal. In other words, it is a relatively
direct model of the learning that people and animals do in
their normal lives. In the last two decades, this age-old
problem has come to be much better understood by integrating
ideas from psychology, optimal control, artificial neural
networks, and artificial intelligence. New methods and
combinations of methods have enabled much better solutions to
large-scale applications than had been possible by all other
means. This tutorial will provide a top-down introduction to
the field, covering Markov decision processes and approximate
value functions as the formulation of the problem, and dynamic
programming, temporal-difference learning, and Monte Carlo
methods as the principal solution methods. The role of neural
networks and planning will also be covered. The emphasis will
be on understanding the capabilities and appropriate role of
each of class of methods within in an integrated system for
learning and decision making

Reinforcement Learning: Lessons for Artificial Intelligence (IJCAI 8/28/97)

The field of reinforcement
learning has recently produced world-class applications and,
as we survey in this talk, scientific insights that may be
relevant to all of AI. In my view, the main things that we
have learned from reinforcement learning are 1) the power of
learning from experience as opposed to labeled training
examples, 2) the central role of *modifiable *evaluation
functions
in organizing sequential behavior, and 3) that learning and
planning could be radically similar.

Reinforcement Learning and Information Access (AAAI-SS 3/26/96)

Constructive Induction Needs a Methodology based on Continuing Learning (ICML94, Workshop on Constructive Induction, panel remarks)