 
 
 
 
 
 
 
  
TD methods learn their estimates in part on the basis of other estimates. They learn a guess from a guess--they bootstrap. Is this a good thing to do? What advantages do TD methods have over Monte Carlo and DP methods? Developing and answering such questions will take the rest of this book and more. In this section we briefly anticipate some of the answers.
Obviously, TD methods have an advantage over DP methods in that they do not require a model of the environment, of its reward and next-state probability distributions.
The next most obvious advantage of TD methods over Monte Carlo methods is that they are naturally implemented in an on-line, fully incremental fashion. With Monte Carlo methods one must wait until the end of an episode, because only then is the return known, whereas with TD methods one need wait only one time step. Surprisingly often this turns out to be a critical consideration. Some applications have very long episodes, so that delaying all learning until an episode's end is too slow. Other applications are continuing tasks and have no episodes at all. Finally, as we noted in the previous chapter, some Monte Carlo methods must ignore or discount episodes on which experimental actions are taken, which can greatly slow learning. TD methods are much less susceptible to these problems because they learn from each transition regardless of what subsequent actions are taken.
But are TD methods sound?  Certainly it is convenient to learn one guess from the
next, without waiting for an actual outcome, but can we still guarantee
convergence to the correct answer?  Happily, the answer is yes.  For any fixed
policy  , the TD algorithm described above has been proved to converge to
, the TD algorithm described above has been proved to converge to
 , in the mean for a constant step-size parameter if it is sufficiently small,
and with probability 1 if the step-size parameter decreases according to the
usual stochastic approximation conditions (2.8).  Most
convergence proofs apply only to the table-based case of the algorithm presented
above (6.2), but some also apply to the case of general linear function
approximation.  These results are discussed in a more general setting in the
next two chapters.
, in the mean for a constant step-size parameter if it is sufficiently small,
and with probability 1 if the step-size parameter decreases according to the
usual stochastic approximation conditions (2.8).  Most
convergence proofs apply only to the table-based case of the algorithm presented
above (6.2), but some also apply to the case of general linear function
approximation.  These results are discussed in a more general setting in the
next two chapters.
If both TD and Monte Carlo methods converge asymptotically to the correct predictions, then
a natural next question is "Which gets there first?"  In other words, which
method learns faster?  Which makes the more efficient use of limited data?  At
the current time this is an open question in the sense that no one has been able
to prove mathematically that one method converges faster than the other.  In
fact, it is not even clear what is the most appropriate formal way to phrase
this question!  In practice, however, TD methods have usually been found to
converge faster than constant- MC methods on stochastic tasks, as illustrated
in the following example.
 MC methods on stochastic tasks, as illustrated
in the following example.
Example 6.2: Random Walk In this example we empirically compare the prediction abilities of TD(0) and constant-
 MC applied to the small Markov
process shown in Figure  
6.5.  All episodes start in
the center state,
 MC applied to the small Markov
process shown in Figure  
6.5.  All episodes start in
the center state,  , and proceed either left or right by one state on
each step, with equal probability.  This behavior is presumably due to the
combined effect of a fixed policy and an environment's state-transition
probabilities, but we do not care which;  we are concerned only with predicting
returns however they are generated.  Episodes terminate either on the extreme left
or the extreme right.  When an episode terminates on the right a reward of
, and proceed either left or right by one state on
each step, with equal probability.  This behavior is presumably due to the
combined effect of a fixed policy and an environment's state-transition
probabilities, but we do not care which;  we are concerned only with predicting
returns however they are generated.  Episodes terminate either on the extreme left
or the extreme right.  When an episode terminates on the right a reward of  occurs; all other rewards are zero. For example, a typical walk might consist of the
following state-and-reward sequence:
occurs; all other rewards are zero. For example, a typical walk might consist of the
following state-and-reward sequence:
 .  Because this task is undiscounted and episodic, the
true value of each state is the probability of terminating on the right if
starting from that state.  Thus, the true value of the center state is
.  Because this task is undiscounted and episodic, the
true value of each state is the probability of terminating on the right if
starting from that state.  Thus, the true value of the center state is
 .  The true values of all the states,
.  The true values of all the states,  through
 through  , are
, are  , and
, and  .  Figure  
6.6
shows the values learned by TD(0) approaching the true values as
more episodes are experienced.  Averaging over many episode sequences,
Figure  
6.7 shows the average error in the predictions found by
TD(0) and constant-
.  Figure  
6.6
shows the values learned by TD(0) approaching the true values as
more episodes are experienced.  Averaging over many episode sequences,
Figure  
6.7 shows the average error in the predictions found by
TD(0) and constant- MC, for a variety of values of
 MC, for a variety of values of  , as a function of number of
episodes. In all cases the approximate value function was initialized to the
intermediate value
, as a function of number of
episodes. In all cases the approximate value function was initialized to the
intermediate value
 , for all
, for all  .  The TD method is consistently better than the MC method on this task
over this number of episodes.
.  The TD method is consistently better than the MC method on this task
over this number of episodes.

 
 
 
Exercise 6.1 This is an exercise to help develop your intuition about why TD methods are often more efficient than Monte Carlo methods. Consider the driving home example and how it is addressed by TD and Monte Carlo methods. Can you imagine a scenario in which a TD update would be better on average than an Monte Carlo update? Give an example scenario--a description of past experience and a current state--in which you would expect the TD update to be better. Here's a hint: Suppose you have lots of experience driving home from work. Then you move to a new building and a new parking lot (but you still enter the highway at the same place). Now you are starting to learn predictions for the new building. Can you see why TD updates are likely to be much better, at least initially, in this case? Might the same sort of thing happen in the original task?
Exercise 6.2 From Figure 6.6, it appears that the first episode results in a change in only
 .  What does this tell you about what happened on the first
episode?  Why was only the estimate for this one state changed?  By exactly how much
was it changed?
.  What does this tell you about what happened on the first
episode?  Why was only the estimate for this one state changed?  By exactly how much
was it changed?
Exercise 6.3 Do you think that by choosing the step-size parameter,
 , differently,
either algorithm could have done significantly better than shown in
Figure  
6.7?  Why or why not?
, differently,
either algorithm could have done significantly better than shown in
Figure  
6.7?  Why or why not?
Exercise 6.4 In Figure 6.7, the RMS error of the TD method seems to go down and then up again, particularly at high
 's.  What could have caused this?  Do
you think this always occurs, or might it be a function of how the approximate
value function was initialized?
's.  What could have caused this?  Do
you think this always occurs, or might it be a function of how the approximate
value function was initialized?
Exercise 6.5 Above we stated that the true values for the random walk task are
 , and
, and  , for states
, for states  through
 through
 .  Describe at least two different ways that these could have been computed.
Which would you guess we actually used?  Why?
.  Describe at least two different ways that these could have been computed.
Which would you guess we actually used?  Why?
 
 
 
 
 
 
