What is the space of methods lying between Monte Carlo and TD methods? Consider
estimating from sample episodes generated using . Monte Carlo
methods perform a backup for each state based on the entire sequence of observed
rewards from that state until the end of the episode. The backup of simple TD
methods, on the other hand, is based on just the one next reward, using the value
of the state one step later as a proxy for the remaining rewards. One kind of
intermediate method, then, would perform a backup based on an intermediate number
of rewards: more than one, but less than all of them until termination. For
example, a two-step backup would be based on the first two rewards and the estimated
value of the state two steps later. Similarly, we could have three-step backups,
four-step backups, and so on. Figure
7.1 diagrams the spectrum of
* -step backups* for , with one-step, simple TD backups on the left and
up-until-termination Monte Carlo backups on the right.

The methods that use -step backups are still TD methods
because they still change an earlier estimate based on how it differs from a
later estimate. Now the later estimate is not one step later, but
steps later. Methods in which the temporal difference extends over steps
are called *-step TD methods*. The TD methods introduced in the
previous chapter all use one-step backups, and henceforth we call them *one-step TD methods*.

More formally, consider the backup
applied to state as a result of the state-reward sequence,
(omitting the actions for
simplicity). We know that in Monte Carlo backups the estimate of
is updated in the direction of the complete return:

where is the last time step of the episode. Let us call this quantity the

This makes sense because takes the place of the remaining terms , as we discussed in the previous chapter. Our point now is that this idea makes just as much sense after two steps as it does after one. The two-step target is

where now takes the place of the terms . In general, the -step target is

This quantity is sometimes called the "corrected -step truncated return" because it is a return truncated after steps and then approximately corrected for the truncation by adding the estimated value of the th next state. That terminology is descriptive but a bit long. We instead refer to simply as the

Of course, if the episode ends in less than steps, then the truncation in an -step return occurs at the episode's end, resulting in the conventional complete return. In other words, if , then . Thus, the last -step returns of any episode are always complete returns, and an infinite-step return is always a complete return. This definition enables us to treat Monte Carlo methods as the special case of infinite-step returns. All of this is consistent with the tricks for treating episodic and continuing tasks equivalently that we introduced in Section 3.4. There we chose to treat the terminal state as a state that always transitions to itself with zero reward. Under this trick, all -step returns that last up to or past termination have the same value as the complete return.

An *-step backup* is defined to be a backup toward the -step return. In
the tabular, state-value case, the increment to (the estimated value of
at time ), due to an -step backup of , is defined by

where is a positive step-size parameter, as usual. Of course, the increments to the estimated values of the other states are , for all . We define the -step backup in terms of an increment, rather than as a direct update rule as we did in the previous chapter, in order to distinguish two different ways of making the updates. In

The expected value of all -step returns is guaranteed to improve in a certain
way over the current value function as an approximation to the true value
function. For any , the expected value of the -step return
using is guaranteed to be a better estimate of than is, in a
worst-state sense. That is, the worst error under the new estimate is guaranteed
to be less than or equal to times the worst error under :

This is called the

Nevertheless, -step TD methods are rarely used because they are inconvenient to implement. Computing -step returns requires waiting steps to observe the resultant rewards and states. For large , this can become problematic, particularly in control applications. The significance of -step TD methods is primarily for theory and for understanding related methods that are more conveniently implemented. In the next few sections we use the idea of -step TD methods to explain and justify eligibility trace methods.