Next: 7.8 Replacing Traces
 Up: 7. Eligibility Traces
 Previous: 7.6 Q(
)
     Contents 
In this section we describe how to extend the actor-critic methods
introduced in Section 6.6 to use eligibility traces.
This is fairly straightforward.
The critic part of an actor-critic method is simply on-policy
learning of 
.  The TD(
) algorithm can be used for that, with one
eligibility trace for each state. The actor part needs to use an eligibility trace
for each state-action pair.  Thus, an actor-critic method needs two sets of traces,
one for each state and one for each state-action pair.
Recall that the one-step actor-critic method updates the
actor by
 
 
where 
 is the TD(
) error (7.6), and
 is the preference for taking action 
 at time 
if in state 
.  The preferences determine the policy via, for example,  a
softmax method (Section 2.3).   We generalize the 
above equation to use eligibility traces as follows:
 
  
 | (7.14) | 
 
 
where 
 denotes the trace at time 
 for state-action pair
.  For the simplest case mentioned above, the trace can be
updated as in Sarsa(
).  
In Section 6.6 we also
discussed a more sophisticated actor-critic method that uses the
update
 
 
To generalize this equation to eligibility traces we can use the
same update (7.14) with a slightly different
trace.  Rather than incrementing the trace by 1 each time a
state-action pair occurs, it is updated by 
:
 
  
 | (7.15) | 
 
 
for all 
.
 
 
 
  
 Next: 7.8 Replacing Traces
 Up: 7. Eligibility Traces
 Previous: 7.6 Q(
)
     Contents 
Mark Lee
2005-01-04