Another class of effective learning methods for the -armed bandit
problem are *pursuit* methods. Pursuit methods maintain both
action-value estimates *and* action preferences, with the preferences
continually "pursuing" the action that is greedy according to the
current action-value estimates. In the simplest pursuit method, the action
preferences are the probabilities, , with which each action, , is
selected on play .

After each play, the probabilities are updated so as to
make the greedy action more likely to be selected. After the th play, let
denote the greedy action (or a random sample
from the greedy actions if there are more than one) for the ()st play. Then
the probability of selecting
is incremented a fraction, , of the way toward 1:

while the probabilities of selecting the other actions are decremented toward zero:

The action values, , are updated in one of the ways discussed in the preceding sections, for example, to be sample averages of the observed rewards, using (2.1).

Figure 2.6 shows the performance of the pursuit algorithm described above when the action values are estimated using sample averages (incrementally computed using ). In these results, the initial action probabilities were , for all , and the parameter was 0.01. For comparison, we also show the performance of an -greedy method () with action values also estimated using sample averages. The performance of the reinforcement comparison algorithm from the previous section is also shown. Although the pursuit algorithm performs the best of these three on this task at these parameter settings, the ordering could well be different in other cases. All three of these methods appear to have their uses and advantages.