We have presented in this chapter some simple ways of balancing exploration and exploitation. The -greedy methods choose randomly a small fraction of the time, the softmax methods grade their action probabilities according to the current action-value estimates, and the pursuit methods keep taking steps toward the current greedy action. Are these simple methods really the best we can do in terms of practically useful algorithms? So far, the answer appears to be "yes." Despite their simplicity, in our opinion the methods presented in this chapter can fairly be considered the state of the art. There are more sophisticated methods, but their complexity and assumptions make them impractical for the full reinforcement learning problem that is our real focus. Starting in Chapter 5 we present learning methods for solving the full reinforcement learning problem that use in part the simple methods explored in this chapter.

Although the simple methods explored in this chapter may be the best we can do at present, they are far from a fully satisfactory solution to the problem of balancing exploration and exploitation. We conclude this chapter with a brief look at some of the current ideas that, while not yet practically useful, may point the way toward better solutions.

One promising idea is to use estimates of the uncertainty of the action-value estimates to direct and encourage exploration. For example, suppose there are two actions estimated to have values slightly less than that of the greedy action, but that differ greatly in their degree of uncertainty. One estimate is nearly certain; perhaps that action has been tried many times and many rewards have been observed. The uncertainty for this action's estimated value is so low that its true value is very unlikely to be higher than the value of the greedy action. The other action is known less well, and the estimate of its value is very uncertain. The true value of this action could easily be better than that of the greedy action. Obviously, it makes more sense to explore the second action than the first.

This line of thought leads to *interval estimation* methods. These
methods estimate for each action a confidence interval of the action's
value. That is, rather than learning that the action's value is approximately
10, they learn that it is between 9 and 11 with, say, 95% confidence. The
action selected is then the action whose confidence interval has the highest
upper limit. This encourages exploration of actions that are uncertain *and* have a chance of ultimately being the best action. In some cases one can
obtain guarantees that the optimal action has been found with confidence equal
to the confidence factor (e.g., the 95%). Unfortunately, interval estimation
methods are problematic in practice because of the complexity of the
statistical methods used to estimate the confidence intervals. Moreover, the
underlying statistical assumptions required by these methods are often not
satisfied. Nevertheless, the idea of using confidence intervals, or some other
measure of uncertainty, to encourage exploration of particular actions is sound
and appealing.

There is also a well-known algorithm for computing the Bayes optimal
way to balance exploration and exploitation. This method is
computationally intractable when done exactly, but there may be efficient
ways to approximate it. In this method we assume that we know the
distribution of problem instances, that is, the probability of each
possible set of true action values. Given any action selection, we can
then compute the probability of each possible immediate reward and the
resultant posterior probability distribution over action values. This
evolving distribution becomes the *information state* of the
problem. Given a horizon, say 1000 plays, one can consider all possible
actions, all possible resulting rewards, all possible next actions, all
next rewards, and so on for all 1000 plays. Given the assumptions, the
rewards and probabilities of each possible chain of events can be
determined, and one need only pick the best. But the tree of
possibilities grows extremely rapidly; even if there are only two actions
and two rewards, the tree will have
leaves. This approach effectively turns the bandit problem
into an instance of the full reinforcement learning problem. In the end,
we may be able to use reinforcement learning methods to approximate
this optimal solution. But that is a topic for current research and beyond
the scope of this introductory book.

The classical solution to balancing exploration and exploitation in -armed
bandit problems is to compute special functions called *Gittins indices*.
These provide an optimal solution to a certain kind of bandit problem more general
than that considered here but that assumes the prior distribution of possible
problems is known. Unfortunately, neither the theory nor the computational
tractability of this method appear to generalize to the full reinforcement
learning problem that we consider in the rest of the book.