Next: 2.6 Tracking a Nonstationary Up: 2. Evaluative Feedback Previous: 2.4 Evaluation Versus Instruction   Contents

2.5 Incremental Implementation

The action-value methods we have discussed so far all estimate action values as sample averages of observed rewards. The obvious implementation is to maintain, for each action , a record of all the rewards that have followed the selection of that action. Then, when the estimate of the value of action a is needed at time , it can be computed according to (2.1), which we repeat here:

where are all the rewards received following all selections of action prior to play . A problem with this straightforward implementation is that its memory and computational requirements grow over time without bound. That is, each additional reward following a selection of action requires more memory to store it and results in more computation being required to determine .

As you might suspect, this is not really necessary. It is easy to devise incremental update formulas for computing averages with small, constant computation required to process each new reward. For some action, let denote the average of its first rewards (not to be confused with , the average for action at the th play). Given this average and a st reward, , then the average of all rewards can be computed by

 (2.4)

which holds even for , obtaining for arbitrary . This implementation requires memory only for and , and only the small computation (2.4) for each new reward.

The update rule (2.4) is of a form that occurs frequently throughout this book. The general form is

 (2.5)

The expression is an error in the estimate. It is reduced by taking a step toward the "Target." The target is presumed to indicate a desirable direction in which to move, though it may be noisy. In the case above, for example, the target is the reward.

Note that the step-size parameter () used in the incremental method described above changes from time step to time step. In processing the th reward for action , that method uses a step-size parameter of . In this book we denote the step-size parameter by the symbol or, more generally, by . For example, the above incremental implementation of the sample-average method is described by the equation . Accordingly, we sometimes use the informal shorthand to refer to this case, leaving the action dependence implicit.

Exercise 2.5   Give pseudocode for a complete algorithm for the -armed bandit problem. Use greedy action selection and incremental computation of action values with step-size parameter. Assume a function that takes an action and returns a reward. Use arrays and variables; do not subscript anything by the time index . Indicate how the action values are initialized and updated after each reward.

Next: 2.6 Tracking a Nonstationary Up: 2. Evaluative Feedback Previous: 2.4 Evaluation Versus Instruction   Contents
Mark Lee 2005-01-04