The action-value methods we have discussed so far all estimate action
values as sample averages of observed rewards.
The obvious implementation is to maintain, for each action , a record of all
the rewards that have followed the selection of that action. Then, when
the estimate of the value of action a is needed at time , it can be computed
according to (2.1), which we repeat here:

where are all the rewards received following all selections of action prior to play . A problem with this straightforward implementation is that its memory and computational requirements grow over time without bound. That is, each additional reward following a selection of action requires more memory to store it and results in more computation being required to determine .

As you might suspect, this is not really necessary. It is easy to devise
incremental update formulas for computing averages with small, constant
computation required to process each new reward. For some action, let denote
the average of its first rewards (not to be confused with , the
average for action at the th *play*). Given this average and a
st reward,
, then the average of all rewards can be computed by

which holds even for , obtaining for arbitrary . This implementation requires memory only for and , and only the small computation (2.4) for each new reward.

The update rule (2.4) is of a form that
occurs frequently throughout this book. The general form is

The expression is an

Note that the step-size parameter () used in the incremental method described above changes from time step to time step. In processing the th reward for action , that method uses a step-size parameter of . In this book we denote the step-size parameter by the symbol or, more generally, by . For example, the above incremental implementation of the sample-average method is described by the equation . Accordingly, we sometimes use the informal shorthand to refer to this case, leaving the action dependence implicit.