Almost all reinforcement learning algorithms are
based on estimating *value functions*--functions of states (or of
state-action pairs) that estimate *how good* it is for the agent to be in
a given state (or how good it is to perform a given action in a given
state). The notion of "how good" here is defined in terms of future rewards
that can be expected, or, to be precise, in terms of expected return. Of course the
rewards the agent can expect to receive in the future depend on what actions
it will take. Accordingly, value functions are defined with respect to
particular policies.

Recall that a policy, , is a mapping from each state, , and action, , to the probability of taking action
when in state . Informally, the *value* of a state under a
policy
, denoted , is the expected return when starting in and
following thereafter. For MDPs, we can define formally as

where denotes the expected value given that the agent follows policy , and is any time step. Note that the value of the terminal state, if any, is always zero. We call the function the

Similarly, we define the value of taking action in
state under a policy , denoted , as the expected
return starting from , taking the action , and thereafter following
policy :

We call the

The value functions and can be estimated from
experience. For example, if an agent follows policy
and maintains an average, for each state encountered, of the actual returns
that have followed that state, then the average will converge to
the state's value, , as the number of times that state is
encountered approaches infinity. If separate averages are kept for each action taken
in a state, then these averages will similarly converge to the action values,
. We call estimation methods of this kind *Monte Carlo
methods* because they involve averaging over many random samples
of actual returns. These kinds of methods are presented in Chapter 5. Of
course, if there are very many states, then it may not be practical to keep
separate averages for each state individually. Instead, the agent would have to
maintain
and as parameterized functions and adjust the parameters to
better match the observed returns. This can also produce accurate estimates,
although much depends on the nature of the parameterized function approximator
(Chapter 8).

A fundamental property of value functions used throughout reinforcement learning
and dynamic programming is that they satisfy
particular recursive relationships. For any
policy and any state
, the following consistency condition holds between the value of and the value
of its possible successor states:

where it is implicit that the actions, , are taken from the set , and the next states, , are taken from the set , or from in the case of an episodic problem. Equation (3.10) is the

The value function is the unique solution to its Bellman
equation. We show in subsequent chapters
how this Bellman equation forms the basis of a number of ways to compute,
approximate, and learn . We call diagrams like those shown in
Figure
3.4 *backup diagrams* because they diagram
relationships that form the basis of the update or *backup* operations that are
at the heart of reinforcement learning methods. These operations transfer value
information *back* to a state (or a state-action pair) from its successor states
(or state-action pairs). We use backup diagrams throughout the book to provide
graphical summaries of the algorithms we discuss. (Note that unlike transition
graphs, the state nodes of backup diagrams do not necessarily represent distinct
states; for example, a state might be its own successor. We also omit explicit
arrowheads because time always flows downward in a backup diagram.)

Suppose the agent selects all four actions with equal probability in all states. Figure 3.5b shows the value function, , for this policy, for the discounted reward case with . This value function was computed by solving the system of equations (3.10). Notice the negative values near the lower edge; these are the result of the high probability of hitting the edge of the grid there under the random policy. State A is the best state to be in under this policy, but its expected return is less than 10, its immediate reward, because from A the agent is taken to , from which it is likely to run into the edge of the grid. State B, on the other hand, is valued more than 5, its immediate reward, because from B the agent is taken to , which has a positive value. From the expected penalty (negative reward) for possibly running into an edge is more than compensated for by the expected gain for possibly stumbling onto A or B.

Give the equation corresponding to this intuition and diagram for the value at the root node, , in terms of the value at the expected leaf node, , given . This expectation depends on the policy, . Then give a second equation in which the expected value is written out explicitly in terms of such that no expected value notation appears in the equation.

Give the equation corresponding to this intuition and diagram for the action value, , in terms of the expected next reward, , and the expected next state value, , given that and . Then give a second equation, writing out the expected value explicitly in terms of and , defined respectively by (3.6) and (3.7), such that no expected value notation appears in the equation.