Almost all reinforcement learning algorithms are based on estimating value functions--functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of "how good" here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return. Of course the rewards the agent can expect to receive in the future depend on what actions it will take. Accordingly, value functions are defined with respect to particular policies.
Recall that a policy, , is a mapping from each state, , and action, , to the probability of taking action
when in state . Informally, the value of a state under a
, denoted , is the expected return when starting in and
following thereafter. For MDPs, we can define formally as
Similarly, we define the value of taking action in
state under a policy , denoted , as the expected
return starting from , taking the action , and thereafter following
The value functions and can be estimated from experience. For example, if an agent follows policy and maintains an average, for each state encountered, of the actual returns that have followed that state, then the average will converge to the state's value, , as the number of times that state is encountered approaches infinity. If separate averages are kept for each action taken in a state, then these averages will similarly converge to the action values, . We call estimation methods of this kind Monte Carlo methods because they involve averaging over many random samples of actual returns. These kinds of methods are presented in Chapter 5. Of course, if there are very many states, then it may not be practical to keep separate averages for each state individually. Instead, the agent would have to maintain and as parameterized functions and adjust the parameters to better match the observed returns. This can also produce accurate estimates, although much depends on the nature of the parameterized function approximator (Chapter 8).
A fundamental property of value functions used throughout reinforcement learning
and dynamic programming is that they satisfy
particular recursive relationships. For any
policy and any state
, the following consistency condition holds between the value of and the value
of its possible successor states:
The value function is the unique solution to its Bellman equation. We show in subsequent chapters how this Bellman equation forms the basis of a number of ways to compute, approximate, and learn . We call diagrams like those shown in Figure 3.4 backup diagrams because they diagram relationships that form the basis of the update or backup operations that are at the heart of reinforcement learning methods. These operations transfer value information back to a state (or a state-action pair) from its successor states (or state-action pairs). We use backup diagrams throughout the book to provide graphical summaries of the algorithms we discuss. (Note that unlike transition graphs, the state nodes of backup diagrams do not necessarily represent distinct states; for example, a state might be its own successor. We also omit explicit arrowheads because time always flows downward in a backup diagram.)
Example 3.8: Gridworld Figure 3.5a uses a rectangular grid to illustrate value functions for a simple finite MDP. The cells of the grid correspond to the states of the environment. At each cell, four actions are possible: north, south, east, and west, which deterministically cause the agent to move one cell in the respective direction on the grid. Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of . Other actions result in a reward of 0, except those that move the agent out of the special states A and B. From state A, all four actions yield a reward of and take the agent to . From state B, all actions yield a reward of and take the agent to .
Suppose the agent selects all four actions with equal probability in all states. Figure 3.5b shows the value function, , for this policy, for the discounted reward case with . This value function was computed by solving the system of equations (3.10). Notice the negative values near the lower edge; these are the result of the high probability of hitting the edge of the grid there under the random policy. State A is the best state to be in under this policy, but its expected return is less than 10, its immediate reward, because from A the agent is taken to , from which it is likely to run into the edge of the grid. State B, on the other hand, is valued more than 5, its immediate reward, because from B the agent is taken to , which has a positive value. From the expected penalty (negative reward) for possibly running into an edge is more than compensated for by the expected gain for possibly stumbling onto A or B.
Example 3.9: Golf To formulate playing a hole of golf as a reinforcement learning task, we count a penalty (negative reward) of for each stroke until we hit the ball into the hole. The state is the location of the ball. The value of a state is the negative of the number of strokes to the hole from that location. Our actions are how we aim and swing at the ball, of course, and which club we select. Let us take the former as given and consider just the choice of club, which we assume is either a putter or a driver. The upper part of Figure 3.6 shows a possible state-value function, , for the policy that always uses the putter. The terminal state in-the-hole has a value of . From anywhere on the green we assume we can make a putt; these states have value . Off the green we cannot reach the hole by putting, and the value is greater. If we can reach the green from a state by putting, then that state must have value one less than the green's value, that is, . For simplicity, let us assume we can putt very precisely and deterministically, but with a limited range. This gives us the sharp contour line labeled in the figure; all locations between that line and the green require exactly two strokes to complete the hole. Similarly, any location within putting range of the contour line must have a value of , and so on to get all the contour lines shown in the figure. Putting doesn't get us out of sand traps, so they have a value of . Overall, it takes us six strokes to get from the tee to the hole by putting.
Exercise 3.8 What is the Bellman equation for action values, that is, for ? It must give the action value in terms of the action values, , of possible successors to the state-action pair . As a hint, the backup diagram corresponding to this equation is given in Figure 3.4b. Show the sequence of equations analogous to (3.10), but for action values.
Exercise 3.9 The Bellman equation (3.10) must hold for each state for the value function shown in Figure 3.5b. As an example, show numerically that this equation holds for the center state, valued at , with respect to its four neighboring states, valued at , , , and . (These numbers are accurate only to one decimal place.)
Exercise 3.10 In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.2), that adding a constant to all the rewards adds a constant, , to the values of all states, and thus does not affect the relative values of any states under any policies. What is in terms of and ?
Exercise 3.11 Now consider adding a constant to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example.
Exercise 3.12 The value of a state depends on the the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:
Exercise 3.13 The value of an action, , can be divided into two parts, the expected next reward, which does not depend on the policy , and the expected sum of the remaining rewards, which depends on the next state and the policy. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state-action pair) and branching to the possible next states: