The term "dynamic programming" is due to Bellman (1957a), who showed how these methods could be applied to a wide range of problems. Extensive treatments of DP can be found in many texts, including Bertsekas (1995), Bertsekas and Tsitsiklis (1996), Dreyfus and Law (1977), Ross (1983), White (1969), and Whittle (1982, 1983). Our interest in DP is restricted to its use in solving MDPs, but DP also applies to other types of problems. Kumar and Kanal (1988) provide a more general look at DP.

To the best of our knowledge, the first connection between DP and reinforcement learning was made by Minsky (1961) in commenting on Samuel's checkers player. In a footnote, Minsky mentioned that it is possible to apply DP to problems in which Samuel's backing-up process can be handled in closed analytic form. This remark may have misled artificial intelligence researchers into believing that DP was restricted to analytically tractable problems and therefore largely irrelevant to artificial intelligence. Andreae (1969b) mentioned DP in the context of reinforcement learning, specifically policy iteration, although he did not make specific connections between DP and learning algorithms. Werbos (1977) suggested an approach to approximating DP called "heuristic dynamic programming" that emphasizes gradient-descent methods for continuous-state problems (Werbos, 1982, 1987, 1988, 1989, 1992). These methods are closely related to the reinforcement learning algorithms that we discuss in this book. Watkins (1989) was explicit in connecting reinforcement learning to DP, characterizing a class of reinforcement learning methods as "incremental dynamic programming."

Iterative
policy evaluation is an example of a classical successive approximation
algorithm for solving a system of linear equations. The version of the algorithm
that uses two arrays, one holding the old values while the other is updated, is
often called a *Jacobi-style* algorithm, after Jacobi's classical use of
this method. It is also sometimes called a *synchronous* algorithm because
it can be performed in parallel, with separate processors simultaneously
updating the values of individual states using input from other processors. The
second array is needed to simulate this parallel computation sequentially. The
in-place version of the algorithm is often called a *Gauss-Seidel-style*
algorithm after the classical Gauss-Seidel algorithm for solving systems of
linear equations. In addition to iterative policy evaluation, other DP
algorithms can be implemented in these different versions. Bertsekas and
Tsitsiklis (1989) provide excellent coverage of
these variations and their performance differences.