# Email discussion of numerical instabilities in the Gambler's problem

### Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto

A czech translation of this page is available. Many thanks to Olga Babenko.

The optimal policy shown in Figure 4.6 is not unique. There are many other optimal policies, all of which share the optimal function also shown in the figure. Which optimal policy you get from your value iteration algorithm depends on how it breaks ties within the numerical resolution of your computer. (Daniel Loeb, Daniel Polani, Francesco Decomite, and many students)

### The following is an email discussion of the issue, in chronological order:

Date: Mon, 1 Mar 1999 12:45:33 +0100 (MEZ)
From: Daniel Polani <polani@Informatik.Uni-Mainz.DE>
To: rich@cs.umass.edu

Dear Prof. Sutton,

I've been giving a course on Reinforcement Learning this winter and
there arose an interesting question in conjunction with the gambler's

There was some discussion in my course why the optimal policy as
shown in Fig. 4.6 should look like that. So I decided to code the
problem and noticed that the results are not always stable (you
indicate that in Exercise 4.8). In fact, the optimization is quite
stable when the problem size is odd, but it isn't for even problem
sizes.

The optimal policies in the even case usually assume a kind of fractal
characteristics, which is very volatile during iteration, while there
is some fixed underlying structure of filled triangles (of which your
Figure shows four). I noticed that the number of filled triangles is the
same as the power of 2 in the problem size. In other words, if problem
size is 2^n * k, where k is odd, then the solutions close to the
optimal policy have 2^n filled triangles in them plus some volatile
(instable) fractal structure. In particular, if problem size is odd,
then n = 0, 2^n = 1 and you find (stably) a single filled triangle in
it.

Now to my questions:

1. Can you confirm my above observations?
2. Do you know of any theoretical treatment of this question?
3. Although my observations seem generally consistent with your result in
Fig. 4.6, I could not reproduce your precise result for the optimal
policy, probably due to the precision loss in the calculation. Did
you have similar problems calculating the solution, or is your
result derived from theoretical considerations? If not, how did you
avoid precision problems?

--
Daniel Polani

```| Institut fuer Informatik                  |  Staudingerweg 9   | | Fachbereich Mathematik                    |  55099 Mainz       | | Johannes Gutenberg-Universitaet           |  Germany           | | ---------------------------------------------------------------| | polani@informatik.uni-mainz.de,  Tel. +6131/39-3608            | | http://www.informatik.uni-mainz.de/PERSONEN/Polani.html        |```

P.S.: Thank you very much for your smooth, consistent and unifying
introduction book into RL, it was very useful for me to prepare
my RL course.

To: polani@informatik.mathematik.uni-mainz.de

Dear Daniel Polani,

Thanks for your kind words and descriptions of your experiences teaching reinforcement learning and using our book.  I am afraid I can't give you any definitive answer on the gambler's problem.  I remember one student in a class I taught made a careful study of it, mainly computationally, but I don't recall his exact conclusions.  What I do remember, I think, is that the solution shown in the book is just one of many optimal policies for the task.  This is perhaps one of the main points of the exercise - to drive home the fact that there are potentially many optimal policies whereas the optimal values are unique.  The policy one finds often has to do with trivial things like numerical resolution or the order in which one's program breaks ties for maximal action value.  This is probably why you did not get necessarily exactly the result in Figure 4.6.  I believe this is still an optimal policy, however.  You might check this using the optimal value function you found.  See if my policy locally max's your value function to within the numerical resolution of your machine.  The code I used is available from the web page for the book if you want to check the order I used to break ties (or for a bug in my code!).

I don't know of any theoretical analysis of this problem.  We took the problem from a standard text, probably a DP text, but I couldn't say which now and I am disappointed to see that we didn't note which in our book.  You might check the exercises in some of the older Bertsekas DP texts.  But in any event, we slightly changed the problem so any prior analysis might not apply.  You probably now understand the problem as well as anybody does.

Best regards,
Rich Sutton

From: Daniel Loeb
To: Rich Sutton

I have a question about "The Gambler's Problem"
which appears at example 4.3 in your book
"Reinforcement Learning: An Introduction".

Your solution implies there is only one optimal solution
to the gambler's problem.
For example, starting with \$70 it is only optimal to bet \$5.
I believe that it is optimal to bet either \$5 or \$20 or \$30.
Am I missing something?

Here is my complete solution.

```o = your solution * = other solutions```

```            o            * *           *   *          *     *         *       *        *         *       o           o      * *         * *     *   *       *   *    o     o     o     o   o o   o o   o o   o o  o   o o   o o   o o   o \$0   \$25   \$50   \$75   \$100```

In particular, the simple policy of always making the
largest allowed bet is optimal.

It seems to be the complete set of optimal policy for any
p under 50%.
The number of levels of subdivisions seems to depend on
the number of powers of 2 which go into the goal (\$100).

I wonder if a complete solution is known for infinitely
divisible money. You can get a good idea what that might look
like by choosing a goal which is a large power of two.

Here is the code I used (in Matlab) to find my solutions:

`function [b, v] = optcasino(n,prob)`

```figure; title(prob); hold on;```

`v = zeros(1,n-1);`

```delta = 1; iter = 0;```

```while ( delta>1e-50 )    iter = iter + 1    [b,v, delta] = casino(v,prob);    plot(1:(n-1), v); end```

```hold on; % bar(b); plot(b(:,1), b(:,2), 'm*');```

`=======`

```function [bnew, vnew, delta] = casino(vold, prob) ```

`n = 1+length(vold);`

```vold = [ 0, vold, n ]; bnew = [];```

```for i = 1:(n-1)    bets = [1:min(i,n-i)];    vs = prob*vold(1+i+bets) + (1-prob)*vold(1+i-bets);    if ( i==15 )       vs - max(vs);    end      vnew(i) = max(vs);      goodbets = find( vs >= vnew(i)*(1-1e-9) );    gbets = ones(length(goodbets),2);    gbets(:,1) = i;    gbets(:,2) = goodbets';    bnew = [ bnew; gbets ]; end```

`delta = max( abs( [0,vnew,n] - vold ) )`

--
Yours, Daniel Loeb, Susquehanna International Group, LLP
http://dept-info.labri.fr/~loeb
Work loeb@sig.com     401 City Line, Bala Cynwyd PA 19004 610/617-2671
Home loeb@delanet.com

Date: Fri, 30 Nov 2001 09:00:47 -0500
To: Dan Loeb <loeb@sig.com>
From: Andy Barto <barto@cs.umass.edu>
Subject: Re: Reinforcement Learning: Example 4.3
Cc: Rich Sutton <sutton@research.att.com>

Dan,
Many thanks for comment about example 4.3. Indeed, we write as if the solution of Fig. 4.6 is the only one, when there is a whole class of them. You've done a nice job of characterizing the entire class. We are aware of this, and students in our course have similarly described all the solutions. But thanks very much for your message. A future edition of the book will do this better.
Sincrely,
A. Barto

PS: I haven't thought about the case of infinitely divisible money

From: Francesco.Decomite@lifl.fr
Subject: Gambler's problem
Date: May 26, 2004 6:46:57 AM MDT
To: rich@richsutton.com

dear Sir,

While studying your book, I am trying to rewrite and test the algorithms.
I am faced with a problem for about a week now, working on the gambler's problem (exemple 4.3 page 101) :

No matter how hard I tried, I never obtain the well-shaped lower graph on figure 4.6 page 103, but only approximations.

Even when using your code, the shape is approximative. Another point is that, as far as p<0.5, all the graphs are looking similar. I join you a graph obtained from your code with two values of p(I just added a function : lp to output the final policy) and divided epsilon by 100 : well, I attach also the modified code)
my question then is : do you obtain the graph by analytics methods, or other reasoning ways ? If so, can you give me some hints ?

Best regards

F. de Comite

--
Francesco De Comite
tel (33) 03 28 77 85 72
fax (33) 03 28 77 85 37
www.lifl.fr/~decomite

` ;;; Gambler's problem. The gambler has a stake s between 0 and 100. At each ;;; play he wagers an integer <= s. He wins that much with prob p, else he;;; loses that much. If he builds his stake to 100 he wins (thus he never ;;; wagers more than (- 100 s)); if his stake falls to 0 he loses.;;; Thus, the stake s is the state the actions are the size of the bid.;;; Here we implement value iteration.(defVar v (make-array 101 :initial-element 0))(setf (aref V 100) 1)(defVar p .12)(defun backup-action (s a)  (+ (* p (aref V (+ s a)))     (* (- 1 p) (aref V (- s a)))))(defun vi (&optional (epsilon .0000000001))    "Value iteration to the criterion epsilon"   (loop collect (loop for i from 1 to 99 collect (list i (aref v i)))         while (< epsilon                  (loop for s from 1 below 100                        for old-V = (aref V s)                        do (setf (aref V s)                                 (loop for a from 1 upto (min s (- 100 s))                                       maximize (backup-action s a)))                        maximize (abs (- old-V (aref V s)))))         ))(defun policy (s &optional (epsilon .0000000001))  (loop with best-value = -1        with best-action        for a from 1 upto (min s (- 100 s))        for this-value = (backup-action s a)        do (when (> this-value (+ best-value epsilon))             (setq best-value this-value)             (setq best-action a))        finally (return best-action)))(defun lp()  (loop for i from 1 to 99 collect (list i (policy i)))  )`