So far we have considered methods for estimating the value functions for a
policy given an infinite supply of episodes generated using that policy.
Suppose now that all we have are episodes generated from a *different*
policy. That is, suppose we wish to estimate or , but all we have
are episodes following , where . Can we learn the value function
for a policy given only experience "off" the policy?

Happily, in many cases we can. Of course, in order to use episodes from to estimate values for , we require that every action taken under is also taken, at least occasionally, under . That is, we require that implies . In the episodes generated using , consider the th first visit to state and the complete sequence of states and actions following that visit. Let and denote the probabilities of that complete sequence happening given policies and and starting from . Let denote the corresponding observed return from state . To average these to obtain an unbiased estimate of , we need only weight each return by its relative probability of occurring under and , that is, by . The desired Monte Carlo estimate after observing returns from state is then

This equation involves the probabilities and ,
which are normally considered unknown in applications of Monte Carlo methods.
Fortunately, here we need only their ratio, , which *can* be determined with no knowledge of the environment's dynamics. Let
be the time of termination of the th episode involving state
. Then

and

Thus the weight needed in (5.3), , depends only on the two policies and not at all on the environment's dynamics.