 |
Reinforcement Learning and Artificial
Intelligence (RLAI)
The value-function hypothesis
|
The ambition of this web
page is to state, refine, clarify and, most of all, promote discussion of, the following scientific hypothesis:
All
efficient methods for solving sequential decision problems determine
(learn or compute) value functions as an intermediate step.
Is this true? False? A definition? Unfalsifiable? You are encouraged to comment on
the hypothesis, even in minimal ways. For example, you might submit an
extension "Yes" to indicate that you believe the hypothesis, or
similarly "No" or "Not sure". These minimal responses will be collected and
tallied at some point, and you may want to change yours later, so
please include your name in some way.
Definitions: A value function
is an estimate of expected cumulative future reward, usually as a
function of state or state-action pair. The reward may be discounted, with lesser weight being given to delayed reward, or it may be cumulative only within individual episodes of interaction with the environment. Finally, in the average-reward case, the values are all relative to the mean reward received when following the current policy.
I think the necessity of value-functions is something we have learned
through long experience over the last 20 years. Dynamic
programming computes value functions. All the most effective
reinforcement learning methods estimate value functions. We can't
prove directly that value functions are necessary, it is just an
experience thing at this point.
People are eternally proposing that value functions aren't necessary,
that policies can be found directly, as in "policy search" methods
(don't ask me what this means), but in the end the systems that perform
the best always use values. And not just relative values (of
actions from each state, which are essentially a policy), but absolute
values giving an genuine absolute estimate of the expected cumulative
future reward.
To prove the value-function hypothesis in any general sense would be
big news. In my opinion this is one of the most important open
problems in artificial intelligence, and one that could have an
essentially analytic (mathematical) solution.
-Rich