 |
discrete time step |
 |
final time step of an episode |
 |
state at  |
 |
action at  |
 |
reward at , dependent, like , on and  |
 |
return (cumulative discounted reward) following  |
 |
-step return (Section 7.1) |
 |
-return (Section 7.2) |
 |
policy, decision-making rule |
 |
action taken in state under deterministic policy  |
 |
probability of taking action in state under stochastic policy  |
 |
set of all nonterminal states |
 |
set of all states, including the terminal state |
 |
set of actions possible in state  |
| |
|
 |
probability of transition from state to state under action  |
 |
expected immediate reward on transition from to under action  |
 |
value of state under policy (expected return) |
 |
value of state under the optimal policy |
,  |
estimates of or  |
 |
value of taking action in state under policy  |
 |
value of taking action in state under the optimal policy |
,  |
estimates of or  |
 |
vector of parameters underlying or  |
 |
vector of features representing state  |
| |
|
 |
temporal-difference error at  |
 |
eligibility trace for state at  |
 |
eligibility trace for a state-action pair |
| |
|
 |
discount-rate parameter |
 |
probability of random action in -greedy policy |
 |
step-size parameters |
 |
decay-rate parameter for eligibility traces |