discrete time step | |
final time step of an episode | |
state at | |
action at | |
reward at , dependent, like , on and | |
return (cumulative discounted reward) following | |
-step return (Section 7.1) | |
-return (Section 7.2) | |
policy, decision-making rule | |
action taken in state under deterministic policy | |
probability of taking action in state under stochastic policy | |
set of all nonterminal states | |
set of all states, including the terminal state | |
set of actions possible in state | |
probability of transition from state to state under action | |
expected immediate reward on transition from to under action | |
value of state under policy (expected return) | |
value of state under the optimal policy | |
, | estimates of or |
value of taking action in state under policy | |
value of taking action in state under the optimal policy | |
, | estimates of or |
vector of parameters underlying or | |
vector of features representing state | |
temporal-difference error at | |
eligibility trace for state at | |
eligibility trace for a state-action pair | |
discount-rate parameter | |
probability of random action in -greedy policy | |
step-size parameters | |
decay-rate parameter for eligibility traces |