Bellman Equation
The Bellman equation provides a recursive definition for the value of a state in a Markov Decision Process. It expresses the value of a state as the expected return of taking an action from that state and then following a particular policy. This recursive formulation is central to many reinforcement learning algorithms, including Q-learning and Value Iteration.
Where:
-
: The value of state under policy . It represents the expected total reward the agent can accumulate starting from state , following policy .
-
: The expectation over all possible actions and resulting transitions, assuming the agent follows policy .
-
: The immediate reward received after taking an action in state at time .
-
: The discount factor, a number between 0 and 1, reduces the importance of future rewards. A lower prioritizes immediate rewards more heavily.
-
: The value of the next state, indicating how good it is to be in the state that follows from the current one, assuming policy continues to be followed.
In other words, it's telling us: "What's the expected return if we start in state , take an action according to policy , receive an immediate reward, and then continue following policy from whatever next state we end up in?"