Bellman Equation

The Bellman equation provides a recursive definition for the value of a state in a Markov Decision Process. It expresses the value of a state as the expected return of taking an action from that state and then following a particular policy. This recursive formulation is central to many reinforcement learning algorithms, including Q-learning and Value Iteration.

Vπ(s)=Eπ[Rt+1+γVπ(St+1)St=s] \textcolor{magenta}{V^\pi(s)} = \mathbb{E}_\pi \left[ \textcolor{red}{R_{t+1}} + \textcolor{orange}{\gamma} \cdot \textcolor{blue}{V^\pi(S_{t+1})} \mid S_t = s \right]

Where:

  • Vπ(s)\textcolor{magenta}{V^\pi(s)}: The value of state ss under policy π\pi. It represents the expected total reward the agent can accumulate starting from state ss, following policy π\pi.

  • Eπ[...]\mathbb{E}_\pi[...]: The expectation over all possible actions and resulting transitions, assuming the agent follows policy π\pi.

  • Rt+1\textcolor{red}{R_{t+1}}: The immediate reward received after taking an action in state ss at time tt.

  • γ\textcolor{orange}{\gamma}: The discount factor, a number between 0 and 1, reduces the importance of future rewards. A lower γ\gamma prioritizes immediate rewards more heavily.

  • Vπ(St+1)\textcolor{blue}{V^\pi(S_{t+1})}: The value of the next state, indicating how good it is to be in the state that follows from the current one, assuming policy π\pi continues to be followed.

In other words, it's telling us: "What's the expected return if we start in state ss, take an action according to policy π\pi, receive an immediate reward, and then continue following policy π\pi from whatever next state we end up in?"