Solving the example using Value Iteration
VI should make even more sense once we complete an example problem, so let’s get back to our golf MDP. We have formalised this as an MDP but currently, the agent doesn’t know the best strategy when playing golf, so let’s solve the golf MDP using VI.
We’ll start by defining our model parameters using fairly standard values:
γ = 0.9 // discount factor
θ = 0.01 // convergence threshold
We will then initialise our value table to 0 for states in S:
// value tableV(s0) = 0
V(s1) = 0
V(s2) = 0
We can now start in the outer loop:
Δ = 0
And three passes of the inner loop for each state in S:
// Bellman update rule
// V(s) ← maxₐ Σₛ′, ᵣ p(s′, r|s, a) * [r + γV(s′)]//******************* state s0 *******************//
v = 0
// we have only looked at one action here as only one is available from s0
// we know that the others are not possible and would therefore sum to 0
V(s0) = max[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +
T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))]
V(s0) = max[0.1 * (0 + 0.9 * 0) +
0.9 * (0 + 0.9 * 0)]
V(s0) = max[0] = 0
// Delta update rule
// Δ ← max(Δ,| v - V(s)|)
Δ = max[Δ, |v - v(s0)|] = max[0, |0 - 0|] = 0
//******************* state s1 *******************//
v = 0
// there are 2 available actions here
V(s1) = max[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) +
T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),
T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) +
T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]
V(s1) = max[0.9 * (0 + 0.9 * 0) +
0.1 * (0 + 0.9 * 0),
0.1 * (0 + 0.9 * 0) +
0.9 * (10 + 0.9 * 0)]
V(s1) = max[0, 9] = 9
Δ = max[Δ, |v - v(s1)|] = max[0, |0 - 9|] = 9
//******************* state s2 *******************//
// terminal state with no actions
This gives us the following update to our value table:
V(s0) = 0
V(s1) = 9
V(s2) = 0
We don’t need to worry about s2 as this is a terminal state, meaning no actions are possible here.
We now break out the inner loop and continue the outer loop, performing a convergence check on:
Δ < θ = 9 < 0.01 = False
Since we have not converged, we do a second iteration of the outer loop:
Δ = 0
And another 3 passes of the inner loop, using the updated value table:
//******************* state s0 *******************//v = 0
V(s0) = max[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +
T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))]
V(s0) = max[0.1 * (0 + 0.9 * 0) +
0.9 * (0 + 0.9 * 9)]
V(s0) = max[7.29] = 7.29
Δ = max[Δ, |v - v(s0)|] = max[0, |0 - 7.29|] = 7.29
//******************* state s1 *******************//
v = 9
V(s1) = max[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) +
T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),
T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) +
T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]
V(s1) = max[0.9 * (0 + 0.9 * 7.29) +
0.1 * (0 + 0.9 * 9),
0.1 * (0 + 0.9 * 9) +
0.9 * (10 + 0.9 * 0)]
V(s1) = max(6.7149, 9.81) = 9.81
Δ = max[Δ, |v - v(s1)|] = max[7.29, |9 - 9.81|] = 7.29
//******************* state s2 *******************//
// terminal state with no actions
At the end of the second iteration, our values are:
V(s0) = 7.29
V(s1) = 9.81
V(s2) = 0
Check convergence once again:
Δ < θ = 7.29 < 0.01 = False
Still no convergence, so we continue the same process as above until Δ < θ. I won’t show all the calculations, the above two are enough to understand the process.
After 6 iterations, our policy converges. This is our values and convergence rate as they change over each iteration:
Iteration V(s0) V(s1) V(s2) Δ Converged
1 0 9 0 9 False
2 7.29 9.81 0 7.29 False
3 8.6022 9.8829 0 1.3122 False
4 8.779447 9.889461 0 0.177247 False
5 8.80061364 9.89005149 0 0.02116664 False
6 8.8029969345 9.8901046341 0 0.0023832945 True
Now we can extract our policy:
// Policy extraction rule
// π(s) = argmaxₐ Σₛ′, ᵣ p(s′, r|s, a) * [r + γV(s′)]//******************* state s0 *******************//
// we know there is only one possible action from s0, but let's just do it anyway
π(s0) = argmax[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +
T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))
π(s0) = argmax[0.1 * (0 + 0.9 * 8.8029969345) +
0.9 * (0 + 0.9 * 9.8901046341)]
π(s0) = argmax[8.80325447773]
π(s0) = hit to green
//******************* state s1 *******************//
π(s1) = argmax[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) +
T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),
T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) +
T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]
V(s1) = max[0.9 * (0 + 0.9 * 8.8029969345) +
0.1 * (0 + 0.9 * 9.8901046341),
0.1 * (0 + 0.9 * 9.8901046341) +
0.9 * (10 + 0.9 * 0)]
π(s1) = argmax[8.02053693401, 9.89010941707]
π(s1) = hit in hole
Our final policy is:
π(s0) = hit to green
π(s1) = hit to hole
π(s2) = terminal state (no action)
So, when our agent is in the Ball on fairway state (s0), the best action is to hit to green. This seems pretty obvious since that is the only available action. However, in s1, where there are two possible actions, our policy has learned to hit in hole. We can now give this learned policy to other agents who want to play golf!
And there you have it! We have just solved a very simple RL problem using Value Iteration.