In conclusion, the Q-learning agent converged to a sub-optimal strategy as mentioned previously. Moreover, a portion of the environment remains unexplored by the Q-function, which prevents the agent from finding the new optimal path when the purple portal appears after the 100th episode.
These performance limitations can be attributed to the relatively low number of training steps (400), limiting the possibilities of interaction with the environment and the exploration induced by the ε-greedy policy.
Planning, an essential component of model-based reinforcement learning methods is particularly useful to improve sample efficiency and estimation of action values. Dyna-Q and Dyna-Q+ are good examples of TD algorithms incorporating planning steps.
The Dyna-Q algorithm (Dynamic Q-learning) is a combination of model-based RL and TD learning.
Model-based RL algorithms rely on a model of the environment to incorporate planning as their primary way of updating value estimates. In contrast, model-free algorithms rely on direct learning.
”A model of the environment is anything that an agent can use to predict how the environment will respond to its actions” — Reinforcement Learning: an introduction.
In the scope of this article, the model can be seen as an approximation of the transition dynamics p(s’, r|s, a). Here, p returns a single next-state and reward pair given the current state-action pair.
In environments where p is stochastic, we distinguish distribution models and sample models, the former returns a distribution of the next states and actions while the latter returns a single pair, sampled from the estimated distribution.
Models are especially useful to simulate episodes, and therefore train the agent by replacing real-world interactions with planning steps, i.e. interactions with the simulated environment.
Agents implementing the Dyna-Q algorithm are part of the class of planning agents, agents that combine direct reinforcement learning and model learning. They use direct interactions with the environment to update their value function (as in Q-learning) and also to learn a model of the environment. After each direct interaction, they can also perform planning steps to update their value function using simulated interactions.
A quick Chess example
Imagine playing a good game of chess. After playing each move, the reaction of your opponent allows you to assess the quality of your move. This is similar to receiving a positive or negative reward, which allows you to “update” your strategy. If your move leads to a blunder, you probably wouldn’t do it again, provided with the same configuration of the board. So far, this is comparable to direct reinforcement learning.
Now let’s add planning to the mix. Imagine that after each of your moves, while the opponent is thinking, you mentally go back over each of your previous moves to reassess their quality. You might find weaknesses that you neglected at first sight or find out that specific moves were better than you thought. These thoughts may also allow you to update your strategy. This is exactly what planning is about, updating the value function without interacting with the real environment but rather a model of said environment.
Dyna-Q therefore contains some additional steps compared to Q-learning:
After each direct update of the Q values, the model stores the state-action pair and the reward and next-state that were observed. This step is called model training.
- After model training, Dyna-Q performs n planning steps:
- A random state-action pair is selected from the model buffer (i.e. this state-action pair was observed during direct interactions)
- The model generates the simulated reward and next-state
- The value function is updated using the simulated observations (s, a, r, s’)