Temporal-Difference Learning and the importance of exploration: An illustrated guide | by Ryan Pégoud

In conclusion, the Q-learning agent converged to a sub-optimal strategy as mentioned previously. Moreover, a portion of the environment remains unexplored by the Q-function, which prevents the agent from finding the new optimal path when the purple portal appears after the 100th episode.

These performance limitations can be attributed to the relatively low number of training steps (400), limiting the possibilities of interaction with the environment and the exploration induced by the ε-greedy policy.

Planning, an essential component of model-based reinforcement learning methods is particularly useful to improve sample efficiency and estimation of action values. Dyna-Q and Dyna-Q+ are good examples of TD algorithms incorporating planning steps.

The Dyna-Q algorithm (Dynamic Q-learning) is a combination of model-based RL and TD learning.

Model-based RL algorithms rely on a model of the environment to incorporate planning as their primary way of updating value estimates. In contrast, model-free algorithms rely on direct learning.

”A model of the environment is anything that an agent can use to predict how the environment will respond to its actions” — Reinforcement Learning: an introduction.

In the scope of this article, the model can be seen as an approximation of the transition dynamics p(s’, r|s, a). Here, p returns a single next-state and reward pair given the current state-action pair.

In environments where p is stochastic, we distinguish distribution models and sample models, the former returns a distribution of the next states and actions while the latter returns a single pair, sampled from the estimated distribution.

Models are especially useful to simulate episodes, and therefore train the agent by replacing real-world interactions with planning steps, i.e. interactions with the simulated environment.

Agents implementing the Dyna-Q algorithm are part of the class of planning agents, agents that combine direct reinforcement learning and model learning. They use direct interactions with the environment to update their value function (as in Q-learning) and also to learn a model of the environment. After each direct interaction, they can also perform planning steps to update their value function using simulated interactions.

A quick Chess example

Imagine playing a good game of chess. After playing each move, the reaction of your opponent allows you to assess the quality of your move. This is similar to receiving a positive or negative reward, which allows you to “update” your strategy. If your move leads to a blunder, you probably wouldn’t do it again, provided with the same configuration of the board. So far, this is comparable to direct reinforcement learning.

Now let’s add planning to the mix. Imagine that after each of your moves, while the opponent is thinking, you mentally go back over each of your previous moves to reassess their quality. You might find weaknesses that you neglected at first sight or find out that specific moves were better than you thought. These thoughts may also allow you to update your strategy. This is exactly what planning is about, updating the value function without interacting with the real environment but rather a model of said environment.

Planning, acting, model learning, and direct RL: the schedule of a planning agent (made by the author)

Dyna-Q therefore contains some additional steps compared to Q-learning:

After each direct update of the Q values, the model stores the state-action pair and the reward and next-state that were observed. This step is called model training.

After model training, Dyna-Q performs n planning steps:
A random state-action pair is selected from the model buffer (i.e. this state-action pair was observed during direct interactions)
The model generates the simulated reward and next-state
The value function is updated using the simulated observations (s, a, r, s’)

Source link

What's Hot

How to Build Your Own Roadmap for a Successful Data Science Career | by TDS Editors | Sep, 2024

This AI Paper Introduces a Comprehensive Framework for LLM-Driven Software Engineering Tasks

CollaMamba: A Resource-Efficient Framework for Collaborative Perception in Autonomous Systems

Temporal-Difference Learning and the importance of exploration: An illustrated guide | by Ryan Pégoud | Sep, 2023

How to Build Your Own Roadmap for a Successful Data Science Career | by TDS Editors | Sep, 2024

Build a Tokenizer for the Thai Language from Scratch | by Milan Tamang | Sep, 2024

Asking for Feedback as a Data Scientist Individual Contributor | by Jose Parreño | Sep, 2024

Leave A Reply Cancel Reply

How ML AI Can Help Businesses Reduce Overhead Costs

How the AI Surge May Help Current WFH Employees

The ultimate contact center automation guide

Top 5AI Development Companies To Transform Your Business | by Amyra Sheldon

How to Build Your Own Roadmap for a Successful Data Science Career | by TDS Editors | Sep, 2024

This AI Paper Introduces a Comprehensive Framework for LLM-Driven Software Engineering Tasks

CollaMamba: A Resource-Efficient Framework for Collaborative Perception in Autonomous Systems

Source2Synth: A New AI Technique for Synthetic Data Generation and Curation Grounded in Real Data Sources

Our Picks

How to Build Your Own Roadmap for a Successful Data Science Career | by TDS Editors | Sep, 2024

This AI Paper Introduces a Comprehensive Framework for LLM-Driven Software Engineering Tasks

CollaMamba: A Resource-Efficient Framework for Collaborative Perception in Autonomous Systems

What's Hot

Temporal-Difference Learning and the importance of exploration: An illustrated guide | by Ryan Pégoud | Sep, 2023

A quick Chess example

Related Posts

Leave A Reply Cancel Reply