Optimizing Inventory Management with Reinforcement Learning: A Hands-on Python Guide | by Peyman Kor

The current state is represented by a tuple (alpha, beta), where: alpha is the current on-hand inventory (items in stock), beta is the current on-order inventory (items ordered but not yet received), init_inv calculates the total initial inventory by summing alpha and beta.

Then, we need to simulate customer demand using Poisson distribution with lambda value “self.poisson_lambda”. Here, the demand shows the randomness of customer demand:

alpha, beta = state
init_inv = alpha + beta
demand = np.random.poisson(self.poisson_lambda)

Note: Poisson distribution is used to model the demand, which is a common choice for modeling random events like customer arrivals. However, we can either train the model with historical demand data or live interaction with environment in real time. In its core, reinforcement learning is about learning from the data, and it does not require prior knowledge of a model.

Now, the “next alpha” which is in-hand inventory can be written as max(0,init_inv-demand). What that means is that if demand is more than the initial inventory, then the new alpha would be zero, if not, init_inv-demand.

The cost comes in two parts. Holding cost: is calculated by multiplying the number of bikes in the store by the per-unit holding cost. Then, we have another cost, which is stockout cost. It is a cost that we need to pay for the cases of missed demand. These two parts form the “reward” which we try to maximize using reinforcement learning method.( a better way to put is we want to minimize the cost, so we maximize the reward).

new_alpha = max(0, init_inv - demand)
holding_cost = -new_alpha * self.holding_cost
stockout_cost = 0if demand > init_inv:
stockout_cost = -(demand - init_inv) * self.stockout_cost
reward = holding_cost + stockout_cost
next_state = (new_alpha, action)

Exploration — Exploitation in Q-Learning

Choosing action in the Q-learning method involves some degree of exploration to get an overview of the Q value for all the states in the Q table. To do that, at every action chosen, there is an epsilon chance that we take an exploration approach and “randomly” select an action, whereas, with a 1-ϵ chance, we take the best action possible from the Q table.

def choose_action(self, state):# Epsilon-greedy action selection
if np.random.rand() < self.epsilon:
return np.random.choice(self.user_capacity - (state[0] + state[1]) + 1)
else:
return max(self.Q[state], key=self.Q[state].get)

Training RL Agent

The training of the RL agent is done by the “train” function, and it is follow as: First, we need to initialize the Q (empty dictionary structure). Then, experiences are collected in each batch (self.batch.append((state, action, reward, next_state))), and the Q table is updated at the end of each batch (self.update_Q(self.batch)). The number of episodes is limited to “max_actions_per_episode” in each batch. The number of episodes is the number of times the agent interacts with the environment to learn the optimal policy.

Each episode starts with a randomly assigned state, and while the number of actions is lower than max_actions_per_episode, the collecting data for that batch continues.

def train(self):self.Q = self.initialize_Q()  # Reinitialize Q-table for each training run
for episode in range(self.episodes):
alpha_0 = random.randint(0, self.user_capacity)
beta_0 = random.randint(0, self.user_capacity - alpha_0)
state = (alpha_0, beta_0)
#total_reward = 0
self.batch = []  # Reset the batch at the start of each episode
action_taken = 0
while action_taken < self.max_actions_per_episode:
action = self.choose_action(state)
next_state, reward = self.simulate_transition_and_reward(state, action)
self.batch.append((state, action, reward, next_state))  # Collect experience
state = next_state
action_taken += 1
self.update_Q(self.batch)  # Update Q-table using the batch

Source link

What's Hot

Microsoft Released LLM2CLIP: A New AI Technique in which a LLM Acts as a Teacher for CLIP’s Visual Encoder

This Machine Learning Paper Transforms Embodied AI Efficiency: New Scaling Laws for Optimizing Model and Dataset Proportions in Behavior Cloning and World Modeling Tasks

Gradient Boosting | Towards Data Science

Optimizing Inventory Management with Reinforcement Learning: A Hands-on Python Guide | by Peyman Kor | Oct, 2024

Gradient Boosting | Towards Data Science

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

How I Created a Data Science Project Following CRISP-DM Lifecycle | by Gustavo Santos | Nov, 2024

Leave A Reply Cancel Reply

How ML AI Can Help Businesses Reduce Overhead Costs

How the AI Surge May Help Current WFH Employees

The ultimate contact center automation guide

Top 5AI Development Companies To Transform Your Business | by Amyra Sheldon

Microsoft Released LLM2CLIP: A New AI Technique in which a LLM Acts as a Teacher for CLIP’s Visual Encoder

This Machine Learning Paper Transforms Embodied AI Efficiency: New Scaling Laws for Optimizing Model and Dataset Proportions in Behavior Cloning and World Modeling Tasks

Gradient Boosting | Towards Data Science

The Complete Guide to NetSuite Saved Searches

Our Picks

Microsoft Released LLM2CLIP: A New AI Technique in which a LLM Acts as a Teacher for CLIP’s Visual Encoder

This Machine Learning Paper Transforms Embodied AI Efficiency: New Scaling Laws for Optimizing Model and Dataset Proportions in Behavior Cloning and World Modeling Tasks

Gradient Boosting | Towards Data Science

What's Hot

Optimizing Inventory Management with Reinforcement Learning: A Hands-on Python Guide | by Peyman Kor | Oct, 2024

Exploration — Exploitation in Q-Learning

Training RL Agent

Related Posts

Leave A Reply Cancel Reply