The current state is represented by a tuple (alpha, beta), where: alpha is the current on-hand inventory (items in stock), beta is the current on-order inventory (items ordered but not yet received), init_inv calculates the total initial inventory by summing alpha and beta.
Then, we need to simulate customer demand using Poisson distribution with lambda value “self.poisson_lambda”. Here, the demand shows the randomness of customer demand:
alpha, beta = state
init_inv = alpha + beta
demand = np.random.poisson(self.poisson_lambda)
Note: Poisson distribution is used to model the demand, which is a common choice for modeling random events like customer arrivals. However, we can either train the model with historical demand data or live interaction with environment in real time. In its core, reinforcement learning is about learning from the data, and it does not require prior knowledge of a model.
Now, the “next alpha” which is in-hand inventory can be written as max(0,init_inv-demand). What that means is that if demand is more than the initial inventory, then the new alpha would be zero, if not, init_inv-demand.
The cost comes in two parts. Holding cost: is calculated by multiplying the number of bikes in the store by the per-unit holding cost. Then, we have another cost, which is stockout cost. It is a cost that we need to pay for the cases of missed demand. These two parts form the “reward” which we try to maximize using reinforcement learning method.( a better way to put is we want to minimize the cost, so we maximize the reward).
new_alpha = max(0, init_inv - demand)
holding_cost = -new_alpha * self.holding_cost
stockout_cost = 0if demand > init_inv:
stockout_cost = -(demand - init_inv) * self.stockout_cost
reward = holding_cost + stockout_cost
next_state = (new_alpha, action)
Exploration — Exploitation in Q-Learning
Choosing action in the Q-learning method involves some degree of exploration to get an overview of the Q value for all the states in the Q table. To do that, at every action chosen, there is an epsilon chance that we take an exploration approach and “randomly” select an action, whereas, with a 1-ϵ chance, we take the best action possible from the Q table.
def choose_action(self, state):# Epsilon-greedy action selection
if np.random.rand() < self.epsilon:
return np.random.choice(self.user_capacity - (state[0] + state[1]) + 1)
else:
return max(self.Q[state], key=self.Q[state].get)
Training RL Agent
The training of the RL agent is done by the “train” function, and it is follow as: First, we need to initialize the Q (empty dictionary structure). Then, experiences are collected in each batch (self.batch.append((state, action, reward, next_state))), and the Q table is updated at the end of each batch (self.update_Q(self.batch)). The number of episodes is limited to “max_actions_per_episode” in each batch. The number of episodes is the number of times the agent interacts with the environment to learn the optimal policy.
Each episode starts with a randomly assigned state, and while the number of actions is lower than max_actions_per_episode, the collecting data for that batch continues.
def train(self):self.Q = self.initialize_Q() # Reinitialize Q-table for each training run
for episode in range(self.episodes):
alpha_0 = random.randint(0, self.user_capacity)
beta_0 = random.randint(0, self.user_capacity - alpha_0)
state = (alpha_0, beta_0)
#total_reward = 0
self.batch = [] # Reset the batch at the start of each episode
action_taken = 0
while action_taken < self.max_actions_per_episode:
action = self.choose_action(state)
next_state, reward = self.simulate_transition_and_reward(state, action)
self.batch.append((state, action, reward, next_state)) # Collect experience
state = next_state
action_taken += 1
self.update_Q(self.batch) # Update Q-table using the batch