In my previous article, I conducted a thorough analysis of the most popular strategies for tackling the dynamic pricing problem using simple Multi-armed Bandits. If you’ve come here from that piece, firstly, thank you. It’s by no means an easy read, and I truly appreciate your enthusiasm for the subject. Secondly, get ready, as this new article promises to be even more demanding. However, if this is your introduction to the topic, I strongly advise beginning with the previous article. There, I present foundational concepts, which I’ll assume readers are familiar with in this discussion.
Anyway, a brief recap: the prior analysis aimed to simulate a dynamic pricing scenario. The main goal was to assess as quickly as possible various price points to find the one yielding the highest cumulated reward. We explored four distinct algorithms: greedy, ε-greedy, Thompson Sampling, and UCB1, detailing the strengths and weaknesses of each. Although the methodology employed in that article is theoretically sound, it bears oversimplifications that don’t hold up in more complex, real-world situations. The most problematic of these simplifications is the assumption that the underlying process is stationary — meaning the optimal price remains constant irrespective of the external environment. This is clearly not the case. Consider, for example, fluctuations in demand during holiday seasons, sudden shifts in competitor pricing, or changes in raw material costs.
To solve this issue, Contextual Bandits come into play. Contextual Bandits are an extension of the Multi-armed Bandit problem where the decision-making agent not only receives a reward for each action (or “arm”) but also has access to context or environment-related information before choosing an arm. The context can be any piece of information that might influence the outcome, such as customer demographics or external market conditions.
Here’s how they work: before deciding which arm to pull (or, in our case, which price to set), the agent observes the current…