Reinforcement Learning : The Markov Decision Problem for feature selection
It has been demonstrated that reinforcement learning (RL) technics can be very efficient for problems like game solving. The concept of RL is based on Markovian Decision Process (MDP). The point here is not to define deeply the MDP but to get the general idea of how it works and how it can be useful to our problem.
The naive idea behind RL is that an agent starts in an unknown environnement. This agent has to take actions to complete a task. In function of the current state of the agent and the action he has selected previously, the agent will be more inclined to choose some actions. At every new state reached and action taken, the agent receives a reward. Here are then the main parameters that we need to define for feature selection purpose:
- What is a state ?
- What is an action ?
- What are the rewards ?
- How do we choose an action ?
Firstly, the state is merely a subset of features that exist in the data set. For example, if the data set has three features (Age, Gender, Height) plus one label, here will be the possible states:
[] --> Empty set
[Age], [Gender], [Height] --> 1-feature set
[Age, Gender], [Gender, Height], [Age, Height] --> 2-feature set
[Age, Gender, Height] --> All-feature set
In a state, the order of the features does not matter and it will be explained why a little bit later in the article. We have to consider it as a set and not a list of features.
Concerning the actions, from a subset we can go to any other subset with one not-already explored feature more than the current state. In the feature selection problem, an action is then selecting a not-already explored feature in the current state and add it to the next state. Here is a sample of possible actions:
[Age] -> [Age, Gender]
[Gender, Height] -> [Age, Gender, Height]
Here is an example of impossible actions:
[Age] -> [Age, Gender, Height]
[Age, Gender] -> [Age]
[Gender] -> [Gender, Gender]
We have defined the states and the actions but not the reward. The reward is a real number that is used for evaluating the quality of a state. For example if a robot is trying to reach the exit of a maze and decides to go to the exit as his next action, then the reward associated to this action will be “good”. If he selects as a next action to go in a trap then the reward will be “not good”. The reward is a value that brought information about the previous action taken.
In the problem of feature selection an interesting reward could be a value of accuracy that is added to the model by adding a new feature. Here is an example of how the reward is computed:
[Age] --> Accuracy = 0.65
[Age, Gender] --> Accuracy = 0.76
Reward(Gender) = 0.76 - 0.65 = 0.11
For each state that we visit for the first time a classifier will be trained with the set of features. This value is stored in the state and the training of the classifier, which is very costly, will only happens once even if the state is reached again later. The classifier does not consider the order of the feature. This is why we can see this problem as a graph and not a tree. In this example, the reward of the action of selecting Gender as a new feature for the model is the difference between the accuracy of the current state and the next state.
On the graph above, each feature has been mapped to a number (i.e “Age” is 1, “Gender” is 2 and “Height” is 3). It is totally possible to take other metrics to maximise to find the optimal set. In many business applications the recall is more considered than the accuracy.
The next important question is how do we select the next state from the current state or how do we explore our environement. We have to find the most optimal way to do it since it can quickly become a very complex problem. Indeed, if we naively explore all the possible set of features in a problem with 10 features, the number of states would be
10! + 2 = 3 628 802 possible states
The +2 is because we consider an empty state and a state that contains all the possible features. In this problem we would have to train the same model on all the states to get the set of features that maximises the accuracy. In the RL approach we will not have to go in all the states and to train a model every time that we go in an already visited state.
We had to determine some stop conditions for this problem and they will be detailed later. For now the epsilon-greedy state selection has been chosen. The idea is from a current state we select the next action randomly with a probability of epsilon (between 0 and 1 and often around 0.2) and otherwise select the action that maximises a function. For feature selection the function is the average of reward that each feature has brought to the accuracy of the model.
The epsilon-greedy algorithm implies two steps:
- A random phase : with a probability epsilon, we select randomly the next state among the possible neighbours of the current state (we can imagine either a uniform or a softmax selection)
- A greedy phase : we select the next state such that the feature added to the current state has the maximal contribution of accuracy to the model. To reduce the time complexity, we have initialised a list containing this values for each feature. This list is updated every time that a feature is chosen. The update is very optimal thanks to the following formula:
- AORf : Average of reward brought by the feature “f”
- k : number of times that “f” has been selected
- V(F) : state’s value of the set of features F (not detailed in this article for clarity reasons)
The global idea is to find which feature has brought the most accuracy to the model. That is why we need to browse different states to evaluate in many different environments the most global accurate value of a feature for the model.
Finally I will detail the two stop conditions. Since the goal is to minimise the number of state that the algorithm visits we need to be careful about them. The less never visited state we visit, the less amount of models we will have to train with different set of features. Training the model to get the accuracy is the most costly phase in terms of time and computation power.
- The algorithm stops in any case in the final state which is the set containing all the features. We want to avoid reaching this state since it is the most expensive to train a model with.
- Also, it stops browsing the graph if a sequence of visited states see their values degrade successively. A threshold has been set such that after square root of the number of total features in the dataset, it stops exploring.
Now that the modelling of the problem has been explained, we will detail the implementation in python.