There is a lot of hype about Large Language Models nowadays, but it doesn’t mean that old-school ML approaches now deserve extinction. I doubt that ChatGPT will be helpful if you give it a dataset with hundreds numeric features and ask it to predict a target value.
Neural Networks are usually the best solution in case of unstructured data (for example, texts, images or audio). But, for tabular data, we can still benefit from the good old Random Forest.
The most significant advantages of Random Forest algorithms are the following:
- You only need to do a little data preprocessing.
- It’s rather difficult to screw up with Random Forests. You won’t face overfitting issues if you have enough trees in your ensemble since adding more trees decreases the error.
- It’s easy to interpret results.
That’s why Random Forest could be a good candidate for your first model when starting a new task with tabular data.
In this article, I would like to cover the basics of Random Forests and go through approaches to interpreting model results.
We will learn how to find answers to the following questions:
- What features are important, and which ones are redundant and can be removed?
- How does each feature value affect our target metric?
- What are the factors for each prediction?
- How to estimate the confidence of each prediction?
We will be using the Wine Quality dataset. It shows the relation between wine quality and physicochemical test for the different Portuguese “Vinho Verde” wine variants. We will try to predict wine quality based on wine characteristics.
With decision trees, we don’t need to do a lot of preprocessing:
- We don’t need to create dummy variables since the algorithm can handle it automatically.
- We don’t need to do normalisation or get rid of outliers because only ordering matters. So, Decision Tree based models are resistant to outliers.
However, the scikit-learn realisation of Decision Trees can’t work with categorical variables or Null values. So, we have to handle it ourselves.
Fortunately, there are no missing values in our dataset.
df.isna().sum().sum()0
And we only need to transform the type
variable (‘red’ or ‘white’) from string
to integer
. We can use pandas Categorical
transformation for it.
categories = {}
cat_columns = ['type']
for p in cat_columns:
df[p] = pd.Categorical(df[p])categories[p] = df[p].cat.categories
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
print(categories)
{'type': Index(['red', 'white'], dtype='object')}
Now, df['type']
equals 0 for red wines and 1 for white vines.
The other crucial part of preprocessing is to split our dataset into train and validation sets. So, we can use a validation set to assess our model’s quality.
import sklearn.model_selectiontrain_df, val_df = sklearn.model_selection.train_test_split(df,
test_size=0.2)
train_X, train_y = train_df.drop(['quality'], axis = 1), train_df.quality
val_X, val_y = val_df.drop(['quality'], axis = 1), val_df.quality
print(train_X.shape, val_X.shape)
(5197, 12) (1300, 12)
We’ve finished the preprocessing step and are ready to move on to the most exciting part — training models.
Before jumping into the training, let’s spend some time understanding how Random Forests work.
Random Forest is an ensemble of Decision Trees. So, we should start with the elementary building block — Decision Tree.
In our example of predicting wine quality, we will be solving a regression task, so let’s start with it.
Decision Tree: Regression
Let’s fit a default decision tree model.
import sklearn.tree
import graphvizmodel = sklearn.tree.DecisionTreeRegressor(max_depth=3)
# I've limited max_depth mostly for visualisation purposes
model.fit(train_X, train_y)
One of the most significant advantages of Decision Trees is that we can easily interpret these models — it’s just a set of questions. Let’s visualise it.
dot_data = sklearn.tree.export_graphviz(model, out_file=None,
feature_names = train_X.columns,
filled = True)graph = graphviz.Source(dot_data)
# saving tree to png file
png_bytes = graph.pipe(format='png')
with open('decision_tree.png','wb') as f:
f.write(png_bytes)
As you can see, the Decision Tree consists of binary splits. On each node, we are splitting our dataset into 2.
Finally, we calculate predictions for the leaf nodes as an average of all data points in this node.
Side note: Because Decision Tree returns an average of all data points for a leaf node, Decision Trees are pretty bad in extrapolation. So, you need to keep an eye on the feature distributions during training and inference.
Let’s brainstorm how to identify the best split for our dataset. We can start with one variable and define the optimal division for it.
Suppose we have a feature with four unique values: 1, 2, 3 and 4. Then, there are three possible thresholds between them.
We can consequently take each threshold and calculate predicted values for our data as an average value for leaf nodes. Then, we can use these predicted values to get MSE (Mean Square Error) for each threshold. The best split will be the one with the lowest MSE. By default, DecisionTreeRegressor from scikit-learn works similarly and uses MSE as a criterion.
Let’s calculate the best split for sulphates
feature manually to understand better how it works.
def get_binary_split_for_param(param, X, y):
uniq_vals = list(sorted(X[param].unique()))tmp_data = []
for i in range(1, len(uniq_vals)):
threshold = 0.5 * (uniq_vals[i-1] + uniq_vals[i])
# split dataset by threshold
split_left = y[X[param] <= threshold]
split_right = y[X[param] > threshold]
# calculate predicted values for each split
pred_left = split_left.mean()
pred_right = split_right.mean()
num_left = split_left.shape[0]
num_right = split_right.shape[0]
mse_left = ((split_left - pred_left) * (split_left - pred_left)).mean()
mse_right = ((split_right - pred_right) * (split_right - pred_right)).mean()
mse = mse_left * num_left / (num_left + num_right) \
+ mse_right * num_right / (num_left + num_right)
tmp_data.append(
{
'param': param,
'threshold': threshold,
'mse': mse
}
)
return pd.DataFrame(tmp_data).sort_values('mse')
get_binary_split_for_param('sulphates', train_X, train_y).head(5)
| param | threshold | mse |
|:----------|------------:|---------:|
| sulphates | 0.685 | 0.758495 |
| sulphates | 0.675 | 0.758794 |
| sulphates | 0.705 | 0.759065 |
| sulphates | 0.715 | 0.759071 |
| sulphates | 0.635 | 0.759495 |
We can see that for sulphates
, the best threshold is 0.685 since it gives the lowest MSE.
Now, we can use this function for all features we have to define the best split overall.
def get_binary_split(X, y):
tmp_dfs = []
for param in X.columns:
tmp_dfs.append(get_binary_split_for_param(param, X, y))return pd.concat(tmp_dfs).sort_values('mse')
get_binary_split(train_X, train_y).head(5)
| param | threshold | mse |
|:--------|------------:|---------:|
| alcohol | 10.625 | 0.640368 |
| alcohol | 10.675 | 0.640681 |
| alcohol | 10.85 | 0.641541 |
| alcohol | 10.725 | 0.641576 |
| alcohol | 10.775 | 0.641604 |
We got absolutely the same result as our initial decision tree with the first split on alcohol <= 10.625
.
To build the whole Decision Tree, we could recursively calculate the best splits for each of the datasets alcohol <= 10.625
and alcohol > 10.625
and get the next level of Decision Tree. Then, repeat.
The stopping criteria for recursion could be either the depth or the minimal size of the leaf node. Here’s an example of a Decision Tree with at least 420 items in the leaf nodes.
model = sklearn.tree.DecisionTreeRegressor(min_samples_leaf = 420)
Let’s calculate the mean absolute error on the validation set to understand how good our model is. I prefer MAE over MSE (Mean Squared Error) because it’s less affected by outliers.
import sklearn.metrics
print(sklearn.metrics.mean_absolute_error(model.predict(val_X), val_y))
0.5890557338155006
Decision Tree: Classification
We’ve looked at the regression example. In the case of classification, it’s a bit different. Even though we won’t go deep into classification examples in this article, it’s still worth discussing its basics.
For classification, instead of the average value, we use the most common class as a prediction for each leaf node.
We usually use the Gini coefficient to estimate the binary split’s quality for classification. Imagine getting one random item from the sample and then the other. The Gini coefficient would be equal to the probability of the situation when items are from different classes.
Let’s say we have only two classes, and the share of items from the first class is equal to p
. Then we can calculate the Gini coefficient using the following formula:
If our classification model is perfect, the Gini coefficient equals 0. In the worst case (p = 0.5
), the Gini coefficient equals 0.5.
To calculate the metric for binary split, we calculate Gini coefficients for both parts (left and right ones) and norm them on the number of samples in each partition.
Then, we can similarly calculate our optimisation metric for different thresholds and use the best option.
We’ve trained a simple Decision Tree model and discussed how it works. Now, we are ready to move on to the Random Forests.
Random Forests are based on the concept of Bagging. The idea is to fit a bunch of independent models and use an average prediction from them. Since models are independent, errors are not correlated. We assume that our models have no systematic errors, and the average of many errors should be close to zero.
How could we get lots of independent models? It’s pretty straightforward: we can train Decision Trees on random subsets of rows and features. It will be a Random Forest.
Let’s train a basic Random Forest with 100 trees and the minimal size of leaf nodes equal to 100.
import sklearn.ensemble
import sklearn.metricsmodel = sklearn.ensemble.RandomForestRegressor(100, min_samples_leaf=100)
model.fit(train_X, train_y)
print(sklearn.metrics.mean_absolute_error(model.predict(val_X), val_y))
0.5592536196736408
With random forest, we’ve achieved a much better quality than with one Decision Tree: 0.5592 vs. 0.5891.
Overfitting
The meaningful question is whether Random Forrest could overfit.
Actually, no. Since we are averaging not correlated errors, we cannot overfit the model by adding more trees. Quality will improve asymptotically with the increase in the number of trees.
However, you might face overfitting if you have deep trees and not enough of them. It’s easy to overfit one Decision Tree.
Out-of-bag error
Since only part of the rows is used for each tree in Random Forest, we can use them to estimate the error. For each row, we can select only trees where this row wasn’t used and use them to make predictions. Then, we can calculate errors based on these predictions. Such an approach is called “out-of-bag error”.
We can see that the OOB error is much closer to the error on the validation set than the one for training, which means it’s a good approximation.
# we need to specify oob_score = True to be able to calculate OOB error
model = sklearn.ensemble.RandomForestRegressor(100, min_samples_leaf=100,
oob_score=True)model.fit(train_X, train_y)
# error for validation set
print(sklearn.metrics.mean_absolute_error(model.predict(val_X), val_y))
0.5592536196736408
# error for training set
print(sklearn.metrics.mean_absolute_error(model.predict(train_X), train_y))
0.5430398596179975
# out-of-bag error
print(sklearn.metrics.mean_absolute_error(model.oob_prediction_, train_y))
0.5571191870008492
As I mentioned in the beginning, the big advantage of Decision Trees is that it’s easy to interpret them. Let’s try to understand our model better.
Feature importances
The calculation of the feature importance is pretty straightforward. We look at each decision tree in the ensemble and each binary split and calculate its impact on our metric (squared_error
in our case).
Let’s look at the first split by alcohol
for one of our initial decision trees.
Then, we can do the same calculations for all binary splits in all decision trees, add everything up, normalize and get the relative importance for each feature.
If you use scikit-learn, you don’t need to calculate feature importance manually. You can just take model.feature_importances_
.
def plot_feature_importance(model, names, threshold = None):
feature_importance_df = pd.DataFrame.from_dict({'feature_importance': model.feature_importances_,
'feature': names})\
.set_index('feature').sort_values('feature_importance', ascending = False)if threshold is not None:
feature_importance_df = feature_importance_df[feature_importance_df.feature_importance > threshold]
fig = px.bar(
feature_importance_df,
text_auto = '.2f',
labels = {'value': 'feature importance'},
title = 'Feature importances'
)
fig.update_layout(showlegend = False)
fig.show()
plot_feature_importance(model, train_X.columns)
We can see that the most important features overall are alcohol
and volatile acidity
.
Understanding how each feature affects our target metric is exciting and often useful. For example, whether quality increases/decreases with higher alcohol or there’s a more complex relation.
We could just get data from our dataset and plot averages by alcohol, but it won’t be correct since there might be some correlations. For example, higher alcohol in our dataset could also correspond to more elevated sugar and better quality.
To estimate the impact only from alcohol, we can take all rows in our dataset and, using the ML model, predict the quality for each row for different values of alcohol: 9, 9.1, 9.2, etc. Then, we can average results and get the actual relation between alcohol level and wine quality. So, all the data is equal, and we are just varying alcohol levels.
This approach could be used with any ML model, not only Random Forest.
We can use sklearn.inspection
module to easily plot this relations.
sklearn.inspection.PartialDependenceDisplay.from_estimator(clf, train_X,
range(12))
We can gain quite a lot of insights from these graphs, for example:
- wine quality increases with the growth of free sulfur dioxide up to 30, but it’s stable after this threshold;
- with alcohol, the higher the level — the better the quality.
We can even look at relations between two variables. It can be pretty complex. For example, if the alcohol level is above 11.5, volatile acidity has no effect. But, for lower alcohol levels, volatile acidity significantly impacts quality.
sklearn.inspection.PartialDependenceDisplay.from_estimator(clf, train_X,
[(1, 10)])
Confidence of predictions
Using Random Forests, we can also assess how confident each prediction is. For that, we could calculate predictions from each tree in the ensemble and look at variance or standard deviation.
val_df['predictions_mean'] = np.stack([dt.predict(val_X.values)
for dt in model.estimators_]).mean(axis = 0)
val_df['predictions_std'] = np.stack([dt.predict(val_X.values)
for dt in model.estimators_]).std(axis = 0)ax = val_df.predictions_std.hist(bins = 10)
ax.set_title('Distribution of predictions std')
We can see that there are predictions with low standard deviation (i.e. below 0.15) and the ones with std
above 0.3.
If we use the model for business purposes, we can treat such cases differently. For example, do not take into account prediction if std
above X
or show to the customer intervals (i.e. percentile 25% and percentile 75%).
How prediction was made?
We can also use packages treeinterpreter
and waterfallcharts
to understand how each prediction was made. It could be handy in some business cases, for example, when you need to tell customers why credit for them was rejected.
We will look at one of the wines as an example. It has relatively low alcohol and high volatile acidity.
from treeinterpreter import treeinterpreter
from waterfall_chart import plot as waterfallrow = val_X.iloc[[7]]
prediction, bias, contributions = treeinterpreter.predict(model, row.values)
waterfall(val_X.columns, contributions[0], threshold=0.03,
rotation_value=45, formatting='{:,.3f}');
The graph shows that this wine is better than average. The main factor that increases quality is a low level of volatile acidity, while the main disadvantage is a low level of alcohol.
So, there are a lot of handy tools that could help you to understand your data and model much better.
The other cool feature of Random Forest is that we could use it to reduce the number of features for any tabular data. You can quickly fit a Random Forest and define a list of meaningful columns in your data.
More data doesn’t always mean better quality. Also, it can affect your model performance during training and inference.
Since in our initial wine dataset, there were only 12 features, for this case, we will use a slightly bigger dataset — Online News Popularity.
Looking at feature importance
First, let’s build a Random Forest and look at feature importances. 34 out of 59 features have an importance lower than 0.01.
Let’s try to remove them and look at accuracy.
low_impact_features = feature_importance_df[feature_importance_df.feature_importance <= 0.01].index.valuestrain_X_imp = train_X.drop(low_impact_features, axis = 1)
val_X_imp = val_X.drop(low_impact_features, axis = 1)
model_imp = sklearn.ensemble.RandomForestRegressor(100, min_samples_leaf=100)
model_imp.fit(train_X_sm, train_y)
- MAE on validation set for all features: 2969.73
- MAE on validation set for 25 important features: 2975.61
The difference in quality is not so big, but we could make our model faster in the training and inference stages. We’ve already removed almost 60% of the initial features — good job.
Looking at redundant features
For the remaining features, let’s see whether there are redundant (highly correlated) ones. For that, we will use a Fast.AI tool:
import fastbook
fastbook.cluster_columns(train_X_imp)
We could see that the following features are close to each other:
self_reference_avg_sharess
andself_reference_max_shares
kw_min_avg
andkw_min_max
n_non_stop_unique_tokens
andn_unique_tokens
.
Let’s remove them as well.
non_uniq_features = ['self_reference_max_shares', 'kw_min_max',
'n_unique_tokens']
train_X_imp_uniq = train_X_imp.drop(non_uniq_features, axis = 1)
val_X_imp_uniq = val_X_imp.drop(non_uniq_features, axis = 1)model_imp_uniq = sklearn.ensemble.RandomForestRegressor(100,
min_samples_leaf=100)
model_imp_uniq.fit(train_X_imp_uniq, train_y)
sklearn.metrics.mean_absolute_error(model_imp_uniq.predict(val_X_imp_uniq),
val_y)
2974.853274034488
Quality even a little bit improved. So, we’ve reduced the number of features from 59 to 22 and increased the error only by 0.17%. It proves that such an approach works.
You can find the full code on GitHub.
In this article, we’ve discussed how Decision Tree and Random Forest algorithms work. Also, we’ve learned how to interpret Random Forests:
- How to use feature importance to get the list of the most significant features and reduce the number of parameters in your model.
- How to define the effect of each feature value on the target metric using partial dependence.
- How to estimate the impact of different features on each prediction using
treeinterpreter
library.
Thank you a lot for reading this article. I hope it was insightful to you. If you have any follow-up questions or comments, please leave them in the comments section.
Datasets
- Cortez,Paulo, Cerdeira,A., Almeida,F., Matos,T., and Reis,J.. (2009). Wine Quality. UCI Machine Learning Repository.
https://doi.org/10.24432/C56S3T - Fernandes,Kelwin, Vinagre,Pedro, Cortez,Paulo, and Sernadela,Pedro. (2015). Online News Popularity. UCI Machine Learning Repository. https://doi.org/10.24432/C5NS3V
Sources
This article was inspired by Fast.AI Deep Learning Course