## REGRESSION ALGORITHM

Decision Trees aren’t limited to categorizing data — they’re equally good at predicting numerical values! Classification trees often steal the spotlight, but Decision Tree Regressors (or Regression Trees) are powerful and versatile tools in the world of continuous variable prediction.

While we’ll discuss the mechanics of regression tree construction (which are mostly similar to the classification tree), here, we’ll also advance beyond the *pre*-pruning methods like “minimal sample leaf” and “max tree depth” introduced in the classifier article. We’ll explore the most common *post*-pruning method which is **cost complexity pruning**, that introduces a complexity parameter to the decision tree’s cost function.

A Decision Tree for regression is a model that predicts numerical values using a tree-like structure. It splits data based on key features, starting from a root question and branching out. Each node asks about a feature, dividing data further until reaching leaf nodes with final predictions. To get a result, you follow the path matching your data’s features from root to leaf.

To demonstrate our concepts, we’ll work with our standard dataset. This dataset is used to predict the number of golfers visiting on a given day and includes variables like weather outlook, temperature, humidity, and wind conditions.

`import pandas as pd`

import numpy as np

from sklearn.model_selection import train_test_split# Create dataset

dataset_dict = {

'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],

'Temp.': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],

'Humid.': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],

'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],

'Num_Players': [52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29, 25, 51, 41, 14, 34, 29, 49, 36, 57, 21, 23, 41]

}

df = pd.DataFrame(dataset_dict)

# One-hot encode 'Outlook' column

df = pd.get_dummies(df, columns=['Outlook'],prefix='',prefix_sep='')

# Convert 'Wind' column to binary

df['Wind'] = df['Wind'].astype(int)

# Split data into features and target, then into training and test sets

X, y = df.drop(columns='Num_Players'), df['Num_Players']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

The Decision Tree for regression operates by recursively dividing the data based on features that best reduce prediction error. Here’s the general process:

- Begin with the entire dataset at the root node.
- Choose the feature that minimizes a specific error metric (such as mean squared error or variance) to split the data.
- Create child nodes based on the split, where each child represents a subset of the data aligned with the corresponding feature values.
- Repeat steps 2–3 for each child node, continuing to split the data until a stopping condition is reached.
- Assign a final predicted value to each leaf node, typically
**the average of the target values**in that node.

We will explore the regression part in the decision tree algorithm CART (Classification and Regression Trees). It builds binary trees and typically follows these steps:

1.Begin with all training samples in the root node.

2.For each feature in the dataset:

a. Sort the feature values in ascending order.

b. Consider all midpoints between adjacent values as potential split points.

3. For each potential split point:

a. Calculate the mean squared error (MSE) of the current node.

b. Compute the weighted average of errors for the resulting split.

4. After evaluating all features and split points, select the one with lowest weighted average of MSE.

5. Create two child nodes based on the chosen feature and split point:

– Left child: samples with feature value <= split point

– Right child: samples with feature value > split point

6. Recursively repeat steps 2–5 for each child node. (Continue until a stopping criterion is met.)

7. At each leaf node, assign the average target value of the samples in that node as the prediction.

`from sklearn.tree import DecisionTreeRegressor, plot_tree`

import matplotlib.pyplot as plt# Train the model

regr = DecisionTreeRegressor(random_state=42)

regr.fit(X_train, y_train)

# Visualize the decision tree

plt.figure(figsize=(26,8))

plot_tree(regr, feature_names=X.columns, filled=True, rounded=True, impurity=False, fontsize=16, precision=2)

plt.tight_layout()

plt.show()

Here’s how a regression tree makes predictions for new data:

1. Start at the top (root) of the tree.

2. At each decision point (node):

– Look at the feature and split value.

– If the data point’s feature value is smaller or equal, go left.

– If it’s larger, go right.

3. Keep moving down the tree until you reach the end (a leaf).

4. The prediction is the average value stored in that leaf.

## Evaluation Step

After building the tree, the only thing we need to worry about is the method to make the tree smaller to prevent overfitting. In general, the method of pruning can be categorized as:

## Pre-pruning

Pre-pruning, also known as early stopping, involves halting the growth of a decision tree during the training process based on certain predefined criteria. This approach aims to prevent the tree from becoming too complex and overfitting the training data. Common pre-pruning techniques include:

**Maximum depth**: Limiting how deep the tree can grow.**Minimum samples for split**: Requiring a minimum number of samples to justify splitting a node.**Minimum samples per leaf**: Ensuring each leaf node has at least a certain number of samples.**Maximum number of leaf nodes**: Restricting the total number of leaf nodes in the tree.**Minimum impurity decrease**: Only allowing splits that decrease impurity by a specified amount.

These methods stop the tree’s growth when the specified conditions are met, effectively “pruning” the tree during its construction phase.

(We have discussed these methods before, which is exactly the same in regression case.)

## Post-pruning

Post-pruning, on the other hand, allows the decision tree to grow to its full extent and then prunes it back to reduce complexity. This approach first builds a complete tree and then removes or collapses branches that don’t significantly contribute to the model’s performance. One common post-pruning technique is called **Cost-Complexity Pruning.**

## Step 1: Calculate the Impurity for Each Node

For each interim node, calculate the impurity (MSE for regression case). We then sorted this value from the lowest to highest.

`# Visualize the decision tree`

plt.figure(figsize=(26,8))

plot_tree(regr, feature_names=X.columns, filled=True, rounded=True, impurity=True, fontsize=16, precision=2)

plt.tight_layout()

plt.show()

## Step 2: Create Subtrees by Trimming The Weakest Link

The goal is to gradually turn the interim nodes into leaves starting from the **node with the lowest MSE** (= weakest link). We can create a path of pruning based on that.

## Step 3: Calculate Total Leaf Impurities for Each Subtree

For each subtree *T*, total leaf impurities (*R*(*T*)) can be calculated as:

*R*(*T*) = (1/*N*) Σ *I*(*L*) * *n*_*L*

where:**·** *L* ranges over all leaf nodes**·** *n_L* is the number of samples in leaf *L*

**·**

*N*is the total number of samples in the tree

**·**

*I*(

*L*) is the impurity (MSE)

*of leaf*

*L*

## Step 4: Compute the Cost Function

To control when to stop turning the interim nodes into leaves, we check the cost complexity first for each subtree *T *using the following formula:

Cost(*T*) = *R*(*T*) + *α* * |*T*|

where:**·** *R*(*T*) is the total leaf impurities**·** |*T*| is the number of leaf nodes in the subtree**·*** α* is the complexity parameter

## Step 5: Select the Alpha

The value of alpha control which subtree we will end up with. The **subtree with the lowest cost will be the final tree**.

While we can freely set the *α*, in scikit-learn, you can also get the smallest value of *α* to obtain a particular subtree. This is called **effective α**

*.*

`# Compute the cost-complexity pruning path`

tree = DecisionTreeRegressor(random_state=42)

effective_alphas = tree.cost_complexity_pruning_path(X_train, y_train).ccp_alphas

impurities = tree.cost_complexity_pruning_path(X_train, y_train).impurities# Function to count leaf nodes

count_leaves = lambda tree: sum(tree.tree_.children_left[i] == tree.tree_.children_right[i] == -1 for i in range(tree.tree_.node_count))

# Train trees and count leaves for each complexity parameter

leaf_counts = [count_leaves(DecisionTreeRegressor(random_state=0, ccp_alpha=alpha).fit(X_train_scaled, y_train)) for alpha in effective_alphas]

# Create DataFrame with analysis results

pruning_analysis = pd.DataFrame({

'total_leaf_impurities': impurities,

'leaf_count': leaf_counts,

'cost_function': [f"{imp:.3f} + {leaves}α" for imp, leaves in zip(impurities, leaf_counts)],

'effective_α': effective_alphas

})

print(pruning_analysis)

## Final Remarks

Pre-pruning methods are generally faster and more memory-efficient, as they prevent the tree from growing too large in the first place.

Post-pruning can potentially create more optimal trees, as it considers the entire tree structure before making pruning decisions. However, it can be more computationally expensive.

Both approaches aim to find a balance between model complexity and performance, with the goal of creating a model that generalizes well to unseen data. The choice between pre-pruning and post-pruning (or a combination of both) often depends on the specific dataset, the problem at hand, and of course, computational resources available.

In practice, it’s common to use a combination of these methods, like applying some pre-pruning criteria to prevent excessively large trees, and then using post-pruning for fine-tuning the model’s complexity.

`import pandas as pd`

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.metrics import root_mean_squared_error

from sklearn.tree import DecisionTreeRegressor

from sklearn.preprocessing import StandardScaler# Create dataset

dataset_dict = {

'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],

'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],

'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],

'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],

'Num_Players': [52,39,43,37,28,19,43,47,56,33,49,23,42,13,33,29,25,51,41,14,34,29,49,36,57,21,23,41]

}

df = pd.DataFrame(dataset_dict)

# One-hot encode 'Outlook' column

df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)

# Convert 'Wind' column to binary

df['Wind'] = df['Wind'].astype(int)

# Split data into features and target, then into training and test sets

X, y = df.drop(columns='Num_Players'), df['Num_Players']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Initialize Decision Tree Regressor

tree = DecisionTreeRegressor(random_state=42)

# Get the cost complexity path, impurities, and effective alpha

path = tree.cost_complexity_pruning_path(X_train, y_train)

ccp_alphas, impurities = path.ccp_alphas, path.impurities

print(ccp_alphas)

print(impurities)

# Train the final tree with the chosen alpha

final_tree = DecisionTreeRegressor(random_state=42, ccp_alpha=0.1)

final_tree.fit(X_train_scaled, y_train)

# Make predictions

y_pred = final_tree.predict(X_test)

# Calculate and print RMSE

rmse = root_mean_squared_error(y_test, y_pred)

print(f"RMSE: {rmse:.4f}")