Data science practitioners encounter numerous challenges when handling diverse data types across various projects, each demanding unique processing methods. A common obstacle is working with data formats that traditional machine learning models struggle to process effectively, resulting in subpar model performance. Since most machine learning algorithms are optimized for numerical data, transforming categorical data into numerical form is essential. However, this often oversimplifies complex categorical relationships, especially when the feature have high cardinality — meaning a large number of unique values — which complicates processing and impedes model accuracy.
High cardinality refers to the number of unique elements within a feature, specifically addressing the distinct count of categorical labels in a machine learning context. When a feature has many unique categorical labels, it has high cardinality, which can complicate model processing. To make categorical data usable in machine learning, these labels are often converted to numerical form using encoding methods based on data complexity. One popular method is One-Hot Encoding, which assigns each unique label a distinct binary vector. However, with high-cardinality data, One-Hot Encoding can dramatically increase dimensionality, leading to complex, high-dimensional datasets that require significant computational capacity for model training and potentially slow down performance.
Consider a dataset with 2,000 unique IDs, each ID linked to one of only three countries. In this case, while the ID feature has a cardinality of 2,000 (since each ID is unique), the country feature has a cardinality of just 3. Now, imagine a feature with 100,000 categorical labels that must be encoded using One-Hot Encoding. This would create an extremely high-dimensional dataset, leading to inefficiency and significant resource consumption.
A widely adopted solution among data scientists is K-Fold Target Encoding. This encoding method helps reduce feature cardinality by replacing categorical labels with target-mean values, based on K-Fold cross-validation. By focusing on individual data patterns, K-Fold Target Encoding lowers the risk of overfitting, helping the model learn specific relationships within the data rather than overly general patterns that can harm model performance.
K-Fold Target Encoding involves dividing the dataset into several equally-sized subsets, known as “folds,” with “K” representing the number of these subsets. By folding the dataset into multiple groups, this method calculates the cross-subset weighted mean for each categorical label, enhancing the encoding’s robustness and reducing overfitting risks.
Using an example from Fig 1. of a sample dataset of Indonesian domestic flights emissions for each flight cycle, we can put this technique into practice. The base question to ask with this dataset is “What is the weighted mean for each categorical labels in ‘Airlines’ by looking at feature ‘HC Emission’ ?”. However, you might come with the same question people been asking me about. “But, if you just calculated them using the targeted feature, couldn’t it result as another high cardinality feature?”. The simple answer is “Yes, it could”.
Why?
In cases where a large dataset has a highly random target feature without identifiable patterns, K-Fold Target Encoding might produce a wide variety of mean values for each categorical label, potentially preserving high cardinality rather than reducing it. However, the primary goal of K-Fold Target Encoding is to address high cardinality, not necessarily to reduce it drastically. This method works best when there is a meaningful correlation between the target feature and segments of the data within each categorical label.
How does K-Fold Target Encoding operate? The simplest way to explain this is that, in each fold, you calculate the mean of the target feature from the other folds. This approach provides each categorical label with a unique weight, represented as a numerical value, making it more informative. Let’s look at an example calculation using our dataset for a clearer understanding.
To calculate the weight of the ‘AirAsia’ label for the first observation, start by splitting the data into multiple folds, as shown in Fig 2. You can assign folds manually to ensure equal distribution, or automate this process using the following sample code:
import seaborn as sns
import matplotlib.pyplot as plt# In order to split our data into several parts equally lets assign KFold numbers to each of the data randomly.
# Calculate the number of samples per fold
num_samples = len(df) // 8
# Assign fold numbers
df['kfold'] = np.repeat(np.arange(1, 9), num_samples)
# Handle any remaining samples (if len(df) is not divisible by 8)
remaining_samples = len(df) % 8
if remaining_samples > 0:
df.loc[-remaining_samples:, 'kfold'] = np.arange(1, remaining_samples + 1)
# Shuffle again to ensure randomness
fold_df = df.sample(frac=1, random_state=42).reset_index(drop=True)