Image created by Author
Introduction
Feature engineering is one of the most important aspects of the machine learning pipeline. It is the practice of creating and modifying features, or variables, for the purposes of improving model performance. Well-designed features can transform weak models into strong ones, and it is through feature engineering that models can become both more robust and accurate. Feature engineering acts as the bridge between the dataset and the model, giving the model everything it needs to effectively solve a problem.
This is a guide intended for new data scientists, data engineers, and machine learning practitioners. The objective of this article is to communicate fundamental feature engineering concepts and provide a toolbox of techniques that can be applied to real-world scenarios. My aim is that, by the end of this article, you will be armed with enough working knowledge about feature engineering to apply it to your own datasets to be fully-equipped to begin creating powerful machine learning models.
Understanding Features
Features are measurable characteristics of any phenomenon that we are observing. They are the granular elements that make up the data with which models operate upon to make predictions. Examples of features can include things like age, income, a timestamp, longitude, value, and almost anything else one can think of that can be measured or represented in some form.
There are different feature types, the main ones being:
- Numerical Features: Continuous or discrete numeric types (e.g. age, salary)
- Categorical Features: Qualitative values representing categories (e.g. gender, shoe size type)
- Text Features: Words or strings of words (e.g. “this” or “that” or “even this”)
- Time Series Features: Data that is ordered by time (e.g. stock prices)
Features are crucial in machine learning because they directly influence a model’s ability to make predictions. Well-constructed features improve model performance, while bad features make it harder for a model to produce strong predictions. Feature selection and feature engineering are preprocessing steps in the machine learning process that are used to prepare the data for use by learning algorithms.
A distinction is made between feature selection and feature engineering, though both are crucial in their own right:
- Feature Selection: The culling of important features from the entire set of all available features, thus reducing dimensionality and promoting model performance
- Feature Engineering: The creation of new features and subsequent changing of existing ones, all in the aid of making a model perform better
By selecting only the most important features, feature selection helps to only leave behind the signal in the data, while feature engineering creates new features that help to model the outcome better.
Basic Techniques in Feature Engineering
While there are a wide range of basic feature engineering techniques at our disposal, we will walk through some of the more important and well-used of these.
Handling Missing Values
It is common for datasets to contain missing information. This can be detrimental to a model’s performance, which is why it is important to implement strategies for dealing with missing data. There are a handful of common methods for rectifying this issue:
- Mean/Median Imputation: Filling missing areas in a dataset with the mean or median of the column
- Mode Imputation: Filling missing spots in a dataset with the most common entry in the same column
- Interpolation: Filling in missing data with values of data points around it
These fill-in methods should be applied based on the nature of the data and the potential effect that the method might have on the end model.
Dealing with missing information is crucial in keeping the integrity of the dataset in tact. Here is an example Python code snippet that demonstrates various data filling methods using the pandas
library.
import pandas as pd
from sklearn.impute import SimpleImputer
# Sample DataFrame
data = {'age': [25, 30, np.nan, 35, 40], 'salary': [50000, 60000, 55000, np.nan, 65000]}
df = pd.DataFrame(data)
# Fill in missing ages using the mean
mean_imputer = SimpleImputer(strategy='mean')
df['age'] = mean_imputer.fit_transform(df[['age']])
# Fill in the missing salaries using the median
median_imputer = SimpleImputer(strategy='median')
df['salary'] = median_imputer.fit_transform(df[['salary']])
print(df)
Encoding of Categorical Variables
Recalling that most machine learning algorithms are best (or only) equipped to deal with numeric data, categorical variables must often be mapped to numerical values in order for said algorithms to better interpret them. The most common encoding schemes are the following:
- One-Hot Encoding: Producing separate columns for each category
- Label Encoding: Assigning an integer to each category
- Target Encoding: Encoding categories by their individual outcome variable averages
The encoding of categorical data is necessary for planting the seeds of understanding in many machine learning models. The right encoding method is something you will select based on the specific situation, including both the algorithm at use and the dataset.
Below is an example Python script for the encoding of categorical features using pandas
and elements of scikit-learn
.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# Sample DataFrame
data = {'color': ['red', 'blue', 'green', 'blue', 'red']}
df = pd.DataFrame(data)
# Implementing one-hot encoding
one_hot_encoder = OneHotEncoder()
one_hot_encoding = one_hot_encoder.fit_transform(df[['color']]).toarray()
df_one_hot = pd.DataFrame(one_hot_encoding, columns=one_hot_encoder.get_feature_names(['color']))
# Implementing label encoding
label_encoder = LabelEncoder()
df['color_label'] = label_encoder.fit_transform(df['color'])
print(df)
print(df_one_hot)
Scaling and Normalizing Data
For good performance of many machine learning methods, scaling and normalization needs to be performed on your data. There are several methods for scaling and normalizing data, such as:
- Standardization: Transforming data so that it has a mean of 0 and a standard deviation of 1
- Min-Max Scaling: Scaling data to a fixed range, such as [0, 1]
- Robust Scaling: Scaling high and low values iteratively by the median and interquartile range, respectively
The scaling and normalization of data is crucial for ensuring that feature contributions are equitable. These methods allow the varying feature values to contribute to a model commensurately.
Below is an implementation, using scikit-learn
, that shows how to complete data that has been scaled and normalized.
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# Sample DataFrame
data = {'age': [25, 30, 35, 40, 45], 'salary': [50000, 60000, 55000, 65000, 70000]}
df = pd.DataFrame(data)
# Standardize data
scaler_standard = StandardScaler()
df['age_standard'] = scaler_standard.fit_transform(df[['age']])
# Min-Max Scaling
scaler_minmax = MinMaxScaler()
df['salary_minmax'] = scaler_minmax.fit_transform(df[['salary']])
# Robust Scaling
scaler_robust = RobustScaler()
df['salary_robust'] = scaler_robust.fit_transform(df[['salary']])
print(df)
The basic techniques above along with the corresponding example code provide pragmatic solutions for missing data, encoding categorical variables, and scaling and normalizing data using powerhouse Python tools pandas
and scikit-learn
. These techniques can be integrated into your own feature engineering process to improve your machine learning models.
Advanced Techniques in Feature Engineering
We now turn our attention to to more advanced featured engineering techniques, and include some sample Python code for implementing these concepts.
Feature Creation
With feature creation, new features are generated or modified to fashion a model with better performance. Some techniques for creating new features include:
- Polynomial Features: Creation of higher-order features with existing features to capture more complex relationships
- Interaction Terms: Features generated by combining several features to derive interactions between them
- Domain-Specific Feature Generation: Features designed based on the intricacies of subjects within the given problem realm
Creating new features with adapted meaning can greatly help to boost model performance. The next script showcases how feature engineering can be used to bring latent relationships in data to light.
import pandas as pd
import numpy as np
# Sample DataFrame
data = {'x1': [1, 2, 3, 4, 5], 'x2': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
# Polynomial Features
df['x1_squared'] = df['x1'] ** 2
df['x1_x2_interaction'] = df['x1'] * df['x2']
print(df)
Dimensionality Reduction
In order to simplify models and increase their performance, it can be useful to downsize the number of model features. Dimensionality reduction techniques that can help achieve this goal include:
- PCA (Principal Component Analysis): Transformation of predictors into a new feature set comprised of linearly independent model features
- t-SNE (t-Distributed Stochastic Neighbor Embedding): Dimension reduction that is used for visualization purposes
- LDA (Linear Discriminant Analysis): Finding new combinations of model features that are effective for deconstructing different classes
In order to shrink the size of your dataset and maintain its relevancy, dimensional reduction techniques will help. These techniques were devised to tackle the high-dimensional issues related to data, such as overfitting and computational demand.
A demonstration of data shrinking implemented with scikit-learn
is shown next.
import pandas as pd
from sklearn.decomposition import PCA
# Sample DataFrame
data = {'feature1': [2.5, 0.5, 2.2, 1.9, 3.1], 'feature2': [2.4, 0.7, 2.9, 2.2, 3.0]}
df = pd.DataFrame(data)
# Use PCA for Dimensionality Reduction
pca = PCA(n_components=1)
df_pca = pca.fit_transform(df)
df_pca = pd.DataFrame(df_pca, columns=['principal_component'])
print(df_pca)
Time Series Feature Engineering
With time-based datasets, specific feature engineering techniques must be used, such as:
- Lag Features: Former data points are used to derive model predictive features
- Rolling Statistics: Data statistics are calculated across data windows, such as rolling means
- Seasonal Decomposition: Data is partitioned into signal, trend, and random noise categories
Temporal models need varying augmentation compared to direct model fitting. These methods follow temporal dependence and patterns to make the predictive model sharper.
A demonstration of time series features augmenting applied using pandas
is shown next as well.
import pandas as pd
import numpy as np
# Sample DataFrame
date_rng = pd.date_range(start="1/1/2022", end='1/10/2022', freq='D')
data = {'date': date_rng, 'value': [100, 110, 105, 115, 120, 125, 130, 135, 140, 145]}
df = pd.DataFrame(data)
df.set_index('date', inplace=True)
# Lag Features
df['value_lag1'] = df['value'].shift(1)
# Rolling Statistics
df['value_rolling_mean'] = df['value'].rolling(window=3).mean()
print(df)
The above examples demonstrate practical applications of advanced feature engineering techniques, through usage of pandas
and scikit-learn
. By employing these methods you can enhance the predictive power of your model.
Practical Tips and Best Practices
Here are a few simple but important tips to keep in mind while working through your feature engineering process.
- Iteration: Feature engineering is a trial-and-error process, and you will get better with it each time you iterate. Test different feature engineering ideas to find the best set of features.
- Domain Knowledge: Utilize expertise from those who know the subject matter well when creating features. Sometimes subtle relationships can be captured with realm-specific knowledge.
- Validation and Understanding of Features: By understanding which features are most important to your mode, you are equipped to make important decisions. Tools for determining feature importance include:
- SHAP (SHapley Additive exPlanations): Helping to quantify the contribution of each feature in predictions
- LIME (Local Interpretable Model-agnostic Explanations): Showcasing the meaning of model predictions in plain language
An optimal mix of complexity and interpretability is necessary for having both good and simple to digest results.
Conclusion
This short guide has addressed fundamental feature engineering concepts, as well as basic and advanced techniques, and practical recommendations and best practices. What many would consider some of the most important feature engineering practices — dealing with missing information, encoding of categorical data, scaling data, and creation of new features — were covered.
Feature engineering is a practice that becomes better with execution, and I hope you have been able to take something away with you that may improve your data science skills. I encourage you to apply these techniques to your own work and to learn from your experiences.
Remember that, while the exact percentage varies depending on who tells it, a majority of any machine learning project is spent in the data preparation and preprocessing phase. Feature engineering is a part of this lengthy phase, and as such should be viewed with the import that it demands. Learning to see feature engineering what it is — a helping hand in the modeling process — should make it more digestible to newcomers.
Happy engineering!
Matthew Mayo (@mattmayo13) holds a Master’s degree in computer science and a graduate diploma in data mining. As Managing Editor, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.