How to Handle Missing Data with Scikit-learn’s Imputer Module

Image by Editor | Midjourney & Canva

Let’s learn how to use Scikit-learn’s imputer for handling missing data.

Preparation

Ensure you have the Numpy, Pandas and Scikit-Learn installed in your environment. If not, you can install them via pip using the following code:

pip install numpy pandas scikit-learn

Then, we can import the packages into your environment:

import numpy as np
import pandas as pd
import sklearn
from sklearn.experimental import enable_iterative_imputer

Handle Missing Data with Imputer

A scikit-Learn imputer is a class used to replace missing data with certain values. It can streamline your data preprocessing process. We will explore several strategies for handling the missing data.

Let’s create a data example for our example:

sample_data = {'First': [1, 2, 3, 4, 5, 6, 7, np.nan,9], 'Second': [np.nan, 2, 3, 4, 5, 6, np.nan, 8,9]}
df = pd.DataFrame(sample_data)
print(df)

    First  Second
0    1.0     NaN
1    2.0     2.0
2    3.0     3.0
3    4.0     4.0
4    5.0     5.0
5    6.0     6.0
6    7.0     NaN
7    NaN     8.0
8    9.0     9.0

You can fill the columns’ missing values with the Scikit-Learn Simple Imputer using the respective column’s mean.

    First  Second
0   1.00    5.29
1   2.00    2.00
2   3.00    3.00
3   4.00    4.00
4   5.00    5.00
5   6.00    6.00
6   7.00    5.29
7   4.62    8.00
8   9.00    9.00

For note, we round the result into 2 decimal places.

It’s also possible to impute the missing data with Median using Simple Imputer.

imputer = sklearn.SimpleImputer(strategy='median')
df_imputed = round(pd.DataFrame(imputer.fit_transform(df), columns=df.columns),2)

print(df_imputed)

   First  Second
0    1.0     5.0
1    2.0     2.0
2    3.0     3.0
3    4.0     4.0
4    5.0     5.0
5    6.0     6.0
6    7.0     5.0
7    4.5     8.0
8    9.0     9.0

The mean and median imputer approach is simple, but it can distort the data distribution and create bias in a data relationship.

There are also possible to use a K-NN imputer to fill in the missing data using the nearest neighbour approach.

knn_imputer = sklearn.KNNImputer(n_neighbors=2)
knn_imputed_data = knn_imputer.fit_transform(df)
knn_imputed_df = pd.DataFrame(knn_imputed_data, columns=df.columns)

print(knn_imputed_df)

    First  Second
0    1.0     2.5
1    2.0     2.0
2    3.0     3.0
3    4.0     4.0
4    5.0     5.0
5    6.0     6.0
6    7.0     5.5
7    7.5     8.0
8    9.0     9.0

The KNN imputer would use the mean or median of the neighbour’s values from the k nearest neighbours.

Lastly, there is the Iterative Impute methodology, which is based on modelling each feature with missing values as a function of other features. As this article states, it’s an experimental feature, so we need to enable it initially.

iterative_imputer = IterativeImputer(max_iter=10, random_state=0)
iterative_imputed_data = iterative_imputer.fit_transform(df)
iterative_imputed_df = round(pd.DataFrame(iterative_imputed_data, columns=df.columns),2)

print(iterative_imputed_df)

    First  Second
0    1.0     1.0
1    2.0     2.0
2    3.0     3.0
3    4.0     4.0
4    5.0     5.0
5    6.0     6.0
6    7.0     7.0
7    8.0     8.0
8    9.0     9.0

If you can properly use the imputer, it could help make your data science project better.

Additional Resouces

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

Source link