DATA PREPROCESSING
Let’s talk about something that every data scientist, analyst, or curious number-cruncher has to deal with sooner or later: missing values. Now, I know what you’re thinking — “Oh great, another missing value guide.” But hear me out. I’m going to show you how to tackle this problem using not one, not two, but six different imputation methods, all on a single dataset (with helpful visuals as well!). By the end of this, you’ll see why domain knowledge is worth its weight in gold (something even our AI friends might struggle to replicate).
Before we get into our dataset and imputation methods, let’s take a moment to understand what missing values are and why they’re such a common headache in data science.
What Are Missing Values?
Missing values, often represented as NaN (Not a Number) in pandas or NULL in databases, are essentially holes in your dataset. They’re the empty cells in your spreadsheet, the blanks in your survey responses, the data points that got away. In the world of data, not all absences are created equal, and understanding the nature of your missing values is crucial for deciding how to handle them.
Why Do Missing Values Occur?
Missing values can sneak into your data for a variety of reasons. Here are some common reasons:
- Data Entry Errors: Sometimes, it’s just human error. Someone might forget to input a value or accidentally delete one.
- Sensor Malfunctions: In IoT or scientific experiments, a faulty sensor might fail to record data at certain times.
- Survey Non-Response: In surveys, respondents might skip questions they’re uncomfortable answering or don’t understand.
- Merged Datasets: When combining data from multiple sources, some entries might not have corresponding values in all datasets.
- Data Corruption: During data transfer or storage, some values might get corrupted and become unreadable.
- Intentional Omissions: Some data might be intentionally left out due to privacy concerns or irrelevance.
- Sampling Issues: The data collection method might systematically miss certain types of data.
- Time-Sensitive Data: In time series data, values might be missing for periods when data wasn’t collected (e.g., weekends, holidays).
Types of Missing Data
Understanding the type of missing data you’re dealing with can help you choose the most appropriate imputation method. Statisticians generally categorize missing data into three types:
- Missing Completely at Random (MCAR): The missingness is totally random and doesn’t depend on any other variable. For example, if a lab sample was accidentally dropped.
- Missing at Random (MAR): The probability of missing data depends on other observed variables but not on the missing data itself. For example, men might be less likely to answer questions about emotions in a survey.
- Missing Not at Random (MNAR): The missingness depends on the value of the missing data itself. For example, people with high incomes might be less likely to report their income in a survey.
Why Care About Missing Values?
Missing values can significantly impact your analysis:
- They can introduce bias if not handled properly.
- Many machine learning algorithms can’t handle missing values out of the box.
- They can lead to loss of important information if instances with missing values are simply discarded.
- Improperly handled missing values can lead to incorrect conclusions or predictions.
That’s why it’s crucial to have a solid strategy for dealing with missing values. And that’s exactly what we’re going to explore in this article!
First things first, let’s introduce our dataset. We’ll be working with a golf course dataset that tracks various factors affecting the crowdedness of the course. This dataset has a bit of everything — numerical data, categorical data, and yes, plenty of missing values.
import pandas as pd
import numpy as np# Create the dataset as a dictionary
data = {
'Date': ['08-01', '08-02', '08-03', '08-04', '08-05', '08-06', '08-07', '08-08', '08-09', '08-10',
'08-11', '08-12', '08-13', '08-14', '08-15', '08-16', '08-17', '08-18', '08-19', '08-20'],
'Weekday': [0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5],
'Holiday': [0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
'Temp': [25.1, 26.4, np.nan, 24.1, 24.7, 26.5, 27.6, 28.2, 27.1, 26.7, np.nan, 24.3, 23.1, 22.4, np.nan, 26.5, 28.6, np.nan, 27.0, 26.9],
'Humidity': [99.0, np.nan, 96.0, 68.0, 98.0, 98.0, 78.0, np.nan, 70.0, 75.0, np.nan, 77.0, 77.0, 89.0, 80.0, 88.0, 76.0, np.nan, 73.0, 73.0],
'Wind': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, np.nan, 1.0, 0.0],
'Outlook': ['rainy', 'sunny', 'rainy', 'overcast', 'rainy', np.nan, 'rainy', 'rainy', 'overcast', 'sunny', np.nan, 'overcast', 'sunny', 'rainy', 'sunny', 'rainy', np.nan, 'rainy', 'overcast', 'sunny'],
'Crowdedness': [0.14, np.nan, 0.21, 0.68, 0.20, 0.32, 0.72, 0.61, np.nan, 0.54, np.nan, 0.67, 0.66, 0.38, 0.46, np.nan, 0.52, np.nan, 0.62, 0.81]
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
# Display basic information about the dataset
print(df.info())
# Display the first few rows of the dataset
print(df.head())
# Display the count of missing values in each column
print(df.isnull().sum())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 20 non-null object
1 Weekday 20 non-null int64
2 Holiday 19 non-null float64
3 Temp 16 non-null float64
4 Humidity 17 non-null float64
5 Wind 19 non-null float64
6 Outlook 17 non-null object
7 Crowdedness 15 non-null float64
dtypes: float64(5), int64(1), object(2)
memory usage: 1.3+ KBDate Weekday Holiday Temp Humidity Wind Outlook Crowdedness
0 08-01 0 0.0 25.1 99.0 0.0 rainy 0.14
1 08-02 1 0.0 26.4 NaN 0.0 sunny NaN
2 08-03 2 0.0 NaN 96.0 0.0 rainy 0.21
3 08-04 3 0.0 24.1 68.0 0.0 overcast 0.68
4 08-05 4 NaN 24.7 98.0 0.0 rainy 0.20
Date 0
Weekday 0
Holiday 1
Temp 4
Humidity 3
Wind 1
Outlook 3
Crowdedness 5
dtype: int64
As we can see, our dataset contains 20 rows and 8 columns:
- Date: The date of the observation
- Weekday: Day of the week (0–6, where 0 is Monday)
- Holiday: Boolean indicating if it’s a holiday (0 or 1)
- Temp: Temperature in Celsius
- Humidity: Humidity percentage
- Wind: Wind condition (0 or 1, possibly indicating calm or windy)
- Outlook: Weather outlook (sunny, overcast, or rainy)
- Crowdedness: Percentage of course occupancy
And look at that! We’ve got missing values in every column except Date and Weekday. Perfect for our imputation party.
Now that we have our dataset loaded, let’s tackle these missing values with six different imputation methods. We’ll use a different strategy for each type of data.
Listwise deletion, also known as complete case analysis, involves removing entire rows that contain any missing values. This method is simple and preserves the distribution of the data, but it can lead to a significant loss of information if many rows contain missing values.
👍 Common Use: Listwise deletion is often used when the number of missing values is small and the data is missing completely at random (MCAR). It’s also useful when you need a complete dataset for certain analyses that can’t handle missing values.
In Our Case: We’re using listwise deletion for rows that have at least 4 missing values. These rows might not provide enough reliable information, and removing them can help us focus on the more complete data points. However, we’re being cautious and only removing rows with significant missing data to preserve as much information as possible.
# Count missing values in each row
missing_count = df.isnull().sum(axis=1)# Keep only rows with less than 4 missing values
df_clean = df[missing_count < 4].copy()
We’ve removed 2 rows that had too many missing values. Now let’s move on to imputing the remaining missing data.
Simple imputation involves replacing missing values with a summary statistic of the observed values. Common approaches include using the mean, median, or mode of the non-missing values in a column.
👍 Common Use: Mean imputation is often used for continuous variables when the data is missing at random and the distribution is roughly symmetric. Mode imputation is typically used for categorical variables.
In Our Case: We’re using mean imputation for Humidity and mode imputation for Holiday. For Humidity, assuming the missing values are random, the mean provides a reasonable estimate of the typical humidity. For Holiday, since it’s a binary variable (holiday or not), the mode gives us the most common state, which is a sensible guess for missing values.
# Mean imputation for Humidity
df_clean['Humidity'] = df_clean['Humidity'].fillna(df_clean['Humidity'].mean())# Mode imputation for Holiday
df_clean['Holiday'] = df_clean['Holiday'].fillna(df_clean['Holiday'].mode()[0])
Linear interpolation estimates missing values by assuming a linear relationship between known data points. It’s particularly useful for time series data or data with a natural ordering.
👍 Common Use: Linear interpolation is often used for time series data, where missing values can be estimated based on the values before and after them. It’s also useful for any data where there’s expected to be a roughly linear relationship between adjacent points.
In Our Case: We’re using linear interpolation for Temperature. Since temperature tends to change gradually over time and our data is ordered by date, linear interpolation can provide reasonable estimates for the missing temperature values based on the temperatures recorded on nearby days.
df_clean['Temp'] = df_clean['Temp'].interpolate(method='linear')
Forward fill (or “last observation carried forward”) propagates the last known value forward to fill gaps, while backward fill does the opposite. This method assumes that the missing value is likely to be similar to the nearest known value.
👍 Common Use: Forward/backward fill is often used for time series data, especially when the value is likely to remain constant until changed (like in financial data) or when the most recent known value is the best guess for the current state.
In Our Case: We’re using a combination of forward and backward fill for Outlook. Weather conditions often persist for several days, so it’s reasonable to assume that a missing Outlook value might be similar to the Outlook of the previous or following day.
df_clean['Outlook'] = df_clean['Outlook'].fillna(method='ffill').fillna(method='bfill')
This method involves replacing all missing values in a variable with a specific constant value. This constant could be chosen based on domain knowledge or a safe default value.
👍 Common Use: Constant value imputation is often used when there’s a logical default value for missing data, or when you want to explicitly flag that a value was missing (by using a value outside the normal range of the data).
In Our Case: We’re using constant value imputation for the Wind column, replacing missing values with -1. This approach explicitly flags imputed values (since -1 is outside the normal 0–1 range for Wind) and it preserves the information that these values were originally missing.
df_clean['Wind'] = df_clean['Wind'].fillna(-1)
K-Nearest Neighbors (KNN) imputation estimates missing values by finding the K most similar samples in the dataset (just like KNN as Classification Algorithm) and using their values to impute the missing data. This method can capture complex relationships between variables.
👍 Common Use: KNN imputation is versatile and can be used for both continuous and categorical variables. It’s particularly useful when there are expected to be complex relationships between variables that simpler methods might miss.
In Our Case: We’re using KNN imputation for Crowdedness. Crowdedness likely depends on a combination of factors (like temperature, holiday status, etc.), and KNN can capture these complex relationships to provide more accurate estimates of missing crowdedness values.
from sklearn.impute import KNNImputer# One-hot encode the 'Outlook' column
outlook_encoded = pd.get_dummies(df_clean['Outlook'], prefix='Outlook')
# Prepare features for KNN imputation
features_for_knn = ['Weekday', 'Holiday', 'Temp', 'Humidity', 'Wind']
knn_features = pd.concat([df_clean[features_for_knn], outlook_encoded], axis=1)
# Apply KNN imputation
knn_imputer = KNNImputer(n_neighbors=3)
df_imputed = pd.DataFrame(knn_imputer.fit_transform(pd.concat([knn_features, df_clean[['Crowdedness']]], axis=1)),
columns=list(knn_features.columns) + ['Crowdedness'])
# Update the original dataframe with the imputed Crowdedness values
df_clean['Crowdedness'] = df_imputed['Crowdedness']
So, there you have it! Six different ways to handle missing values, all applied to our golf course dataset.
Let’s recap how each method tackled our data:
- Listwise Deletion: Helped us focus on more complete data points by removing rows with extensive missing values.
- Simple Imputation: Filled in Humidity with average values and Holiday with the most common occurrence.
- Linear Interpolation: Estimated missing Temperature values based on the trend of surrounding days.
- Forward/Backward Fill: Guessed missing Outlook values from adjacent days, reflecting the persistence of weather patterns.
- Constant Value Imputation: Flagged missing Wind data with -1, preserving the fact that these values were originally unknown.
- KNN Imputation: Estimated Crowdedness based on similar days, capturing complex relationships between variables.
Each method tells a different story about our missing data, and the “right” choice depends on what we know about our golf course operations and what questions we’re trying to answer.
The key takeaway? Don’t just blindly apply imputation methods. Understand your data, consider the context, and choose the method that makes the most sense for your specific situation.