If you ask any experienced data scientist and machine learning engineer, what costs the most amount of time in their job? I guess many of them will say: data preprocessing — a step that cleans up the data and prepares it for sequential data analysis. The reason is simple — garbage in, garbage out. That is if you don’t prepare the data correctly, your “insights” of the data can hardly be meaningful.
Although the data preprocessing step can be rather tedious, Pandas provides all essential functions that allow us to complete our data clean-up job relatively easily. However, because of its versatility, not every user knows all the functionalities that the pandas library has to offer. In this article, I’d like to share 3 lesser-known, yet super useful, functions that you can try in your data science projects.
Without further ado, let’s dive in.
Note: To provide context, suppose that you’re responsible for data management and analysis of a clothing store. The examples shown below are based on this assumption.
The first function that I want to mention is explode
. This function is useful when you deal with data in a column that contains lists. When you use explode
with this column, you create multiple rows by extracting each of the elements in the list into separate rows.
Here’s a simple code example to show you how to use the explode
function. Suppose that you have a data frame that stores order information. In this table, you have a column (i.e., the order
column) that contains lists of items, as shown below:
order_data = {
'customer': ['John', 'Zoe', 'Mike'],
'order': [['Shoes', 'Pants', 'Caps'], ['Jackets', 'Shorts'], ['Ties', 'Hoodies']]
}
order_df = pd.DataFrame(order_data)
order_df
The needed operation is to split each item of the list into a separate row for further data processing. Without using explode
, a naive solution may be the following. We simply iterate the original rows…