As a data person, Pandas is a go-to package for any data manipulation activity because it’s intuitive and easy to use. That’s why many data science education include Pandas in their learning curriculum.
Pandas are built on the NumPy package, especially the NumPy array. Many NumPy functions and methodologies still work well with them, so we can use NumPy to effectively improve our data analysis with Pandas.
This article will explore several examples of how NumPy can help our Pandas data analysis experience.
Let’s get into it.
Pandas Data Analysis Improvement with NumPy
Before proceeding with the tutorial, we should have all the required packages installed. If you haven’t done so, you can install Pandas and NumPy using the following code.
We can start by explaining how Pandas and NumPy are connected. As mentioned above, Pandas is built on the NumPy package. Let’s see how they could complement each other to improve our data analysis.
First, let’s try to create a NumPy array and Pandas DataFrame with the respective packages.
import numpy as np
import pandas as pd
np_array= np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
pandas_df = pd.DataFrame(np_array, columns=['A', 'B', 'C'])
print(np_array)
print(pandas_df)
Output>>
[[1 2 3]
[4 5 6]
[7 8 9]]
A B C
0 1 2 3
1 4 5 6
2 7 8 9
As you can see in the code above, we can create Pandas DataFrame with a NumPy array with the same dimension structure.
Next, we can use NumPy in the Pandas data processing and cleaning steps. For example, we can use the NumPy NaN object as the missing data placeholder.
df = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': [5, np.nan, np.nan, 3, 2],
'C': [1, 2, 3, np.nan, 5]
})
print(df)
Output>>
A B C
0 1.0 5.0 1.0
1 2.0 NaN 2.0
2 NaN NaN 3.0
3 4.0 3.0 NaN
4 5.0 2.0 5.0
As you can see in the result above, the NumPy NaN object becomes a synonym with any missing data in Pandas.
This code can examine the number of NaN objects in each Pandas DataFrame column.
Output>>
A 1
B 2
C 1
dtype: int64
The data collector may represent the missing data values in the DataFrame column as strings. If that happens, we can try to replace that string value with a NumPy NaN object.
df['A'] = df['A'].replace('missing data'', np.nan)
NumPy can also used for outlier detection. Let’s see how we can do that.
df = pd.DataFrame({
'A': np.random.normal(0, 1, 1000),
'B': np.random.normal(0, 1, 1000)
})
df.loc[10, 'A'] = 100
df.loc[25, 'B'] = -100
def detect_outliers(data, threshold=3):
z_scores = np.abs((data - data.mean()) / data.std())
return z_scores > threshold
outliers = detect_outliers(df)
print(df[outliers.any(axis =1)])
Output>>
A B
10 100.000000 0.355967
25 0.239933 -100.000000
In the code above, we generate random numbers with NumPy and then create a function that detects outliers using the Z-score and sigma rules. The result is the DataFrame containing the outlier.
We can perform statistical analysis with Pandas. NumPy could help facilitate more efficient analysis during the aggregation process. For example, here is statistical aggregation with Pandas and NumPy.
df = pd.DataFrame({
'Category': [np.random.choice(['A', 'B']) for i in range(100)],
'Values': np.random.rand(100)
})
print(df.groupby('Category')['Values'].agg([np.mean, np.std, np.min, np.max]))
Output>>
mean std amin amax
Category
A 0.524568 0.288471 0.025635 0.999284
B 0.525937 0.300526 0.019443 0.999090
Using NumPy, we can use the statistical analysis function to the Pandas DataFrame and acquire aggregate statistics similar to the above output.
Lastly, we will talk about vectorized operations using Pandas and NumPy. Vectorized operations are a method of performing operations on the data simultaneously rather than looping them individually. The result would be faster and memory-optimized.
For example, we can perform element-wise addition operations between DataFrame columns using NumPy.
data = {'A': [15,20,25,30,35], 'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
df['C'] = np.add(df['A'], df['B'])
print(df)
Output>>
A B C
0 15 10 25
1 20 20 40
2 25 30 55
3 30 40 70
4 35 50 85
We can also transform the DataFrame column via the NumPy mathematical function.
df['B_exp'] = np.exp(df['B'])
print(df)
Output>>
A B C B_exp
0 15 10 25 2.202647e+04
1 20 20 40 4.851652e+08
2 25 30 55 1.068647e+13
3 30 40 70 2.353853e+17
4 35 50 85 5.184706e+21
There is also the possibility of conditional replacement with NumPy for Pandas DataFrame.
df['A_replaced'] = np.where(df['A'] > 20, df['B'] * 2, df['B'] / 2)
print(df)
Output>>
A B C B_exp A_replaced
0 15 10 25 2.202647e+04 5.0
1 20 20 40 4.851652e+08 10.0
2 25 30 55 1.068647e+13 60.0
3 30 40 70 2.353853e+17 80.0
4 35 50 85 5.184706e+21 100.0
Those are all the examples we have explored. These functions from NumPy would undoubtedly help to improve your Data Analysis process.
Conclusion
This article discusses how NumPy can help improve efficient data analysis using Pandas. We have tried to perform data preprocessing, data cleaning, statistical analysis, and vectorized operations with Pandas and NumPy.
I hope it helps!
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.