NumPy with Pandas for More Efficient Data Analysis

As a data person, Pandas is a go-to package for any data manipulation activity because it’s intuitive and easy to use. That’s why many data science education include Pandas in their learning curriculum.

Pandas are built on the NumPy package, especially the NumPy array. Many NumPy functions and methodologies still work well with them, so we can use NumPy to effectively improve our data analysis with Pandas.

This article will explore several examples of how NumPy can help our Pandas data analysis experience.

Let’s get into it.

Pandas Data Analysis Improvement with NumPy

Before proceeding with the tutorial, we should have all the required packages installed. If you haven’t done so, you can install Pandas and NumPy using the following code.

We can start by explaining how Pandas and NumPy are connected. As mentioned above, Pandas is built on the NumPy package. Let’s see how they could complement each other to improve our data analysis.

First, let’s try to create a NumPy array and Pandas DataFrame with the respective packages.

import numpy as np
import pandas as pd

np_array= np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
pandas_df = pd.DataFrame(np_array, columns=['A', 'B', 'C'])

print(np_array)
print(pandas_df)

Output>>
[[1 2 3]
 [4 5 6]
 [7 8 9]]
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9

As you can see in the code above, we can create Pandas DataFrame with a NumPy array with the same dimension structure.

Next, we can use NumPy in the Pandas data processing and cleaning steps. For example, we can use the NumPy NaN object as the missing data placeholder.

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [5, np.nan, np.nan, 3, 2],
    'C': [1, 2, 3, np.nan, 5]
})
print(df)

Output>>
    A    B    C
0  1.0  5.0  1.0
1  2.0  NaN  2.0
2  NaN  NaN  3.0
3  4.0  3.0  NaN
4  5.0  2.0  5.0

As you can see in the result above, the NumPy NaN object becomes a synonym with any missing data in Pandas.

This code can examine the number of NaN objects in each Pandas DataFrame column.

Output>>
A    1
B    2
C    1
dtype: int64

The data collector may represent the missing data values in the DataFrame column as strings. If that happens, we can try to replace that string value with a NumPy NaN object.

df['A'] = df['A'].replace('missing data'', np.nan)

NumPy can also used for outlier detection. Let’s see how we can do that.

df = pd.DataFrame({
    'A': np.random.normal(0, 1, 1000),
    'B': np.random.normal(0, 1, 1000)
})

df.loc[10, 'A'] = 100
df.loc[25, 'B'] = -100

def detect_outliers(data, threshold=3):
    z_scores = np.abs((data - data.mean()) / data.std())
    return z_scores > threshold

outliers = detect_outliers(df)
print(df[outliers.any(axis =1)])

Output>>
            A           B
10  100.000000    0.355967
25    0.239933 -100.000000

In the code above, we generate random numbers with NumPy and then create a function that detects outliers using the Z-score and sigma rules. The result is the DataFrame containing the outlier.

We can perform statistical analysis with Pandas. NumPy could help facilitate more efficient analysis during the aggregation process. For example, here is statistical aggregation with Pandas and NumPy.

df = pd.DataFrame({
    'Category': [np.random.choice(['A', 'B']) for i in range(100)],
    'Values': np.random.rand(100)
})

print(df.groupby('Category')['Values'].agg([np.mean, np.std, np.min, np.max]))

Output>>
             mean       std      amin      amax
Category                                        
A         0.524568  0.288471  0.025635  0.999284
B         0.525937  0.300526  0.019443  0.999090

Using NumPy, we can use the statistical analysis function to the Pandas DataFrame and acquire aggregate statistics similar to the above output.

Lastly, we will talk about vectorized operations using Pandas and NumPy. Vectorized operations are a method of performing operations on the data simultaneously rather than looping them individually. The result would be faster and memory-optimized.
For example, we can perform element-wise addition operations between DataFrame columns using NumPy.

data = {'A': [15,20,25,30,35], 'B': [10, 20, 30, 40, 50]}

df = pd.DataFrame(data)
df['C'] = np.add(df['A'], df['B'])  

print(df)

Output>>
   A   B   C
0  15  10  25
1  20  20  40
2  25  30  55
3  30  40  70
4  35  50  85

We can also transform the DataFrame column via the NumPy mathematical function.

df['B_exp'] = np.exp(df['B'])
print(df)

Output>>
   A   B   C         B_exp
0  15  10  25  2.202647e+04
1  20  20  40  4.851652e+08
2  25  30  55  1.068647e+13
3  30  40  70  2.353853e+17
4  35  50  85  5.184706e+21

There is also the possibility of conditional replacement with NumPy for Pandas DataFrame.

df['A_replaced'] = np.where(df['A'] > 20, df['B'] * 2, df['B'] / 2)
print(df)

Output>>
   A   B   C         B_exp  A_replaced
0  15  10  25  2.202647e+04         5.0
1  20  20  40  4.851652e+08        10.0
2  25  30  55  1.068647e+13        60.0
3  30  40  70  2.353853e+17        80.0
4  35  50  85  5.184706e+21       100.0

Those are all the examples we have explored. These functions from NumPy would undoubtedly help to improve your Data Analysis process.

Conclusion

This article discusses how NumPy can help improve efficient data analysis using Pandas. We have tried to perform data preprocessing, data cleaning, statistical analysis, and vectorized operations with Pandas and NumPy.

I hope it helps!

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

Source link

What's Hot

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

BLIP3-KALE: An Open-Source Dataset of 218 Million Image-Text Pairs Transforming Image Captioning with Knowledge-Augmented Dense Descriptions

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

NumPy with Pandas for More Efficient Data Analysis

10 GitHub Features That You Are Missing Out On

Building Command Line Apps in Python with Click

5 Common Data Science Resume Mistakes to Avoid

Leave A Reply Cancel Reply

How ML AI Can Help Businesses Reduce Overhead Costs

How the AI Surge May Help Current WFH Employees

The ultimate contact center automation guide

Top 5AI Development Companies To Transform Your Business | by Amyra Sheldon

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

BLIP3-KALE: An Open-Source Dataset of 218 Million Image-Text Pairs Transforming Image Captioning with Knowledge-Augmented Dense Descriptions

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

Our Picks

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

BLIP3-KALE: An Open-Source Dataset of 218 Million Image-Text Pairs Transforming Image Captioning with Knowledge-Augmented Dense Descriptions

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

What's Hot

NumPy with Pandas for More Efficient Data Analysis

Pandas Data Analysis Improvement with NumPy

Conclusion

Related Posts

Leave A Reply Cancel Reply