CountVectorizer to Extract Features from Texts in Python, in Detail | by Rashida Nasrin Sucky

Everything you need to know to use CountVectorizer efficiently in Sklearn

The most basic data processing that any Natural Language Processing (NLP) project requires is to convert the text data to the numeric data. As long as the data is in text form we cannot do any kind of computation action on it.

There are multiple methods available for this text-to-numeric data conversion. This tutorial will explain one of the most basic vectorizers, the CountVectorizer method in the scikit-learn library.

This method is very simple. It takes the frequency of occurrence of each word as the numeric value. An example will make it clear.

In the following code block:

We will import the CountVectorizer method.
Call the method.
Fit the text data to the CountVectorizer method and, convert that to an array.

import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer #This is the text to be vectorized
text = ["Hello Everyone! This is Lilly. My aunt's name is also Lilly. I love my aunt.\
I am trying to learn how to use count vectorizer."]
cv= CountVectorizer() 
count_matrix = cv.fit_transform(text)
cnt_arr = count_matrix.toarray()
cnt_arr

Output:

array([[1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1]],
dtype=int64)

Here I have the numeric values representing the text data above.

How do we know which values represent which words in the text?

To make that clear, it will be helpful to convert the array into a DataFrame where column names will be the words themselves.

cnt_df = pd.DataFrame(data = cnt_arr, columns = cv.get_feature_names())
cnt_df

Now, it shows clearly. The value of the word ‘also’ is 1 which means ‘also’ appeared only once in the test. The word ‘aunt’ came twice in the text. So, the value of the word ‘aunt’ is 2.

In the last example, all the sentences were in one string. So, we got only one row of data for four sentences. Let’s rearrange the text and…

Source link

What's Hot

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

CountVectorizer to Extract Features from Texts in Python, in Detail | by Rashida Nasrin Sucky | Oct, 2023

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

How I Created a Data Science Project Following CRISP-DM Lifecycle | by Gustavo Santos | Nov, 2024

Increase Trust in Your Regression Model The Easy Way | by Jonte Dancker | Nov, 2024

Leave A Reply Cancel Reply

How ML AI Can Help Businesses Reduce Overhead Costs

How the AI Surge May Help Current WFH Employees

The ultimate contact center automation guide

Top 5AI Development Companies To Transform Your Business | by Amyra Sheldon

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

How I Created a Data Science Project Following CRISP-DM Lifecycle | by Gustavo Santos | Nov, 2024

Our Picks

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

What's Hot

CountVectorizer to Extract Features from Texts in Python, in Detail | by Rashida Nasrin Sucky | Oct, 2023

Everything you need to know to use CountVectorizer efficiently in Sklearn

Related Posts

Leave A Reply Cancel Reply