Data Collators in HuggingFace. What they are and what they do | by Mina Ghashami

What they are and what they do

When I started learning HuggingFace, data collators were one of the least intuitive components for me. I had a hard time understanding them, and I did not find good enough resources that explain them intuitively.

In this post, we take a look at what data collators are, how they differ, and how to write a customized data collator.

Data collators are an essential part of data processing in HuggingFace. We all have used them after tokenizing the data, and before passing the data to the Trainer object to train the model.

In a nutshell, they put together a list of samples into a mini training batch. What they do depends on the task they are defined for, but at the very least they pad or truncate input samples to make sure all samples in a mini batch are of same length. Typical mini-batch sizes range from 16 to 256 samples, depending on the model size, data, and hardware constraints.

Data collators are task-specific. There is a data collator for each of the following tasks:

Causal language modeling (CLM)
Masking language modeling (MLM)
Sequence classification
Seq2Seq
Token classification

Some data collators are simple. For example for the “sequence classification” task, the data collator just needs to pad all sequences in a mini batch to ensure they are of the same length. It would then concatenate them into one tensor.

Some data collators are quite complex, as they need to handle the data processing for that task.

Two of most basic data collators are as following:

1)DefaultDataCollator: This does not do any padding or truncation. It assumes all input samples are of the same length. If your input samples are not of the same length, this would throw errors.

import torch
from transformers import DefaultDataCollatortexts = ["Hello world", "How are you?"]
# Tokenize
from transformers import AutoTokenizer
tokenizer =…

Source link

What's Hot

Microsoft Released LLM2CLIP: A New AI Technique in which a LLM Acts as a Teacher for CLIP’s Visual Encoder

This Machine Learning Paper Transforms Embodied AI Efficiency: New Scaling Laws for Optimizing Model and Dataset Proportions in Behavior Cloning and World Modeling Tasks

Gradient Boosting | Towards Data Science

Data Collators in HuggingFace. What they are and what they do | by Mina Ghashami | Nov, 2023

Gradient Boosting | Towards Data Science

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

How I Created a Data Science Project Following CRISP-DM Lifecycle | by Gustavo Santos | Nov, 2024

Leave A Reply Cancel Reply

How ML AI Can Help Businesses Reduce Overhead Costs

How the AI Surge May Help Current WFH Employees

The ultimate contact center automation guide

Top 5AI Development Companies To Transform Your Business | by Amyra Sheldon

Microsoft Released LLM2CLIP: A New AI Technique in which a LLM Acts as a Teacher for CLIP’s Visual Encoder

This Machine Learning Paper Transforms Embodied AI Efficiency: New Scaling Laws for Optimizing Model and Dataset Proportions in Behavior Cloning and World Modeling Tasks

Gradient Boosting | Towards Data Science

The Complete Guide to NetSuite Saved Searches

Our Picks

Microsoft Released LLM2CLIP: A New AI Technique in which a LLM Acts as a Teacher for CLIP’s Visual Encoder

This Machine Learning Paper Transforms Embodied AI Efficiency: New Scaling Laws for Optimizing Model and Dataset Proportions in Behavior Cloning and World Modeling Tasks

Gradient Boosting | Towards Data Science

What's Hot

Data Collators in HuggingFace. What they are and what they do | by Mina Ghashami | Nov, 2023

What they are and what they do

Related Posts

Leave A Reply Cancel Reply