When I started learning HuggingFace, data collators were one of the least intuitive components for me. I had a hard time understanding them, and I did not find good enough resources that explain them intuitively.
In this post, we take a look at what data collators are, how they differ, and how to write a customized data collator.
Data collators are an essential part of data processing in HuggingFace. We all have used them after tokenizing the data, and before passing the data to the Trainer object to train the model.
In a nutshell, they put together a list of samples into a mini training batch. What they do depends on the task they are defined for, but at the very least they pad or truncate input samples to make sure all samples in a mini batch are of same length. Typical mini-batch sizes range from 16 to 256 samples, depending on the model size, data, and hardware constraints.
Data collators are task-specific. There is a data collator for each of the following tasks:
- Causal language modeling (CLM)
- Masking language modeling (MLM)
- Sequence classification
- Seq2Seq
- Token classification
Some data collators are simple. For example for the “sequence classification” task, the data collator just needs to pad all sequences in a mini batch to ensure they are of the same length. It would then concatenate them into one tensor.
Some data collators are quite complex, as they need to handle the data processing for that task.
Two of most basic data collators are as following:
1)DefaultDataCollator: This does not do any padding or truncation. It assumes all input samples are of the same length. If your input samples are not of the same length, this would throw errors.
import torch
from transformers import DefaultDataCollatortexts = ["Hello world", "How are you?"]
# Tokenize
from transformers import AutoTokenizer
tokenizer =…