Transforming text into vectors: TSDAE’s unsupervised approach to enhanced embeddings | by Silvia Onofrei

Load and Prepare the Pre-training Data

To build our domain specific data, we are using the Kaggle arXiv dataset comprised roughly of about 1.7M scholarly STEM papers sourced from the established electronic preprint platform, arXiv. Besides title, abstract and authors, there is a significant amount of metadata associated with each article. However, here we are concerned only with the titles.

After download, I’ll select the mathematics preprints. Given the hefty size of the Kaggle file, I’ve added a reduced version of the mathematics papers file to Github for easier access. However, if you’re inclined towards a different subject, download the dataset, and replace math with your desired topic in the code below:


# Collect the papers with subject "math"
def extract_entries_with_math(filename: str) -> List[str]:
"""
Function to extract the entries that contain the string 'math' in the 'id'.
"""# Initialize an empty list to store the extracted entries.
entries_with_math = []
with open(filename, 'r') as f:
for line in f:
try:
# Load the JSON object from the line
data = json.loads(line)
# Check if the "id" key exists and if it contains "math"
if "id" in data and "math" in data["id"]:
entries_with_math.append(data)
except json.JSONDecodeError:
# Print an error message if this line isn't valid JSON
print(f"Couldn't parse: {line}")
return entries_with_math
# Extract the mathematics papers
entries = extract_entries_with_math(arxiv_full_dataset)
# Save the dataset as a JSON object
arxiv_dataset_math = file_path + "/data/arxiv_math_dataset.json"
with open(arxiv_dataset_math, 'w') as fout:
json.dump(entries, fout)

I’ve loaded our dataset into a Pandas dataframe df. A quick inspection shows that the reduced dataset contains 55,497 preprints—a more practical size for our experiment. While the [tsdae_article] suggests around 10K entries are adequate, I’ll keep the entire reduced dataset. Mathematics titles might have LaTeX code, which I’ll swap for ISO code to optimize processing.

parsed_titles = []for i,a in df.iterrows():
"""
Function to replace LaTeX script with ISO code.
"""
try:
parsed_titles.append(LatexNodes2Text().latex_to_text(a['title']).replace('\\n', ' ').strip()) 
except:
parsed_titles.append(a['title'].replace('\\n', ' ').strip())
# Create a new column with the parsed titles
df['parsed_title'] = parsed_titles

I’ll use the parsed_title entries for training, so let’s extract them as a list:

# Extract the parsed titles as a list
train_sentences = df.parsed_title.to_list()

Next, let’s form the corrupted sentences by removing approximately 60% of tokens from each entry. If you’re interested in exploring further or trying different deletion ratios, check out the denoising script.

# Add noise to the data
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)

Let’s take a look at what happened to one entry after processing:

print(train_dataset[2010])initial text: "On solutions of Bethe equations for the XXZ model"
corrupted text: "On solutions of for the XXZ"

As you notice, Bethe equations and model were removed from the initial text.

The last step in our data processing is to load the dataset in batches:

train_dataloader = DataLoader(train_dataset, batch_size=8, 
shuffle=True, drop_last=True)

TSDAE Training

While I’ll be following the approach from the train_tsdae_from_file.py, I’ll construct it step by step for better understanding.

Start with selecting a pre-trained transformer checkpoint, and stick with the default option:

model_name = 'bert-base-uncased'
word_embedding_model = models.Transformer(model_name)

Choose CLS as the pooling method and specify the dimension of the vectors to be constructed:

pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
"cls")											 'cls')

Next build the sentence transformer by combining the two layers:

model = SentenceTransformer(modules=[word_embedding_model,
pooling_model])													pooling_model])

Lastly, specify the loss function and tie the encoder-decoder parameters for the training phase.

train_loss = losses.DenoisingAutoEncoderLoss(model,
decoder_name_or_path=model_name,
tie_encoder_decoder=True)

Now, we’re set to invoke the fit method and train the model. I’ll also store it for the subsequent steps. You’re welcome to tweak the hyperparameters to optimize your experiment.

model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
weight_decay=0,
scheduler='constantlr',
optimizer_params={'lr': 3e-5},
show_progress_bar=True,
use_amp=True # set to False if GPU does not support FP16 cores
)pretrained_model_save_path = 'output/tsdae-bert-uncased-math'
model.save(pretrained_model_save_path)

The pre-training stage took about 15 min on a Google Colab Pro instance with A100 GPU set on High-RAM.

Source link

What's Hot

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

BLIP3-KALE: An Open-Source Dataset of 218 Million Image-Text Pairs Transforming Image Captioning with Knowledge-Augmented Dense Descriptions

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

Transforming text into vectors: TSDAE’s unsupervised approach to enhanced embeddings | by Silvia Onofrei | Oct, 2023

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

How I Created a Data Science Project Following CRISP-DM Lifecycle | by Gustavo Santos | Nov, 2024

Increase Trust in Your Regression Model The Easy Way | by Jonte Dancker | Nov, 2024

Leave A Reply Cancel Reply

How ML AI Can Help Businesses Reduce Overhead Costs

How the AI Surge May Help Current WFH Employees

The ultimate contact center automation guide

Top 5AI Development Companies To Transform Your Business | by Amyra Sheldon

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

BLIP3-KALE: An Open-Source Dataset of 218 Million Image-Text Pairs Transforming Image Captioning with Knowledge-Augmented Dense Descriptions

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

Our Picks

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

BLIP3-KALE: An Open-Source Dataset of 218 Million Image-Text Pairs Transforming Image Captioning with Knowledge-Augmented Dense Descriptions

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

What's Hot

Transforming text into vectors: TSDAE’s unsupervised approach to enhanced embeddings | by Silvia Onofrei | Oct, 2023

Load and Prepare the Pre-training Data

TSDAE Training

Related Posts

Leave A Reply Cancel Reply