Load and Prepare the Pre-training Data
To build our domain specific data, we are using the Kaggle arXiv dataset comprised roughly of about 1.7M scholarly STEM papers sourced from the established electronic preprint platform, arXiv. Besides title, abstract and authors, there is a significant amount of metadata associated with each article. However, here we are concerned only with the titles.
After download, I’ll select the mathematics preprints. Given the hefty size of the Kaggle file, I’ve added a reduced version of the mathematics papers file to Github for easier access. However, if you’re inclined towards a different subject, download the dataset, and replace math
with your desired topic in the code below:
# Collect the papers with subject "math"
def extract_entries_with_math(filename: str) -> List[str]:
"""
Function to extract the entries that contain the string 'math' in the 'id'.
"""# Initialize an empty list to store the extracted entries.
entries_with_math = []
with open(filename, 'r') as f:
for line in f:
try:
# Load the JSON object from the line
data = json.loads(line)
# Check if the "id" key exists and if it contains "math"
if "id" in data and "math" in data["id"]:
entries_with_math.append(data)
except json.JSONDecodeError:
# Print an error message if this line isn't valid JSON
print(f"Couldn't parse: {line}")
return entries_with_math
# Extract the mathematics papers
entries = extract_entries_with_math(arxiv_full_dataset)
# Save the dataset as a JSON object
arxiv_dataset_math = file_path + "/data/arxiv_math_dataset.json"
with open(arxiv_dataset_math, 'w') as fout:
json.dump(entries, fout)
I’ve loaded our dataset into a Pandas dataframe df
. A quick inspection shows that the reduced dataset contains 55,497 preprints—a more practical size for our experiment. While the [tsdae_article] suggests around 10K entries are adequate, I’ll keep the entire reduced dataset. Mathematics titles might have LaTeX code, which I’ll swap for ISO code to optimize processing.
parsed_titles = []for i,a in df.iterrows():
"""
Function to replace LaTeX script with ISO code.
"""
try:
parsed_titles.append(LatexNodes2Text().latex_to_text(a['title']).replace('\\n', ' ').strip())
except:
parsed_titles.append(a['title'].replace('\\n', ' ').strip())
# Create a new column with the parsed titles
df['parsed_title'] = parsed_titles
I’ll use the parsed_title
entries for training, so let’s extract them as a list:
# Extract the parsed titles as a list
train_sentences = df.parsed_title.to_list()
Next, let’s form the corrupted sentences by removing approximately 60% of tokens from each entry. If you’re interested in exploring further or trying different deletion ratios, check out the denoising script.
# Add noise to the data
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)
Let’s take a look at what happened to one entry after processing:
print(train_dataset[2010])initial text: "On solutions of Bethe equations for the XXZ model"
corrupted text: "On solutions of for the XXZ"
As you notice, Bethe equations
and model
were removed from the initial text.
The last step in our data processing is to load the dataset in batches:
train_dataloader = DataLoader(train_dataset, batch_size=8,
shuffle=True, drop_last=True)
TSDAE Training
While I’ll be following the approach from the train_tsdae_from_file.py, I’ll construct it step by step for better understanding.
Start with selecting a pre-trained transformer checkpoint, and stick with the default option:
model_name = 'bert-base-uncased'
word_embedding_model = models.Transformer(model_name)
Choose CLS
as the pooling method and specify the dimension of the vectors to be constructed:
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
"cls") 'cls')
Next build the sentence transformer by combining the two layers:
model = SentenceTransformer(modules=[word_embedding_model,
pooling_model]) pooling_model])
Lastly, specify the loss function and tie the encoder-decoder parameters for the training phase.
train_loss = losses.DenoisingAutoEncoderLoss(model,
decoder_name_or_path=model_name,
tie_encoder_decoder=True)
Now, we’re set to invoke the fit method and train the model. I’ll also store it for the subsequent steps. You’re welcome to tweak the hyperparameters to optimize your experiment.
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
weight_decay=0,
scheduler='constantlr',
optimizer_params={'lr': 3e-5},
show_progress_bar=True,
use_amp=True # set to False if GPU does not support FP16 cores
)pretrained_model_save_path = 'output/tsdae-bert-uncased-math'
model.save(pretrained_model_save_path)
The pre-training stage took about 15 min on a Google Colab Pro instance with A100 GPU set on High-RAM.