Image by jcomp on Freepik
We have relied on software in our phones and computers in the modern era. Many applications, such as e-commerce, movie streaming, game platforms, and others, have changed how we live, as these applications make things easier. To make things even better, the business often provides features that allow recommendations from the data.
The basis of recommendation systems is to predict what the user might interested in based on the input. The system would provide the closest items based on either the similarity between the items (content-based filtering) or the behavior (collaborative filtering).
With many approaches to the recommendation system architecture, we can use the Hugging Face Transformers package. If you didn’t know, Hugging Face Transformers is an open-source Python package that allows APIs to easily access all the pre-trained NLP models that support tasks such as text processing, generation, and many others.
This article will use the Hugging Face Transformers package to develop a simple recommendation system based on embedding similarity. Let’s get started.
Develop a Recommendation System with Hugging Face Transformers
Before we start the tutorial, we need to install the required packages. To do that, you can use the following code:
pip install transformers torch pandas scikit-learn
You can select the suitable version for your environment via their website for the Torch installation.
As for the dataset example, we would use the Anime recommendation dataset example from Kaggle.
Once the environment and the dataset are ready, we will start the tutorial. First, we need to read the dataset and prepare them.
import pandas as pd
df = pd.read_csv('anime.csv')
df = df.dropna()
df['description'] = df['name'] +' '+ df['genre'] + ' ' +df['type']+' episodes: '+ df['episodes']
In the code above, we read the dataset with Pandas and dropped all the missing data. Then, we create a feature called “description” that contains all the information from the available data, such as name, genre, type, and episode number. The new column would become our basis for the recommendation system. It would be better to have more complete information, such as the anime plot and summary, but let’s be content with this one for now.
Next, we would use Hugging Face Transformers to load an embedding model and transform the text into a numerical vector. Specifically, we would use sentence embedding to transform the whole sentence.
The recommendation system would be based on the embedding from all the anime “description” we will perform soon. We would use the cosine similarity method, which measures the similarity of two vectors. By measuring the similarity between the anime “description” embedding and the user’s query input embedding, we can get precise items to recommend.
The embedding similarity approach sounds simple, but it can be powerful compared to the classic recommendation system model, as it can capture the semantic relationship between words and provide contextual meaning for the recommendation process.
We would use the embedding model sentence transformers from the Hugging Face for this tutorial. To transform the sentence into embedding, we would use the following code.
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
def get_embeddings(sentences):
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
return sentence_embeddings
Try the embedding process and see the vector result with the following code. However, I would not show the output as it’s pretty long.
sentences = ['Some great movie', 'Another funny movie']
result = get_embeddings(sentences)
print("Sentence embeddings:")
print(result)
To make things easier, Hugging Face maintains a Python package for embedding sentence transformers, which would minimize the whole transformation process in 3 lines of code. Install the necessary package using the code below.
pip install -U sentence-transformers
Then, we can transform the whole anime “description” with the following code.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
anime_embeddings = model.encode(df['description'].tolist())
With the embedding database is ready, we would create a function to take user input and perform cosine similarity as a recommendation system.
from sklearn.metrics.pairwise import cosine_similarity
def get_recommendations(query, embeddings, df, top_n=5):
query_embedding = model.encode([query])
similarities = cosine_similarity(query_embedding, embeddings)
top_indices = similarities[0].argsort()[-top_n:][::-1]
return df.iloc[top_indices]
Now that everything is ready, we can try the recommendation system. Here is an example of acquiring the top five anime recommendations from the user input query.
query = "Funny anime I can watch with friends"
recommendations = get_recommendations(query, anime_embeddings, df)
print(recommendations[['name', 'genre']])
Output>>
name \
7363 Sentou Yousei Shoujo Tasukete! Mave-chan
8140 Anime TV de Hakken! Tamagotchi
4294 SKET Dance: SD Character Flash Anime
1061 Isshuukan Friends.
2850 Oshiete! Galko-chan
genre
7363 Comedy, Parody, Sci-Fi, Shounen, Super Power
8140 Comedy, Fantasy, Kids, Slice of Life
4294 Comedy, School, Shounen
1061 Comedy, School, Shounen, Slice of Life
2850 Comedy, School, Slice of Life
The result is all of the comedy anime, as we want the funny anime. Most of them also include anime, which is suitable to watch with friends from the genre. Of course, the recommendation would be even better if we had more detailed information.
Conclusion
A Recommendation System is a tool for predicting what users might be interested in based on the input. Using Hugging Face Transformers, we can build a recommendation system that uses the embedding and cosine similarity approach. The embedding approach is powerful as it can account for the text’s semantic relationship and contextual meaning.
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.