When preparing data for embedding and retrieval in a RAG system, splitting the text into appropriately sized chunks is crucial. This process is guided by two main factors, Model Constraints and Retrieval Effectiveness.
Model Constraints
Embedding models have a maximum token length for input; anything beyond this limit gets truncated. Be aware of your chosen model’s limitations and ensure that each data chunk doesn’t exceed this max token length.
Multilingual models, in particular, often have shorter sequence limits compared to their English counterparts. For instance, the widely used Paraphrase multilingual MiniLM-L12 v2 model has a maximum context window of just 128 tokens.
Also, consider the text length the model was trained on — some models might technically accept longer inputs but were trained on shorter chunks, which could affect performance on longer texts. One such is example, is the Multi QA base from SBERT as seen below,
Retrieval effectiveness
While chunking data to the model’s maximum length seems logical, it might not always lead to the best retrieval outcomes. Larger chunks offer more context for the LLM but can obscure key details, making it harder to retrieve precise matches. Conversely, smaller chunks can enhance match accuracy but might lack the context needed for complete answers. Hybrid approaches use smaller chunks for search but include surrounding context at query time for balance.
While there isn’t a definitive answer regarding chunk size, the considerations for chunk size remain consistent whether you’re working on multilingual or English projects. I would recommend reading further on the topic from resources such as Evaluating the Ideal Chunk Size for RAG System using Llamaindex or Building RAG-based LLM Applications for Production.
Text splitting: Methods for splitting text
Text can be split using various methods, mainly falling into two categories: rule-based (focusing on character analysis) and machine learning-based models. ML approaches, from simple NLTK & Spacy tokenizers to advanced transformer models, often depend on language-specific training, primarily in English. Although simple models like NLTK & Spacy support multiple languages, they mainly address sentence splitting, not semantic sectioning.
Since ML based sentence splitters currently work poorly for most non-English languages, and are compute intensive, I recommend starting with a simple rule-based splitter. If you’ve preserved relevant syntactic structure from the original data, and formatted the data correctly, the result will be of good quality.
A common and effective method is a recursive character text splitter, like those used in LangChain or LlamaIndex, which shortens sections by finding the nearest split character in a prioritized sequence (e.g., \n\n, \n, ., ?, !).
Taking the formatted text from the previous section, an example of using LangChains recursive character splitter would look like:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("intfloat/e5-base-v2")
def token_length_function(text_input):
return len(tokenizer.encode(text_input, add_special_tokens=False))
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size = 128,
chunk_overlap = 0,
length_function = token_length_function,
separators = ["\n\n", "\n", ". ", "? ", "! "]
)
split_texts = text_splitter(formatted_document['Boosting RAG: Picking the Best Embedding & Reranker models'])
Here it’s important to note that one should define the tokenizer as the embedding model intended to use, since different models ‘count’ the words differently. The function will now, in a prioritized order, split any text longer than 128 tokens first by the \n\n we introduced at end of sections, and if that is not possible, then by end of paragraphs delimited by \n and so forth. The first 3 chunks will be:
Token of text: 111 UPDATE: The pooling method for the Jina AI embeddings has been adjusted to use mean pooling, and the results have been updated accordingly. Notably, the JinaAI-v2-base-en with bge-reranker-largenow exhibits a Hit Rate of 0.938202 and an MRR (Mean Reciprocal Rank) of 0.868539 and withCohereRerank exhibits a Hit Rate of 0.932584, and an MRR of 0.873689.
-----------
Token of text: 112
When building a Retrieval Augmented Generation (RAG) pipeline, one key component is the Retriever. We have a variety of embedding models to choose from, including OpenAI, CohereAI, and open-source sentence transformers. Additionally, there are several rerankers available from CohereAI and sentence transformers.
But with all these options, how do we determine the best mix for top-notch retrieval performance? How do we know which embedding model fits our data best? Or which reranker boosts our results the most?
-----------
Token of text: 54
In this blog post, we’ll use the Retrieval Evaluation module from LlamaIndex to swiftly determine the best combination of embedding and reranker models. Let's dive in!
Let’s first start with understanding the metrics available in Retrieval Evaluation
Now that we have successfully split the text in a semantically meaningful way, we can move onto the final part of embedding these chunks for storage.