Starting from a high-level, Transformers require two pieces of information for inputs: the token embeddings and the positional encodings. Token embeddings are things like tiktoken
where they will use a fixed vocabulary size to generate a unique key for each token. Through training, the model then learns the query and value for each token so that it can generate the next token successfully with the information.
In addition to the embeddings, we also need positional information to tell the LLM where in a sentence the token is. The equations above show the most abstracted view for passing along the positional information. We have 3 functions, 1 for each element of the token, and 2 word embedding vectors (Xm and Xn, where m and n signify the different dimensions each vector has).
One approach is to simply create a new vector for each token you see, so that the position is perfectly unique. Naturally, the trade-off here is that the unique vector makes it hard for the model to see similarities in the training data, degrading performance.
A secondary approach would be to create a vector that has a similarity factor with other vectors for each token. This way we still capture information about how similar a situation is to another distinct situation. Nevertheless, as we can create collisions of these vectors, there can be confusion that arises from this methodology.
How do we find the best combination of these approaches?
The industry has largely focused on RoPE as a way to get the best of both worlds. Without going too deep into the mathematics, RoPE uses sinusoidal functions to assign positional values to the tokens. As sinusoidal functions are repetitious by design, there are some positional values which will be very similar to others. Consequently, items that are similar will have some quantitative value indicating just how similar they are.
As you can see from the equation above, we have a sparse matrix filled with different functions revolving around the value θ which is passed in as a way to keep all of the positional encodings related.
The exact way these θ are related is shown below:
The most critical part of this equation for context size is the value 10,000. As we have tried to create bigger contexts with non-infinite ranges of numbers, the value of 10,000 has become a limiting factor — after all there are only so many vectors you can create with that number as your base.
While you could train a new model from scratch using a larger base value for your positional encodings, there are a few reasons stopping people at large from doing this. First, there is a huge cost associated with training from scratch. As only a few organizations in the world have the resources to do so currently, the burden to do this is great. Second, it is incredibly difficult to find a large volume of high quality long text. As the training requires trillions of tokens, finding quality long-data at that scale is a major challenge.
Consequently, researchers have put forward different methodologies for expanding RoPE to larger thetas.
The first method is Linear positional interpolation (PI), where you can expand the number of possible positions by reducing theta by some value λ. The equation below uses Beta to represent the θ^(2/d) equation which we used to connect all of the thetas from before.
While this works, the authors of the paper note that there is a crowding effect where some of the information ends up getting lost after the reduction.
The second method is YaRN (Yet another RoPE extensioN method) where we divide the RoPE Dimensions into 3 groups and assign a different linear factor to each of them. The basic idea is that tokens that appear frequently should not be altered (their λ := 1) and the ones that are less so are altered. From the graph below, we can see that this works well at expanding up to 128k context length. The issue at play here is determining the groupings. The groups are determined by people and thus there can be sub-optimal decisions made that reduce performance.
Thus, while both YaRN and Linear Projection (PI) work, they have limitations that hold them back. Long RoPE takes the best of each idea and finds a clever way to combine them.
The Long RoPE Researchers realized that to improve upon previous methods, they would introduce two key ideas: (1) the distribution of good λ is irregular, so searching for λ is better than assuming a correct answer and (2) there is a subset of tokens that should simply not have their positions changed.
Both of these findings are found in the formula below. To find the optimal λ, they created a loss function that they could minimize. The formula below is a reformatted version of RoPE with result of 𝕀 and ( n/ βi ) representing the scaling done to our positional vector. When they find the smallest loss, they choose that corresponding λ.
The 𝕀 step function is how we actualize the subset of tokens that should not be altered. By picking a value of 1, we are signaling that the positional encodings there should stay the same. To keep the search limited, they only considered n-hat values of {0, 1, 2, 4, 8, 12, 16, 20, 24, 28, 32, 64, 128, 256}. The higher the value of n-hat, the more tokens that keep their original positional encodings.
Now that we’ve covered the theory, let’s see the results!
Long RoPE works both without fine-tuning and with. The graph above shows the performance of LongRoPE when applied to LLaMA2–7B. The original context for that model was 4k. By finding the optimal λ, they were able to expand the context window to 32k tokens without a noticeable change in perplexity! What’s so incredible about this is the compute necessary to make a change like this is almost negligible compared to the costs to fine-tune. An 8x expansion without major compute spend is incredible.
To get a huge expansion does require a combination of fine-tuning and searching for the optimal λ. The researchers in the paper got a 512x expansion following this methodology. They first took the model to a size of 128k and 256k. They fine-tuned for 400 steps on the 128k and then switched to use the 256k factors for an additional 600 steps. As this worked better than just directly fine-tuning 256k, it appears that learning a more general distribution rather than just one of the scaled ones gives better performance. They then optimized for the best λ again and got to a context window of 2048k, an increase of 512 over the original 4k context window!
One of the difficulties of a larger context is a loss of performance for tasks with small contexts. This behavior has been seen before, and the theory is that data at the beginning gets condensed into a smaller range, resulting in some attention loss.
They resolved this in the 2048k context window model by finding the ideal λ for shorter lengths (in the paper this was 4k and 8k). During inference, if the context is determined to be small, the LLM will dynamically shift to using the smaller λ for positional encoding data.
LLMs are tremendous at reasoning and they continue to amaze us with their applications in the real world. With a larger context window, especially one that can be obtained at limited cost with still high performance, we will only see their applications grow.
One interesting question is whether dynamic positional encoding calculations are the way of the future. If you can fine-tune on multiple position encodings and get quality performance for 2 λ’s, then it may be that we have 1 model that can seamlessly switch between multiple λ’s at inference time.
One of the things I find most exciting about the LLM space is the potential to sift through data. While the internet has done an amazing job democratizing access to information, it has unfortunately also inundated our lives with noise. There are many things we are shown online that have almost no consequence to us. With a tool that can pull out the important information from the mundane and even deleterious, we can use the internet to its full potential.
With larger context windows, the LLM’s capacity to summarize and condense information can be used to even greater effect. There may even come a time when great leaps forward come from giving LLMs two seemingly disparate sets of information and having them figure out something new that can be reasoned given the premises in each set.
It’s an exciting time to be building.