Let’s get started with our main activity.
The design decisions left for us in the model architecture are typically expressed as hyperparameters. For LoRA specifically, we can define which modules to adapt and how large r
should be for each module’s adapter.
In the last article we only suggested selecting these modules based on our understanding of the task and the architecture.
Now, we’ll dive deeper. Where should we apply finetuning at all?
In the illustration above, you can see all the potential modules that we could finetune–including the classifier and the embeddings–on the left. On the right, I’ve made a sample selection for the illustration . But how do we arrive at an actual selection?
Let’s look at our options from a high level:
- Classifier
It is clear that we absolutely need to train the classifier. This is because it has not been trained during pre-training and, hence, for our finetuning, it is randomly initialized. Furthermore, its central position makes it highly impactful on the model performance, as all information must flow through it. It also has the most immediate impact on the loss calculation as it starts at the classifier. Lastly, it has few parameters, therefore, it is efficient to train.
In conclusion, we always finetune the classifier, but do not adapt it (with LoRA). - Embeddings
The embeddings reside at the bottom–close to the inputs–and carry the semantic meaning of the tokens. This is important for our downstream task. However, it’s not “empty”. Even without finetuning, we would get all of what was learned during pre-training. At this point, we are considering whether finetuning the embeddings directly would give us additional abilities and if our downstream task would benefit from a refined understanding of the token meanings?
Let’s reflect. If this were the case, could this additional knowledge not also be learned in one of the layers above the embeddings, perhaps even more efficiently?
Finally, the embeddings typically have lots of parameters, so we would have to adapt them before finetuning.
Taking both aspects together, we decided to pass on this option and not make the embeddings trainable (and consequently not apply LoRA to them). - Transformer Layers
Finetuning all parameters in the transformer layers would be inefficient. Therefore, we need to at least adapt them with LoRA to become parameter-efficient. This leads us to consider whether we should train all layers, and all components within each layer? Or should we train some layers, some components, or specific combinations of both?
There is no general answer here. We’ll adapt these layers and their modules and explore the details further in this article.
In the illustration above, on the right, you can see an exemplary selection of modules to finetune on the right. This is just one combination, but many other combinations are possible. Keep in mind as well that the illustration only shows five layers, while your model likely has more. For instance, the RoBERTa base model–used in our example–has 12 layers, a number that is considered small by today’s standards. Each layer also has 6 components:
- Attention: Query, Key, Value, Output
- Feed Forward: Up, Down
Even if we disregard that we also want to tune r
and — for now — just focus on the binary decision of which modules to include, this will leave us with 64 (2**6) combinations per layer. Given this only looks at the combinations of one layer, but that we have 12 layers that can be combined, we end up with more than a sextillion combinations:
In [1]: (2**6)**12.
Out[1]: 4.722366482869645e+21
It’s easy to see that we can’t exhaustively compute all combinations, let alone to explore the space manually.
Typically in computer science, we turn to the dice when we want to explore a space that is too large to fully investigate. But in this case, we could sample from that space, but how would we interpret the results? We would get back a number of arbitrary combination of layers and components (at least 12*6=72 following the small example of above). How would we generalize from these details to find higher-level rules that align with our natural understanding of the problem space? We need to align these details with our conceptual understanding on a more abstract level.
Hence, we need to consider groups of modules and look for structures or patterns that we can use in our experiments, rather than operating on a collection of individual components or layers. We need to develop an intuition about how things should work, and then formulate and test hypotheses.
Question: Does it help to experiment on defined groups of parameters in isolation? The answer is yes. These isolated groups of parameters can lead the way even though we may need to combine some of them later to achieve the best results. Testing in isolation allows us to see patterns of impact more clearly.
However, there is a risk. When these patterns are used in combination, their impact may change. That’s not perfect, but let’s not be so negative about it 🙂 We need to start somewhere, and then refine our approach if needed.
Ready? Let’s try this out.
Tuning Vertically / Layer-wise
I suspect that the upper layers, closer to the classification head, will be more impactful than the lower layers. Here is my thinking: Our task is sentiment analysis. It would make sense, wouldn’t it, that most of the specific decisions have to be made either in the classification head or close to it? Like recognizing certain phrases (“I needed that like a hole in my head”) or composed constructs (“The check-in experience negated the otherwise wonderful service”). This would suggest that it is crucial to finetune the parameters of our network that define how different tokens are used together–in context–to create a sentiment as opposed to changing the meaning of words (in the embeddings) compared to their meaning during the pre-training.
Even if that’s not always the case, adapting the upper layers still provides the opportunity to override or refine decisions from the lower layers and the embeddings. On the other hand, this suggests that finetuning the lower layers is less important.
That sounds like a solid hypothesis to try out (Oops. Message from future Mariano: Don’t stop reading here).
As an aside, we are not reflecting on the general necessity of the embeddings or any of the transformer layers. That decision has already been made: all of them were part of the pre-training and will be part of our finetuned model. What we’re considering at this point is how we can best help the model learn about our downstream task, which is sentiment analysis. The question we’re asking is: which weights should we finetune for impact and to achieve parameter efficiency?
Let’s put this to the test.
To clearly see the effect of our hypothesis, what do we test it against? Let’s design experiments that should exaggerate the effect:
- In our first experiment we finetune and adapt all components of the upper half of the model, namely layers 7–12 in our example. This is our hypothesis.
- In contrast, we run another experiment where we only finetune the layers in the lower half of the model. Specifically, we train layers 1–6 with all components. That’s the opposite of our hypothesis.
- Let’s consider another contrastive hypothesis as well: that a light touch to all layers is more beneficial than just tuning the top layers. So, let’s also include a third scenario where we finetune half of the layers but spread them out evenly.
- Let’s also include an experiment where we tune all layers (not depicted in the illustration above). This is not a fair performance comparison as we train twice as many parameters as in the first three experiments. However, for that reason, it highlights how much performance we potentially lose in the previous scenarios where we were tuning only half the number of parameters.
In summary, we have 3+1 scenarios that we want to run as experiments. Here are the results:
Execution of Experiments:
We start by using the already tuned
learning rate
,epochs
. Then, we run trials (training runs) with different values for the scenario settings, such aslower
,upper
,even
,all
. Within AMT, we run these experiments as a Grid Search.Question: Grid Search is known to be simple, but inefficient in finding the best solution. So why are we using it?
Let’s take a step back. If we were to run a few trials with Bayesian Search, we’d quickly learn about hyperparameter values that are performing well. This would bias the subsequent trials to focus on these values, i.e., pre-dominantly stay closer to known good values. While increasingly exploiting what we learn about the search space is a good strategy to find the best values, its bias makes it difficult to understand the explored space, as we under-sample in areas that showed low performance early on.
With Grid Search, we can precisely define which parameter values to explore, making the results easier to interpret.
In fact, if you were to look at the provided code, you’d see that AMT would reject sampling the same values more than once. But we want that, hence, we introduce a dummy variable with values from 0 to the number of trials we want to conduct. This is helpful, allowing us to repeat the trials with the same hyperparameter values to estimate the standard deviation of this combination.
While we used 5 trials each for an already tuned baseline scenario above to see how well we can reproduce a chosen combination of hyperparameter values, here we use 7 trials per combination to get a slightly more precise understanding of this combination’s variance to see tiny differences.
The same principles are applied to the following two scenarios in this article and will not be mentioned again henceforth.
Let’s get the easy thing out of the way first: As expected, tuning all layers and consequently using double the number of parameters, improves performance the most. This improvement is evident in the bottom figure.
Also, the peaks of all scenarios, as shown in the density plots on the right of the individual figures, are relatively close. When comparing these peaks, which represent the most frequently observed performance, we only see an improvement of ~0.08 in validation accuracy between the worst and best scenario. That’s not much. Therefore, we consider it a wash.
Regardless, let’s still examine our original hypothesis: We (me, really) expected that finetuning the upper six layers would yield better performance than finetuning the lower six layers. However, the data disagrees. For this task it makes no difference. Hence, I need to update my understanding.
We have two potential takeaways:
- Spreading the layers evenly is a little better than focusing on the top or bottom layers. That said, the improvement is so small that this insight may be brittle and might not generalize well, not event to new runs of the same model. Hence, we will discard our “discovery”.
- Tuning all layers, with double the cost, produces marginally better results. This outcome, however, is not surprising anyone. Still good to see confirmed though, as we otherwise would have found an opportunity to save trainable parameters, i.e., cost.
Overall, good to know all of that, but as we do not consider it actionable, we are moving on. If you are interested, you can find more details in this notebook.
Tuning Horizontally / Component-wise
Within each transformer layer, we have four learned projections used for attention that can be adapted during finetuning:
- Q — Query, 768 -> 768
- K — Key, 768 -> 768
- V — Value, 768 -> 768
- O — Output, 768 -> 768
In addition to these, we use two linear modules in each position-wise feedforward layer that live within the same transformer layer as the projections from above:
- Up — Up projection, 768 -> 3072
- Down — Down projection, 3072 -> 768
We can already see from the numbers above that the feedforward layers (ff) are four times as large as the QKVO projections we previously discussed. Hence the ff components will have a potentially larger impact and certainly higher cost.
Besides this, what other expectations could we have? It’s hard to say. We know from Multi-Query Attention [3] that the query projection is particularly important, but does this importance hold when finetuning with an adapter on our task (as opposed to, for example, pre-training)? Instead, let’s try out what the impact of the individual components is and proceed based on those results. We will be able to see which components are the strongest and maybe this will allow us to just pick those for tuning going forward.
Let’s run these experiments and inspect the results:
As was to be expected, the ff layers use their four-times size advantage to outperform the attention projections. Still, we can see that there are differences within these two groups. These differences are relatively minor, and if you want to leverage them, it’s necessary to validate their applicability for your specific task.
An important observation is that by merely tuning one of the ff layers (~0.943), we could almost achieve the performance of tuning all modules from the “LoRA Base” scenario (~0.946). Consequently, if we’re looking to balance between overall performance and the parameter count, this could be a good strategy. We’ll keep this in mind for the final comparison.
Within the attention projections (middle figure) it turns out that the query projection did not prove as impactful as expected. Contrarily, the output and value projections proved more useful. However, on their own, they were not that impressive.
So far, we have looked at the individual contributions of the components. Let’s also check if their impact overlaps or if combining components can improve the results.
Let’s run some of the possible combinations and see if this is informative. Here are the results:
Looking at the numbers charted above the first takeaway is that we have no performance regressions. Given that we added more parameters and combined existing combinations, that’s how it should be. Nevertheless, there is always the chance that when combining design decisions their combined performance is worse than their individual performance. Not here though, good!
We should not over-interpret the results, but it is interesting to recognize that when we testing our hypothesis individually the output projection’s performance was slightly ahead of the performance of the value projection. Here now, in combination with the position-wise feed forward up projection this relationship is reversed (now: o+up ~0.945, v+up ~0.948).
We’ll also recognize in the previous experiment, that the up projection was already performing almost on that level on its own. Therefore, we keep our enthusiasm in check, but include this scenario in our final comparison. If only, because we get a performance that is slightly better than when tuning and adapting all components in all layers, “LoRA Base”, but with much fewer parameters.
You can find more details in this notebook.
We know from the literature [2] that it is recommended to use a small r
value, meaning that r
is only a fraction of the minimum dimension of the original module, e.g. to use 8
instead of 768
. However, let’s validate this for ourselves and get some empirical feedback. Could it be worth investigating using a larger value for r
, despite the conventional wisdom?
For the previous trials, we used r=8
and invested more time to tune learning-rate
and the number of epochs
to train for this value. Now trying different values for r
will significantly alter the capacity of the linear modules. Ideally, we would re-tune the learning-rate
for each value of r
, but we aim to be frugal. Consequently, for now, we stick to the same learning-rate
. However, as farther we go away from our tuned r=8
value as stronger the need to retune the other hyperparameters mentioned above.
A consideration we need to remember when reviewing the results:
In the first figure, we see that the model performance is not particularly sensitive to additional capacity with good performances at r=4
and r=8
. r=16
was a tiny bit better, but is also more expensive in terms of parameter count. So let’s keep r=4
and r=8
in mind for our final comparison.
To see the effect of r
on the parameter count, we will also include r=1
in the final comparison.
One odd thing to observe in the figures above is that the performance is falling off sharply at r=32
. Providing a model, that uses residual connections, more capacity should yield the same or better performance than with a lower capacity. This is clearly not the case here. But as we tuned the learning-rate for r=8
and we now have many more learnable parameters with r=32
(see the upper right panel in preceding figure) we should also reduce the learning-rate
, or ideally, re-tune the learning-rate
and number of epochs
to adapt to the much larger capacity. Looking at the lower right panel in the previous figure we should then also consider adding more regularization to deal with the more pronounced overfitting we see.
Despite the general potential for improvement when providing the model with more capacity, the other values of r
we observed did not indicate that more capacity would improve performance without also markedly increasing the number of parameters. Therefore, we’ll skip chasing an even larger r
.
More details in this notebook.
Throughout this long article, we have gathered numerous analytical results. To consolidate these findings, let’s explore and compare several interesting combinations of hyperparameter values in one place. For our purposes, a result is considered interesting if it either improves the overall performance of the model or gives us additional insights about how the model works to ultimately strengthen our intuitive understanding
All experiments finetune the sst2 task on RoBERTa base as seen in the RoBERTa paper [1].
Execution of Experiments:
As before, when I show the results of a scenario (reported as the “target_tuner_name” column in the table above, and as labels on the y-axis in the graph), it’s based on executing the same combination of hyperparameter values five times. This allows me to report the mean and standard deviation of the objective metric.
Now, let’s discuss some observations from the scenarios depicted in the graph above.
Classifier Only
This baseline—where we only train the classifier head—has the lowest cost. Refer to parameters_relative
, which indicates the percentage of parameters needed, compared to a full finetuning. This is illustrated in the second panel, showing that ~0.5% is the lowest parameter count of all scenarios.
This has a beneficial impact on the “GPU Memory” panel (where lower is better) and markedly in the “Train Speed” panel (where higher is better). The latter indicates that this scenario is the fastest to train, because of the lower parameter count, and also because there are fewer modules to handle, as we do not add additional modules in this scenario.
This serves as an informative bare-bones baseline to see relative improvements in training speed and GPU memory use, but also highlights a tradeoff: the model performance (first panel) is the lowest by a wide margin.
Additionally, this scenario reveals that 0.48% of the full fine-tuning parameters represent the minimum parameter count. We allocate that fraction of the parameters exclusively for the classifier. Additionally, as all other scenarios tune the classifier, we consistently include that 0.48% in addition to whatever parameters are further tuned in those scenarios.
LoRA Base
This scenario serves as the foundation for all experiments beyond the baselines. We user=8
and adapt and finetune all linear modules across all layers.
We can observe that the model performance matches the full finetuning performance. We might have been lucky in this case, but the literature suggest that we can expect to nearly match the full finetuning performance with just about 1% of the parameters. We can see evidence of this here.
Additionally, because of adapting all linear modules, we see that the train speed is the lowest of all experiments and the GPU memory utilization is amongst the highest, but in line with most of the other scenarios.
LoRA all, r={1,4,8}
Overall, these scenarios are variations of “LoRA Base” but with different values of r
. There is only a small difference in the performance. However, as expected, there is a positive correlation between r
and the parameter count and a slightly positive correlation between r
and GPU memory utilization. Despite the latter, the value of r
remains so low that this does not have a substantial impact on the bottom line, specifically the GPU memory usage. This confirms what we explored in the original experiments, component-wise, as discussed above.
When reviewing r=1
, however, we see that this is a special case. With 0.61% for the relative parameter count, we are just a smidgen above the 0.48% of the “Classifier Only” scenario. But we see a validation accuracy of ~0.94 with r=1
, compared to ~0.82 with “Classifier Only”. With just 0.13% of the total parameters, adapted solely in the transformer layers, we can elevate the model’s validation accuracy by ~0.12. Bam! This is impressive, and hence, if we are interested in a low parameter count, this could be our winner.
Regarding GPU memory utilization, we’ll review this a bit later. But briefly, besides allocating memory for each parameter in the model, the optimizer, and the gradients, we also need to keep the activations around to calculate the gradients during backpropagation.
Additionally, larger models will show a bigger impact of choosing a small value for r
.
For what it’s worth, the scenario “LoRA all, r=8” used identical hyperparameter values to “LoRA Base”, but was executed independently. To make it easier to compare r=1, r=4 and r=8, this scenario was still evaluated.
LoRA ff_u
In this scenario we are tuning only the position-wise feed forward up projections, across all layers. This leads to a reduction in both the number of parameters and the number of modules to adapt. Consequently, the data shows an improvement in training speed and a reduction in GPU memory utilization.
But we also see a small performance hit. For “LoRA Base” we saw ~0.946, while in this scenario we only see ~0.942, a drop of ~0.04.
Details on the comparisons in this notebook.
When looking at the GPU memory panel above, two things become obvious:
One — LoRA, on its own, does not dramatically reduce the memory footprint
This is especially true when we adapt small models like RoBERTa base with its 125M parameters.
In the previous article’s section on intrinsic dimensionality, we learned that for current generation models (e.g., with 7B parameters), the absolute value of r
can be even smaller than for smaller capacity models. Hence, the memory-saving effect will become more pronounced with larger models.
Additionally using LoRA makes using quantization easier and more efficient – a perfect match. With LoRA, only a small percentage of parameters need to be processed with high precision: This is because we update the parameters of the adapters, not the weights of the original modules. Hence, the majority of the model weights can be quantized and used at much lower precision.
Furthermore, we typically use AdamW as our optimizer. Unlike SGD, which tracks only a single global learning rate, AdamW tracks moving averages of both the gradients and the squares of the gradients for each parameter. This implies that for each trainable parameter, we need to keep track of two values, which could potentially be in FP32. This process can be quite costly. However, as described in the previous paragraph, when using LoRA, we only have a few parameters that are trainable. This can significantly reduce the cost, so that we can use the typically parameter-intensive AdamW, even with large r
values.
We may look into these aspects in part four of our article series, given enough interest of you, dear reader.
Two–GPU memory utilization is only indirectly correlated with parameter count
Wouldn’t it be great if there was a direct linear relationship between the parameter count and the needed GPU memory? Unfortunately there are several findings in the diagrams above that illustrate that it is not that easy. Let’s find out why.
First we need to allocate memory for the model itself, i.e., storing all parameters. Then, for the trainable parameters, we also need to store the optimizer state and gradients (for each trainable parameter individually). In addition we need to consider memory for the activations, which not only depends on the parameters and layers of the model, but also on the input sequence length. Plus, it’s crucial to remember that we need to maintain those activations from the forward pass in order to apply the chain rule during the backward pass to do backpropagation.
If, during backpropagation, we were to re-calculate the activations for each layer when calculating the gradients for that layer, we would not maintain the activations for so lang and could save memory at the cost of increased computation.
This approach is known as gradient checkpointing. The amount of memory that can be saved depends on how much additional memory for activations needs to be retained. It’s important to remember that backpropagation involves repeatedly applying the chain rule, step by step, layer by layer:
Recap — Chain Rule during Back Propagation
During backpropagation, we calculate the error at the top of the network (in the classifier) and then propagate the error back to all trainable parameters that were involved. These parameters are adjusted based on their contributions to the error, to do better in the future. We calculate the parameters’ contributions by repeatedly applying the chain rule, start at the top and traversing the computation graph towards the inputs. This is necessary because any change in a parameter on a lower layer can potentially impact the parameters in all the layers above.
To calculate the local gradients (for each step), we may need the values of the activations for all the steps between the respective trainable parameter and the top (the loss function which is applied at the classification head). Thus, if we have a parameter in one of the top layers (close to the head), we need to maintain fewer activations compared to when training a parameter in the lower layers. For those lower layer parameters, we need to traverse a much longer graph to reach the classification head and, hence, need to maintain more memory to keep the activations around.
In our specific model and task, you can see the effect illustrated below. We train an individual model for each layer, in which only that particular layer undergoes training. This way, we can isolate the effect of the layer’s relative position. We then plot the amount of GPU memory required for each model, and therefore for each layer, during training.
In the graph below (see left panel) you can see that if we are closer to the bottom of the model (i.e., low layer number) the GPU memory requirement is lower than if we are close to the top of the model (i.e., high layer number) where the loss originates.
With gradient checkpointing enabled (see right panel), we no longer can recognize this effect. Instead of saving the activations until backprop we re-calculate them when needed. Hence, the difference in memory usage between the left and right panel are the activations that we maintain for the backward pass.
Execution of Experiments:
As with previous experiments, I used AMT with Grid Search to provide unbiased results.
It is important to remember, that recalculating the activations during backpropagation is slow, so we are trading of computational speed with memory usage.
More details on the testing can be found in this notebook.
As an aside, to the best of my understanding, using Gradient Checkpointing should only have non-functional impact. Unfortunately, this is not what I am seeing though (issue). I may be misunderstanding how to use Hugging Face’s Transformers library. If anyone has an idea why this may be the case, please let me know.
Consequently, take the graphs from above with a bit of caution.
We may revisit the topic of memory in part four of this article series, although it’s not strictly a LoRA topic. If you’re interested, please let me know in the comments below.