LoRA: Low-Rank Adaptation of LLMs Explained

Рет қаралды 10,236

Gabriel Mongaras

LoRA paper can be found here: arxiv.org/abs/2106.09685

Жүктеу

Пікірлер: 46

@blandatz
Ай бұрын
best tutorial on lora if you are interested in in-depth knowledge. the way you present the paper is simple yet effective
@hiranhasanka2164
Жыл бұрын
Thanks for the explanation. I was looking for an in-depth explanation and couldn't find anywhere that explains LoRA like you have here.
@JM-tu8mg
Жыл бұрын
This video saves me a lot of time. Great work my friend. Appreciate it
@amirarsalanrajabi5171
Жыл бұрын
Great explanation, thank you!
@user-wr4yl7tx3w
Жыл бұрын
The diagram sketched out are so helpful.
@kevon217
Жыл бұрын
Great explanation, thanks!
@thanosqin2005
11 ай бұрын
Thank you do much for the video! It helped a ton! Do you have any plan on more related videos? such as Adam or stuff
@gabrielmongaras
11 ай бұрын
Yep, I plan on continuing to make videos! My goal is to at least make videos once or twice a week. I don't plan on making a video on Adam since it's been covered in great detail and the theory is pretty static at the moment. I'm thinking of covering newer papers or papers with theory currently relevant like LLMs and diffusion models.
@fredrelec9311
11 ай бұрын
Thanks Gabriel! it will be nice is you do a life coding session of your work it will be very good for others
@hawkoli1987
2 ай бұрын
Thanks
@pavanbuduguppa2427
Жыл бұрын
Thanks for the great explaination! One question regarding the matrix 'B', when we initialize the weights to zero won't that cause the gradients of matrix B to be zero always and hence preventing it from learning?
@gabrielmongaras
Жыл бұрын
Generally, if we multiply two variables: z = x*y, then the partial derivates are dz/dx = y and dz/dy = x. In our case, if x = 0 and y = 1, then dz/dx = 1 and dz/dy = 0. So, x would update and y would not. But, after the first update since both variables are now non-zero, both gradients are non-zero. Same with matrices in a similar way. If we have z = A*B and B is all zeros, then its gradient is (essentially for ease of readability) dz/dB = A which is nonzero since A is nonzero. We do run into an issue where the gradient of B is basically the same value because we initialized B to the zero matrix. But, since we multiply z by some random-like matrix, then the values of the gradient are no longer equal values, and A and B update in a way where all values are not equal and the values in B are no longer 0. The next update will be like normal matrix optimization. Hope this clarifies things! If not, you could always try out a little toy example with matrices in an autograd framework and see how the gradients of z = A@B compares with z = A@B@[1,2,3].
@pavanbuduguppa2427
Жыл бұрын
@@gabrielmongaras very nicely explained! Thanks a ton
@davidromero1373
7 ай бұрын
Hi a question, can we use lora to just reduce the size of a model and run inference, or we have to train it always?
@bryanw7174
Жыл бұрын
Thanks for the explanation. What's the name of the notetaking app you are using here?
@gabrielmongaras
Жыл бұрын
I'm just using the default note taking app for Samsung Note tablets. So far, it's my favorite note taking app!
@jamesyang5187
Жыл бұрын
Hello, Anyone can help me in answering my question: in section 7.3 “ HOW DOES THE ADAPTATION MATRIX ∆W COMPARE TO W ?“ : “ …..∆W only amplifies directions that are not emphasized in W. Third, the amplification factor is rather huge…….“. By one example: if we use a 1000-entry dataset to do LoRA fine-tune, we get a new weight W1. Based on this W1, if we do this Lora fine-tune again with the same dataset, will it be re-amplified (re-emphasized) once again ? or remain the same ? Thanks
@gabrielmongaras
Жыл бұрын
It's hard to say without experimenting with LoRA a little, but I'd imagine fine-tuning with LoRA a second time on the same dataset would result in little difference from the first set of LoRA weights: h = W0 + ∆W. Let's say we fine-tuned using LoRA on a dataset which resulted in new weights h = W0 + ∆W. Fine-tuning a second time would be fine-tuning (what I'm defining as) h2 = h + ∆∆W = W0 + ∆W + ∆∆W where ∆W are the LoRA weights for the first fine-tuning stage and ∆∆W are the LoRA weights for the second. If we fine-tune for a long period of time resulting a near optimal h matrix, then ∆∆W would likely not need to add very much to the h matrix since h is already near optimal. If h is already near optimal, then h2 = h + ∆∆W where ∆∆W=0 is also near optimal. Changing ∆∆W may result in a more optimal matrix, but I'd image that fine-tuning a second time would result in similar results as just fine-tuning for longer with a single LoRA fine-tuning stage. So, I'm thinking that fine-tuning a second time may emphasize features (which will probably be the same as the initial features that were emphasized), but not by very much as ∆W is already near-optimal, keeping the amplification coefficient very similar to the first LoRA fine-tuning.
@jamesyang5187
Жыл бұрын
@@gabrielmongaras Appreicate your explanation, Gabriel
@jamesyang5187
Жыл бұрын
@@gabrielmongaras I like to ask more about the meaning in Section 7.3: there is one statement "the amplification factor is huge". I am not familiar with the authors' examples, so my question is: if the original weight W is pre-trained by a dataset of 400,000 entries. Now I plan to emphasize 2,000 of them, then I use these 2000 entries (which has been used in the pre-train) to do LoRA, and the 2,000 entries will be amplified (emphasize) (1) What is the difference between the response before and after LoRA? LoRA can compress more details from the 2,000 entries into ∆W, so LoRA response will give more details than pre-traine response ? (2) If so, when we want to emphasize something, we can do LoRA with those entries again to increase their weight. Is this correct ? Thank you.
@gabrielmongaras
Жыл бұрын
@@jamesyang5187 If W was pre-trained on a dataset of 400,000 and 2,000 of them are used for LoRA, then I'd imagine that the LoRA matrix ∆W would emphasize some directions/features in the original W. To me, this means that the model emphasizes different parts of a sequence much more than others to do whatever downstream task you fine-tune it for. So while W may not emphasize many features it finds more than others, ∆W emphasizes several features much more than W as it needs to specialize on a task rather than generalizing to all tasks. When you fine-tune a model on a downstream task, it no longer cares about the original task which is why is amplifies several of the features that weren't amplified in W. If I were to compute a second LoRA, ∆∆W, then I'm thinking it would be the same as just doing the first LoRA run for longer as in the same features would be emphasized in ∆∆W and in ∆W if ∆W is trained for longer. However, I think the main point of section 7.3 is that the model has already learned most of the features it needs for any task in the pre-training stage with W and that the LoRA matrix just emphasizes some of the less important features in W rather than completely finding new features to fine-tune the model.
@jamesyang5187
Жыл бұрын
@@gabrielmongaras Agree, your detailed explanation is so clear, thanks a lot👍👍👍
@talharuzgarakkus7768
Жыл бұрын
It doesn't train the entire Lora model, so I came up with an idea to divide the model and train each part on Lora in each epoch. Wouldn't this approach, which requires less RAM like Lora and involves full fine-tuning, yield the same results?
@gabrielmongaras
Жыл бұрын
Since a transformer is only linear/affine layers, then LoRA can train all parts of the model if you put a LoRA adapter at each layer. Splitting up the model would likely produce invalid output unless you're saying you freeze a certain part of the model and train each frozen part individually. That would probably work, but may produce worse results than training the entire model with LoRA. The RAM usage wouldn't be much lower as most of the RAM goes to the model itself, not the LoRA adapters. QLoRA attempts to fix the model size constraint though.
@talharuzgarakkus7768
Жыл бұрын
@@gabrielmongaras I wanted to say, for example, think about something, in the first epoch, I made a part of the model an adapter, the rest froze, and another part of it in the second epoch, so we train the whole model throughout the entire epoch, and the performance of the model changes proportionally to the adapter we divide. but a more successful result emerges from lora. Am i wrong. Is it possible?
@talharuzgarakkus7768
Жыл бұрын
@@gabrielmongaras And this new technique is same as with lora's ram usage.
@gabrielmongaras
Жыл бұрын
@@talharuzgarakkus7768 oh yeah that's definitely possible. It may require longer training to achieve similar results as normal LoRA because each adapter isn't updated per epoch, but rather every few epochs meaning each adapter sees a lot less data than usual. Though the memory constraint doesn't usually come from the adapters themselves, rather it's the model that's the memory bottleneck. Would try it out though!
@talharuzgarakkus7768
Жыл бұрын
@@gabrielmongaras oh right how can we try this.
@agdsam
9 ай бұрын
Can you clarify that all the benefits of LORA are during the finetuning time, and no benefits accrue during inference time.
@gabrielmongaras
9 ай бұрын
Depends on what you mean by benifits. The LoRA adapters are used to finetune the model, but since there are less parameters than finetuning all model parameters, this fine-tuning is quicker and has a smaller memory footprint. As for inference, the LoRA adapters are used to convert the weights to their fine-tuned versions, allowing the model to perform better on the downstream task you finetuned the model for. However, there are no speed benefits during inference as the model has the exact same number of parameters as before finetuning (assuming the LoRA weights were merged with the original). So, LoRA has benefits for both!
@agdsam
9 ай бұрын
@@gabrielmongaras thanks Gabriel
@RadiCho
Жыл бұрын
ABx is different than BAx, isn't it? When you're writing ABx you are multiplying (r x d) with (d x r) to get (r x r) but you should instead get (d x d) with the reverse process. Actually, the confusion probably comes from you denoting A as (d x r) while in the paper it is (r x k) for k=d in your specific context.
@gabrielmongaras
Жыл бұрын
In the case of the paper since they use B in (d x r) and A in (r x d), then the matrix multiplication must be BA. However, since I transpose these matrices to help visualize them better where A is in (d x r) and B is in (r x d), the the matrix multiplication is AB. Also, I use k=d as it helps simplify things, but the case where dimensionality changes due to W, then A (in the paper) would be in (r x k), and B (in my formulation) would be in (r x k).
@RadiCho
Жыл бұрын
@@gabrielmongaras But you're not transposing them in your example (12:45). You draw A as (1 x 512) which is (r x d) isn't it?
@gabrielmongaras
Жыл бұрын
@@RadiCho That's a good point. I usually draw my matrices in any permuted way is easiest to visualize the matrices for me. In this case, I thought it'd be easier to draw them as a 1x512 matrix for A and a 512x1 matrix for B without really thinking of the dimension ordering. I try to label the dimensions of the input and output of transformation so the dimension in which the matrix multiplication happens (in any permuted way) can be implied by the dimensions of the output matrix.
@amelieschreiber6502
Жыл бұрын
@@gabrielmongaras At 14:23 if you do A, then B, you need BAx. You've got B applied first, then A, that is A(B(x)). Regardless of your dimensions, you've applied B first, which does not match your diagram.
@gabrielmongaras
Жыл бұрын
@@amelieschreiber6502 @RadiCho I see what y'all are saying. Thanks for letting me know! Need to remember that (AB)x is not ABx in mathematical notation, but in programming notation, ABx is (AB)x. Gonna add a card stating this issue 😅
@mahandolatabadi2600
Жыл бұрын
why don't you make more videos?
@gabrielmongaras
Жыл бұрын
Sorry, been really busy these past few weeks. However, I'm going to record a video on Drag GAN today!
@timmygilbert4102
Жыл бұрын
I like Lora, best girl!
@gabrielmongaras
Жыл бұрын
100% We love LoRA :)
@MilesBellas
6 ай бұрын
1.5x = normal