The KV Cache: Memory Usage in Transformers

Рет қаралды 26,188

The KV cache is what takes up the bulk of the GPU memory during inference for large language models like GPT-4. Learn about how the KV cache works in this video!
0:00 - Introduction
1:15 - Review of self-attention
4:07 - How the KV cache works
5:55 - Memory usage and example
Further reading:
* Speeding up the GPT - KV cache (www.dipkumar.dev/becoming-the...)
* Transformer Inference Arithmetic (kipp.ly/transformer-inference...)
* Efficiently Scaling Transformer Inference (arxiv.org/pdf/2211.05102.pdf)

Жүктеу

Пікірлер: 66

@TL-fe9si
7 ай бұрын
This is so clear! Thanks for the explanation!
@michaelnguyen1724
6 ай бұрын
You explained KV cache so well in an easy to understand way.
@zifencai2135
8 ай бұрын
Awesome explanation! Looking forward to more videos.
@markm4642
7 ай бұрын
This was a beatiful, simple video. Great job. Feeding this youtube algo.
@shashank3165
Ай бұрын
A really concise explanation. Thanks a lot.
@boi_doingthings
Ай бұрын
Excellent Video. Just Brilliant.
@Nishant-xu1ns
4 күн бұрын
excellent video sir
@user-kd2st5vc5t
Ай бұрын
very clear.Thank you!
@einsteinsapples2909
5 ай бұрын
great video! thank you, subscribed!
@alexandretl
10 ай бұрын
really great video! funny that i searched for "transformer kv cache" in google and your video was uploaded only 8 hours ago
@EfficientNLP
10 ай бұрын
thanks! last week I looked for a video on this topic and didn't find one so I decided to make it :)
@MacProUser99876
2 ай бұрын
@@EfficientNLP - necessity is the mother of invention. please keep up the good work!
@billykotsos4642
8 ай бұрын
nice info
@BlockDesignz
9 ай бұрын
Good video!
@RoyAAD
23 күн бұрын
Awesome.
@Best9in
7 ай бұрын
GJ!
@huiwei-ed1ip
9 ай бұрын
I hear that vLLM is an optimization for K-V cache, it uses continue-batching and pagedAttention
@SushilDubey171
2 ай бұрын
Great expl
@SinanAkkoyun
7 ай бұрын
Thank you so so much! I have one question: Why is the logit generation for the already existent prompt necessarry? I want to understand how the prediction of a new token is directly related to the already generated logits. I hope my question makes sense. Again, thank you so much, your videos are the best explanations on youtube!
@EfficientNLP
7 ай бұрын
I'm not sure what you mean - I didn't discuss logits anywhere in this video. If you're referring to the key and value vectors for the previous prompt tokens, they are required because all the previous tokens in the sequence need to be computed and multiplied by the matrices Wk and Wv to generate k and v vectors necessary to perform self-attention.
@1PercentPure
8 ай бұрын
cheers
@PMX
4 ай бұрын
For running an LLM locally, a batch size of 1 is enough and would reduce the KV cache to just 1.4GB in the OPT-30 example
@peoplepeople335
9 ай бұрын
Great Video! Btw can we use KV Cache during training?
@EfficientNLP
9 ай бұрын
No, it's only useful for inference because we are generating tokens one at a time, and previous K and V matrices can be cached. During training the entire sequence is processed in parallel, not sequentially, so there is no KV cache.
@peoplepeople335
9 ай бұрын
@@EfficientNLPThanks a lot for the explanation ❤.
@senapatiashok
9 ай бұрын
Great video. Do you have a notebook implemented with KV Cache ? It will be really helpful. Memory optimization is one of the key solution to realize on Device. Keep posting insightful optimizations.
@EfficientNLP
9 ай бұрын
Sure, this blog post contains a minimal implementation: www.dipkumar.dev/becoming-the-unbeatable/posts/gpt-kvcache/
@user-vx1sg4mq8k
4 ай бұрын
Thanks, simplfiying to one layer and one head makes things very clear. Now suppose we have n_layers and n_heads, so we have the Memory Usage per Token like M = n_layers*d_embed (d_embed = d_k * n_heads). But, what about the Compute per Token? Some online references claims that C = n_layers * d_embed² and i'm very suprised that this formula does not depend on the past tokens (context + already generated) : I mean that the 2nd layer expects the embeddings vector x2 output by 1st layer to compute kv cache, and x2 depends on past tokens (see Attention formula). What do you think?
@EfficientNLP
4 ай бұрын
Sorry, I didn't quite understand your question. The KV cache for the layers of a transformer is not affected by each other; that is, each layer has its own KV cache and does not depend on whether the KV cache is used for previous layers.
@Roger11265
7 ай бұрын
Good Video!! ❤ But I have a question: when u calculate the memory usage, u use the d_embed (dimension of embeddings) why not the d_k which means the dimension in W_k (d_embed*d_k)? I think the K-cache shoud be (seqlen*d_k) which comes from the dot product of input matrix (seqlen*d_embed) and W_k (d_embed*d_k)
@EfficientNLP
7 ай бұрын
You are correct that the dimension of the k/v matrix should be d_k and not d_embed. The version I showed is a simplification that ignores the multi-headed attention. With MHA you have d_k = d_embed / nheads. However, since there are nheads different K and V matrices, the memory requirement becomes seqlen * d_k * nheads, which is the same as seqlen * d_embed. So we can ignore MHA in the memory calculation.
@Roger11265
7 ай бұрын
@@EfficientNLP Got it! Thx a lot! ❤❤
@wolpumba4099
8 ай бұрын
*Video Summary: The KV Cache: Memory Usage in Transformers* - *Introduction* - Discusses the memory limitations of Transformer models, especially during text generation. - *Review of Self-Attention* - Explains the self-attention mechanism in Transformers. - Highlights how query, key, and value vectors are generated. - *How the KV Cache Works* - Introduces the concept of the KV (Key-Value) cache. - Explains that the KV cache stores previous context to avoid redundant calculations. - *Memory Usage and Example* - Provides an equation for calculating the memory usage of the KV cache. - Gives an example with a 30 billion parameter model, showing that the KV cache can take up to 180 GB. - *Latency Considerations* - Discusses the latency difference between processing the prompt and subsequent tokens due to the KV cache. The video provides an in-depth look at the KV cache, a crucial component that significantly impacts the memory usage and efficiency of Transformer models. It explains how the KV cache works, its role in self-attention, and its implications for memory usage and latency.
@jokmenen_
2 ай бұрын
Wow im starting to get it. Still blows my mind though, how does it learn...
@bnglr
7 ай бұрын
Is my understanding correct: in your example，“chill” has already been generated，you are demonstrating the preparation work after you got “chill” and before generating the token after “chill”.
@EfficientNLP
7 ай бұрын
It's showing the work done to generate the word 'chill.' We assume that some work has already been done to generate the previous tokens; that's what is cached and can be used to generate this word.
@bnglr
7 ай бұрын
so Q_new, K_new and V_new has to be for the token "a", not "chill"@@EfficientNLP
@mrinalde
9 ай бұрын
Can you add more details on the dimension of each K,Q,V . For example when we compute QKt - here Q is 1xd ( D is the hidden dimension) and o/p is 1 coloumn added to K, which means output is 1xD as well . Given this o/p I try to fit the equation as such op : Q @ Kt ==> 1xD = 1XD @ DxD ?? Is Kt DxD ?
@EfficientNLP
9 ай бұрын
That's not quite right, the dimension of K^T is D x seqlen, not DxD. A good way to figure out the dimensions is by setting up a breakpoint in any transformers library, such as Hugging Face, and print out the dimensions of the tensors.
@thoughtbox
5 ай бұрын
What is the consequence when the KVcache grows to such a point that two GPU's (2x the amount of memory) are needed to continue to calculate the next token? How is the KVcache partitioned across two (or more) GPU's? My guess is that as the context length increases, and the KVcache increases then the amount of compute to calculate each token also continues to expand, is that correct?
@EfficientNLP
5 ай бұрын
Your question depends on which parallelization strategy is used to distribute across multiple GPUs. Assuming the simplest setup, data parallelism, the model replicates across each GPU with each one handling a different part of the input batch. In this scenario, the KV cache is distributed across GPUs as well, and each GPU must store the KV cache for its respective portion of the batch.
@thoughtbox
5 ай бұрын
@@EfficientNLP If the input batch (which contains the prompts from multiple users) is split across multiple (lets say 2) GPUs (lets call these sub-batches), the resulting KVcache on each GPU would end up being different. I have 2 questions. 1. How is this different than simply running a smaller batch size in the first place? 2. Does this mean that responses from the transformer based on the prompts from user#1 and user#2 (that were processed in the same batch), will in some way be impacted by each others prompts, as these responses will be determined by a shared KVCache?
@talis1063
8 ай бұрын
If you didn't want the cache part of the 'KV cache' could you save VRAM? I understand it would be much slower. Like deallocate K before calculating V or something.
@EfficientNLP
8 ай бұрын
Yes, indeed, it is possible to disable the KV cache, which would save memory at the expense of increased compute. Theoretically, you can also enable it for some layers and not for others (although I'm not sure if this is done in practice)
@talis1063
8 ай бұрын
@@EfficientNLP Thanks for the answer and the video.
@DeepTylerDurden
9 ай бұрын
At the beginning the memory usage is not 180GB, right? The total context_length of the model is 1024, but not the current length. Let's say we have a prompt with 20 tokens and we will run inference on that. The model will need to store the kv cache only for seqlen = 20, then 21, 22, ..., up to 1024. So we will typically see the memory usage of kv cache growing during inference. Am I correct?
@EfficientNLP
9 ай бұрын
That's right, the kv cache will contain the embeddings for 20 tokens at the beginning, then grow as more tokens are generated.
@wolpumba4099
8 ай бұрын
@@EfficientNLP You mention at kzitem.info/news/bejne/mWaYr4mdoIR7mWk that one typically uses fp16 for inference. However, I could imagine that 2 bit per entry should be plenty to represent phase and frequency of a sinus in a vector. Can you clarify what parts of the K and V matrices may be quantized?
@EfficientNLP
8 ай бұрын
@@wolpumba4099 In the example, we assume all of the weights and computations in the network are in fp16.
@thoughtbox
7 ай бұрын
I have a question regarding KVcache and multi-tenancy. If the KVcache for a single “inference” fills a GPU or two worth of memory, what happens when the next user inputs a sequence? Does the previous users KVcache need to be flushed, and a new KVcache generated for a new user? Where does that KVcache go? To system memory? Only then to have to be brought back into GPU memory for the next sequence?
@EfficientNLP
7 ай бұрын
In a production deployment, this inference would be batched, so you'll always be handling multiple inputs from different users and generating them simultaneously. This approach utilizes the GPU more efficiently than processing one sentence at a time. Once the outputs are generated to completion, there's no need to keep the KV cache in memory. If you're asking about multiple chat turns, as in ChatGPT, then each turn is treated as a new input. The conversation history is provided as a prompt, and the KV cache from previous turns is not retained in memory.
@thoughtbox
6 ай бұрын
@@EfficientNLPif we consider a single GPU (with 180GB of HBM), and our first “batch#1” user inputs generates a KVcache, as in your example of 180GB, how does the system handle the second “batch#2”? Does it retain the 60GB of model and simple write over the 120GB of memory as it is needed? Then if a user input that was previously handled in “batch#1” has a follow up chat turn as in chatgpt, the entire chat history will now be handled as a single input?
@EfficientNLP
6 ай бұрын
That's correct - Batch 2 will clear out the KV cache for Batch 1. The system won't keep the KV cache of a user across chat messages since you might wait a long time before the user sends another message.
@thoughtbox
6 ай бұрын
@@EfficientNLP Thanks for clarifying! Great channel you have here.
@klstudio9
9 ай бұрын
I am confused about the last part. Why prompting part is slower? It can also generate and append embedding one by one, right?
@EfficientNLP
9 ай бұрын
It is slower during the first iteration, because the model must generate the full K and V matrices (for all the prompt tokens). In subsequent iterations, it only needs to generate a single row or column of the matrices, corresponding to the next token.
@klstudio9
9 ай бұрын
OK. so for prompt tokens, the time complexity is also linear. Because of the length, the generation will take longer.
@nimatajbakhsh999
9 ай бұрын
Modern GPUs such as A100 or H100 have at most 80GB RAM, so how would one run inference for a large language model with KV caching? In your example, KV-caching takes about 180GB. Model parallelism is the only option?
@EfficientNLP
9 ай бұрын
Yea, that is more than you can fit on a single GPU currently, so to run a 60GB model you will need to split the model across several GPUs (eg: pipeline or tensor parallelism).
@stasgurevich7786
7 ай бұрын
The batch size seems strange for inference. In a chatbot scenario you can have batchsize of 1, not 128. This will reduce memory for kv cache to about 1gb.
@EfficientNLP
7 ай бұрын
It depends on the scenario. For interactive use, if the batch size is 1, you probably don't need to use the KV cache at all -- the memory usage won't be very high but this isn't a very efficient use of GPU memory. Large cloud providers like OpenAI will batch together many requests when serving their ChatGPT or API to make more efficient use of GPU resources.
@svkchaitanya
2 ай бұрын
Hats okk dude , you rock ...
@kitgary
3 ай бұрын
I am a bit confused, why a token is represented as a column in the K matrix but a row in the V matrix?
@EfficientNLP
3 ай бұрын
This is to illustrate the shapes of matrix operations in the self-attention mechanism. The matrix K is transposed during the dot product operation, so each token is a column.
@user-or3cb8vz8k
4 ай бұрын
didn't you miss the number of attention heads in the multiplication?
@EfficientNLP
4 ай бұрын
I simplified the formulat to ignore multi-headed attention. With MHA you have d_k = d_embed / nheads. However, since there are nheads different K and V matrices, the memory requirement becomes seqlen * d_k * nheads, which is the same as seqlen * d_embed. So we can ignore MHA in the memory calculation.