The KV cache is what takes up the bulk of the GPU memory during inference for large language models like GPT-4. Learn about how the KV cache works in this video!
0:00 - Introduction
1:15 - Review of self-attention
4:07 - How the KV cache works
5:55 - Memory usage and example
Further reading:
* Speeding up the GPT - KV cache (www.dipkumar.dev/becoming-the...)
* Transformer Inference Arithmetic (kipp.ly/transformer-inference...)
* Efficiently Scaling Transformer Inference (arxiv.org/pdf/2211.05102.pdf)
Негізгі бет Ғылым және технология The KV Cache: Memory Usage in Transformers
Пікірлер: 66