Recording of presentation delivered by me on 28th February for the Winter 2024 course CS 886: Recent Advances on Foundation Models at the University of Waterloo, we delve into novel techniques and recent research that aims to significantly enhance the efficiency and scalability of Large Language Model (LLM) inference.
This lecture covers the following topics:
- Efficient Memory Management for Large Language Model Serving with PagedAttention
- Flash-Decoding for long-context inference
- Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding
Негізгі бет Ғылым және технология Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)
Пікірлер: 2