Негізгі бет Ғылым және технология Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)

3 ай бұрын

Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)

Рет қаралды 2,983

Noble Saji Mathews

1 1

Recording of presentation delivered by me on 28th February for the Winter 2024 course CS 886: Recent Advances on Foundation Models at the University of Waterloo, we delve into novel techniques and recent research that aims to significantly enhance the efficiency and scalability of Large Language Model (LLM) inference.
This lecture covers the following topics:
- Efficient Memory Management for Large Language Model Serving with PagedAttention
- Flash-Decoding for long-context inference
- Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding

Пікірлер: 2

@SpartanPanda
Күн бұрын
Not able to find part 1 of this
@thepresistence5935
9 күн бұрын
Can you give the previous lesson, it will be useful to look