Ollama 0.2 is here! Concurrency is now enabled by default.
ollama.com/download
This unlocks 2 major features:
Parallel requests
Ollama can now serve multiple requests at the same time, using only a little bit of additional memory for each request. This enables use cases such as:
- Handling multiple chat sessions at the same time
- Hosting code completion LLMs for your team
- Processing different parts of a document simultaneously
- Running multiple agents at the same time
Run multiple models
Ollama now supports loading different models at the same time. This improves several use cases:
- Retrieval Augmented Generation (RAG): both the embedding and text completion models can be loaded into memory simultaneously.
- Agents: multiple versions of an agent can now run simultaneously
- Running large and small models side-by-side
Models are automatically loaded and unloaded based on requests and how much GPU memory is available.
❤️ If you want to support the channel ❤️
Support here:
Patreon - / 1littlecoder
Ko-Fi - ko-fi.com/1littlecoder
🧭 Follow me on 🧭
Twitter - / 1littlecoder
Linkedin - / amrrs
Негізгі бет Ғылым және технология How to run Multiple LLMs parallel with Ollama?
Пікірлер: 20