Go Production: ⚡️ Super FAST LLM (API) Serving with vLLM !!!

Рет қаралды 27,323

vLLM is a fast and easy-to-use library for LLM inference Engine and serving.
vLLM is fast with:
State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Optimized CUDA kernels
vLLM is flexible and easy to use with:
Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
vLLM seamlessly supports many Huggingface models
Vllm - github.com/vllm-project/vllm
Google Colab - colab.research.google.com/dri...
❤️ If you want to support the channel ❤️
Support here:
Patreon - / 1littlecoder
Ko-Fi - ko-fi.com/1littlecoder

Жүктеу

Пікірлер: 99

@seanmurphy9273
10 ай бұрын
my dude you're the super hero of these tutorials! I was just thinking about how i'm annoyed these llm's take so long to respond. And bamb, you posted this wonderful video! Thank you!
@1littlecoder
10 ай бұрын
Thanks so much for the kind words :)
@sujantkumarkv5498
10 ай бұрын
great work man... can't thank enough. thanks again. great to see more indian AI tech talent in the out :D
@mohegyux4072
10 ай бұрын
Thank you, your videos are becoming a daily thing for me
@1littlecoder
10 ай бұрын
Happy to hear that!
@marilynlucas5128
10 ай бұрын
❤ great job as always. Keep it up
@gpsb121993
Ай бұрын
Fantastic video! Just what I wanted to see.
@deabyam
10 ай бұрын
Thanks you are the vLLM of this space love the speed of your videos. Colab let's more of us learn with less $$
@1littlecoder
10 ай бұрын
Absolutely, Thanks for the support!
@jankothyson
10 ай бұрын
Wow, this is awesome!
@karamwise1
10 ай бұрын
Awesome video with great value.
@1littlecoder
10 ай бұрын
Thanks for watching!
@shamaldesilva9533
10 ай бұрын
inference was the main bottel neck of LLMs , this is amazing thank you so much 🤩🤩🤩. Please make video on this page attention algorithm 🤩🤩
@1littlecoder
10 ай бұрын
Glad you liked it. Thanks for the suggestion!
@MarceloLimaXP
10 ай бұрын
Wow. Thank you for always bringing news to us ;)
@1littlecoder
10 ай бұрын
My pleasure!
@1littlecoder
10 ай бұрын
What would you love to see more on this channel? Might help me prioriting new content
@MarceloLimaXP
10 ай бұрын
@@1littlecoder One thing I believe could be very useful is the ability to 'talk' with sales reports. Something that makes the AI understand that what it's accessing is a sales report and not just a bunch of CSV data. This would go far beyond 'talk to your PDF' ;)
@riyayadav8468
10 ай бұрын
That's Good 🔥🔥
@prestonmccauley43
8 ай бұрын
Fantastic share!
@1littlecoder
8 ай бұрын
Thank you! Cheers!
@thedoctor5478
10 ай бұрын
you saved our lives
@moondevonyt
10 ай бұрын
mad props to the creator for breaking down vllm and its advantages over traditional llms that page attention tech sounds lit, giving it those crazy throughput numbers but, not gonna lie, using google collab as a production environment? kinda sus still, respect the hustle for making it accessible to peeps without fancy GPUs mad respect for that grind
@1littlecoder
10 ай бұрын
Is this an AI comment ?
@JohnVandivier
10 ай бұрын
great!
@anki1289
10 ай бұрын
amazing 🔥🔥, btw any idea how we can patch this into Gradio ?? that way sharing and access will be much easier
@arjunm2467
10 ай бұрын
Great information, really appreciate 🎉🎉🎉. If possible, can you show me how we can add our own data (Excel report) along with the model so that LLM can provide our data too.
@nic-ori
10 ай бұрын
Thanks.
@santoshshetty6
5 ай бұрын
Thanks for this wonderful video. I want to know can we have RAG over the model with vllm. Also can we run vllm in a kubernetes cluster?
@shivayshakti6575
10 ай бұрын
You are llm angel :)
@alx8439
7 ай бұрын
Does it support quantised models? Is it supposed in Oobabooga already? Quite an interesting topic - I hear ppl are using it in production often
@aliissa4040
8 ай бұрын
Hello, in your opinion, what's better to use in production , TGI or vLLM ?
@pavanpraneeth4659
10 ай бұрын
Awesomeness man does it work with langchain?
@mdmishfaqahmed2138
10 ай бұрын
nice one.
@1littlecoder
10 ай бұрын
Thank you! Cheers!
@Techonsapevole
10 ай бұрын
Cool, does it work also cpu only like con llama.cpp ?
@1littlecoder
10 ай бұрын
It currently doesn't support quantization. So I don't think the CPU would be powerful enough to run those.
@ilianos
9 ай бұрын
So if I had to choose, one of the best LLMs from that selection would be Falcon?
@sakshatkatyarmal2303
8 ай бұрын
Awesome video, But while inferencing and using the endpoint on postman, it shows jupyter notebook server is running and not the answer from the LLM..v1/completions
@cmeneseslob
9 ай бұрын
Thanks for the video, the only problem is that im getting "torch.cuda.OutOfMemoryError: CUDA out of memory." error when trying to serve the LLM in google colab. Is there a way to change the batch size using vllm parameters?
@HarshVerma-xs6ux
10 ай бұрын
Hey, amazing video dude. Is it possible to run GGML or GPTQ models through vLLM?
@1littlecoder
10 ай бұрын
Thanks Not at this moment.
@davidfa7363
2 күн бұрын
Hi. Really interesting and great work. If i am using a model via OpenAi like api, how can i implement a RAG system into it. How can i pass the prompt and the context to the model?
@aozynoob
10 ай бұрын
How does it compare to 4bit quantization?
@bashamsk1288
8 ай бұрын
Does it support device mode auto type of thing ? For loading model in multiple gpus?
@nat.serrano
8 ай бұрын
why does it only support a few models? what are the limitations? when are they going to support vicuna? why use vllm over fastapi? sorry for many questions?
@SloanMosley
10 ай бұрын
Does this support server less, also how would you host with sage maker ?
@TechieBlogging
7 ай бұрын
Does vLLM works om OpenAI Whisper models?
@rageshantony2182
10 ай бұрын
I read that it doesn't support quantized models. Using ExLllama for quantized LLama models is faster with low memory footprints
@nat.serrano
8 ай бұрын
thanks for the confirmation, so what is the best option to expose LLama models? fastapi?
@chiggly007
10 ай бұрын
Does it support chat completion endpoint?
@loicbaconnier9150
10 ай бұрын
So if i understand it's not work for QPTQ and GGML Mmodels ? Is there the chatcompletion and embedding in the api ? Is it possible to use an instruct model ?
@1littlecoder
10 ай бұрын
You're correct. It doesn't work with quantized models yet. I'll check on the chat completion part.
@prudhvithtavva7891
10 ай бұрын
I have finetuned Falcon7B on a custom dataset using qlora, can I use the vllm over the fine-tuned model instead of pre-trained?
@1littlecoder
10 ай бұрын
I guess if you had pushed the final merged model to HF Hub. Yes you can (most likely)
@davidlazer3641
10 ай бұрын
Hey ur videos are nice, can you please give me the steps for how to test my llama2 trained model? I already trained llama2 7b chat model with my data using transformers and merge with the model and pushed it to my hugging face repo..
@marilynlucas5128
10 ай бұрын
It’s like another inference engine I’ve seen for LLMs called OpenLLM
@1littlecoder
10 ай бұрын
Exactly. Nice observation. OpenLLM is in my list to cover 🚀
@marilynlucas5128
8 ай бұрын
@@1littlecoder It doesn't give an open AI API token?
@unimposings
10 ай бұрын
can it run on collabd 24/7 ? how much will it cost to let it run for 1 month?
@rkp23in
3 ай бұрын
can we launch a ollama model as api executing in google colab?
@mtteslian9159
10 ай бұрын
Is it possible to use this solution through langchain?
@Gerald-xg3rq
2 ай бұрын
hi nice video. how can i use vllm this on aws sagemaker?
@True_Feelingsss...
4 ай бұрын
How to load custom finetuned model using vllm
@user-fc5em1rk1s
9 ай бұрын
Can we use Langchain along with vLLM? When we use QA chains we actually create an llm instance using langchain. In that case how can we use this vLLM?
@larsuk9578
8 ай бұрын
exactly what I am trying to do!
@VijayasarathyMuthu
10 ай бұрын
Could you tell how to run this in Cloud Run or such service?
@rageshantony2182
10 ай бұрын
Please compare with ExLLAMA vs vLLM
@solomonaryeetey7370
9 ай бұрын
Hey buddy, can you show how to deploy vLLM with SkyPilot?
@Gokulhraj
9 ай бұрын
can we use it with lagnchain?
@pointlesspos8440
10 ай бұрын
Hey, do you know of a solution for this: I'm looking for a solution that is similar to chatgpt in that you host /serve 1 LLM and then multiple users can access it. Or do you have to server 1 llm for each user? I'm looking to build a chatbot with a Qlora trained on it for doing tech support/sales.
@pramodpatil2883
6 ай бұрын
hey did you find any solution for this as i am also in same problem..your help will appreciated
@pointlesspos8440
6 ай бұрын
No, what I have found is self served LLM's really start to lag after a short period of time. For most of my purposes, it would be fine. Since Im' just doing tech support chat. But also, doing multi-user could work with many spall models spun up. I have 48GB so maybe I could do three for 4 chat sessions with 7b models. I can do a 70b which is good, but not with 4 simultaneous users. But even so, I haven't been able to get a good model running as I have with ChatpGpt with my own docs embedded. What kind of solution are you looking to work on? @@pramodpatil2883
@urisrssfeeds
10 ай бұрын
how long does the pip install take? Mine has been going like 45 mins in google collab
@1littlecoder
10 ай бұрын
I guess it took about 20 mins in my case
@mohamedsheded4143
9 ай бұрын
Why when i make the API endpoint to gives me a runtime error ? any one face the same issue ?
@davidlazer3641
10 ай бұрын
I run the exact commad that you given in free colab tier, its gives me cuda out of memory, what can i do? any suggestions
@1littlecoder
10 ай бұрын
Did you use the same model as mine or any other big model?
@brandomiranda6703
3 ай бұрын
Why can't you use vllm for training?
@don-jp2rs
Ай бұрын
but why use vllm when you can use chatgpt api ?
@Ryan-yj4sd
10 ай бұрын
How to do batch inference?
@viratchoudhary6827
10 ай бұрын
hi bro , can you give me a ref " how to hide files as whisper-jax on huggingface"
@nithinbhandari3075
10 ай бұрын
Not able to replicate the result. Even after 10 minutes it stuck at "pip install vllm". Let see after few month. By the way, i was trying serverless gpu in runpod. The cold start is 30 second (for first request). It is just awesome. Just pay as you go. If you known any other method by which we can reduce inference time, please share. Thanks.
@1littlecoder
10 ай бұрын
Strange. Did it work ?
@nithinbhandari3075
10 ай бұрын
@@1littlecoder Vllm is not working for atleast me. Runpod serverless is working. (This is totally different topic that i am talking about, not related to vllm)
@ghaithkhelifi66
10 ай бұрын
hey my friend i have this setup ryzen 9 5900x with 48gb ram ddr4 with rtx 3090 msi oc so if you need help with testing reply i can give you my pc remotly so you can help yourself and i will learn from you if i can
@1littlecoder
10 ай бұрын
That's so kind of you. I'll let you know here in reply if such a setup might be required. Honestly every youtuber has to pick a niche and my niche is mostly people without powerful nvidia GPUs
@MarceloLimaXP
10 ай бұрын
@@1littlecoderExactly. I live in Brazil, and here the price of a GPU machine is desperate =P
@fxhp1
5 ай бұрын
you dont need a tunnel if you set --host 0.0.0.0
@yosefmoatti3633
5 ай бұрын
Very interesting video. Thanks! Unfortunately, I encounter problems with the initial "! pip install vllm": ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. lida 0.0.10 requires kaleido, which is not installed. lida 0.0.10 requires python-multipart, which is not installed. tensorflow-probability 0.22.0 requires typing-extensions
@michabbb
10 ай бұрын
because it's another Indian guy talking way too fast, I have to slow down the speed, to understand something..... 🙄
@1littlecoder
10 ай бұрын
Did you manage to understand when slowed down the speed ?
@heythere6390
10 ай бұрын
can I host my own falcon fine tuned model from hf, using the same mechanism?!e.g iamauser/falcon-7b-finetuned?
@1littlecoder
10 ай бұрын
Yes. I think it uses transformers to download the model. So should ideally work
@heythere6390
10 ай бұрын
@@1littlecoder thanks