Self-Host and Deploy Local LLAMA-3 with NIMs

Рет қаралды 7,253

Prompt Engineering

Жүктеу

Пікірлер: 25

@DearGeorge3
3 ай бұрын
It's not clear can I run NIM locally and get 5x in perfomance or not.
@petergasparik924
3 ай бұрын
Im curious too
@engineerprompt
3 ай бұрын
Here are the configurations that they used for running the tests on H100 [Llama 3-70b-instruct, input token length: 7,000, output token length: 1,000. Concurrent client requests: 100. 4xH100 SXM NVLink. NIM Off: FP16, TTFT: ~120s, ITL: ~180ms. NIM On: FP8. TTFT: ~4.5s, ITL: ~70ms You can run NIM locally on Tensor Core GPU but the performance you will get is dependent on your configurations and hardware. So your milage may vary.
@rubencontesti221
2 ай бұрын
@@engineerprompt could you provide the link to those metrics? Thank you!
@jonathanfranklin461
Ай бұрын
I think Triplex/R2R might be more affordable, but it would be interesting to see the comparison in quality of results with GraphRag and others. Thanks for the video.
@zikwin
3 ай бұрын
i dnt have friend kind enough to give me acess to H100
@petergasparik924
3 ай бұрын
Hi, are you sure that inference speed on H100 is correct? Because on my RTX 4090 with Llama 3 Instruct 8B Q8_0 inference speed is about 72t/s, so you have lower speed than me
@rousabout7578
3 ай бұрын
Is this correct? For production use, NIM is part of NVIDIA AI Enterprise, which has different pricing models: - On Microsoft Azure, there's a promotional price of $1 per GPU per hour, though this is subject to change. - For on-premises or other cloud deployments, NVIDIA AI Enterprise is priced at $4,500 per year per GPU.
@engineerprompt
3 ай бұрын
Here is the info: resources.nvidia.com/en-us-ai-enterprise/en-us-nvidia-ai-enterprise/nvidia-ai-enterprise-licensing-guide
@قيمنقعبود-ب2ل
3 ай бұрын
Thank you. Amazing channel
@engineerprompt
3 ай бұрын
Thanks
@orlingueorguiev
3 ай бұрын
Can you provide a benchmark comparison fortwhen using ollama server? I really want to see if the claimed performance improvement is actually there.
@engineerprompt
3 ай бұрын
Let me see if I can do a comparison between different options (ollama, llamacpp, vllm and NIM). Here is a blogpost from NVIDIA that might be helpful (note numbers here are for 8B, the results I showed in the video are for different configuration - 70B) tinyurl.com/as7uvbv8
@Nihilvs
3 ай бұрын
Thanks ! what do you actually pay for, when buying NIM ?
@engineerprompt
3 ай бұрын
You are paying for the license fee. My understanding is you can run this on your own hardware but paying licensing fee for using the software stack.
@Nihilvs
3 ай бұрын
@@engineerprompt Good to know ! thanks
@chirwatra
2 ай бұрын
How do you get/build Grafana dashboard?
@AnilKumar-im2ur
3 ай бұрын
Is it possible to deploy the llama3 in sagemaker?,i mean able to download it as NIM and use with in sagemaker. Let me know if this work out?
@zerosypher0114
Ай бұрын
Soo not everyone can use these?
@RickySupriyadi
3 ай бұрын
how is Nvidia optimize it for their software only? I'm curious what's the difference both using the same CUDA
@engineerprompt
3 ай бұрын
I think its also that they are using TensorRT LLM along with some other tweaks.
@RickySupriyadi
3 ай бұрын
@@engineerprompt I see if so then we might see another opensource framework capable of such too... can't wait Inference getting faster...
@eod9910
3 ай бұрын
So I'll say this because evidently other people are too polite, but this is absolute garbage. Who has an H100 hanging around to do this? Don't post stuff that 99% of the people can't do. If you want to post stuff that only people with tens of thousands of dollars and access to this type of hardware can use, go work for one of those companies. Otherwise, you're wasting everybody's time.
@christosmelissourgos2757
3 ай бұрын
Actually I don’t agree. We are building a product and that is something that we are really interested in
@vitalis
3 ай бұрын
dude why are you so bitter? Go out and touch grass for a bit. Have you learnt nothing from the last two decades in tech history? All industrial tech sips through to prosumer and then mainstream. Your local GPU performance would be considered alien tech not too long ago. Sheesh