Last LLM Standing WINS: Groq LPU - Anthropic OPUS - OpenAI - Gemini Pro

WHICH LLM Will Reign Supreme? Here's an LLM Benchmark you can FEEL
Ever wondered how leading language models stack up in a head-to-head challenge? In our latest video, we simplify the complex world of LLM benchmarks with a BATTLE ROYALE that pits Grok, Vertex Pro, GPT-3.5 Turbo, and Claude Sonnet against each other. Get ready for an interesting showdown that will change the way you look at language models.
The rules are simple...
LAST LLM STANDING WINS
Join us as we show off a unique benchmarking tool built in a couple hours thanks to AI Coding Assistants like Aider and Cursor. Last LLM Standing Wins is a simple LLM benchmarking tool where you utilize prompt testing framework promptfoo to generate your tests and then visualize how they perform in a deterministic visual way. We setup several battles where Grok, Vertex Pro, GPT-3.5 Turbo, and Claude Sonnet battle it out over a series of prompt testing challenges surrounding NLQ to SQL prompt tests. Experience the thrill of live competition and gain insights like never before.
Step into the arena with us as we handpick top language models to face off for speed, accuracy and cost. 'Last LLM Standing Wins' benchmark makes it easy to compare the performance of different models, revealing their strengths and weaknesses. The best part is that it's built ontop of the prompt testing framework promptfoo, which is a testing driven prompt engineering framework that allows you to test your prompts and truly KNOW if your LLM is performing well or not for your specific use case.
The battlefield is set, and the contenders are ready. With each model running the same tests, we meticulously evaluate their performance, uncovering strengths and weaknesses. Grok's LPU mixtral model FLASHES through the challenges, bagging the frugal award for least cost, while others vie for the bullseye award for error-free executions. We witness Anthropics Claude models struggling with rate limits and GPT-3.5 Turbo's impressive balance between speed and accuracy.
The suspense is real as we tally the scores, revealing which LLMs stand tall and which crumble under the pressure. By applying a transparent, simple and easy-to-understand benchmarking tool, we shine a light on the performance of each model, bestowing awards for efficiency and accuracy. Every moment is packed with action and enlightening insights, making it impossible to look away.
The prompt is the new fundamental unit of knowledge work and programming.
Master the prompt, and you master the LLM.
🛠️ Prompt Testing
promptfoo.dev/
🧪 Testing Driven Prompt Engineering
• Test Driven PROMPT Eng...
🆚 Gemini Pro vs GPT 3.5 turbo LLM Benchmarks
• Test Driven PROMPT Eng...
💬 Text to SQL to Results
talktoyourdatabase.com/
🏁 Claude Benchmarks
www.anthropic.com/news/claude...
📖 Chapters
00:00 LLM Benchmarks are biased and complex
00:36 Last LLM Standing Wins
02:20 GPT-4 vs Claude Opus
03:44 Promptfoo Testing Framework
05:44 Speedster Test - GPT-3.5 Turbo vs GROQ
09:00 Final Mega LLM Battle
10:50 Groq LPU Mixtral is insanely fast
12:20 Benchmark your personal prompts
13:22 A real production use case
15:10 KNOW that your prompts are working
#llmbenchmark #bestllm #openai

Жүктеу

Пікірлер: 15

@yt-caio
Ай бұрын
Great content. Cheers.
@H1kari_1
3 ай бұрын
Absolutely cool, wow.
@s_streichsbier
3 ай бұрын
I can confirm that the issues with claude 3 are likely due to rate limits. Had the same issues when I started my eval runs. Limiting them to run one at the time worked for me.
@s11-informationatyourservi44
3 ай бұрын
this is the coolest llm thing i’ve seen in a while. very cool! i would love to play with this
@fire17102
3 ай бұрын
Really cool as always dan! some thoughts... It would be nice if this intefrace can not only load but also keep track of all previous tests, compare them, and create new tests live. The reason for failure must be either color coded differently or show up in the quick hover. The cost amount should be clearly displayed. This type of benchmarking seems also great for rag pipelines, and instead of comparing models, you could compare hyperparameters, chunck sizes etc or even full pipelines. Also if the answers are not deterministic like sql that can be checked with assertion to produce a boolean pass/fail, you could ask a cheap model to rate and score the outputs, and show total score after the battle. I've also seen fact-checking pipelines that this could use for scoring. Maybe even layers or multi scores - cost, speed, correctntess etc. I love that in every new video you raise the bar ! Stay awesome , All the best !
@DARKSXIDE
3 ай бұрын
very interested since i like to compare the open source great work dude
@duhai1836
3 ай бұрын
I love the new Claude for text output thou ... seems smarter than chatgpt4!
@IvarDaigon
3 ай бұрын
Groq (not to be confused with Grok from X AI) isn't an LLM, it's a piece of hardware used to accellerate LLM inference, therefore the speed tests are invalid because in theory you could run any LLM on Groq hardware and indeed the Groq API supports several open source models. In the near future we can expect other vendors to use Groq hardware or other custom harware designed specifically to accelerate inference which will greatly increase their speed.. Also, you need to consider that not many people are using the Groq API atm which also contrubutes to faster response times.
@indydevdan
3 ай бұрын
Great call out and I agree. In fact I'm willing to bet OpenAI has purchased and is taking apart Groq's LPU RIGHT NOW to build their own version.
@ryzikx
3 ай бұрын
wow great idea, LLM battle royale 😂 edit - yea i would like to see claude not get rate limited 😢
@Bigjuergo
3 ай бұрын
when will Groq LPU available for PCI4 for PCs?
@percy9228
3 ай бұрын
it's not groq that's battling here but minstrel right? groq will run any LLM?
@percy9228
3 ай бұрын
gg

Do companies NEED software engineers? Let's talk Devin, Layoffs, AI Coding Assistants.

Agent OS: LLM OS Micro Architecture for Composable, Reusable AI Agents

Can teeth really be exchanged for gifts#joker #shorts

Backstage 🤫 tutorial #elsarca #tiktok

как видит мама vs что происходит на самом деле ( вкусняшки )

Accessorio fantastico per chiudere qualsiasi contenitore || Conservazione ermetica!

Two-Way Prompts: SIMPLIFY your AI Agents, Agentic Workflows, Personal AI Assistant

How To Learn Prompt Engineering For Free | Future Of AI | Open AI | #sandeepmaheshwari #hopeshub

ZERO Cost AI Agents: Are ELMs ready for your prompts? (Llama3, Ollama, Promptfoo, BUN)

1BRC ¿Puedes hacerlo? El 1 Billion Row Challenge EXPLICADO

MOST Important AGENTIC Application - Speech to Text to AI Agents (TTS, STT, LLM Router)

Untold story of AI’s fastest chip

Animating a Stick Fight in 10 Seconds vs 10 Hours

MASTER the Prompt: TOP 5 Elements for Reusable Prompts, AI Agents, Agentic Workflows

RAG from the Ground Up with Python and Ollama

GPT-5 is coming: 3 ways to prepare for a 100x improvement in SOTA LLMs

WWDC 2024 Recap: Is Apple Intelligence Legit?

Windows - это кринж! #пк #игры #сборкапк #игровойпк #гейминг #pc #games #windows #macos #apple

Какой ПК нужен для Escape From Tarkov?

Worlds smallest 4K headset 😎 Visor.com #tech #vr #technology #virtualreality #insideout2

Обзор REALME GT6 - ЛУЧШИЙ Realme всех времён? И ДА, и НЕТ!

1$ vs 500$ ВИРТУАЛЬНАЯ РЕАЛЬНОСТЬ !

Какие телефоны запрещены в разных странах мира ?(Часть 2) 📱

Last LLM Standing WINS: Groq LPU - Anthropic OPUS - OpenAI - Gemini Pro - LLM Benchmarks

Пікірлер: 15