7 Popular LLM Benchmarks Explained [OpenLLM Leaderboard & Chatbot Arena]

Check out my website here! leaderboard.bycloud.ai/
In this video, I will be going through and explain the benchmarks for Chatbot Arena & Open LLM leaderboard. These are more general benchmarks for text-based LLMs, so HumanEval is not here. Let me know any other benchmarks you want me to explain in the future!
[Chatbot Arena] huggingface.co/spaces/lmsys/c...
[Open LLM Leaderboard] huggingface.co/spaces/Hugging...
[MMLU] huggingface.co/datasets/cais/...
[ARC] huggingface.co/datasets/ai2_arc
[Winogrande] huggingface.co/datasets/winog...
[TruthfulQA] huggingface.co/datasets/truth...
[GSM8K] huggingface.co/datasets/gsm8k
[MT-Bench] huggingface.co/datasets/Huggi...
This video is supported by the kind Patrons & KZitem Members:
🙏Andrew Lescelius, alex j, Chris LeDoux, Alex Maurice, Miguilim, Deagan, FiFaŁ, Daddy Wen, Tony Jimenez, Panther Modern, Jake Disco, Demilson Quintao, Shuhong Chen, Hongbo Men, happi nyuu nyaa, Carol Lo, Mose Sakashita, Miguel, Bandera, Gennaro Schiano, gunwoo, Ravid Freedman, Mert Seftali, Mrityunjay, Richárd Nagyfi, Timo Steiner, Henrik G Sundt, projectAnthony, Brigham Hall, Kyle Hudson, Kalila, Jef Come, Jvari Williams, Tien Tien, BIll Mangrum, owned, Janne Kytölä, SO, Richárd Nagyfi
[Discord] / discord
[Twitter] / bycloudai
[Patreon] / bycloud
[Profile & Banner Art] / pygm7
[Video Editor] Silas
0:00 Intro
0:57 MMLU
1:41 ARC
2:10 HELLASWAG
2:57 Winograde
3:27 TruthfulQA
3:52 GSM8K
4:26 MT-Bench
5:05 Outro

Жүктеу

Пікірлер: 28

@simonstrandgaard5503
5 ай бұрын
Another benchmark unfortunately also named ARC is "Abstraction and Reasoning Corpus". The best models solves 30% of the tasks.
@Yottenburgen
5 ай бұрын
What do you think about the discovered unreliability of the MMLU?
@arkurianstormblade4109
5 ай бұрын
sometimes 1 point makes a big difference! like when Fallout New Vegas got a 84 instead of an 85 on meta critic and thus Bethesda denied them a cash bonus...
@theresalwaysanotherway3996
5 ай бұрын
I'd suggest mentioning humaneval, humaneval+ and taco to cover the current programming benchmarks.
@zyxwvutsrqponmlkh
5 ай бұрын
3:28 dam, this makes gpt 3 seem like a gigachad.
@Synthetiks
5 ай бұрын
Sorry, enjoying your videos for few months and didnt even subscribed. Now subcribed and notification on. Keep doing awesome videos my dude.
@harnageaa
5 ай бұрын
Nice editing in this video.
@BrunoSabadini7
5 ай бұрын
Nice topic!
@Y0UT0PIA
3 ай бұрын
Nice. You mostly bracketed the "safety" benchmark datsets that seem to be more important in other contexts (end users mostly don't care, but governments and consequently big corporations do). Maybe you could go into some of those in some future video?
@luislozano2896
5 ай бұрын
So what benchmarks are available for stable diffusion? And how long are we going to have Bad Hands?
@inout3394
2 ай бұрын
If the llm solves the tasks, then later he knows these tasks and can prepare himself (the creators) for the same questions. That is, next time the creators of this llm will do tuning to llm to be prepared for such questions. It's like going to an exam, writing down the questions and next time solving them 100%
@wuspoppin6564
2 ай бұрын
3:28 a 9\11 question in a LLM benchmark is crazy
@luciengrondin5802
5 ай бұрын
What about performance at the foundational model? That is, the ability to predict the next token in a large text corpus? It seems to me this metric is never mentioned.
@Veptis
5 ай бұрын
I am working on a "creative" code (shadercode) evaluation benchmark for my thesis. And after reading quite some literature on code evals... it's hella messive and actually really bad. For example I recently found that mixtral paper does pass@25 for some and pass@4 for others just to be slightly better than llama2... I guess they weren't better at 0 shot, 1 shot or 5 shot etc... HumanEval and all pass@k metrics are stupid. Because pass@10 or pass@100 means that the model solved the problem once giving. And guess what, larger models sampled my diversely with high temperature - so they do better with pass@k where k is hight. My thesis isn't about debunking the usefulness of pass@k, but I will have some comments about that metric. The leaderboard looks like a merging circus to me, because they finetune on almost the same number and barely differentiate. It's all just random jiggle based on their initial numbers.
@mrrespected5948
5 ай бұрын
Nice
@rajatajayakumar7598
3 ай бұрын
Which website has all the 100+ benchmarks listed?
@gametophacker5047
5 ай бұрын
Can u explain midjourney? Can u explain video generator? Can u explain plzz
@DrW1ne
5 ай бұрын
NEW VIDEOOOO
@alexxx4434
5 ай бұрын
Should mention the danger of training on tests data, which is called 'benchmark poisoning'. A lot of recent models were removed from the HF leaderboard for that.
@I.____.....__...__
5 ай бұрын
5:23 "It's available in both English and Japanese and Chinese".
@felipevaldes7679
5 ай бұрын
TLDW: Here is a summary of the key points from the document: The document discusses the 7 most popular benchmarks used to evaluate text-based large language models. These benchmarks are used to rank models on leaderboards and test different capabilities. Massive Multitask Language Understanding (MMLU) - Multi-choice questions testing knowledge across many domains. Scores models by averaging performance per category. AI2 Reasoning Challenge (ARC) - Multiple choice questions testing reasoning abilities at a 3rd-9th grade level. Focuses on scientific reasoning. HellaSwag (HSWAG) - Choose the most plausible continuation out of 4 sentences, with 3 being adversarially generated wrong answers. Tests common sense. Winograd Schema Challenge (Winograd) - Fill-in-the-blank problems with binary choice. Tests common sense reasoning. Truthful QA - Answer questions correctly and not generate conspiracy theories or false facts. Checks against spreading misinformation. Grade School Math (GSM) - Multi-step math word problems testing logic and math capabilities. Mt Bench - Fine-tuning benchmark with 160 multi-turn conversational questions to test instruction following and conversational abilities. Used for chatbot leaderboards. The benchmarks test different capabilities of language models and are used to rank them on public leaderboards like Anthropic's Constitutional AI and chatbot Arena.
@Guedez1
5 ай бұрын
Imagine if the AI started outputting conspiracy theories like jews digging tunnels in NY city :^)
@jeffsanaraujo
5 ай бұрын
Second 😅
@shApYT
5 ай бұрын
And a significant portion of these benchmarks are just wrong. I think real human evaluation with an ELO system is the only real way to assess models. There is no ground truth for generative AI to be compared to.
@dogecoinx3093
5 ай бұрын
benchmarks are dumb LLM are suppose to learn over time. so it should pass any test over time. its like teaching a kid a life time of data on day one
@mattweger437
5 ай бұрын
First

Mamba Might Just Make LLMs 1000x Cheaper...

The Most Important Algorithm in Machine Learning

Backstage 🤫 tutorial #elsarca #tiktok

Accessorio fantastico per chiudere qualsiasi contenitore || Conservazione ermetica!

1❤️#thankyou #shorts

Чай будешь? #чайбудешь

Unlock the Power of Reliable Agents: Memory Systems & Tools Workshop

Everything WRONG with LLM Benchmarks (ft. MMLU)!!!

AI Leader Reveals The Future of AI AGENTS (LangChain CEO)

5 AI Scams That Are Wildin' Right Now

AI Just Changed Everything … Again

The Largest Mamba LLM Experiment Just Dropped

All You Need To Know About Running LLMs Locally

Why Does Scrum Make Programmers HATE Coding?

This is What Limits Current LLMs

ChatGPT: 30 Year History | How AI Learned to Talk

Девушка и AirPods Max 😳

ВЫ ЧЕ СДЕЛАЛИ С iOS 18?

Лучшая «воздушка» враг хорошей, СЖО без помпы, необычные БП, топовый корпус за 18$ и прочее.

ХОТЕЛ КУПИТЬ ПЕРВЫЙ КОМП APPLE-1 1976 ГОДА ВЫПУСКА! #ломбард #viral #shorts

КОПИМ НА АЙФОН В ТГК АРСЕНИЙ СЭДГАПП🛒

ДЕРЬМОВЫЕ ИНСТРУМЕНТЫ: Лазерный Гравер Xiaomi! Вы угараете?

Подписывайся, здесь всё о технике Apple, в том числе о том, как покупать её выгодно!

7 Popular LLM Benchmarks Explained [OpenLLM Leaderboard & Chatbot Arena]

Пікірлер: 28