How are LLMs evaluated?
00:00 - Introduction and motivation for looking at LLM benchmarks
00:38 HumanEval benchmark for code synthesis
02:27 - Exploring the HumanEval dataset
03:24 - MMLU (Massive Multitask Language Understanding) benchmark
04:37 - Exploring the MMLU dataset
05:58 - BigBench meta-benchmark with 200+ tasks
06:50 - Exploring a logical reasoning task in BigBench
08:13 - BigBench Hard subset of challenging tasks for LLMs
08:46 - Example tasks from BigBench Hard
10:21 - Wrap up and other notable benchmarks not covered
github.com/openai/human-eval
github.com/google/BIG-bench
github.com/suzgunmirac/BIG-Be...
github.com/hendrycks/test (MMLU)
vivekhaldar.com
x.com/vivekhaldar
Негізгі бет Ғылым және технология LLM benchmarks
Пікірлер: 1