LLM evaluation evaluation on SQUAD2

Lets evaluate deepeval on a known dataset.
Previous episdoe: • Evaluating deepeval fr...
Hallucination metric checks whether the LLM output contains things that didn't exist in the context given some question. In order to evaluate how accurate this metric is, I chose the SQUAD2 question answering dataset. I expect that running the hallucination metric should produce accuracy of 100%, because the dataset was manually annotated.
Results:
When testing the accuracy of the metric on 100 examples we get an accuracy if 73%
Why?
Watch the video for a quick error analysis.
-- Watch live at www.twitch.tv/...
Libraries I've used as tools:
1. UV fast python package manager: github.com/ast...
2. Python invoke command line tool: www.pyinvoke.org/
3. Visidata Command line csv viewer: www.visidata.org/
Timestamps
0:00 Intro
3:20 Python Invoke library usage
4:00 Python UV package manager usage
4:51 Pre processing squad2
8:51 Python visidata command line
13:00 Setting up deepeval
14:00 Fixing pre processing
30:00 Finish pipeline
37:00 Parallelising deepeval calls
40:00 Error analysis

Жүктеу

Being Competent With Coding Is More Fun

How to Make Your Website Not Ugly: Basic UX for Programmers - Hilary Stohs-Krause

Mom had to stand up for the whole family!❤️😍😁

Un coup venu de l’espace 😂😂😂

Watermelon magic box! #shorts by Leisi Crazy

Epic Reflex Game vs MrBeast Crew 🙈😱

How might LLMs store facts | Chapter 7, Deep Learning

Bolt.new Tutorial for Beginners (the Cursor AI and V0 Killer)

Solving Standard Recursion Problem

Cheap mini runs a 70B LLM 🤯

What Is an AI Anyway? | Mustafa Suleyman | TED

Computer Scientist Answers Computer Questions From Twitter

What are AI Agents?

AI, Machine Learning, Deep Learning and Generative AI Explained

RAG vs. Fine Tuning

Contributing to Open Source: Jazzy - Part 4 - Making the tests pass

Mom had to stand up for the whole family!❤️😍😁

LLM evaluation evaluation on SQUAD2

Пікірлер: 2