Lets evaluate deepeval on a known dataset.
Previous episdoe: • Evaluating deepeval fr...
Hallucination metric checks whether the LLM output contains things that didn't exist in the context given some question. In order to evaluate how accurate this metric is, I chose the SQUAD2 question answering dataset. I expect that running the hallucination metric should produce accuracy of 100%, because the dataset was manually annotated.
Results:
When testing the accuracy of the metric on 100 examples we get an accuracy if 73%
Why?
Watch the video for a quick error analysis.
-- Watch live at www.twitch.tv/...
Libraries I've used as tools:
1. UV fast python package manager: github.com/ast...
2. Python invoke command line tool: www.pyinvoke.org/
3. Visidata Command line csv viewer: www.visidata.org/
Timestamps
0:00 Intro
3:20 Python Invoke library usage
4:00 Python UV package manager usage
4:51 Pre processing squad2
8:51 Python visidata command line
13:00 Setting up deepeval
14:00 Fixing pre processing
30:00 Finish pipeline
37:00 Parallelising deepeval calls
40:00 Error analysis
Негізгі бет LLM evaluation evaluation on SQUAD2
Пікірлер: 2