Try Empirical: github.com/empirical-run/empi... | HumanEval example: github.com/empirical-run/empi...
----
New LLMs showcase their performance through LLM benchmarks like HumanEval. But these benchmarks have made no sense to us and other devs who are using LLMs in their applications. They are just a bunch of numbers on a blog post.
Instead, we end up relying on playgrounds where we can try a few scenarios, and do a "vibe check" on the model outputs. Vibe checks are great - because they are "real" - but they only give us anecdotal confidence.
What if we could combine the systematic validation of a scientific benchmark, that runs hundreds of scenarios, along with the experiential nature of vibe checking that gives us a better understanding of model behavior? What's required is tooling that makes it super easy for us to run these benchmarks, iterate quickly for model/prompt changes, and then build our own benchmarks.
Watch the video to learn about the HumanEval benchmark, run it across LLMs from OpenAI, Anthropic and Databricks using Empirical. Empirical is an open source testing framework for LLM applications.
Негізгі бет Ғылым және технология Learn about the HumanEval LLM benchmark with Empirical
Пікірлер: 2