Automating Tests for your RAG Chatbot or Other Generative Tool by Abigail Haddad
Visit rstats.ai for information on upcoming conferences.
Abstract: Building a Retrieval Augmented Generation (RAG) chatbot that answers questions about a specific set of documents is straightforward. But how do you tell if it's working? Automated evaluation of generative tools for specific use cases is tricky, but it's also important if you want to easily compare performance using different underlying LLMs, system prompts, temperatures, or other parameters -- or just make sure you're not breaking something when you push your code. In this talk, I'll discuss why this kind of evaluation is challenging and review a few options for the kinds of assessments you can create, including using an LLM to evaluate your LLM-based tool. We'll then look at several ways to write automated LLM-led evaluations, including with a library that allows you to easily and with very little coding create complex grading rubrics for your tests.
Bio: Abigail Haddad is a data scientist who is working on automating LLM evaluations. Previously, she worked on research and data science for the Department of Defense, including at the RAND Corporation and as a Department of the Army civilian. Her hobbies include analyzing federal job listings and co-organizing Data Science DC. She blogs at The Present of Coding.
Twitter: / abbystat
Presented at the 2024 New York R Conference (May 16, 2024)
Hosted by Lander Analytics (landeranalytic...)
Негізгі бет Abigail Haddad - Automating Tests for your RAG Chatbot or Other Generative Tool
Пікірлер