Benchmarking Question/Answering Over CSV Data

Рет қаралды 6,372

LangChain

Blog: blog.langchain...
Benchmarking Repo: github.com/lan...
LangSmith: smith.langchai...

Жүктеу

Пікірлер: 20

@jaimerv19
Жыл бұрын
Really really helpful! It would be great to see more videos in this format with more Langsmith and maybe integration other sources like Notion, gDrive,...
@guilhermeveiga9345
Жыл бұрын
Reaaaally helpful, tnks!
@ilianos
7 ай бұрын
🎯 Key Takeaways for quick navigation: 00:04 🎥 *Introduction to the new video format and blog post overview.* - Introduction of a new video format accompanying a deep dive blog post. - Mention of raw, one-take video due to lack of fancy editing equipment. - Overview of the blog's focus on benchmarking question answering over CSV data. 00:32 🚀 *Background and motivation for the project.* - Explanation of why benchmarking question answering over CSV data is interesting. - The significance of addressing question answering over tabular or CSV data. 01:14 🧰 *Initial steps and debugging process.* - The starting point of the project with minimal resources. - Utilization of Langsmith for debugging and aiming to improve question answering over CSV data. - Description of the evaluation setup and initial solution attempts. 02:08 📊 *Question Answering for Unstructured vs. Tabular Data.* - Comparison of question answering processes for unstructured data and tabular (CSV) data. - The challenge of improving question answering over CSV data. 03:02 📘 *Writing the blog post: Motivation and challenges.* - Reasons behind writing the blog post, including the representation of tasks in LM applications and the challenges in evaluation and improvement. 04:15 🌱 *Starting from scratch: Data collection and evaluation.* - The absence of an initial dataset for evaluation and the approach to gather evaluation data. - The creation of a simple application to collect real questions and feedback on the Titanic dataset. 05:37 🔍 *Gathering real-world data through user interaction.* - Setup of a user interface for querying the Titanic dataset and collecting feedback. - How user feedback and questions were used to build and refine the evaluation dataset. 07:41 📈 *Constructing the evaluation dataset from user feedback.* - Use of Langsmith to log user interactions and feedback for dataset construction. - The process of refining the dataset based on user feedback and identified issues. 10:33 💡 *Initial solution and areas for improvement.* - The initial approach to question answering over CSV data and the identification of two main areas for improvement. - Discussion on the need for running queries against the data frame and the challenges faced. 14:56 🐞 *Debugging and insights from the project.* - A specific debugging example related to pandas' display settings and its impact on the project. - General insights on the importance of data processing and debugging in LM applications. 18:55 📚 *Conclusion and future directions.* - Summary of insights and experiences from the project. - Announcement of future additions to the benchmarking repository, including SQL and other formats. 19:08 📊 *Evaluation Process and Challenges* - Discussion on the challenges of evaluating language model outputs for question answering over CSV data. - Highlighting the semantic equivalence challenge in evaluation. - Use of a language model for evaluating other models’ responses. 20:45 🤖 *Final Custom Agent Design* - Introduction of the final custom agent and its setup. - The inclusion of a retriever tool and Python REPL tool for the agent. - Adjustments to the pandas display options to improve data representation. 25:38 🧪 *Evaluation Examples and Insights* - Examples of evaluation outcomes and the insights drawn from them. - The importance of thorough evaluation in identifying areas for improvement. - Discussion on the limitations and potentials of using language models for evaluation. 28:10 🛠️ *General Takeaways and Future Directions* - Reflections on the project’s specific and general implications for LM applications. - Acknowledgment of the project's limitations and the potential for future improvements. - Invitation for contributions to the benchmarking repository and LangChain. Made with HARPA AI
@mg4u4ever
Жыл бұрын
Was watching bits here and ther from this channel but the content lately have been getting really good and im becoming a more regular here. Kudos and keep it up
@surfbort
Жыл бұрын
Sick!
@anthonydattolo6297
Жыл бұрын
Would like to see a performance deep dive on your listed vector store providers based on task. Like top 5-10 most popular ones
@adilrerhrhaye3421
Жыл бұрын
For an unedited video, that was very great. It was to the point and very smooth. Please, more of them. Thanks again!!!
@happyday.mjohnson
Жыл бұрын
Thank you. It would be better if you were using an open source LLM instead of openai. Or at least have the options.
@jorgefelipegaviriafierro705
Жыл бұрын
This is really interesting and useful, benchmarking is something really needed! Thanks, looking forward to the SQL example!
@AlbertoChillon
Жыл бұрын
Thank you very much, super helpful! Please keep on with these kind of videos!!
@DreamsAPI
Жыл бұрын
Love it, so that is four likes. Please keep it raw, that is real life. Want to know how you solve that is very important
@ahmedzahid8354
Жыл бұрын
Thanks for these videos really seeing how it is working in the background.
@asatorftw
Жыл бұрын
Love it! Learned a bunch and hope you do more deep dives! I like the more interview style videos too, but they lack details sometimes, so this is great!
@AI-LLM
Жыл бұрын
Great idea focused deep dives. Thankyou.
@angelochu3156
Жыл бұрын
Is it possible to use other cases other than the typical titanic dataset?
@roberth8737
Жыл бұрын
Excellent - more of these videos !
@toastrecon
Жыл бұрын
This is so cool! A few ideas and questions: is there a way to display “max rows” to get around the pandas summary view like you did with max columns? What about figuring out a more granular feedback mechanism? Right now, it’s just thumbs up or down, and that would only be really helpful if you had a LOT of feedback to eventually arrive at the right behavior. It’s almost like the implementation of the functions is like supervised learning in this case. If you somehow dedicated human time into shaping the right queries, you might get pretty good at answering most questions. Maybe shaping the data, too? Like having tables with male/female split or something that would be easier for the llm to find reliably. I didn’t see quark(?) used or mentioned much, is there more info on that?
@LangChain
Жыл бұрын
kork is linked to in the blog! working on examples to add more granular feedback
@toastrecon
Жыл бұрын
@@LangChain thank you! I was watching this around midnight and didn’t think to check the blog.