What is the Role of Data Quality and Diversity in LLM Systems?

Check out my essays: aisc.substack....
OR book me to talk: calendly.com/a...
OR subscribe to our event calendar: lu.ma/aisc-llm...
AF: A few years ago, there was this whole movement around data centric machine learning that I think Andrew Ng started talking about it, that, you don't need the largest data sets if you sit down and really clean them up and make sure that the signals you're interested in are really amplified. LLMs have brought this to a very different level because of their generative capabilities, we no longer even need to have many examples as long as we captured the right diversity. Because of the pre training that most of these models go through, they have the majority of the skills we need. We just need to nudge them a little further to get specific.
PJ: This is the question that I often ask my researchers: how do we increase the amount of data, but not sacrifice the quality and diversity. And a lot of time it has been very hard.
So having subject matter expertise, if you can afford it, to validate those data and having that human in the loop are very important as well.
We cannot say that large amount of data is not important but in the case of large language models, we obviously wanted to focus more on giving emphasis that brings value.
AF: Going back to the topic of hallucination, a big part of the role that the extra data plays is keeping the large language model grounded. And if you cannot swear by the quality of extra information you're providing to the system through this data, then, you're not going to get much better.
PJ: Yeah, definitely. It's more important to add more diverse data into the model to be able to potentially tackle the hallucination problem.
AF: And also, you made comments about validation and evaluation. In a lot of cases, when you're doing those types of exercises, you run into specific problems that the system has, like, it is getting the entities wrong or it is getting the math wrong. And, you probably want to create data sets specifically for those types of edge cases and problems. And again, the quality is going to be number one factor that you're going to look at because you're trying to solve some specific edge cases.

Жүктеу

Has Generative AI Already Peaked? - Computerphile

What is RAG? (Retrieval Augmented Generation)

Остановили аттракцион из-за дочки!

Пришёл к другу на ночёвку 😂

Amazing Parenting Hacks! 👶✨ #ParentingTips #LifeHacks

Cute

OpenAI’s New ChatGPT: 7 Incredible Capabilities!

Is data management the secret to generative AI?

LLMs - Chunking Strategies and Chunking Refinement

Intro to RAG for AI (Retrieval Augmented Generation)

What Is A Data Scientist?

How to set up RAG - Retrieval Augmented Generation (demo)

Knowledge Graphs - Computerphile

AI, Machine Learning, Deep Learning and Generative AI Explained

Этот чехол НЕ ЗАЩИТИТ твой телефон #shorts #шортс #смартфон #факты #чехол

САМЫЙ ДОРОГОЙ Набор Геймера RAZER с DNS | Клавиатура, мышь, наушники, микрофон,стеклопад, колонки !

iPhone 15 Pro Max vs Samsung S23 Ultra Speed Test! 🚀 Who’s the Real Speed King? #shorts#viralvideo

How to connect electrical wires with good contact #short

Телефон - самая грязная ваша вещь

Creepy Samsung Alarm 🫣 🍪 011

Как выбрать процессор Intel? #пк #игры #сборкапк #игровойпк #pc #games

What is the Role of Data Quality and Diversity in LLM Systems?

Пікірлер