Check out my essays: aisc.substack....
OR book me to talk: calendly.com/a...
OR subscribe to our event calendar: lu.ma/aisc-llm...
AF: A few years ago, there was this whole movement around data centric machine learning that I think Andrew Ng started talking about it, that, you don't need the largest data sets if you sit down and really clean them up and make sure that the signals you're interested in are really amplified. LLMs have brought this to a very different level because of their generative capabilities, we no longer even need to have many examples as long as we captured the right diversity. Because of the pre training that most of these models go through, they have the majority of the skills we need. We just need to nudge them a little further to get specific.
PJ: This is the question that I often ask my researchers: how do we increase the amount of data, but not sacrifice the quality and diversity. And a lot of time it has been very hard.
So having subject matter expertise, if you can afford it, to validate those data and having that human in the loop are very important as well.
We cannot say that large amount of data is not important but in the case of large language models, we obviously wanted to focus more on giving emphasis that brings value.
AF: Going back to the topic of hallucination, a big part of the role that the extra data plays is keeping the large language model grounded. And if you cannot swear by the quality of extra information you're providing to the system through this data, then, you're not going to get much better.
PJ: Yeah, definitely. It's more important to add more diverse data into the model to be able to potentially tackle the hallucination problem.
AF: And also, you made comments about validation and evaluation. In a lot of cases, when you're doing those types of exercises, you run into specific problems that the system has, like, it is getting the entities wrong or it is getting the math wrong. And, you probably want to create data sets specifically for those types of edge cases and problems. And again, the quality is going to be number one factor that you're going to look at because you're trying to solve some specific edge cases.
Негізгі бет Ғылым және технология What is the Role of Data Quality and Diversity in LLM Systems?
Пікірлер