Reinforcement Learning from Human Feedback (RLHF) & Direct Preference Optimization (DPO) Explained

Learn how Reinforcement Learning from Human Feedback (RLHF) actually works and why Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are changing the game.
This video doesn't go deep on math. Instead, I provide a high-level overview of each technique to help you make practical decisions about where to focus your time and energy.
0:52 The Idea of Reinforcement Learning
1:55 Reinforcement Learning from Human Feedback (RLHF)
4:21 RLHF in a Nutshell
5:06 RLHF Variations
6:11 Challenges with RLHF
7:02 Direct Preference Optimization (DPO)
7:47 Preferences Dataset Example
8:29 DPO in a Nutshell
9:25 DPO Advantages over RLHF
10:32 Challenges with DPO
10:50 Kahneman-Tversky Optimization (KTO)
11:39 Prospect Theory
13:35 Sigmoid vs Value Function
13:49 KTO Dataset
15:28 KTO in a Nutshell
15:54 Advantages of KTO
18:03 KTO Hyperparameters
These are the three papers referenced in the video:
1. Deep reinforcement learning from human preferences (arxiv.org/abs/...)
2. Direct Preference Optimization:
Your Language Model is Secretly a Reward Model (arxiv.org/abs/...)
3. KTO: Model Alignment as Prospect Theoretic Optimization (arxiv.org/abs/...)
The Huggingface TRL library offers implementations for PPO, DPO, and KTO:
huggingface.co...
Want to prototype with prompts and supervised fine-tuning? Try Entry Point AI:
www.entrypoint...
How about connecting? I'm on LinkedIn:
/ markhennings

Жүктеу

Пікірлер: 10

@FreddyEmery-p2n
9 күн бұрын
Miller Susan Rodriguez Gary Lee David
@liberate7604
3 ай бұрын
Great video , Is it better to use KTO as optimizer for a binary classification?
@EntryPointAI
3 ай бұрын
I couldn't say for sure. Binary classification is a fairly simple task, so I would start with supervised fine-tuning.
@priscillaleapman2367
14 күн бұрын
Martin Shirley Jackson Kenneth Allen Mary
@MarshallRoy-h9e
20 күн бұрын
Melisa Branch
2 ай бұрын
Awesome. Thanks
@VerdonTrigance
3 ай бұрын
Hey! Thanks for video! I never used these techniques, but what I really wants to do is to train a base or chat LLM model like llama or phi-3 on some big text (Lord of the Ring for example). But all techniques I've seen so far requires a proper dataset to be prepared, but who and how can do that? Ask all of possible questions and answer them as well? It's impossible! Don't you know how can I prepare a dataset to later train a model on?
@EntryPointAI
3 ай бұрын
Besides including the big text in a model's pretraining, you can fine-tune on it using empty prompts, which will make the model more likely to respond in a style similar to the writing. That doesn't necessarily make it an expert on the contents. In order to answer questions about a corpus, the typical approach is to chunk it up and use RAG. I have another video on the difference between RAG and fine-tuning.
@iasplay224
3 ай бұрын
Thank you for the info, it was very good explained for an introduction

LoRA & QLoRA Fine-tuning Explained In-Depth

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Крутой фокус + секрет! #shorts

Допрос | 2 серия | Сериал «Эскорт. Новый вызов» | КОНКУРС

Wow!😮 Delicious Candies Turned Into A Snail Dessert!🐌🍭 #catvideos #catmemes #trending

The joker favorite#joker #shorts

DPO Debate: Is RL needed for RLHF?

Aligning LLMs with Direct Preference Optimization

Prompt Engineering, RAG, and Fine-tuning: Benefits and When to Use

ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

Reinforcement Learning from Human Feedback (RLHF) Explained

Large Language Models (LLMs) Explained

Fine-tuning Datasets with Synthetic Inputs

Policy Gradient Methods | Reinforcement Learning Part 6

Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained

Крутой фокус + секрет! #shorts

Reinforcement Learning from Human Feedback (RLHF) & Direct Preference Optimization (DPO) Explained

Пікірлер: 10