Негізгі бет Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models

Күн бұрын

Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models

Рет қаралды 11,014

Serrano.Academy

Жүктеу

Пікірлер: 48

@aryangaur276
2 ай бұрын
I am literally crying, what a wonderful explanation 😭
@gemini_537
5 ай бұрын
You are a genius of explaining complex concepts with simple terms!
@testme2026
7 ай бұрын
Seriously mate, you are annoyingly good! like off the charts amazing! Thank you so much Luis Serrano.
@jff711
7 ай бұрын
Thank you! Looking forward to watching your DPO video.
@ብርቱሰው
3 ай бұрын
I would like to say thank you for the wonderful video. I want to learn reinforcement learning for my future study in the field of robotics. I have seen that you only have 4 videos about RL. I am hungry for more of your videos. I found that your videos are easier to understand because you explain well. Please add more RL videos. Thank you 🙏
@hoseinalavi3916
4 ай бұрын
Your explanation is so great. Keep going on my friend. I am waiting for your next video.
@vigneshram5193
3 ай бұрын
Thank you Luis Serrano for this super explanatory video
@asma5179
7 ай бұрын
Thanks a lot for sharing your knowledge
@dragolov
7 ай бұрын
Deep respect, Luis Serrano!
@SerranoAcademy
7 ай бұрын
Thank you Ivan! Deep respect to you too!
@omsaikommawar
7 ай бұрын
Waiting for your video for very long time 😁
@SerranoAcademy
7 ай бұрын
Thank you! Finally here! There's one on DPO coming out soon too!
@sgrimm7346
7 ай бұрын
Great presentation, thank you.
@SerranoAcademy
7 ай бұрын
Thank you! Glad you liked it! :)
@gergerger53
5 ай бұрын
Great explanation. Loved the Simpsons reference 🤣
@SerranoAcademy
5 ай бұрын
LOL! Yay, someone got the reference!!! :)
@gemini_537
5 ай бұрын
Gemini: This video is about reinforcement learning with human feedback (RLHF), a method used to train large language models (LLMs). Specifically, it covers how to fine-tune LLMs after they've been trained. Here are the key points of the video: * **Reinforcement learning (RL) with human feedback (RLHF):** * RLHF is a method for training LLMs. * It involves human annotators rating the responses generated by a large language model to a specific prompt. * The LLM is then trained to get high scores from the human annotators. * **Review of Reinforcement Learning (RL):** * The video reviews the basics of RL using a grid world example. * An agent moves around a grid trying to collect points and avoid getting eaten by a dragon. * The agent learns the optimal policy through trial and error, which is to move towards the squares with the most points. * Value neural network and policy neural network are introduced to approximate the values and the policy, respectively. * **Proximal Policy Optimization (PPO):** * PPO is an algorithm for training RL agents. * It approximates the value and policy functions using neural networks. * The agent learns by moving around the state space and getting points based on the actions it takes. * **Transformers:** * Transformers are neural networks that are used to generate text. * They are trained on a massive amount of text data. * They generate text one word at a time by predicting the next word in a sequence. * **Fine-tuning Transformers with RLHF:** * The core idea of RLHF is to combine RL with human feedback to fine-tune Transformers. * Imagine the agent is moving around a grid of sentences, adding one word at a time. * The goal is to generate coherent sentences. * The agent generates multiple possible continuations for a sentence. * Human annotators then rate these continuations, and the agent is trained to favor generating the higher-rated continuations. * In essence, the value neural network mimics the human evaluator, assigning scores to responses, while the policy neural network learns the probabilities of transitioning between states (sentences) which is similar to what Transformers do. The video concludes by mentioning that this is the third video in a series of four about reinforcement learning.
@KumR
7 ай бұрын
Thanks a lot Luis.
@SerranoAcademy
7 ай бұрын
Thank you @KumR!
@HamidrezaFarhidzadeh
7 ай бұрын
You are amazing. Thanks
@RunchuTian
5 ай бұрын
Some Questions for RLHF - The value model in RLHF is not like the typical PPO value model that gives a value to each grid of move. The value model in RLHF only gives a value to a complete chain of moves. It is more like a 'reward model' actually. - What is the loss function like for the policy model in RLHF? Does it still follow the one in PPO or add some change to it?
@paveltsvetkov7948
3 ай бұрын
Why do you need Value neural network? Why can't you train the policy neural network alone? Is it because the value neural network allows to replace human evaluator and get more training samples for the policy network without need for human input?
@meme31382
7 ай бұрын
Thanks for the great video, can you make one to explain graph neural networks, thanks in advance
@SerranoAcademy
7 ай бұрын
Thanks for the message and the suggestion! Yes that topic has been coming up, and it looks super interesting!
@jeffpan2785
5 ай бұрын
Could you please give a introduction to DPO(Direct Policy Optimization) as well? Thanks a lot!
@SerranoAcademy
5 ай бұрын
Thanks! Absolutely, I'm working on a DPO video, but tbh, I haven't yet fully understood the loss function the way I want to. I'll get it out hopefully very soon!
@lrostagno2000
6 ай бұрын
Love the video, but please could you remove the loud music at the beginning of the sections?
@BshsjhGshanna
23 күн бұрын
Lee Betty Williams Frank Robinson Edward
@tutolopezgonzalez1106
7 ай бұрын
Love your videos ❤ thank you for sharing and bringing us light. Would you explain how rlhf is relevant to aligning AI systems?
@SerranoAcademy
7 ай бұрын
Thank you so much, I'm glad you liked it! Yes, great question! In here I mostly talked about fine-tuning, which is to make them answer accurate responses, whether in general or for a specific dataset. Aligning them is deeper, as it requires them to be ethical, responsible, etc.. I would say that in general the process is similar, the difference lies in that the goals used are others. But I don't think there's a huge difference in the reward model, etc. I'll still check and if there's a big difference, I'll add it in the next video. Thanks for the suggestion!
@minditon3264
3 ай бұрын
Deterministic Policy Optimization VIdeo ??
@SerranoAcademy
3 ай бұрын
Working on it, almost there! :)
@pushpakagrawal7292
6 ай бұрын
Great! When is DPO coming?
@SerranoAcademy
6 ай бұрын
Thanks! Soon, working on it :)
@romanemul1
7 ай бұрын
So its actually just pushing the learning of the model into the right direction ?
@SerranoAcademy
7 ай бұрын
Great question, exactly! The model is trained, but this improves the results.
@SysknShall
17 күн бұрын
Johnson Melissa Taylor Lisa Harris George
@EdwardColeman-v5z
12 күн бұрын
Harris Brian Miller Richard Miller Angela
@sainulia
6 ай бұрын
Amazing explanation!
@itsSandraKublik
7 ай бұрын
Such a great video! ❤ So intuitive as always ❤
@SerranoAcademy
7 ай бұрын
Yayyyy! Thank you Sandra!!!! 🤗
@somerset006
7 ай бұрын
Thanks for the great video! Is it a part of a playlist? You seem to be missing a playlist of the 4 videos at the end of this one.
@SerranoAcademy
7 ай бұрын
Thanks for pointing it out! Yes I forgot that part, I'll add it now!
@SerranoAcademy
7 ай бұрын
And it's been added! Here's the playlist (1 more video to come) kzitem.info/door/PLs8w1Cdi-zvYviYYw_V3qe6SINReGF5M-
@RoyDipta
3 ай бұрын
where is the dpo video? 🥹
@SerranoAcademy
3 ай бұрын
Thanks for your interest! I’m working on it, but it still hasn’t fully clicked in my head. Hopefully soon! ☺️
@azurewang
3 ай бұрын
@@SerranoAcademy please let us know when it comes out！we are all waiting for it, it's very informative.
@SerranoAcademy
3 ай бұрын
Hello! DPO is almost ready, coming out in a few days!