How to Code RLHF on LLama2 w/ LoRA, 4-bit, TRL, DPO

Рет қаралды 12,011

Python code to code "Reinforcement Learning from Human Feedback" (RLHF) on a LLama 2 model with 4-bit quantization, LoRA and new DPO method, by Stanford Univ (instead of old PPO). Fine-tune LLama 2 with DPO.
A1. Code for Supervised Fine-tuning LLama2 model with 4-bit quantization.
A2. Code for DPO-Trainer by HuggingFace with PEFT, LoRA, 4-bit bnb, ...
B1. Code for Supervised Fine-tuning LLama1 model with 4-bit quantization, LoRA.
B2. Code for Reward Modelling of LLama1 model with 4-bit quantization.
B3. Code for Reinforcement Learning (RL) - Training of LLama1 model with 4-bit quantization.
All rights with authors of py files and HuggingFace as listed:
--------------------------------------------------------------------------------
LLama 2 model RLHF with DPO in 4-bit with Lora:
github.com/huggingface/trl/tr...
LLama 1 model RLHF with PPO in 4-bit with Lora:
github.com/huggingface/trl/tr...
#llama2
#reinforcementlearning
#aieducation

Жүктеу

Пікірлер: 14

@AdamBrusselback
9 ай бұрын
One thing to note, The scripts are not targeting all layers. It has been noticed that targeting all layers leads to very little differences between lora training and a full fine tune by Tim Dettmers.
@ppbroAI
6 ай бұрын
Wow, Is a really useful information. ty. Would love to see more on this !
@maximilianrck254
9 ай бұрын
Hi Thanks for your Video and the great content in general! Are you familar with projects like GigaGAN, where they train a model with a generator and a discriminator? What do you think about really small LLMs trained on an EVOL-Set at first and then retrained with a generator / discriminator training? Is this possible or do we have any incompabilities?
@pruff3
8 ай бұрын
Great video I want to train this on Sagemaker
@riyajatar6859
7 ай бұрын
Is it compulsory to have billion size model for doing these all steps? I mean instead of llama can play these steps with other Generative model like gpt2 or other
@TheCreativeautomaton
9 ай бұрын
The fact that you did this is pretty amazing considering is such a niche subject, I assume that it works? Would you have any reference of a SFT model inference compared to the DPO trained model? I having had time look to deeply into DPO training but can this be applied to say MEGATRON-LLM models instead of LLAMA2 or other 7B+ parameter models? cheers!
@lezgoverci
9 ай бұрын
noob question: is PEFT better than RLHF? is it easier and less resource needed when choosing PEFT?
@kumarinamita4267
9 ай бұрын
Thanks for the video but I still don't understand as when should we just use ICL? when should we use PEFT SFTT? and when should we use DPO training?
@code4AI
9 ай бұрын
Simple: kzitem.info/news/bejne/wHyP3aKFmJleZ3osi=Swhgw-zC0Wj7Yypg
@AdamBrusselback
9 ай бұрын
It seems like properly engineering your rejected prompts could have a huge impact on the models ability to learn a task and generalize properly.
@cecilsalas8721
9 ай бұрын
💻😄👏🏆
@erniea5843
9 ай бұрын
Not enough acronyms in the title of the video
@adizhol
5 ай бұрын
aren't you supposed to let the model produce outputs and then rate them for DPO? You are showing a stackexchange dataset with good/bad answers, not model generated answers
@CognitiveComputations
5 ай бұрын
why would anyone ever use bloom for anything, ever?