Python code to code "Reinforcement Learning from Human Feedback" (RLHF) on a LLama 2 model with 4-bit quantization, LoRA and new DPO method, by Stanford Univ (instead of old PPO). Fine-tune LLama 2 with DPO.
A1. Code for Supervised Fine-tuning LLama2 model with 4-bit quantization.
A2. Code for DPO-Trainer by HuggingFace with PEFT, LoRA, 4-bit bnb, ...
B1. Code for Supervised Fine-tuning LLama1 model with 4-bit quantization, LoRA.
B2. Code for Reward Modelling of LLama1 model with 4-bit quantization.
B3. Code for Reinforcement Learning (RL) - Training of LLama1 model with 4-bit quantization.
All rights with authors of py files and HuggingFace as listed:
--------------------------------------------------------------------------------
LLama 2 model RLHF with DPO in 4-bit with Lora:
github.com/huggingface/trl/tr...
LLama 1 model RLHF with PPO in 4-bit with Lora:
github.com/huggingface/trl/tr...
#llama2
#reinforcementlearning
#aieducation
Негізгі бет Ғылым және технология How to Code RLHF on LLama2 w/ LoRA, 4-bit, TRL, DPO
Пікірлер: 14