Hey Nathan, your research seems to defend PPO over DPO but the most recent large models from llama3.1 and nemotron 4 DONT make use of PPO. They just make use of DPO with rejection sampling. In fact llama 3.1 paper chooses DPO only because of ease of compute. What are your thoughts on this? Is PPO more relevant for small to medium sized LLMs? Can the scale of large LLMs with DPO (and clever rejection sampling) be enough?
@natolambert
2 ай бұрын
@@sumanthbalaji1768 will write an update on this soon on www.interconnects.ai/ :)
Пікірлер: 7