Негізгі бет 🐐Llama 3 Fine-Tune with RLHF [Free Colab 👇🏽]

Күн бұрын

🐐Llama 3 Fine-Tune with RLHF [Free Colab 👇🏽]

Рет қаралды 15,655

1 1

Пікірлер: 61

@brainybotnlp
Жыл бұрын
Great content.
@WhisperingAI
Жыл бұрын
Glad you liked it
@rabin1620
Жыл бұрын
Excellent topic
@WhisperingAI
Жыл бұрын
Thank you. Glad you liked it
@TeodoraTeodora-b4o
Жыл бұрын
this video is really helpful.Thanks for sharing 🙌 but could you please share with us which version of libraries you are using?
@WhisperingAI
Жыл бұрын
All the libraries are upto dated one, but i guess transformer was 4.29
@_SHRUTIDAYAMA
9 ай бұрын
Hey, this video is really helpful. Can you please tell how to give input and generate output after step3? Also when we create UI , then how feedback from UI will be given to policy model ? Can you please make a video on it, it will be really helpful !!!! Thanks :)
@WhisperingAI
9 ай бұрын
Sure i will try creating a short video for it with in couple of days.
@_SHRUTIDAYAMA
9 ай бұрын
That will really be helpful!!!!! Thanks:)@@WhisperingAI
@shrutidayama8193
8 ай бұрын
Hey , It will really be helpful if you make it ...please help me
@WhisperingAI
8 ай бұрын
@@shrutidayama8193 It will be uploaded tomorrow. Thanks
@mohamedsatti3038
Жыл бұрын
thank you
@WhisperingAI
Жыл бұрын
Glad you liked it
@talhaanwar2911
Жыл бұрын
thanks
@dibyanshuchatterjee4126
6 ай бұрын
Great video. Just a quick question, is it possible to intercept just the reward model's output for an LLM response, before the reward produced for each response goes into the LLM from reward model. Meaning, is there anyway can just use the reward model to see what response of LLM was good Vs bad and store those results?
@WhisperingAI
6 ай бұрын
Yes you can. In step 3 there is a line which takes the result from policy model and pass it to reward model for score. You can print the output.
@Joshua-m1j
Жыл бұрын
Thanks for your insightful sharing. The fine-tune Llama 2 model return an incomplete sentence for the last sentence. Do you have ways to solve this?
@WhisperingAI
Жыл бұрын
Try increasing the max length while inference.
@HaroldKouadio-gj7uw
3 ай бұрын
what of doing a translation task with the LLMs and reinforce it with RLHF?
@WhisperingAI
3 ай бұрын
We can do that
@rajeepthapa5426
3 ай бұрын
Are you nepali ?
@ivanleung6034
8 ай бұрын
I notice the reward model structure is the same as the fine-tuned model. As someone said, we can use a small model with much fewer parameters and layer to do the reward model, that's work too right?
@WhisperingAI
8 ай бұрын
That works, but in case of reward model its basically the sequence classification model with one head, so output produced is only one logits, but i guess it is handled internally by the trl library.
@mahmoudmohamed-lr9ql
4 ай бұрын
does this use reference model and kl divergence?
@WhisperingAI
4 ай бұрын
Yes it use both
@cookiearmy2960
7 ай бұрын
how is the reward model trained can anyone explain in detail? i know that we used the starcoder model with chosen and rejected input ids, but how are these mapped to a particular score, since the output of the reward model is not always binary , it returns logits as it's output, how it is done here ?
@talhaanwar2911
Жыл бұрын
can you create a tutorial on inference of this
@WhisperingAI
Жыл бұрын
Sure
@ManethGamage
10 ай бұрын
can you share the jupyter folder. i dont have any idea about paths.
@WhisperingAI
10 ай бұрын
I haved updated the code base path, let me know if you don''t understand again.
@ManethGamage
10 ай бұрын
@@WhisperingAI how to train a chatbot iteratively with userfeedbacks. train chatbot over time with users interaction and feedbacks. how do i do that with pretrained models.
@WhisperingAI
10 ай бұрын
@@ManethGamage The answer is simple you must keep retraining the model, if you wish to train it with userfeedback. Offload the model from production, and train it with data, including older version of data with timestamp. Evaluate the model performance and if its good, shift the train model in production. But if you dont want to do that RAG method can be used in this case you can check my this video if its help kzitem.info/news/bejne/pZhpvKGsf6CpbH4
@denidugamage2096
10 ай бұрын
@@WhisperingAI. My idea is creating a low providing chatbot. It’s a group project. And other’s doing the chatbot part and NLP part. My party is RLHF. Chatbot dataset is constitute. How do i train my chatbot with user feedbacks. Im asking you because i don’t have idea about it🥲
@WhisperingAI
10 ай бұрын
Please watch my earlier video kzitem.info/news/bejne/o2uazpWQiXyCq2U if you want to do it for feedback. I have used the amazon review in there.
@sanduntharaka4256
7 ай бұрын
Can we use the same code for llama2??
@WhisperingAI
7 ай бұрын
Yes you can but i guess you cannt run it on google colab unless you use lora or 4bit
@sanduntharaka4256
7 ай бұрын
@@WhisperingAI Im using Kagle notebooks. I have created policy model. But in reward training it gives IndexError: index out of range in self Why?
@sanduntharaka4256
7 ай бұрын
@@WhisperingAI And have executed your same code in high ram environment. But it gives same error: IndexError: index out of range in self, I want to apply RLHF in llama2. Your video is the only one i found that relates with RLHF.
@WhisperingAI
7 ай бұрын
There might be some issue while loading the dataset, or tokenizing. Can you share on which step you are facing this issue?
@WhisperingAI
7 ай бұрын
Please check your dataloader and try running each step individually
@sym-t5k
Жыл бұрын
First, thank you for the good video, and I would like to ask you two questions. 1) In the third part of the colab code, kzitem.info/news/bejne/s2imx6minGWBZYI , I am confused about which model goes into the "model_path" of "starcoder_model = AutoModelForCausalLMWithValueHead.from_pretrained(model_path)". "model_path" is "bigcode/tiny_starcoder_py"? or "summarization_policy_new/"? which one is correct? 2) Can I think of the first part ("Creating the policy model for human Evaluation") of the colab code on the previous KZitem as SFT Training? and Can I think that the resulting policy model is the SFT model?
@WhisperingAI
Жыл бұрын
Thats actually the policy model that we have trained on first step, that is summarization_policy_new/. As we are refining the model from step 1 with reward in step 3. Hope that clarify. If you have any question fell free to ask. I would love to help
@sym-t5k
Жыл бұрын
@@WhisperingAI Thank you. It's clear. Could you please answer the second question regarding the SFT?
@WhisperingAI
Жыл бұрын
For the second question. Dont think it as that way. In the first step we are simply finetuning the model ( model can be anything like llama , gpt, starcoder) SFT is the library we are just using it to get rid of writing pytorch code for creating dataloader and training loop. So after first step resulting policy model is finetuned model not sft model.
@sayansamanta3775
11 ай бұрын
@@WhisperingAIhey can you please tell which model are we using in the second step where we have MODEL_PATH = "model/"? Is it big_code/tiny_starcoder_py or the policy model trained in the first step?
@developer_deepak_bhattarai
Жыл бұрын
Are you nepali?