Olivier Sigaud

Olivier Sigaud

Olivier Sigaud

12:07
21 сағат бұрын

Direct policy search and reinforcement learning: taking better steps

9:03
Күн бұрын

Direct policy search and reinforcement learning : details about the policy gradient

14:29
Күн бұрын

Direct policy search and reinforcement learning: a quick overview of direct policy search methods

9:34
14 күн бұрын

Direct policy search and reinforcement learning: introduction

12:52
28 күн бұрын

Goal-conditioned reinforcement learning: state-based goal reachers

7:48
Ай бұрын

Goal-conditioned reinforcement learning: hindsight experience replay

14:05
Ай бұрын

Goal-conditioned reinforcement learning: curriculum

12:34
Ай бұрын

Goal-conditioned reinforcement learning: skill learners

11:12
Ай бұрын

Goal-conditioned reinforcement learning: typology of setters

15:37
Ай бұрын

Goal-conditioned reinforcement learning: frameworks and core concepts

10:47
Ай бұрын

Goal-conditioned reinforcement learning: Introduction

39:30
Жыл бұрын

IMOL 2023 presentation: Towards Inferential Social Learning in Teachable Autotelic Agents

28:17
2 жыл бұрын

Data collection in SB3

9:29
2 жыл бұрын

Advantage Actor Critic

5:57
3 жыл бұрын

From Policy Gradient to Actor-Critic: Introduction (RLVS 2021 version)

4:46
3 жыл бұрын

Policy Gradient and Actor-Critic: wrap-up (RLVS 2021 version)

4:23
3 жыл бұрын

Policy Gradient and Reward Weighted Regression (RLVS 2021 version)

14:17
3 жыл бұрын

SAC and TQC (RLVS 2021 version)

16:53
3 жыл бұрын

DDPG and TD3 (RLVS 2021 version)

8:43
3 жыл бұрын

Proximal Policy Optimization (RVLS 2021 version)

11:05
3 жыл бұрын

TRPO and ACKTR (RLVS 2021 version)

12:50
3 жыл бұрын

On-Policy versus Off-Policy (RLVS 2021 version)

9:44
3 жыл бұрын

The bias-variance trade-off in Reinforcement Learning (RLVS 2021 version)

9:42
3 жыл бұрын

From Policy Gradient with baseline to Actor-Critic (RLVS 2021 version)

6:56
3 жыл бұрын

Policy Gradient Derivation (part 3/3) (RLVS 2021 version)

9:43
3 жыл бұрын

Policy Gradient Derivation (part 2/3) (RLVS 2021 version)

12:18
3 жыл бұрын

Policy Gradient derivation (part 1/3) (RLVS 2021 version)

7:53
3 жыл бұрын

The Policy Search Problem (RLVS 2021 version)

41:09
3 жыл бұрын

Coding tips for the Basic Policy Gradient lab

Пікірлер

@sumertuncay
4 күн бұрын
very dense and insightful, thanks for uploading!
@OlivierSigaud
3 күн бұрын
Thanks for your kind comment, this is very rewarding
@forheuristiclifeksh7836
9 күн бұрын
2:50 Sequential decision making ...Policy search
@ArtfulBenutzer
14 күн бұрын
Re-watched the series already. I look forward to finding time to read your SVGG paper.
@ArtfulBenutzer
14 күн бұрын
Thanks for the lesson, Olivier.
@ArtfulBenutzer
18 күн бұрын
Slide 12 on Sequential Setters is especially interesting. I am looking forward to your course in HRL!
@ArtfulBenutzer
25 күн бұрын
Thanks. This went a bit too fast for me, but it is good to have the references.
@OlivierSigaud
25 күн бұрын
Thanks for watching, and thanks for the feedback. Unfortunately, the next videos will probably take significantly longer to come.
@ArtfulBenutzer
25 күн бұрын
@@OlivierSigaud No problem: there is plenty to digest in this series about GCRL. I intend to re-watch the series after this semester is over.
@ArtfulBenutzer
26 күн бұрын
Great. Thanks.
@ArtfulBenutzer
26 күн бұрын
Thank you for the info.
@ArtfulBenutzer
28 күн бұрын
Very useful info. Thanks, Olivier.
@ArtfulBenutzer
29 күн бұрын
I'm looking forward to watching the rest of this series.
@hirakjyotibasumatary5760
Ай бұрын
Great series Sir. Awaiting a lot more from you 🙂
@jiangpengli86
2 ай бұрын
Thank you for this tutorial. It’s really clear and helpful 🎉.
@priya_cast_23
3 ай бұрын
Could you explain why some agents fails when using td3 and ddpg ?
@OlivierSigaud
3 ай бұрын
As explained in some of my videos, these algorithms suffer from bias: they update an estimate based on a previous (potentially wrong) estimate. So if you are not lucky with the data collected by the agent or if the hyper-parameters were not tuned properly, then the agent can get wrong.
@priya_cast_23
3 ай бұрын
@@OlivierSigaud, Thank you very much for your inputs, professor.
@priya_cast_23
3 ай бұрын
Thank for all your explanation. It was detailed. Could you explain why sometimes the agent fails when using ddpg and td3 ?
@naeemajilforoushan5784
5 ай бұрын
it is great too of your lectures. Thank you
@naeemajilforoushan5784
5 ай бұрын
Thank you again.
@naeemajilforoushan5784
5 ай бұрын
Hello Professor, Thank you for your time. This lecture about (TRPO-ACKTR-PPO) is still the best and briefest of Deep-Stochastic-RL on KZitem.
@OlivierSigaud
5 ай бұрын
Thank you very much for the rewarding comment :)
@finarwa3602
5 ай бұрын
Very helpful Lecture. Thank you !
@liyanglim2139
6 ай бұрын
Thank you so much for taking the time to make the video. I am super confused by the derivation process and come across your video, and it really helps me in understanding the idea behind policy gradient.
@SinaEbrahimi-ee3fq
6 ай бұрын
Thanks. very informative
@hamzehal-hallaj5890
7 ай бұрын
Where can I find the " how to code A2C " video? Also, previously, you mentioned that the PPO is explicitly explained. IS the video available on this channel? Thank you
@takieddinesoualhi5827
7 ай бұрын
Amazing presentation ! thanks for all the videos you are sharing with us.
@marioalan1547
7 ай бұрын
Quick question regarding the final loss of the policy. Is the target critic used? Or only the local one? Because in the original papers they don't clarify if they use the target or the other one for the policy loss. Another question also regarding the policy loss. Is the input of the critic s_(t+1) or s_(t)? In the original paper the critic in the policy loss takes as an input s_(t).
@OlivierSigaud
7 ай бұрын
To compute the policy loss, it makes more sense to use the current critic rather than the target critic, as the former is more up-to-date than the latter (the latter is a slow tracker). And you should use s_t rather than s_{t+1}.
@marioalan1547
7 ай бұрын
Merci pour la réponse Dr.!@@OlivierSigaud
@cagedgandalf3472
9 ай бұрын
would it be unnecessary if I use HER to solve the inverted pendulum problem?
@OlivierSigaud
9 ай бұрын
To use HER, you need to specify goals: the goals you want to achieve, and the goals you achieve in practice. So you have to add this notion of goal to the inverted pendulum environment.
@cagedgandalf3472
9 ай бұрын
@@OlivierSigaud I am currently working on my thesis and I am an undergrad of Electronics engg. I don't have much knowledge in machine learning so I apologize if I have the wrong concepts. I understand what you are saying but in the inverted pendulum environment, it's reward function is a cosine function. I was wondering if the cosine function would constitute as the "goal" in HER. From what I understand, HER sets a goal and it changes, in this case it depends on the angle of the pendulum, the goal will move towards 0deg. Is this not the same as the cosine function which increases the reward the closer it is to 0deg?
@OlivierSigaud
9 ай бұрын
@@cagedgandalf3472 You may set different goals as different angles, and the agent would have to reach a particular angle. But this is quite artificial. You should first try your favorite "standard RL" algorithm without trying to introduce a goal. There is already a lot to be learned about standard RL before moving to goal-conditioned RL...
@cagedgandalf3472
9 ай бұрын
@@OlivierSigaud yes I understand I have tried td3 and sac for this problem and found out that td3 works better and learns faster since it is a deterministic environment. And then I tried this paper about using FORK (forward looking actor I don't know if I can send links in yt but it is by Wei et al) which trained faster than vanilla td3. I then stumbled upon HER and then this is where I am now. So far using td3 fork in simulations has worked and I am in the real world stage. I was just wondering if HER would work better than the FORK and I am planning to test it. Edit: I guess I am trying to find which algorithm is the most sample efficient.
@OlivierSigaud
9 ай бұрын
@@cagedgandalf3472 OK, please keep me posted, I'd be glad to know the results of your attempts. You may have to design a curriculum with more and more difficult goals. You may have a look at these slides (after slide 22) for basic notions about GCRL (goal-conditioned RL) : master-dac.isir.upmc.fr/rl/advanced.pdf
@nad4153
9 ай бұрын
thank you
@shuozheli5723
9 ай бұрын
Prof.Sigaud, you are a role model in my mind.
@OlivierSigaud
9 ай бұрын
Thanks for you nice comment, this is very rewarding.
@sameerap8943
9 ай бұрын
Slang is degrading your video, my advice is use normal pronounciation without any slang
@imamad
9 ай бұрын
content is good but due to accent is hard to grasp it.
@OlivierSigaud
9 ай бұрын
Sorry, I'm doing my best 😋
@jasonp-m169
6 ай бұрын
@@OlivierSigaudContent is good, and accent is good too! Thanks for these amazing resources.
@OlivierSigaud
6 ай бұрын
@@jasonp-m169Thanks for the support :)
@Matellois1996
10 ай бұрын
On slide 9 it should be grad Loss (theta), right? (super tutorial BTW). The Loss = -J(theta)
@OlivierSigaud
9 ай бұрын
Hi Jim, thanks for the nice comment. I'm not sure what you mean: in pytorch, you compute the loss, and then loss.backward() applies the gradient of the loss to the network parameters. Here, I specify the loss computation in the pytorch sense, see slide 8 for the relationship to the gradient...
@edk.2302
10 ай бұрын
Hi! Been watching ur rl video series so far with the slides printed out. But then for this particular video, for some reason, it seems that the slides do not seem to match exactly with this lecture. Lecture: RL 5 off policy vs on policy Slides:From Policy Gradient to Actor-Critic methods On-policy versus Off-policy I see the overlap in the title, so could it be the case that you updated the slides somewhat and that this slide, thus, is what I can use along to take notes of while watching this lecture?
@OlivierSigaud
10 ай бұрын
Yes, some of the videos are now quite old, while my teaching slides are continuously improving, so it is often the case that there is a discrepancy between the slides and the videos. At some point I should redo the videos, but this is very time consuming, sorry about that
@edk.2302
10 ай бұрын
@@OlivierSigaud Got it! Thank you for your reply!
@Krath1988
11 ай бұрын
Thank you for the info. You helped me realize that my learning operations were fundamentally 1-step but the entire system was built to be N-step. Which explains a lot.
@alexanderdimov7329
Жыл бұрын
I don't understand how entropy on 19 slide can be negative H = - |A|, and A - number of actions, for example it 1? and H = -1?? and what about loss(alpha) it is look like linear function by alpha so alpha will be always increased, or i don't understand something
@OlivierSigaud
Жыл бұрын
\bar{H} in slide 19 is the target entropy. In the continuous action space, your target entropy should be minus the size of the action space, computed e.g. with: -np.prod(env.action_space.shape).astype(np.float32) You can have a look at this paper, end of section 3: arxiv.org/pdf/2209.10081.pdf The loss on the alpĥa optimizer is entropy_coef_loss = -(log_entropy_coef * (action_logprobs + target_entropy).mean(), so it is not trivially linear
@alexanderdimov7329
Жыл бұрын
@@OlivierSigaud well, I understand, it's like a regulator that increases or decreases depending on current entropy, but why is the entropy for continuous actions negative or is it just such a name
@alexanderdimov7329
Жыл бұрын
@@OlivierSigaud i set something like that for discrete actions H = tf.reduce_sum(y_pi1*logpi,axis=-1) lossa = - self.alphav*(H-tf.math.log(pt)*0.98), let's see what happens
@followsufism
Жыл бұрын
Worst explanation of DDPG!
@OlivierSigaud
Жыл бұрын
Thank you :) Don't you want to elaborate on your criticism?
@bertobertoberto242
Жыл бұрын
2023 Deep RL student here, thank you for what you have published, however in slide 8 is not clear how you obtain those matrices, did you fixed the other 2 dimensions arbitrarily or is that a partial dependence? so for each pair velocity/angle you average the other 2 dimensions
@OlivierSigaud
Жыл бұрын
Thank you for the good question. For the other two dimensions, I take small random numbers (close to 0).
@bertobertoberto242
Жыл бұрын
@@OlivierSigaud thank you so much, that makes sense... thank you again for the material, are you planning to release new material? I can't find any video on "open problems in DeepRL", it would be very interesting fro new practitioner knowing where the boundary can be further pushed
@OlivierSigaud
Жыл бұрын
@@bertobertoberto242 Thanks for suggesting, this is a good idea. My plan is to release a set of lessons about goal-conditioned reinforcement learning this summer, I'm working hard on it but this is a vast domain...
@rezarawassizadeh4601
Жыл бұрын
Very good explanation, thanks
@wxcvbnazerty7159
Жыл бұрын
Bonjour, merci pour la vidéo. Proposez-vous des cours en Français ?
@OlivierSigaud
Жыл бұрын
Pour ça, il faut venir suivre mes cours à Sorbonne Université :)
@wxcvbnazerty7159
Жыл бұрын
@@OlivierSigaud C'est noté ! Avez-vous prévu de refaire un programme sur 5 jours d'apprentissage par renforcement en 2023 ?
@OlivierSigaud
Жыл бұрын
@@wxcvbnazerty7159 Oui, en juin, cette fois nous prévoyons même 6 jours (3 fois lundi-mardi), mais les participants peuvent faire 2, 4 ou 6 jours. Le site devrait être mis à jour bientôt...
@wxcvbnazerty7159
Жыл бұрын
@@OlivierSigaud je vous remercie, je ferai en sorte de me rendre disponible pour candidater. Bonne fin de journée ! Marc
@OlivierSigaud
Жыл бұрын
@@wxcvbnazerty7159 Il faudra contacter la formation continue et envoyer votre CV, comme indiqué sur le site. Bonne fin d'année :)
@hongkyulee9724
Жыл бұрын
Your explain and slides are very intuitive for me :D Really thank you for the nice video. Hope your happiness :D
@OlivierSigaud
Жыл бұрын
Thanks for your kind message
@faroukdeutsch4116
Жыл бұрын
Bonjour, Merci pour le partage des videos. J'ai une question concernant la slide 9 (11min30). Pourquoi le schéma de l'actor prend 5 states ? et pourquoi le schéma du critic prend trois actions ? J'avais cru comprendre (surement avec erreur) que l'actor prennait le state actuel en entré et fournissait en sortie la distribution des meilleures actions possibles, et que le critic prennait la meilleure action avec l'actuel state pour donner une estimation de Q(s,a) Merci pour votre aide
@OlivierSigaud
Жыл бұрын
Bonjour Farouk, Les 5 entrées du réseau ne représentent pas 5 etats, mais un etat à 5 dimensions. Par exemple, pour CartPole, l'etat a 4 dimensions : la position, la vitesse, l'angle du pole et sa vitesse angulaire... De même, la sortie est une action à 3 dimensions : par exemple les poussées de 3 réacteurs dans LunarLander...
@faroukdeutsch4116
Жыл бұрын
@@OlivierSigaud Ah d'accord. Dans ce cas je comprends merci beaucoup d'avoir pris le temps de me répondre. Je vous souhaite une agréable fin de journée
@卢成龙
Жыл бұрын
why ICLR reject such excellent job like that ? twice !!! I was frastrated.
@卢成龙
Жыл бұрын
very nice job !!!
@sleepingbeauty7911
2 жыл бұрын
These lectures are great! Is there a rule of thumb as to what the value of lambda should be? I understand It will be on a case-by-case basis, but, what should we typically start with and then tune from that point? Thanks!
@OlivierSigaud
2 жыл бұрын
Sorry for the delay. I usually take lambda ~= 0.9, following hyper-parameters from the RL baselines3 zoo here : github.com/DLR-RM/rl-baselines3-zoo/blob/f1064a7e8c4c19d6599e84fd73c684b158de1e56/hyperparams/a2c.yml#L54
@jpark7636
2 жыл бұрын
Thank you for your great lectures and insights regarding RL algorithms. You said, "if you start with a replay buffer full of ___ samples then those algorithms will fail" (10:25) What is ___ ? I couldn't clearly hear what you have said and this sentence seems like very important! :)
@OlivierSigaud
2 жыл бұрын
It was "full of random (or uniformly distributed) samples". In practice, I'm drawing random states and random actions in those states. Sorry for the imperfect sound in my videos...
@nantunest
2 жыл бұрын
Excellent class, very well explained! Please, make a playlist for the reinforcement learning subject. That would be great!
@OlivierSigaud
2 жыл бұрын
Well, the lesson you found is the second in my tabular reinforcement learning playlist... So I'm quite sure that with a short search, your wish will be fulfilled. :)
@aytackas4977
2 жыл бұрын
Assalamu alaykum (peace be upon you). It's amazing that you explain details which usually a foundational point and the development history of these algorithms. Your videos are really underrated.
@OlivierSigaud
2 жыл бұрын
Thank you for your kind comment. If you think these videos are underrated, don't hesitate to advertise them to people who might be interested.
@aytackas4977
2 жыл бұрын
@@OlivierSigaud You're welcome. Indeed, I'm planning to promote these videos whenever O come across someone who might be interested.
@RaoBlackWellizedArman
2 жыл бұрын
Thank you for the great lecture. I have a question though... In slide 4, you mentioned the critic is thrown away after each iteration (rollout) in approach two. I do not understand why we should throw away the critic. Isn't it true that we can always improve the critic using MC updates in the next rollout (not Bootstrapping) without introducing bootstrap bias? If we throw away the critic, training a fresh DNN with random weights will take a whole lot of computation and time! What is the adantage of throwing it away!?
@OlivierSigaud
2 жыл бұрын
Excellent question, thank you for asking. In on-policy methods, you should compute the gradient of your policy using only data from this policy. Therefore, for the policy update, you need to throw away the previous data. So it might seem logical to also compute a temporary critic just from the same data and throw away the previous one. But actually, it is not necessary. And in general, on-policy methods do not do so. So you are completely right and in that respect, my video is inaccurate. So the only difference that remains between policy-gradient-with-baseline and actor-critic methods is that in the latter, the critic must be updated with a bootstrap (temporal difference) mechanism whereas in the former, it is most often updated with a Monte Carlo estimation approach.
@RaoBlackWellizedArman
2 жыл бұрын
Thank you very much for your prompt reply and thank you for all the effort you put into making these videos. Keep up the good work
@aytackas4977
2 жыл бұрын
God willingly, I'll cite this video on my thesis if I could make it on time.
@OlivierSigaud
2 жыл бұрын
I'm happy it was useful
@astaragmohapatra9
2 жыл бұрын
Thank you for such amazing lectures. Your RL lectures are seminal for me. Can you release the video with English translation?
@OlivierSigaud
2 жыл бұрын
I don't know how to do this. If there is an automated way to do it, I'm interested. Otherwise, I won't find time to translate it myself.
@astaragmohapatra9
2 жыл бұрын
@@OlivierSigaud alright, I will try searching for it. Or I can do it from the closed captions. Thanks
@OlivierSigaud
2 жыл бұрын
@@astaragmohapatra9 Yes, on the lower right part of the video you have a button for subtitles (the second one), then in the parameters you choose enligsh. The french subtitles are not very accurate, so neither the english ones, but it can suffice, I hope...
@etaifour2
2 жыл бұрын
You’ve explained this in a fantastic, structured way. Kudos. Chapeau bas. Merci.
@OlivierSigaud
2 жыл бұрын
Thanks, I'm glad you like it
@ThePRASANTHof1994
2 жыл бұрын
@Olivier I've never come across 10 minute episodes where such dense material is covered so elegantly. Your slides are well structured to give a pictorial understanding of the concepts. You recap the previous topics everytime and summarize after every episode. This is one of the best tutorials of RL and needs to get more attention!
@OlivierSigaud
2 жыл бұрын
Thanks a lot for these kind comments, I'm glad you like it and it motivates me a lot to continue adding content on my channel. If you think these videos deserve more attention, don't hesitate to advertise them towards your colleagues and friends... ;)
@ryadhcherifi7897
2 жыл бұрын
Je vous remercie pour cette explication simple et efficace, est ce que DDPG pourrait être efficace dans les problèmes incluant des graphes? par exemple dans un problème de mappage d'un nœud sur un graphe, ou le nombre de nœuds du graphe varie et change à chaque fois ce qui implique que l'ensemble d'action (l'ensemble du nœud du graphe) change.
@OlivierSigaud
2 жыл бұрын
Bonjour. Les algorithmes d'apprentissage par renforcement ne sont généralement pas conçus pour gérer le cas où l'ensemble d'actions est variable. Voyez la thèse de Matthieu Seurin pour en savoir plus.
@ryadhcherifi7897
2 жыл бұрын
@@OlivierSigaud Merci pour votre réponse, une thèse vraiment intéressante dans les différents aspects de la définition de l'ensemble des actions!