Негізгі бет L1 MDPs, Exact Solution Methods, Max-ent RL (Foundations of Deep RL Series)

Күн бұрын

L1 MDPs, Exact Solution Methods, Max-ent RL (Foundations of Deep RL Series)

Рет қаралды 56,271

Pieter Abbeel

Жүктеу

Пікірлер: 45

@theodorocardoso
2 күн бұрын
Thanks for that! It's so much better than any RL intros I've watched on a graduate level.
@prerakmathur1431
2 жыл бұрын
This guy is seriously the god of reinforcement learning. He and Andrew Ng have single handedly transformed ML. Kudos to you Pieter.
@florentinrieger5306
Жыл бұрын
Don't forget David Silver!
@unionsafetymatch
2 жыл бұрын
I don't believe what I've stumbled upon. This is amazing!
@danielmoreiradesousa185
6 ай бұрын
This is one of the best content I've seen in a long time, congratulations and thank you so much!
@Prokage
3 жыл бұрын
Thank you for everything you've done for the field over the years, Dr. Abbeel.
@Shah_Khan
3 жыл бұрын
Thank you Pieter for briniging the latest lecture series on Deep RL. I was looking just for that.
@henriquepett2124
Жыл бұрын
Nice explanation about RL, Pieter! Will be watching your updates closer now
@赵小鹏-w9s
3 жыл бұрын
Hi Pieter, thank you very much for this great lecture! I found a mistake on P54 of the slide attached. For the policy evaluation expression, the item in the last bracket should be " s' " instead of "s".
@OK-bt6lu
Жыл бұрын
This was the best video lecture intro to deep RL that I have ever watched. Thanks a lot for sharing Prof. Abbeel! Please post more :)
@hongkyulee9724
6 ай бұрын
This lecture is my first and best RL lecture. ❤❤
@junghwanro4829
11 ай бұрын
Thank you for the great lecture. It was super helpful even after taking a RL course.
@BruinChang
2 жыл бұрын
Much thanks, no other words.
@itepsilon
3 жыл бұрын
Thanks so much for sharing! Awesome!
@stevens68dev75
3 жыл бұрын
Really an excellent lecture series! Thanks a lot!! Just one question regarding the example at 22:30: Shouldn't the value for the terminal states V(4,3) and V(4,2) both be 0 because in terminal states the expected future reward is 0 (there is no future)? The Value Iteration algorithm at 30:20 also implies it.
@saihemanthiitkgp
2 жыл бұрын
I think it has to do with the environmental setup. In the example of gridworld, the agent can get rewarded just for being in a specific state and the action of collecting the gem doesnt require any timestep. Value Iteration is formalized more for practical scenarios where the agent is rewarded for the decisions (transitions) it makes. Thats why V^*_0(s) = 0 for all s instead of R(s, phi, s), meaning no reward for just being in a state.
@wireghost897
Жыл бұрын
Yeah exactly. This was quite confusing coz in every other book I have seen, terminal state has 0 value.
@许翔宇
2 ай бұрын
I agree with you. Did the lecturer make a mistake here?
@eonr
Жыл бұрын
I believe there's a mistake at 51:01 . The last term in the last two equations should be the value function of s' instead of s.
@user-or7ji5hv8y
3 жыл бұрын
Awesome!
@offthepathworks9171
8 ай бұрын
Especially liked the 'intuition' part! What would be the best way to get more in-depth on some of the "prerequisites" for RL?
@karthiksuryadevara2546
4 күн бұрын
Good explanation. Very clear. I did not understand the entropy part , could someone suggest a good resource to understand that part?
@guoshenli4193
2 жыл бұрын
great lecture, so much thanks!!!!
@wanliyu4243
Жыл бұрын
regarding the example at 26:30, when calculating the value of V*(3,3), we need to know the values of V*(4,3), V*(3,3) and V*(3,2). The problem is except V*(4,3), how can we get to know the values of V*(3,3) and V*(3,2)?
@awesama
3 жыл бұрын
You call Reinforcement Learning as Trial and Error learning. Can we not use prior knowledge of a task or prior data to help learn a task or speed up the learning of such a task? For example, the AlphaGo can learn from existing Go games between players?
@julianequihuabenitez5344
3 жыл бұрын
You can, it's called offline reinforcement learning.
@MrJugodoy
3 жыл бұрын
I believe Alpha Go does this, unlike Alpha Zero which starts learning from "scratch"
@JumpDiffusion
Жыл бұрын
yes, that's what a replay buffer is for.
@datascience6104
3 жыл бұрын
Thanks for sharing 👍
@TheAdityagopalan
3 жыл бұрын
Nice series, thanks! Question- In the MaxEnt part, starting at around 1:07:00, shouldn't the Lagrangian dual of a max problem be the min max instead of max min?
@MLSCLUB-t2y
3 ай бұрын
don't forget he is maximizing the function , what u might be used to is minimizing the funcion , so that's why the logic is flipped
@user-kl9km5iz7v
Жыл бұрын
At 32:30 he said we will explicitly avoid the fire pit how will we do that as the only actions that we have are up, right and left. It would be optimum to take up but as the environment is stochastic we'll end up in the fire pit 20% of the time and the value function must also update to 0.2*0.9*-1. Am I right?
@ai-from-scratch
Жыл бұрын
the possible actions are up, right, left, down, the optimum to take is left, which makes it stay at the same place with probability 0.8, go up with 0.1 and go down with 0.1, go to the fire pit with 0.0
@KrischerBoy
2 жыл бұрын
at 53:33 (exercise2) About the correct Option 2: Shouldn't the sum be flipped? So: sigma(a) [ sigma(s') [ ... ] ] As if I iterate over "a" the term expresses only the reward value of the s' we wanted to go to through action a with its respective probability but not the other possible states through noise? With sigma(a) [ sigma(s') [ ... ] ] the term would express the sum of all the possible outcomes action "a" could cause and then would iterate over another action
@김동규-b6l
3 жыл бұрын
Thank you!
@goldfishjy95
3 жыл бұрын
omg thank you so much!!!!
@Himanshu-xe7ek
2 жыл бұрын
At 1:11:00, how pi(a)= exp[(r(a) - beta + lamda)/beta] became pi(a) = exp(r(a)/beta)/Z , where is the lamda - 1 term?
@sakethsaketh750
3 ай бұрын
Is it recommended lecture series for me a a roboticist but i havent have basics in deep learning or machine learning. Can i directly start with this series
@blackdeutrium746
3 жыл бұрын
Hi proffessor , the walking robot you just made and showed if I wanna make a similar type of robot what I have to learn ? I quite interested in deep reinforcement learning
@zainmehdi8886
Жыл бұрын
How to we know/caclulate action success probability ? By collecting statistical data ?
@pipe_runner_lab
Жыл бұрын
I would recommend kzitem.info/news/bejne/mp1pmKptm31kdqQ before starting this video. The lecturer goes in greater detail over value interaction and how it works.
@wireghost897
Жыл бұрын
*value iteration
@bigjeffystyle7011
Жыл бұрын
Thanks for the suggestion!
@TheThunderSpirit
Жыл бұрын
is it necessary that the state set must be finite?