MIT 6.S191 (2020): Reinforcement Learning

Рет қаралды 107,839

Alexander Amini

Жүктеу

Пікірлер: 55

@hongyihuang3560
4 жыл бұрын
Wow you explained policy gradient very intuitively and visually alongside math! I never have seen such a good explanation of it in any textbooks, paper, or even videos. Lovin it and keep up the effort!
@senthilcaesar
4 жыл бұрын
Policy gradient is best explained in this lecture . Thank you Alex
@frankd1156
3 жыл бұрын
This is 25yrs old....wow you are going places
@burlemanimounika7631
4 жыл бұрын
Can you please make full Reinforcement learning course work? Please your way of teaching is so clear
@shawnirwin6633
3 жыл бұрын
8:04 When he says "Lambda", what he really means is "Gamma". Understandable mistake if you have not had much exposure to Greek.
@mohamedwaleedfakhr9805
4 жыл бұрын
many thanks to you and your team for the great lectures. You both make difficult stuff feel nice and clear :) I wish you could make one for the new advances in RNNs and transformer models.
@vast634
3 жыл бұрын
Why not just use a neural net, feed the inputs (current state) into it, and output the action. Then use the same net for several timesteps in a row (updating current state input, apply actions) for the simulation. The cost function is then simply the end-result of the nets performance in this simulation run. Then train it to improve for the next run.
@abhijeetsharma5533
4 жыл бұрын
33:58 multiplying by -1..shouldn't it make the loss positive(instead of negative) because log term is negative already(coz probability term in it is
@AAmini
4 жыл бұрын
Good question, P(a|s) is not necessarily between 0 and 1, since it is a probability density function (not a probability), so it can be any number greater than 0. This means that log(P(a|s)) can be positive or negative. Multiplying by -1 makes the value "negative" with respect to its original log form (positive -> negative and negative -> positive).
@thanhbinhnguyen2323
4 жыл бұрын
@@AAmini I am not sure this is correct. P(a|s) should be a probability distribution that sums up to 1 over all discrete actions (in this case). A probability density value is 0 at any discrete point. So I think a simple explanation for putting the minus in front of the term is that we want to maximize the expected reward, that is denoted by the term log(P(a_t|s_t).R_t, i.e., the action that brings maximal reward will be more likely selected in the future (log(P(a_t|s_t) is getting larger). This is the same was minimizing the opposite term, i.e., the minus form.
@GabrielHernandez-to9ct
3 жыл бұрын
Great Lecture! I have a couple of questions 1. During Q-Learning training, in order to have the target values, do we have to make a Complete Search, so we have the optimal Q(s,a) ? or are there other ways to do it? 2. In policy gradient, I think that the NN parameters could be the probability for every action, are there ways in deep learning such that the sum of the parameters sum 1?
@ChibatZ
3 жыл бұрын
for point 2: yes, look up "softmax" or just a normal "devide by sum" normalization could work
@shreshtashetty2985
Жыл бұрын
A very good introduction. Bravo!
@aromax504
4 жыл бұрын
Can you please do a video of how you trained your self driving model in simulation and pass those learning in real world cars?
@lifeafteraleekx
4 жыл бұрын
great teacher and great content
@DennisZIyanChen
4 жыл бұрын
Believe or not, I've been watching this specific lecture 4-5 times now and I actually watch to while I am going to bed... I don't know why but I am so completely drawn to the logic presented here. I have a decent background in mathematics/calculus and some in probability/statistics from my work but I have very little experience with programming. Alex, would you have some advice on how to focus my personal training in programming to be able to put some of what's taught here into actual practice (even if it's just simple practice problems like teach the playing of a simple game)?
@lukeparentguy2223
2 жыл бұрын
I don't understand how the Q function network is trained. Do you play a whole game with the same network, and then after that run is done back propagate all the errors at once? So we have to save copies of the outputs for every time step until we know the total reward, and then average all gradients? Or some other way? I don't think this was explained...
@zemanntill
2 жыл бұрын
I have a little bit of trouble comprehending why the negative logprobs work. If i have a high probability (close to 1) and take the log doesn't that just give me a logprob close to 0? So it wouldnt matter if the action resulted in high or low reward because we multiply by 0 anyways?
@jitendrakr9171
4 жыл бұрын
Thank you sir good Explaination of reinforcement learning
@JeminDEV
3 жыл бұрын
8:40 should it be 𝛾^(i-t) ??
@asdf_600
3 жыл бұрын
Hmmm Sorry to point that out, but I think there is an issue during the explanation of policy gradient, your -log(p...) is non negative since the probability is between 0 and 1...
@pavankumarpapysettypally8121
3 жыл бұрын
Hi, I am understood the concept you explained well, that the bar is moving calculating the probability of its action, but what I did not understand is that how it is predicting the path of the ball after hitting the colored blocks? is there any other algorithm hidden to do so? I'm sorry if my question is silly, but as I am new to the A.I and D.L, I'm trying my best to understand the concept.
@vast634
3 жыл бұрын
21:58 so it scored very low on the game Asteroids. Thats not that complex of a game for a classic game AI to solve. So there might be a structural problem with this type of reinforcement learning for those kind of problems.
@aromax504
4 жыл бұрын
Hey man your lecture is awesome. Can you please add explanatory source code plus materials to follow for deeper exploration?? Thanks a lot 👏👏
@sabahshams1582
3 жыл бұрын
Amazing series, thank you.
@josephwong2832
4 жыл бұрын
Great series!!
@hfkssadfrew
4 жыл бұрын
19:25 I don’t get it... how could you know where to take best action before Q network is trained? If you already know the target value, Just maximize the target then you are done, right?
@PSouza-wm2if
4 жыл бұрын
From what I understood, the best actions are represented by a given training data set, which can be obtained, for example, through a simulator like VISTA (37:10). Now, the whole point of RL (DL also) is to extrapolate from such target values, in a way that a machine can do things by itself without direct human programming/intervention. Therefore, having access to a limited set of target data does not mean that the problem is solved, otherwise how one would maximize the reward for new situations? This make sense??? :)
@jayt696969
4 жыл бұрын
@@PSouza-wm2if So before you do anything, you cannot calculate the Loss function = (target Q - predicted Q), because target Q is a function of Best Expected Future Q and we have no data to model Best Expected Future Q. Okay, fine, not a big deal, because we can just let the Agent roam in the environment and we can collect the data from it and can use that data to estimated Best Expected Future Q. But something seems off about it. What algo is determining Agent actions during this data collection phase? How can we assume that data collected from this Best Expected Future Q space is actually a representative space, once we are actually training the DQN? ya, i'm confused.
@gren287
4 жыл бұрын
Feels like a state of the art lecture, wow!
@louben5573
4 жыл бұрын
merci for uploading, very valuable, greetz from Amsterdam SciencePark
@QuentinBrandon
4 жыл бұрын
Hardly surprising that Asterix would show some resistance...
@zhaobangxue6306
3 жыл бұрын
great lecture
@petermuwanguzi3787
4 жыл бұрын
i need help with lab 3 coze at the pong game it kept throwing an error of too many indices for array and i couldnt figure it out. yes i also consulted the solution notebook afterwards still same error. please i need help
@jonathansum9084
4 жыл бұрын
Time to learn some RL. Let's hope it will be helpful in real life.
@retrom
4 жыл бұрын
It's generally not. Requires too much processing/compute power.
@ntumbaeliensampi6305
4 жыл бұрын
it will be useful in the future. talents in RL will be needed.
@TheJustinmulli
3 жыл бұрын
Why would you sample from the probability distribution at 25:40? This could result in action 2 which has a low probability of being optimal. Shouldn't you just pick the action with the highest probability of being optimal?
@AAmini
3 жыл бұрын
During training you sample (explore), but during testing you don't (exploit). The reason is because during training, especially in the beginning, you don't know the optimal probability distribution (you are still learning). So if you sample action 2 (even though it doesn't have the highest probability), you might find that it is actually better than you thought and thus it's probability should be increased. Once done learning you fix the probabilities and just chose the most likely action.
@TheJustinmulli
3 жыл бұрын
@@AAmini wow thanks for the quick response! I didn't realize this was for exploration. thanks!
@MarkoTintor
4 жыл бұрын
Amazing Intro to RL lecture!
@victorsergio
4 жыл бұрын
Is it possible to have a RL model to output a probability distribution P(a|s) to select from two or more continuous actions, e.g. steering angle and gas pedal position? how will be the RL architecture to model multiple continuous output parameters, there will be separate distributions for each output parameter?, how the parameters will be combined? I’m not clear about that part on policy gradient. Thanks.
@tuannguyentranle7151
3 жыл бұрын
softmax function bro, u can do some research
@RaivoKoot
4 жыл бұрын
Someone help me with the loss function at 33:48 . He says a high log probability and a larg reward result in a large number and after adding the minus sign, a very negative number. How is this possible? Probabilities are between 0 and 1 so the log (base 2, e, or 10) of a probability is between negative infinity and zero. If we have a high probability the log probability will be close to zero and the resulting loss will be close to zero too. So, how can the loss be a very negative number when the log probability is as he says "high"?
@Hung94
4 жыл бұрын
@Raivo Koot: You are right. The loss function on the slides is actually wrong and should be: loss = - P(a|s)*R, so without the log-function. I think Alexander's confusion is due to the fact that the log-function comes into play when you explicitly work out the formula for the gradient, i.e. in the expression of grad(loss) in w' = w - grad(loss). You can find more mathematical details by searching online for 'policy gradient objective function', where they usually work with the objective function J(theta) = - loss.
@RaivoKoot
4 жыл бұрын
@@Hung94 Wow thanks for your great reply. That clears my confusion up thank you!
@lizgichora6472
3 жыл бұрын
Thank you.
@ahteshamabbasi9503
3 жыл бұрын
Can we get the slides of this lecture? They haven't been uploaded on the mentioned website.
@AAmini
3 жыл бұрын
This is last year's lecture (2020), the slides are available on the 2020 version of the website (introtodeeplearning.com/2020/). The new 2021 version will be uploaded later today.
@gautamj7450
4 жыл бұрын
29:33 How is the velocity calculated as -0.8 m/s for a probability distribution having mean as -1 and variance as 0.5
@Hung94
4 жыл бұрын
The velocity is not *calculated* to be -0.8 m/s, but it has been randomly *sampled* from a Gaussian distribution with mean -1 and variance 0.5. This -0.8 is just an example. He could use any other random number there.
@clapdrix72
4 жыл бұрын
Nitpick alert: 8:00 that's gamma not lambda
@louerleseigneur4532
3 жыл бұрын
Thanks buddy
@Samuel-wl4fw
4 жыл бұрын
Thanks a lot
@ExV6120
4 жыл бұрын
10:21 suspicious cough 🦠 😷