Негізгі бет Can AI Learn to Cooperate? Multi Agent Deep Deterministic Policy Gradients (MADDPG) in PyTorch

Күн бұрын

Can AI Learn to Cooperate? Multi Agent Deep Deterministic Policy Gradients (MADDPG) in PyTorch

Рет қаралды 38,572

Machine Learning with Phil

Жүктеу

Пікірлер: 105

@MachineLearningwithPhil
3 жыл бұрын
This content is sponsored by my Udemy courses. Level up your skills by learning to turn papers into code. See the links in the description.
@ashmitabhattacharya1457
2 жыл бұрын
I would very much appreciate it if you can make a video on Counterfactual multi-agent RL framework as well (COMA). I have been trying my hands at it , and it would be great to have reference for that, especially by someone as good as you!
@MachineLearningwithPhil
2 жыл бұрын
Great suggestion. It's been on my list for a while. So much good stuff and so little time.
@TriThom50
Жыл бұрын
@@MachineLearningwithPhil Is this not centralized training and decentralized execution? In the description you wrote " The main innovation of this algorithm is the use of centralized execution and decentralized training."
@_jiwi2674
3 жыл бұрын
Dont apologize for making videos long Phil, we can only appreciate your efforts and time investments :)
@MachineLearningwithPhil
3 жыл бұрын
Thanks Jiwi
@JousefM
3 жыл бұрын
Damn I have a lot of videos to watch once I handed in my thesis :D It has been a while!
@MachineLearningwithPhil
3 жыл бұрын
Good luck with that thesis.
@LidoList
3 жыл бұрын
I was fascinated when I read this paper first time. I faced fascination again after watching your amazing vlog bro. Keep doing this and thanks for your effort
@whisleton101
2 жыл бұрын
This is an awesome video to help explain an MDDPG implementation! Thanks!
@mohammadparvini1521
3 жыл бұрын
Thank you very much I have learned a lot from your videos,
@阮雨迪
3 жыл бұрын
thanks a lot for making this video! learned a lot
@xz3642
2 жыл бұрын
Thank you so much for the great tutorial!
@noorwertheim2515
2 жыл бұрын
I really like this in depth explaination of the paper and the code. I was wondering whether this code can also be used for other multi-agent gym environments, or that it is just compatible with this one?
@TriThom50
Жыл бұрын
I also have this question
@danielw2609
3 жыл бұрын
I have a question about action space at 51:55. Since the action space for each agent is "Discrete(5)", I would expect the action value to be a single integer in the range [1,5]. That's what we get if we use env.action_space[0].sample(). But here you are using a list of 5 elements [1,0,0,0,0] as action. Why this is so ? What is physical meaning of the fractional values of actions like [0.1, 0.2, 0.2, 0.3, 0.4] ?
@Majadoon
2 жыл бұрын
I was thinking the same. Plus where do we choose this one value of the action that agent takes?
@Majadoon
2 жыл бұрын
Thank you Phil. There's new MAPPO algorithm that is supposed to be working better. I wonder how does that work
@MachineLearningwithPhil
2 жыл бұрын
I will check it out, thanks!
@mehranzand2873
3 жыл бұрын
Thanks a lot, i like your codes, and of course we like to know about the other approach you mentioned and please don't forget about SAC_decrete : ))
@MachineLearningwithPhil
3 жыл бұрын
Thanks Mehran, I'll get cracking on it.
@chinokyou
Жыл бұрын
Great vid!
@jasonruff1270
2 жыл бұрын
could you do a tutorial on implementing multi-agent reinforcement learning with c++?
@kaierliang
2 жыл бұрын
Hello Phil, for 1:38:00, I do not quite understand the difference between mu and old_action. are they both calculated using the regular actor-network using the current state? I guess the difference is the regular critic is updated so these two values aren't the same. But what if I calculate the actor loss by directly using the old_action, i.e., the actor we actually took.
@jcqin8722
Жыл бұрын
Hi, I have the same question as you. Do you solve or undershand it?
@jcqin8722
Жыл бұрын
I think I find the key. **old_action** comes from memery, so it seems like **old_action** don't have gradient for parameters of action network. Therefore, it's necessary to calculate **mu** online instead of using **old_action** which comes from memery.
@ross951
3 жыл бұрын
Given that there’s nothing mystical about our own brains, and that it all comes down physics, there’s nothing to suppose artificial systems won’t be able to cooperate. It would be shocking to find that they couldn’t. The question is, can WE, find algorithms that allow for cooperation.
@ReubsWalsh
3 жыл бұрын
Assuming, of course, that we do in fact cooperate...
@TriThom50
Жыл бұрын
Could this implementation be used with a custom environment?
@minOddo
Жыл бұрын
When you calculate actor_loss, what about the log(prob(action)) part? I don't see you have that part
@MachineLearningwithPhil
Жыл бұрын
I'm using the same loss function from the DDPG paper. Actor loss is given by applying the chain rule to the output of the critic network, where the actions are chosen according to the actor network.
@k.o.6509
Жыл бұрын
Hello Dr. Phil, thanks for all the valuable knowledge. Is there any pytorch version other than 1.4.0 that works?
@MachineLearningwithPhil
Жыл бұрын
If you check my GitHub I have a repo on the latest versions
@k.o.6509
Жыл бұрын
Thank you@@MachineLearningwithPhil
@aungmyat5497
2 жыл бұрын
can i use one critic for same agent?
@fatenlouati4325
3 жыл бұрын
Im trying to run simply_reference scenario but this error occurs: AttributeError: 'MultiDiscrete' object has no attribute 'n' , in line 23 : n_actions = env.action_space[0].n How to fix it please
@MachineLearningwithPhil
3 жыл бұрын
It's impossible to diagnose here. Make sure you're running the correct versions of the required packages, to start. Failing that, dig through the source code to see what's going on.
@tamalsarkar6325
3 жыл бұрын
do we have a solution to this problem?
@valentino34h
2 жыл бұрын
1:53:53 I have a same error . I tried install previous version(torch 1.4.0) but ı cant. Can we use another torch version on this project ?
@林雷-o7x
Жыл бұрын
Have you solved the problem yet?I have the same problem!
@k.o.6509
Жыл бұрын
I have the same problem, and i think i cant install torch veriosn 1.4.0, due to my current python version. Have you solved this problem?
@marohs5606
3 жыл бұрын
Can't thank u enough for this tutorial, this is the best channel to learn RL coding .. many theoretical videos out there but rare implementations... please more videos about multi-agent as it is more complex than single RL.
@constantinebardis9814
2 жыл бұрын
Hi Phill, Many thanks for taking the time to create this (and all other) RL videos combining the theory of the papers and sensible code implementations, they really do go a long way towards helping me, among others, understand some pretty challenging topics. Regarding the "update_network_parameters" function you mentioned is ugly, I have what I think to be a sensible alternative, written for your Actor-Critic course for the TD3 Algorithm (so it may need some tweaking to work in our case here): def update_network_parameters(self, tau): ''' No other arguments are necessary because they will be taken care of inside the class by referencing self.model. ''' nets = [self.actor, self.actor_target, self.critic_1, self.critic_target_1, self.critic_2, self.critic_target_2] for model_ind in range(0, len(nets), 2): state_dict_online = nets[model_ind].state_dict() state_dict_target = nets[model_ind+1].state_dict() for name, param in state_dict_online.items(): transformed_param = ( param * tau ) + ( (1-tau) * state_dict_target[name] ) state_dict_target[name].copy_(transformed_param) The reasoning is to place all models in some list in a way that makes it easy to update all the parameters with a loop. Check it out and let me know if it helps. In addition, since you asked, it would be FANTASTIC if you could do tutorials or even a Udemy course dedicated on MARL, which is particularly important to me since I am doing my dissertation on it. I personally would love to see you have a go at the COMA paper which you mentioned, and also Multi-Agent PPO (arxiv.org/abs/2103.01955) and MA-POCA (arxiv.org/abs/2111.05992) which is implemented in the very powerful ML-Agents library of Unity. Again many thanks and keep up the great work
@janjaggy
3 жыл бұрын
Thankyou for this video! I am actually very likely going to use MADDPG for my thesis and you have been an amazing help sofar with everything, going to watch this first thing in the morning tomorrow!
@ngochungnguyen9824
2 жыл бұрын
[help] I tried to apply : scenario = 'simple_adversary', but i got the error: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [64, 5]], which is output 0 of AsStridedBackward0, is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck! Then I used this to fix: critic_loss.backward(retain_graph=True, inputs=list(self.agents[agent_idx].critic.parameters())) and actor_loss.backward(retain_graph=True, inputs=list(self.agents[agent_idx].actor.parameters())) but I don't have the good results. Thanks
@joaohenriqueluz6827
2 жыл бұрын
Hi, did you solve this error? I'm having the same issue and could use some help. Thanks in advance!
@ngochungnguyen9824
2 жыл бұрын
@@joaohenriqueluz6827 yes you need to add more parameter in backward function
@williamflinchbaugh6478
2 жыл бұрын
@@ngochungnguyen9824 could you elaborate? I'm having the same issue
@qiaomuzheng5800
3 ай бұрын
‘The rewards attribution problem...and the result is just poor learning, nobody getting anything done. Typical right?’ So funny XD
@SurajBorate-bx6hv
2 ай бұрын
I am getting OpenGL related error when trying to visualize by setting evalute flag true.
@alirezamogharabi8733
3 жыл бұрын
Thanks a lot, this was my request 🙏🙏💖❤️
@noname24291
2 жыл бұрын
At 1:24:00, can anyone explain why we should add noise to output of softmax? Because doing this will result in that the sum of probability is not 1 anymore. Can you give me some references/books doing this.
@diegobenalcazar4836
2 жыл бұрын
Good job. I am working on this algorithm, but I wish you had gone into the details of the environment. If you look at environment.py, it gets a little confusing. Also, I do not understand what you do at 54:08. In a deterministic policy the pdf collapses to one single action, doesn't it? Then, I would expect to see something like [1 0 0 0 0]. I ran and printed the Open AI code and the output is indeed more similar to what you write at 54:08. If you could help me understand this, I would deeply appreciate it.
@hao6247
3 жыл бұрын
Nice video again! Looking forward to the implementation of MUZERO.
@connor-shorten
3 жыл бұрын
Awesome!
@danielboyer4246
2 жыл бұрын
For the fc1 = nn.Linear inputs, if the adversary has different amount of actions would that line of code (Critic Network line 91) have to be modified
@viswanathansankar3789
2 жыл бұрын
It even worked with newest version of numpy. Had some errors while installing the specified version.
@whisleton101
2 жыл бұрын
Hey Phil. One confusion I have is when you are choosing actions from the agent, when I try to recreate the code I get a numpy array of the size of my number of actions but they aren't binary? I would think they need to be binary to just say which action should be chosen or is that handled with the environment you are using? I am trying to implement with Google Football which needs to have selected one integer for an action per agent.
@xiaokejie3456
2 жыл бұрын
Looking forward to your implement the multi-agent RL used in Starcraft in comparison with MADDPG！Thanks Phil！
@lijiang568
3 жыл бұрын
Thank you Phil.
@SteliosStavroulakis
3 жыл бұрын
I still didn't get why each agent takes an array of actions, is it a 1-hot encoding of each action? (for example, no_op = [1,0,0,0,0], right = [0,1,0,0,0] etc?). I absolutely LOVED the tutorial, huge thanks!
@MachineLearningwithPhil
3 жыл бұрын
I don't have a solid understanding. The code isn't well documented. It's just what I figured out by playing with the environment.
@serix_16
3 жыл бұрын
Thank you!
@cuongnguyenuc1776
7 ай бұрын
A very amazing video!! But does anybody wonder in the paper, how do they implement the model with ensemble of sub-policies in the test phase? and what happend to all of that sub-policies after trainning, will they converge to the same weight? if not then how do they choose a specific sub-policy to perform or Will they choose a sub-policy randomly to perform?
@suleiman8664
2 жыл бұрын
I must admit that I learned a lot from your excellent videos, thank you
@qiaomuzheng5800
3 ай бұрын
Maximum appreciations!
@jahcane3711
3 жыл бұрын
I would love it if you covered a starcraft style implementation!
@alexanderjeremijenko3228
3 жыл бұрын
Would be fantastic to a see a starcraft DL implementation!
@cjk9988
Жыл бұрын
Hi I'm facing some dependency issues when trying to install numpy 1.14.5, it seems to not be supported by python 3.8+ is there anyway to get around this without downgrading python because pytorch is not supported below 3.8
@MachineLearningwithPhil
Жыл бұрын
I've redone the code for torch 1.13 and petting zoo. github.com/philtabor/Multi-Agent-Reinforcement-Learning
@cjk9988
Жыл бұрын
@@MachineLearningwithPhil okay got it Thank you! I'll have a look into it.
@cuongnguyenuc1776
6 ай бұрын
@@cjk9988 checkout the code that i fix from this video to work on the lastest setting: colab.research.google.com/drive/1p_N9zQXljl5lVadBLJWsU5iiZ4UCHBq7?usp=sharing
@mukundsrinivas8426
2 жыл бұрын
Hey man m trying to do mappo using petting zoo and stablebaselines3. I am not clear on if petting zoo is the right environment. Can I reach out to u with the code for clarification?
@kuimisk
Ай бұрын
real nice
@abrahamloha3050
2 жыл бұрын
its really amazing tutorial thanks
@deykaushik98
3 жыл бұрын
Hello Phil, Your videos are awesome. The way you explain makes complex things simple.. One request, can you consider implementing the QMIX or MAVEN algorithm which came out of Shimmon Whiteson's lab? Given most of the developments in MARL are based on QMIX or related algorithm, it would be great to understand the code from your eyes.
@MachineLearningwithPhil
3 жыл бұрын
wow, those are some great suggestions. Looking over the MAVEN paper, I'd be inclined (though not certainly) to put this in a course (it's going to be a huge amount of work)... the QMIX paper might be easier to do for YT. I'll have to put those on the ever growing list of stuff to do. Thanks Kaushik!
@deykaushik98
3 жыл бұрын
@@MachineLearningwithPhil Thank You Phil!! Please consider Pytorch when you get a chance to implement the same :)
@tamalsarkar6325
3 жыл бұрын
@@MachineLearningwithPhil the coding for the QMIX algorithm would be so awesome to learn! and yes, as the author of the comment said, Pytorch please :)
@natty261
3 жыл бұрын
Wow amazing video!!! I just graduated university with a degree in AI. I've been researching RL on my own but all of the different papers and topics make it difficult to organize and learn. This video really put it all together. Great Video!! I shared it with all my friends in my degree. I have one question. I've seen some extensions of MADDPG in recent papers that try to improve it. There are many papers I've seen recently do this. What in your opinion would be a good algorithm to use for the environment I describe below? My project is this: I want to make two competing armies on a 2d battle field. There are two armies. Each army consists of N units. 2 methods I can try. Two competing agents (like two opposing generals) this would have a complex action space as each agent has to move multiple units. Or I can do it such that each unit is an agent. Then this would make a mixed competitive-cooperative problem. Sorry for the length and thank you for your time if you see this!
@joshwil7176
3 жыл бұрын
Hi Phil, you mention about how it would be easy to bolt on different RL algorithms to MAAC instead of DDPG. Do you not think that the variance introduced by a stochastic action selection would cause issues with policy updates?
@johncarter8855
2 жыл бұрын
This tutorial and code have been super useful. I'm using MADDPG to extend the Competitive Experience Replay paper to work with 3+ agents in a different way. So your work is much appreciated. For the action, you use a softmax output (giving a probability for the 5 actions) but then you add a uniform random number in [0,1] to each action. This seems a bit weird to me. I know you mention using a sigmoid in the video. I wonder if sigmoid with gaussian noise would be better? Do you know if the environment uses these continuous values and applies its own softmax again, or does it take the argmax of the 5 action values and applies that action? And also you have a high chance of getting a value of 1, which you stated in the video needs to be in the range [0,1], so perhaps the environment also caps things.
@johncarter8855
2 жыл бұрын
Also, do you think it's worth decaying the noise term over time? And also not adding any noise during the evaluation? Thanks again!!
@cjk9988
Жыл бұрын
Hi, I was able to successfully run the code for the 'simple' scenario, is there a way to change the scenarios ? Cuz it seems to be giving some issues when I do so.
@MachineLearningwithPhil
Жыл бұрын
Try switching to petting zoo.
@Malik-bx6qg
3 жыл бұрын
Thank you Dr. Phil for your great efforts, we all appreciate it, God bless you. I noticed that you didn't trigger the ".train()" and the ".eval()" Pytorch modes during the agents' learning process. I recall that you said in one of your previous videos that "we should trigger these modes, otherwise the agents will not learn". Is it really necessary to use these modes, as they are confusing? Can you please comment on this?
@MachineLearningwithPhil
3 жыл бұрын
It's not necessary. It only helps for collecting stats when you're doing batch norm. Sorry for the confusion.
@marcosflaks5214
2 жыл бұрын
Hey Phil Because there is no terminal state, you decided to end the game when the Agent takes 25 steps. So, for the same (State, Action) pair the Agent could go to a next state, (done = False) or the Agent could go to a terminal state (done = True). It is strange, because the Target Value for the same (State, Action) pair will have different values when we train the Critic Network. How the agent will learn ? (State, Action) Pair: if done = False ==> Target Value = reward + gamma * Q next state if done = True, ==> Target Value = reward + 0 (because Q next Sate = 0)
@MachineLearningwithPhil
2 жыл бұрын
The agent learns that no future rewards follow the terminal state; the future value of that state is zero.
@marcosflaks5214
2 жыл бұрын
@@MachineLearningwithPhil yes i understand. But my point is that for the same (state, action) pair we could have a terminal state or we can have a next state. For the same (state, action) pair the target value when we train the Critic Network can have different values. we could have 2 values of the target for the same (state, action) pair. Said that, it could make the training unstable. I am not saying that there is something wrong, but i thought that we should have only 1 target value for each (state, action) pair
@tmgmeireles
3 жыл бұрын
Amazing! Thank you for your video
@bertobertoberto242
Жыл бұрын
great video, however I do not agree on having an agent network for each agent, from a practical POV... otherwise you have very specialized agent for each role, which is not ideal, you will end up with many agent and once you want to use them in the real world, you don't know which to use... instead I would investigate the case where the actor is shared, and the network learns to cooperate with itself, and possibly even the case there the greens and the reds are also the same network, just with a bit telling the NN which of the two role has to play
@resoluation345
6 ай бұрын
You mean you are disagreeing with the paper itself ? Then you should do your own implementation and research and publish a paper if you really do think it is true, you will get much credits for this
@bertobertoberto242
6 ай бұрын
@@resoluation345 in the MARL community is very well established that agents share network, just check the SEAC paper or MAPPO paper
@olumidegodson5799
3 жыл бұрын
Thank you. This was really helpful
@kevinayers7144
2 жыл бұрын
Multi agent PPO :)
@mahermokhtar
Жыл бұрын
great content .. keep it up
@chandlerbi475
3 жыл бұрын
You are a true hero.🖖
@Ruhgtfo
3 жыл бұрын
Nice sensei~