If you think I deserve it, please consider hitting the like button and subscribe for more content like this :)
@sushantmehta7789
Жыл бұрын
Next level video *especially* because of the dimensions laid out and giving intuition for things like k.transpose(-1, -2). Likely the best resource out right now!! Thanks for all your work!
@CodeEmporium
Жыл бұрын
Super glad you find this all useful!
@AnthonyY-o9q
Жыл бұрын
This is the most detailed Transformer video, THANK YOU! I have one question, the values is [30, 8, 200, 64], before we reshape it, shouldn't we permute it first? like: values = values.permute(0, 2, 1, 3).reshape(batch_size, max_sequence_length, self.num_heads * self.head_dim)
@surajgorai618
Жыл бұрын
This is the best explanation I have gone through
@aamirbadershah887
Жыл бұрын
Immense amount of effort put into video. Really appreciate the explanation especially keeping in mind the PyTorch aspect for beginners. Showing details like tensor dimensions throughout the code is just next level. Keep these videos coming.
@user-wr4yl7tx3w
Жыл бұрын
It's really helpful that you are going through all the sizes of the various vectors and matrices.
@CodeEmporium
Жыл бұрын
Glad it is helpful!
@xingfenyizhen
Жыл бұрын
Really friendly for the beginners!😁
@CodeEmporium
Жыл бұрын
Thanks a lot! Glad you found it useful
@jingcheng2602
7 ай бұрын
Superb and so love these classes! Will watch all of them one by one
@danielbrooks6246
Жыл бұрын
I watched the entire series and it gave me a deeper understanding on how all of this works. Very well done!!!! Takes a real master to take a complex topic and break it down in such a consumable way. I do have one question: What is the point of the permute? Can we not specify the shape we want in the reshape call?
@seyedmatintavakoliafshari8272
7 ай бұрын
This video was really informative. Thank you for all the detailed explanations!
@DeanLa
Жыл бұрын
This is the best content on youtube
@moseslee8761
Жыл бұрын
bro... i love how u dive deep into explanations. You're a very good teacher holy shit
@gigabytechanz9646
Жыл бұрын
Very clear, useful and helpful explanation! Thank you!
@user-ul2mw6fu2e
9 ай бұрын
You are awesome .The way you teach is incredible.
@CodeEmporium
9 ай бұрын
Thanks so much for this compliment. Super glad you enjoyed this
@salemibrahim2933
Жыл бұрын
@CodeEmporium The transformer series is awesome! It is very informative. I have one comment, It is usually recommended to perform dropout before normalization layers. This is because normalization layers may undo dropout effects by re-scaling the input. By performing dropout before normalization, we ensure that the inputs to the normalization layer are still diverse and have different scales.
@wilsvenleong96
Жыл бұрын
I believe you mean after, right?
@pierrelebreton7634
Жыл бұрын
Thank you, I going through all your videos. great work!
@FAHMIAYARI
Жыл бұрын
bro you're a legend!
@KurtGr
Жыл бұрын
Appreciate your work! As someone else mentioned, hope you can do an implementation of training the network for a few iterations.
@CodeEmporium
Жыл бұрын
Yea. That’s the plan. I am currently working on setting the full thing up.
@jayktharwani9822
3 ай бұрын
great explanation.
@prashantlawhatre7007
Жыл бұрын
Hi Ajay. I think, we need to make a small change in the forward() function of the encoder class. We should be doing `x_residual = x.clone() # or x_residual = x[:]` instead of `x_residual =x`. This will ensure that x_residual contains a copy of the original x and is not affected by any changes made to x.
@CodeEmporium
Жыл бұрын
Oh interesting. I have been running into issues during training. I’ll make this change and check. Thanks a ton for surfacing!
@TransalpDave
Жыл бұрын
Awesome content as always ! Are you planning to demonstrate a training example of training for the encoder for the next video ? For example on a wikipedia data sample or something like that ?
@CodeEmporium
Жыл бұрын
Hoping to get to that stage. I currently have the code ready but it’s a lil strange during inference. for more context : I am running into a situation where it’s predicting the End of Sentence token only. Planning to fix this soon and have a full overview of the transformer soon. But in the mean time there are so many more videos I can make on the decoder
@TransalpDave
Жыл бұрын
@@CodeEmporium Oh ok i see, i'm also close to that step, i'll let you know if i find something
@Zero-ss6pn
7 ай бұрын
Just amazing!!!
@nallarajeshkumar9036
Жыл бұрын
Wonderful explanation
@CodeEmporium
Жыл бұрын
Thanks a lot :)
@dwarakanathchandra7611
Жыл бұрын
Hats Off to you for explaining such a complex topic with simplicity and understanding. Thanks a lot. Is there any course you're offering besides these awesome videos on youtube? Want to learn more concept from you.
@CodeEmporium
Жыл бұрын
Thanks so much for the compliments. At the moment , my best teaching resources are on KZitem. Luckily, there are hundreds of videos on the channel haha
@dwarakanathchandra7611
Жыл бұрын
@@CodeEmporium Thanks for the info, sir. I am a student of AI and ML interested very much in NLP. If you have any suggestions for research projects that I can pursue for my academic research. Kindly suggest. I am reading the papers one by one. If you have any interesting ideas, it would help me a lot.
@chenmargalit7375
Жыл бұрын
Thanks for the great series. Would be very helpful if you'd attach the Colab.
@convolutionalnn2582
Жыл бұрын
What would be the best book to learn probability and statistics for Machine Learning?
@linkinlinkinlinkin654
Жыл бұрын
Before any book just take a 500 level course on probability and linear algebra each from any universities free online classes. These two topics are not truly understood with even the best explanations, just by solving problems
@GIChow
Жыл бұрын
I am looking forward to see whether you will try to put all the bits of the transformer together i.e. the positional encoder before this "encoder" and then the decoder after. I wonder whether/how it will respond to the input text "My name is Ajay". Would it respond as though in a conversation "Hi, how are you" / "My name is Bot" or generate more text in the same vein e.g. "I am 28 years old" or translate it to another language or something else. To achieve an end-to-end use case I guess we will also need appropriate data to be able to train the models and then actually train the models, save the model weights somehow, etc. Am new to all this but your videos are gradually helping me understand more e.g. encoder input and output matrix being of the same size to permit stacking. Thanks 👍
@CodeEmporium
Жыл бұрын
This is the goal. I am constructing this transformer bit by bit and showing my findings. We will eventually have the full thing
@cmacompilation4649
Жыл бұрын
Please, blow up the decoder as well hahaa !! Thank Ajay, these videos were very helpful for me.
@ramanShariati
Жыл бұрын
you are awesome bro
@-mwolf
Жыл бұрын
Thanks! Please do Cross Attention and maybe Attention visualizations next!
@CodeEmporium
Жыл бұрын
Yep! I plan to do some more videos on the decoder part too
@hermannangstl1904
Жыл бұрын
I understand how the forward way works, but not how the learning works. Basically all videos I have seen so far covering Transformers "only" explain the way forward, but not the training. For example I'd like to know what the loss function is. Question 2: afaik an Encoder can work on its own and doesn't (necessarily) need a Decoder (for example for non-translation use cases). How does the training work in this case? What is the loss function here? (-> we don't have a target sentence)
@CodeEmporium
Жыл бұрын
If you go further into the playlist (I just uploaded the code for this my my most recent video in the playlist), it is a cross entropy loss. We compare every character generated to the label; take the average loss; and perform backpropogation to update all weights in the network once after seeing all sentences in the batch For your Question 2, I am not exactly sure what you are alluding to. Yes, you can just use the encoder but depending on the task you want to solve, you’ll need to define an appropriate loss. For example, BERT architectures are encoder only architectures that may append additional feed forward networks to solve a specific task. These architectures will also learn via back propagation once we are able to quantify a loss.
@hermannangstl1904
Жыл бұрын
@@CodeEmporium Thank you for your reply. For Q2: My plan is to deal/code/understand the Encoder and the Decoder part separately, starting with the Encoder. Especially how this Attention vectors develop over time. How they actually look for a small example, trained with a couple of sentences. Visualize them. See how, for example, "dog" is closer to "cat" than to, for example, "screwdriver". But I don't know what the loss function would be to train this model. Could I maybe feed the network with parts of a sentence so that it can learn how to predict the next word? E.G. Full sentence could be: "my dog likes to chase the cat of my neighbor". X: "my" Y: "dog" X: "my dog" Y: "likes" X: "my dog likes" Y: "to" X: "my dog likes to" Y: "chase" ... and so on ... Would this kind of training be sufficient for the network to calculate the Attention vectors?
@WaiPanTam
Жыл бұрын
thank you!
@elhalmihamza28
2 ай бұрын
thanks, it's a good video
@chrisillas3010
Жыл бұрын
Great video!!!!best content for transformer... Ca n you suggest ways to implement transformer encoder for a time series data
@eekinchan6620
11 ай бұрын
Hi. Great video but i have a question. Referring to 19:31, why is the dimension of k found by using the code q.size()[-1], shouldn't it be k.size()[-1] instead. Thnx in advance:)
@qingjieqi3379
Жыл бұрын
Amazing video series! At 39:07, why does the layer normalization just consider 1 dimension, the length of parameter shape, but not consider the batch size? Your previous video about the layer normalization mentioned layer normalization should consider both. Am I missing something?
@li-pingho1441
Жыл бұрын
awesome content! thanks a lot!!
@CodeEmporium
Жыл бұрын
Thanks so much!
@creativeuser9086
Жыл бұрын
I know it’s a lazy question, but can someone tell me why is multi-head better than single head for performing attention?
@terjeoseberg990
2 ай бұрын
My guess is because each one can learn different types of relationships in the input text to present to the next layer of processing.
@amiralioghli8622
Жыл бұрын
Overall your explaination is great, But I little confiused. Actually i could not understand the difference between positinal encoding and Position-wise Feed Forward Network. Can anyone explain to me?
@froozynoobfan
Жыл бұрын
your code is pretty clean, except i more like "black" code formatting
@Xuan-z9w
Жыл бұрын
thank u a lot
@RanDuan-dp6oz
Жыл бұрын
Thanks!
@CodeEmporium
Жыл бұрын
Thanks for the donation and for watching!
@user-wr4yl7tx3w
Жыл бұрын
where did you get the 3 for 3 times 512 =1536? Is it 3 because you have query, key, and value?
@CodeEmporium
Жыл бұрын
For every token (word or character), we have 3 vectors: query, key and value. Each token is represented by a 512 dimensional vector. This is encoded into the query key and value vectors that are also 512 dimensions each. Hence 3 * 512
@-mwolf
Жыл бұрын
I think you forgot to address in you MHA code to pass the mask value.. I think here you need ModuleList and can't use nn.Sequential
@CodeEmporium
Жыл бұрын
I definitely need this for the decoder and I get around this by implementing my custom “Sequential” class. I was able to run this code tho just fine as is (sorry if I missed exactly what you are alluding to)
@-mwolf
Жыл бұрын
@@CodeEmporium Ah of course - I missed that we don't need it for the encoder (and that you could implement custom nn.Sequential as opposed to a ModuleList of the Layers. Although I'm not sure which of the approaches would be nicer).
@vigneshvicky6720
Жыл бұрын
Yolov8
@LiLi-n1n
Жыл бұрын
Did he just mimic what Andrej Kaparthy was doing. Explanation not even 10% as clear as what Andrej did. So bad.
@godly_wisdom777
Жыл бұрын
a video about how to code chatgpt in which the code is generated by chatgpt 😁
Пікірлер: 71