Негізгі бет Rasa Algorithm Whiteboard - Transformers & Attention 3: Multi Head Attention

Күн бұрын

Rasa Algorithm Whiteboard - Transformers & Attention 3: Multi Head Attention

Рет қаралды 56,574

Rasa

Жүктеу

Пікірлер: 84

@avinashpaul1665
4 жыл бұрын
Very good explanation , the entire series(1,2,3) on attention provides a good step by step understanding about the attention concepts.
@anassbairouk953
3 жыл бұрын
The best explanation ever
@zionremy9968
3 жыл бұрын
you probably dont give a shit but does anybody know a trick to log back into an Instagram account..? I was dumb lost my password. I would love any tips you can offer me
@brocklukas2015
3 жыл бұрын
@Zion Remy instablaster =)
@zionremy9968
3 жыл бұрын
@Brock Lukas thanks for your reply. I got to the site thru google and Im trying it out atm. Looks like it's gonna take quite some time so I will get back to you later with my results.
@zionremy9968
3 жыл бұрын
@Brock Lukas it worked and I now got access to my account again. Im so happy! Thank you so much you really help me out :D
@brocklukas2015
3 жыл бұрын
@Zion Remy No problem :)
@sujitnalawade8661
2 жыл бұрын
One of the best explanation available on internet for transformers.
@mehrzadio
4 жыл бұрын
That was the best explanation I'v seen < BRAVO >
@hadjdaoudmomo9534
4 жыл бұрын
Excellent and clear
@fawzinashashibi4758
3 жыл бұрын
Series on attention mechanism are the best I've seen: clear and intuitive. Thank you !
@mikaslanche
4 жыл бұрын
These explanations are so good. Thanks for uploading these :)
@saikatroy3818
2 жыл бұрын
Explanation is awesome, superb. Attention mechanisms was a black box for me, but now it is like open secret. Thanks
@esteveslisboeta
3 жыл бұрын
Another exceptional video! It seems like the main idea behind this multi-head attention was the continuation of the queries, keys, values, which was essentially to increase the learning power of the model. Before there were Q,K,V, that model was not trainable. Then, by adding Q,K and V to the model it became able to learn. Now by adding multi-head attentions the model becomes smarter because each attention can pick up a different relation. Thanks a lot for sharing this precious knowledge
@morrislo6042
3 жыл бұрын
Best explanation that I have seen ever
@farrukhzamir
Жыл бұрын
Very good tutorials series of attention, multihead attention, transformer. God bless you. Without your explanation video i wouldnt have understood it.
@pushkarparanjpe
3 жыл бұрын
Fabulous! Clear explanations.
@CristiVladZ
3 жыл бұрын
Not a little bit, but massive intuition. Thank you!
@richadutt665
3 жыл бұрын
That clearly explained all my questions. Thanks
@ericklepptenberger6352
3 жыл бұрын
Thank you, best explanation ever. You helped me to understand attention intuitively for the first time. Thx!
@danish5326
8 ай бұрын
U explained mw what I have been struggling to learn for a year. Thanks so much . BTW its parallelize not paralyze😜 4:23
@linjie6446
3 жыл бұрын
What a fantastic explanation!
@lilialola123
3 жыл бұрын
THIS IS AMAZING THANK YOU FOR THE CLARITY! please keep the ML videos up
@albertwang5974
4 жыл бұрын
For these cannot understand multi-head, here is the tip for you: Every head can be treated as a channel or a feature
@ytcio
8 ай бұрын
Ok but how do they specialize? How don't they be just copies of each other?
@kartikpodugu
6 ай бұрын
@@ytcio They are just like different CNN channels. Just like each nxn window of a CNN channel focuses on different aspects in the image, each head focuses on different types of attention.
@krishanudasbaksi9530
4 жыл бұрын
once again, very nice explanation
@444haluk
3 жыл бұрын
Omg you are the best! Just listening to all of your explanation video in the case of my stupid teachers missed other things to teach as well!
@kurotenshi7069
Жыл бұрын
Thanks a looot! the best explanation of the multi-head attention mechanism concept!
@aayushjariwala6256
2 жыл бұрын
loved this series
@gausseinstein
2 жыл бұрын
Wonderful explanation. Many thanks!!
@naserpiltan1539
3 жыл бұрын
The best explanation I've ever seen. So clear and helpful .Thx
@TheVinkle
3 жыл бұрын
good and intitutive explanations, thanks.
@techrelieve1716
2 жыл бұрын
Really appreciate to make the explanation such a easy to understand. Keep up the great work.
@zihanchen4312
4 жыл бұрын
Perfect explanation man! Thank you for your efforts, and can't wait to see your future content! : D
@asmersoy4111
Жыл бұрын
Incredible Explanation! Thank you so much!
@AnkitBindal97
4 жыл бұрын
Thank you !
@evandavid940
Жыл бұрын
This series is just incredible!!
@ashokkumarj594
3 жыл бұрын
Thank you for your great job😍😍😍
@engenglish610
3 жыл бұрын
The best explanation at all
@aj-tg
4 жыл бұрын
Nicely done mate!
@jeff__w
Жыл бұрын
9:23 “Every attention header is giving its attention on something different.” (1) Is that just a function of each attention header calculating the dot product for a particular (and different) token in the sentence? (2) Another post I read said “ *Each Head does not process the whole embedding vector, it processes just a part of the vector.* Assume that our embedding is of size _d,_ and that we have _h_ heads, that means that the first head is going to process the first _d/h_ dimensions of the vector, the second head is going to process the next _d/h_ dimensions, and we continue in the same pattern.” Is _that_ what is giving rise to the difference in attention? (3) What are the “layers”? Are those each multi-head attention layers? Basically, what is giving rise to the attention header giving attention to something different?
@mingyanghe7029
3 жыл бұрын
thank you, the best explanation ever
@ahmed22502145
2 жыл бұрын
Great job! you really helped me
@zhangmanren
3 жыл бұрын
deserves more views
@AdrianYang
2 жыл бұрын
My understanding and thoughts: full connection is too flexible with all its weights free from restriction, causing it prone to overfit; RNN is too strict with all the items sharing the same weights (with only different powers), causing it underfit (gradient explosion / vanishing leads to poor learning ability); LSTM and GRU solve this by adding more weights in the form of gates, so that more memory can be kept; attention continues to relax the tight weight restriction of RNN, while keeping them not as free as full connection, where the key weights in attention must be from inputs to calculate the scores. The weights in full connection and RNN are more to learn the position info, while the weight in attention are to learn the embeddings.
@sumowll8903
2 жыл бұрын
Great explanation!!! Thank you!!!
@masoudparpanchi505
3 жыл бұрын
thanks
@karimabdultankian28
4 жыл бұрын
Amazing!
@ansharora3248
3 жыл бұрын
Wow!
@donkeyknight1453
2 жыл бұрын
you explain things better than my professor lol
@pranjalchaubey
4 жыл бұрын
Super!
@arigato39000
3 жыл бұрын
thankyou from japan
@sampsuns
Жыл бұрын
Are these heads in parallel or sequentially? At 7:30 it seems in parallel and 10:41 it seems sequentially. Another question is that if they are in parallel, why the trained Q K V not the same for each h?
@karannchew2534
5 ай бұрын
Multi Layer/Stack/Block. Each layer/stack/block Multi Head.
@harryshuman9637
2 жыл бұрын
Quality stuff
@heets1971
8 ай бұрын
I don't understand why do we need multiple attention heads. Also how are the weights for Keys, queries and values not trained to the same values for multiple heads? Is it because they are trained differently or are they initialized differently?
@bright1402
3 жыл бұрын
Thank you for your video! But one thing I am not clear is that when you conduct the split before feeding to the multi-head attention to get K, Q and V How to split the data V? For instance, if V is a n*d matrix, where n is the size of the vocabulary and d is the dimension of word vector, is this split worked on the n dimension or the d dimension? Thank you!
@RasaHQ
3 жыл бұрын
There's not so much a split happening. It's more that there are multiple layers being attached. Each arrow represents a matrix multiplication, not a "cut" or a "split". Does this help?
@bright1402
3 жыл бұрын
@@RasaHQ Thank you for your reply! Yeah, it is clear now
@gordonlim2322
3 жыл бұрын
@@RasaHQ I quote from the paper "In this work we employ h = 8 parallel attention layers, or heads. For each of these we use dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality." It seems there is actually split going on. To relate to the variables you've used. There shouldn't be h vectors (v1 ... vn). Each of the h vectors should have a reduced dimensionality of n/h.
@asraajalilsaeed7435
11 ай бұрын
Where can i find code that display multi attention in the video? When move cursor above tokens.
@xflory26x
Жыл бұрын
what do you mean by layers from the interactive visualization in reference to your previous diagrams, and how are they different to the different colors (heads)?
@sethjchandler
3 жыл бұрын
Brilliant. Thanks!,
@alexvass
Жыл бұрын
With Multi-head attention, does that mean that there are multiple M_k matrices (multiple matrices for the key weights)? Same for M_q and M_v being multiple?
@ax5344
3 жыл бұрын
@4:07 suppose we use multiple blocks instead of one, as the transformation inside these blocks is the same: q,k,v, then they are just the same three matrixes with different initial attention values?
@karannchew2534
7 ай бұрын
Why do they first need to pass through linear layers?
@6511497115104
Жыл бұрын
Don't we cut the original V vector to h slices and feed each slice to a different attention head?
@6511497115104
Жыл бұрын
ChatGPT: Yes, in the multi-head attention mechanism, the original input embedding vector is sliced into multiple sub-vectors, or "heads", which are then processed in parallel to compute multiple sets of attention weights. The number of heads is a hyperparameter that is typically set to a small value, such as 8 or 16. Each head in the multi-head attention mechanism has its own set of learned weight matrices, which are used to project the input embedding vector into a query, key, and value vector for that particular head. These projected vectors are then used to compute the attention weights and the weighted sum of the values, which are concatenated across all heads and passed through a final linear layer to produce the final output representation. By splitting the input embedding vector into multiple heads, the multi-head attention mechanism is able to capture different aspects of the input representation, allowing the model to learn more nuanced and fine-grained relationships between the input tokens. Additionally, the parallel processing of the multiple heads can lead to faster training and inference times.
@TamilSelvanMurugesan-mw2bv
Жыл бұрын
Neat...
@emineaysesunarstudent1767
2 жыл бұрын
Thank you so much for the video! I am stuck on one point, how do we ensure that these different heads learn different attentions?
@RasaHQ
2 жыл бұрын
(Vincent here) While there is no hard guarantee, you could wonder about the following thought experiment. Suppose that we allow for 2 heads. Under what circumstances would the weights for both attention heads be exactly the same? If there's an improvement to be made, the gradient signal should cause the weights to differ. This depends a lot on the labels, but that's where the learning will be.
@zeinramadan
2 жыл бұрын
Random initialization of the weights?
@kartikpodugu
6 ай бұрын
I think everybody should play with the visualization tool to understand MHA better.
@abc-by1kb
2 жыл бұрын
10:06 do you mean "Named Entity Recognition"? Great video btw. Thank you so much!!!
@RasaHQ
2 жыл бұрын
(Vincent here) d0h! Yep, you're right!
@abc-by1kb
2 жыл бұрын
@@RasaHQ Really want to say thank you so much for the video! Never thought someone could explain self-attention and transformers in such a logical, incremental, and intuitive way. Great work!
@abc-by1kb
2 жыл бұрын
@@RasaHQ As a CS student, I think your videos should def come on top when people search for transformers
@search_is_mouse
2 жыл бұрын
사랑해요...
@usertempeuqwer7576
4 жыл бұрын
reupload X)
@RasaHQ
4 жыл бұрын
Yeah there was an issue with the previous version that we only discovered after hitting the "live" button. So we re-rendered. This one shoud be fine.
@shenhaochong
3 жыл бұрын
I see the last step of multi-header attention is to concatenate and make a dense vector of all the output vectors from MATMUL, but does it mean the input vector will grow its dimension by the times of # of Headers, each time it passes through one multi-header attention block?
@deepanshusingh4140
2 жыл бұрын
On point!! Super
@coolblue2468
3 жыл бұрын
Nothing can be better than this explanation for multiheadattention. Thanks a lot.