Multihead Attention's Impossible Efficiency Explained

Рет қаралды 4,722

If the claims in my last video sound too good to be true, check out this video to see how the Multihead Attention layer can act like a linear layer with so much less computation and parameters.
Patreon: / animated_ai
Animations: animatedai.git...

Жүктеу

Пікірлер: 18

@Azanixu
4 ай бұрын
How the hell do you have so little views? This is both one of the best and factually correct animations out there.
@ShadeAKAhayate
3 ай бұрын
He'll get there. Quality informational content always picks up slowly, but as long as quality is not declining, its growth is exponential. To a limit, obviously, since this information is specialized, but this limit is high. So as more teachers discover these illustrations and pass these on to their students, it will grow.
@thecodegobbler2179
24 күн бұрын
The other drawings and visuals can't keep up with this! Great content! I love the visualizations!
@jaredtweed7826
4 ай бұрын
I have been waiting for this video! Very much worth the wait!
@rafa_br34
3 ай бұрын
This is so unfairly underrated, I have never seen such a good video about CNNs.
@coryfan5872
4 ай бұрын
Saying that Multihead Attention has less parameter than a token-wise linear is true for NLP models but not true for ViT. Additionally, simply creating a mechanism which incorporates the entirety of the features does not explain away the success of attention mechanisms -- looking again at computer vision tasks, MLP Mixer also incorporates the entirety of the features in its computations, but is still less successful than the attention based ViTs. One part of the strength of the attention layer is its adaptability -- which you can see the value of in things like GAT. Otherwise, it could just be replaced with a generic low-rank linear.
@mdnaseif7599
4 ай бұрын
You are a legend keep it up!
@vastabyss6496
4 ай бұрын
4:27 I'm definitely judging the animation of the recurrent layer...
@zukofire6424
3 ай бұрын
Thanks! great explanation :)
@KayNg-o9n
4 ай бұрын
I felt like a lot of work has been put into making the animations for the series and I should leave learning something, but somehow I am left more confused after watching the entire series than before I started. Not sure if this is due to a need to visualize something that cannot be represented in 3D space, knowledge gap created due to assumptions used during the explanation process, or I am simply too stupid.
@__-de6he
4 ай бұрын
It would be good to know the rational behind such way of calculation besides computational efficiency.
@commanderlake7997
2 ай бұрын
I'm confused because you make it look like an attention layer could be used as a drop-in replacement for a linear layer but GPT-4o says: "No, an attention layer cannot be used as a direct drop-in replacement for a linear layer due to the fundamental differences in their functionalities and operations."?
@animatedai
2 ай бұрын
That’s correct that an attention layer is not functionally equivalent to a linear layer. This efficiency comes with its own trade-offs. But it’s going to make more sense to talk about those trade-offs a couple more videos down the line in this series, so I didn’t go over them in this video.
@commanderlake7997
2 ай бұрын
@@animatedai Thanks for clearing that up, also I ran some quick tests comparing the performance of a pytorch MultiheadAttention layer with a Linear layer and the linear layer is significantly faster on CPU and GPU in every test i can run so i hope that's something you could clarify in a future video. Looking forward to the next one!
@BooleanDisorder
4 ай бұрын
Computational efficiency is also due to higher dimensionality, right? You can represent data in a much richer space compared to RNN's of similar parameter size and capture more complex features due to this higher dimension space that's enabled by each attention layer. That said I might be unfair to RNN's since they have such bad long-range dependency and "physically" can't do the same stuff even if it wanted.
@wilfredomartel7781
4 ай бұрын
😊
@RobertMStahl
4 ай бұрын
FWIW, have you seen the recent business presentation given by Randell L Mills who can explain the reality of N electron, 4 having the solution to EVERYTHING?
@ucngominh3354
4 ай бұрын
hi