PATCH EMBEDDING | Vision Transformers explained

I will cover Vision transformer in three parts. The first part which is this video focusses on patch embedding in vision transformer.
I will go over all the details and explain everything happening inside the patch embedding in VIT in detail.
I will also go over how an implementation of patch embedding for vision transformer in Pytorch would look like.
The second part which goes through attention can be found here -
Attention in Vision Transformer (Part Two) - • ATTENTION | An Image i...
The third part which builds entire transformer and shows how to visualize attention maps and positional embeddings can be found below -
Implementing Vision Transformer (Part Three) - • Image Classification U...
Timestamps :
00:00 Intro
00:56 Need for Patch Embedding in Vision Transformer
01:30 Converting Image into Sequence of Patches
01:59 Patch Embedding Projection
02:45 Positional Information for Patches
03:40 CLS Token
04:10 Patch Embedding Responsibilities
04:40 Patch Embedding Module Implementation
08:02 Outro
Paper Link - tinyurl.com/exai-vit-paper
Implementation will be pushed here after all three videos are out - tinyurl.com/exai-vit-code
Subscribe - tinyurl.com/exai-channel-link
Background Track - Fruits of Life by Jimena Contreras
Email - explainingai.official@gmail.com

Жүктеу

Пікірлер: 22

@woosh3612
6 ай бұрын
Thanks for the very amazing explanation!! It is well-explained and concise. I like the recap at the end of the explanation where you mentioned what happened. Keep going!
@Explaining-AI
6 ай бұрын
Thanks for the feedback and am really happy that you liked it.
@iftekharuddin
2 ай бұрын
Fantastic explanation.
@ashbha4138
3 ай бұрын
I've been struggling with this lol. Really good stuff.
@kamalamarepalli1165
4 ай бұрын
Very well explained with math clarity behind. Very good.
@Explaining-AI
4 ай бұрын
Thank You!
@user-me7mo1iw2x
8 ай бұрын
Wonderful video!
@Explaining-AI
8 ай бұрын
Thank you very much!
@Explaining-AI
8 ай бұрын
*Github Code* - github.com/explainingai-code/VIT-Pytorch *Patch Embedding* - Vision Transformer (Part One) - kzitem.info/news/bejne/zXifyap4bZuqjIo *Attention* in Vision Transformer (Part Two) - kzitem.info/news/bejne/24qVy6CVnJyafaw *Implementing Vision Transformer* (Part Three) - kzitem.info/news/bejne/qGyVr3Vrr32JhX4
@Arpytanshu
8 ай бұрын
Initializing pos_embed w/ zeros vs w/ randn, made the position embedding similarity visualization (pesv) make sense vs nonsensical. Any idea why this happens? In case when I init with randn, the pesv looked random throughout the training (with better losses and eval metrics). In case when I init with all 0s, the pesv made a lot of sense in very early iters, i.e. similar regions in dataset had similar pos_embed (with little worse eval metrics and losses than before).
@Explaining-AI
8 ай бұрын
That’s interesting. What if you do a reduction of the positional embedding space to a much lower dimensionality using say PCA ? And then see similarity between positional embedding projections in that space ?
@TheDeluxeJonas
8 ай бұрын
Thank you for the amazing video. Only thing for the future: Maybe remove that click sound when you change the slides!
@Explaining-AI
8 ай бұрын
Thank you for the feedback. Will take care of it in future videos :)
@buh357
4 ай бұрын
Thanks for the video. I learned a lot, and I what if I have 3D images(C, W, H, Z(slices)), what parameter should I change or add? What patch size should I consider? Thank you in advance.
@Explaining-AI
4 ай бұрын
Thank you :) So I have never actually worked with 3d images(i am assuming you have medical 3d images) so take everything that I have said below with a grain of salt. If your goal is to do classification you could simply consider tubelets instead of patches, so that means you are patchifying not just along spatial dimension but also along z(slices I am guessing) dimension. But if you trivially try to compute the attention between these tubelet embeddings, it would most likely end up very costly(but if your images are small and z dimension is not huge you can try this). You can refer to ViVIT paper to understand this and details involved in it and some approaches to reduce the cost. In case your images/slices are large then you can try to use Swin transformer. Specifically you can refer to Video Swin Transformer paper. Just that for you temporal dimension becomes slices. For specifc patch sizes and other parameter, theres a Swin paper for 3d medical images which you can take a look at(Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis), specifically their encoder part.
@buh357
4 ай бұрын
@@Explaining-AI thank you for such a great explanation and suggestion regarding my question. My image is 1x244x244x40 (1 channel and 40 slices). Do you think this is a large input for ViT? And yes, I am working on a classification task. I will definitely read and try to use ViVIT and the Swin paper for my problem. thank you again for creating great content and you help.😇
@Explaining-AI
4 ай бұрын
@@buh357 Its my pleasure :) And maybe You can try downscaling images and see how much time it takes for convergence to test that quickly. But for easy calculation assume your images are 224x224x40 and patch dimensions are 16x16x4 Then you will essentially be running a transformer on sequence of 14x14x10 ~ 2000 tokens. So I do feel Swin might be better option here but still do try with simply converting VIT to 3D VIT on downscaled images to see what kind of results you are getting to decide better.
@buh357
4 ай бұрын
@@Explaining-AI hi, sir. you were right, video swin transformer performed better than 3D vision transformer(from my traininng, both train and val loss did not decrease), but problem is that the video swin transformer is not performing better than 3D resnet50, and 3D efficientnet, i am wondering how can i boost the performence of video swin transformer? more data? do you have suggestions or tips for training video swin transformer? btw i am traninng from scracth. thank you in advance.😁
@Explaining-AI
4 ай бұрын
Hello @@buh357 , assuming you have already tried changing things like number of layers and tuning the hyper parameters then yes you can try with more data, that should help. But if getting more data requires more investment (time or cost) then maybe to get sense of whether it will benefit or not, you can try training both(swin and resnet) on lesser data. Say 10%/20%/50%/75% and so on of current dataset. And if you see that the performance gap between resnet and swin keeps decreasing as you provide it higher fraction of dataset then thats a good indication that more data will further help to bridge that gap. And like you said do give it a shot with pretrained checkpoints as well(for both variants for fair comparison). Btw How large is your dataset and whats the difference between the performance of both?

Transformer Neural Networks Derived from Scratch

Stable Diffusion from Scratch in PyTorch | Conditional Latent Diffusion Models

Когда твоя МАМА следит за твоим боем, ты просто НЕ ИМЕЕШЬ ПРАВА проиграть #shorts

Baby Shark Round 3!!! Who’s the Champion??? 🤔🤯 @Mamiko #beatbox #challenge #fyp

Final muy inesperado 🥹

Hice un balde entero de palomitas de maíz usando una plancha para el cabello 🤭🤯

Vision Transformer Basics

Vision Transformers (ViT) Explained + Fine-tuning in Python

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Swin Transformer

But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning

Vision Transformer in PyTorch

Vision Transformer for Image Classification

ViViTA Video Vision Transformer

Stable Diffusion from Scratch in PyTorch | Unconditional Latent Diffusion Models

Adding Agentic Layers to RAG

Когда твоя МАМА следит за твоим боем, ты просто НЕ ИМЕЕШЬ ПРАВА проиграть #shorts

PATCH EMBEDDING | Vision Transformers explained

Пікірлер: 22