Soroush Mehraban

9:09
Ай бұрын

The Entropy Enigma: Success and Failure of Entropy Minimization

13:16
Ай бұрын

Tent: Fully Test-time Adaptation by Entropy Minimization

9:44
Ай бұрын

VPD (ICCV2023): Unleashing Text-to-Image Diffusion Models for Visual Perception

30:13
Ай бұрын

TokenHMR (CVPR2024): Advancing Human Mesh Recovery witha Tokenized Pose Representation

22:26
Ай бұрын

SHViT (CVPR2024): Single-Head Vision Transformer with Memory Efficient Macro Design

22:17
2 ай бұрын

InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

14:10
3 ай бұрын

FastV: An Image is Worth 1/2 Tokens After Layer 2

28:39
3 ай бұрын

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

32:22
5 ай бұрын

PoseGPT (ChatPose): Chatting about 3D Human Pose

9:13
6 ай бұрын

MotionAGFormer (WACV2024): Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network

35:08
7 ай бұрын

HD-GCN (ICCV2023): Skeleton-Based Action Recognition

8:25
8 ай бұрын

ST-GCN: Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

13:08
8 ай бұрын

Graph Convolutional Networks (GCN): From CNN point of view

21:12
9 ай бұрын

DINO: Self-Supervised Vision Transformers

31:03
Жыл бұрын

MoCo (+ v2): Unsupervised learning in computer vision

22:30
Жыл бұрын

ViTPose: 2D Human Pose Estimation

28:40
Жыл бұрын

TrackFormer: Multi-Object Tracking with Transformers

10:59
Жыл бұрын

MetaFormer is Actually What You Need for Vision

21:00
Жыл бұрын

ConvNet beats Vision Transformers (ConvNeXt) Paper explained

21:32
Жыл бұрын

Swin Transformer V2 - Paper explained

15:20
Жыл бұрын

Masked Autoencoders (MAE) Paper Explained

23:13
Жыл бұрын

Relative Position Bias (+ PyTorch Implementation)

19:59
Жыл бұрын

Swin Transformer - Paper Explained

6:41
Жыл бұрын

Vision Transformer (ViT) Paper Explained

7:05
Жыл бұрын

Convolutional Block Attention Module (CBAM) Paper Explained

9:11
Жыл бұрын

Squeeze-and-Excitation Networks (SENet) paper explained

12:18
Жыл бұрын

Faster R-CNN: Faster than Fast R-CNN!

8:11
Жыл бұрын

Receptive Fields: Why 3x3 conv layer is the best?

38:37
Жыл бұрын

Fast R-CNN: Everything you need to know from the paper

Пікірлер

@noony31122009
23 сағат бұрын
awesome
@marioparreno24
4 күн бұрын
Thanks for the intuitions, faqs and clearly explained topics!
@soroushmehraban
4 күн бұрын
Glad you liked it Mario🙂
@marioparreno24
3 күн бұрын
@@soroushmehraban Just one question. Why is centering only applied to the teacher and sharpening to both the student and the teacher? Could we not apply centering to both? Maybe if we add both operations to both sides we play a sum 0 game and we have the collapse problem again, I dont know 😅 Maybe we need then artificially create an unbalance
@soroushmehraban
3 күн бұрын
@@marioparreno24 From my understanding, sharpening makes the model more confident that this sample belongs to a certain sudo-class (the output label of model that we don't have ground truth). And we want the student to be kept certain about it and we sharpen it. The less certain the student is, the less certain it is to differentiate samples from different images. But for images we do both to prevent the mode collapse. But this is just based on my intuition. Don't quote me on that lol.
@MadinideAlwis
8 күн бұрын
Very interesting! need more videos.
@jialiangxu1657
9 күн бұрын
Hi, I'm still a bit confused so could you please tell me how do you solve the 3D pose judder. The 2D pose contains the judder problem, but I can not find it after lifting to 3D pose in the demo video of your code. Thank you.
@soroushmehraban
7 күн бұрын
Hi Jialiang, Throughout training the model also sees 2D poses with jitters but as the ground truth output, it sees motion capture 3D and we have a velocity loss (we multiply by 20 to make it 20 times more important than MPJPE), that make the model estimation to have the same velocity as the ground truth and penalizes it if it has jitters. So the model in addition to lifting the input from 2D to 3D and inferring the underlying 3D structure, it also has to denoise the input.
@pranavgandhiprojects
9 күн бұрын
veryy veryyy well explained..... i also loved your video on fast rcnn:) amazing workk
@pranavgandhiprojects
10 күн бұрын
WOw so well explained....thankyou very much:)
@yakuzi07
18 күн бұрын
Is there a way to use grad cam on a Siamese cnn network. I'm getting graph disconnect error whenever i try and i have read that it's because grad cam was originally designed to accept a single input instead of multiple inputs.
@VedantJoshi-mr2us
20 күн бұрын
By far one of the best + complete, SWIN transformer explanations on the entire Internet.
@soroushmehraban
20 күн бұрын
Thanks!
@FinalProject-rw1yf
19 күн бұрын
@@soroushmehraban Hi sir, could you also explain the FasterViT and GCViT paper...
@hamidrezahemati8837
24 күн бұрын
Great video. keep up the good work
@SaraTaro
24 күн бұрын
This made it so much clear!! Great job :)
@user-gl5ys8nr2u
25 күн бұрын
Excellent video! Would you recommend any resources that explains the theorems they propose for low-rank gradients and their convergence in-depth? Also, what tools do you use to create such cool animations?
@victormanuel8767
25 күн бұрын
I may not be fully caught up but this gives some context around why cross entropy loss is minimized as a criterion during training. Thanks for this overview.
@mjavadrajabi7401
Ай бұрын
Prefect !!
@soroushmehraban
Ай бұрын
Thanks for watching! 😃
@rohollahhosseyni8564
Ай бұрын
Great video Soroush. Thanks.
@soroushmehraban
Ай бұрын
Thanks for the feedback 😃
@NarkeEmpire
Ай бұрын
You are a great teacher 🙏
@soroushmehraban
Ай бұрын
Thanks😃
@user-zb9ub5nd1z
Ай бұрын
Hello Soroush, how can I contact you please? I am working on my thesis and wanted to need your intake on something. Thanks
@soroushmehraban
Ай бұрын
Hello, Just search my name on google and you find me on Twitter or Linkedin. My email is also shared here on KZitem
@alinaderiparizi7193
Ай бұрын
<3
@alinaderiparizi7193
Ай бұрын
Liked (❤)
@alinaderiparizi7193
Ай бұрын
Perfect, Thank you.
@soroushmehraban
Ай бұрын
😃❤️
@ericsy78
Ай бұрын
Fantastic👌
@soroushmehraban
Ай бұрын
Thanks!
@ericsy78
Ай бұрын
You're amazing, create more!
@soroushmehraban
Ай бұрын
Thanks for the kind words 🙂
@senpanwu5163
Ай бұрын
Great Work! You explained 1000 times better than my uni lecturer :D
@subramanyabhat446
Ай бұрын
The loss functions were definitely a bit tricky to get around. But that was a really cool video tho! One thing you could've also touched upon is the usage of deformable detr in place of detr. I can see the trackformer code does incorporate it but wanted to know what changes in trackformer when you switch from detr to deformable one?
@hasanghavidel2701
Ай бұрын
you explain complicated stuff very clearly.. thx
@user-ui5dg3nr3r
Ай бұрын
usefull
@amacodes7347
Ай бұрын
GOAT 🐐!!!!!!!!!! Best GCN explanation and nailed it with the original paper formula decomposition, this the reason why KZitem ML is the best.
@ArchanaVijayan-bc5tr
Ай бұрын
what is mean by channel here
@iliyasindikar4695
2 ай бұрын
well explained.
@Hansly_rz
2 ай бұрын
oh my it explains everything at once! Thank you for making this video!
@mrraptorious8090
2 ай бұрын
Hey, I am asking myself how to train ViTPose by myself. Did you coincidently trained it by yourself? If so could you share experiences?
@AlexXPandian
2 ай бұрын
Is truncated SVD the same as PCA ?
@alihadimoghadam8931
2 ай бұрын
Thanks chief
@punithandharani
2 ай бұрын
Easy to understand. Expecting GAT...
@rohollahhosseyni8564
2 ай бұрын
Great as always
@soroushmehraban
2 ай бұрын
Thanks!
@nadhembenhadjali9063
3 ай бұрын
Nice explanation ! thank you so much !
@Ju124664
3 ай бұрын
Thanks for this video !
@raajushawarma9367
3 ай бұрын
why do we have 2 diagonal matrices in the last equation?
@soroushmehraban
3 ай бұрын
That's a good question. I think that's because one time we do row-wise normalization and other time column-wise normalization.
@raajushawarma9367
3 ай бұрын
thanks for responding quickly, so are we normalising matrix A only or the H matrix too?@@soroushmehraban
@savanthtadepalli3968
3 ай бұрын
Your explanation is truly awesome! Keep making more, please!
@soroushmehraban
3 ай бұрын
Thanks 🙂
@AhmedEssam_eramax
3 ай бұрын
do you think this approach can be used with LLM as well, so if we can apply this approach on the LLM context or any part of the conversation it is going to speed up the inference dramatically.
@soroushmehraban
3 ай бұрын
I don't think so. At 11:19 I explained a table that they tried to prune instruction tokens and performance became worse. Also they used StreamingLLM technique on image tokens and performance degraded significantly. So the behavior of image tokens and text tokens seem to be different and future research can understand why.
@kinger1080
3 ай бұрын
good
@buh357
3 ай бұрын
does it work on small dataset ? let say 1000 images?
@soroushmehraban
3 ай бұрын
I don't think so. Transformers are data hungry and need a lot of data to generalize. the smallest dataset for pretraining that I saw was on ViTPose that they pretrained using this technique on 150k images and when they doubled the data it got only 1.3% better.
@alihadimoghadam8931
3 ай бұрын
Chief
@wolpumba4099
3 ай бұрын
*Abstract* This paper investigates the redundancy of visual representations within large vision-language models (LVLMs). The authors propose a method, "FastV", which prunes image tokens to improve efficiency while preserving performance. *Key Findings:* * *Image tokens are redundant:* Analysis reveals that image tokens consistently receive the lowest attention scores within LVLMs. * *Pruning improves efficiency:* FastV selectively filters image tokens at inference time based on attention analysis. This drastically reduces computational cost (flops) without compromising output quality. * *Shallow layer pruning is detrimental:* Pruning image tokens in early layers of the LVM negatively impacts performance more than pruning in deeper layers. * *System prompts and instructions are essential:* Attempts to modify system prompt or instruction tokens resulted in significant performance drops. *Method (FastV):* 1. *Attention analysis:* Calculate attention scores for each image token within a chosen transformer layer of the LVLMs. 2. *Filtering:* Rank image tokens based on attention scores and filter out a specified percentage (e.g., 50%) of the least important tokens. 3. *Inference:* The pruned model operates on fewer image tokens, reducing computational complexity. *Significance:* * *Optimization potential:* LVLMs can be made significantly more computationally efficient via image token pruning. * *Implications for model design:* The findings highlight the potential to re-design LVLM architectures focusing on efficient visual information processing. i used gemini
@alihadimoghadam8931
3 ай бұрын
Great
@alinaderiparizi7193
3 ай бұрын
Nice job, It would be great if you could create the implementation videos as well ❤
@gamerfawaz1234
3 ай бұрын
Love❤, keep sharing, and shining
@yashmandilwar8904
3 ай бұрын
"mr" is the size of Projector P_t I think. In the algorithm they calculate R_t = P_t.T G_t Great video by the way! Thanks.
@soroushmehraban
3 ай бұрын
Yes you’re right. Why I missed that lol. Thanks!
@alinaderiparizi7193
3 ай бұрын
Interesting!
@christianondo9637
3 ай бұрын
This is easily the best video I've ever seen on GCNs
@soroushmehraban
3 ай бұрын
Thanks 😃
@wolpumba4099
3 ай бұрын
*ELI5 Abstract* Imagine a super smart computer that learns to talk like a person. But it needs a HUGE closet to store all its knowledge! That's a problem. Some scientists make the closet smaller by only training some of its 'smarts' at a time (that's kind of like LoRA). But this can make the computer a bit less good at talking. GLoRe is a new idea! It's like folding the computer's knowledge to fit in the closet. It still learns everything, but sometimes squishes the information to save space. GLoRe also likes to change how the knowledge is folded, so it doesn't get stuck learning only one way. Tests show that GLoRe talks almost as well as the really big computer, but it fits into a much smaller closet! It even does better than LoRA on some tricky language puzzles. *Abstract* Large language models (LLMs) offer powerful capabilities, but their massive memory requirements pose challenges for training. LoRA (Low-Rank Adaptation) is one technique to address this, but it can have limitations in performance and flexibility. This work introduces GLoRe (Gradient Low-Rank Projection), a novel method for memory-efficient LLM training. GLoRe projects gradients into a low-rank space during optimization, reducing memory usage while maintaining the potential to replicate full-rank behavior. It addresses limitations of LoRA by supporting pre-training from scratch and periodically exploring different subspaces to avoid plateaus. Experiments demonstrate that GLoRe achieves performance very close to full-rank models, outperforming LoRA especially at smaller model sizes. On the GLUE benchmark, GLoRe shows superior average performance compared to LoRA. This indicates GLoRe's potential as a powerful technique for enabling LLM training on resource-constrained hardware.
@wolpumba4099
3 ай бұрын
*Summary* *Intro* * *0:00* This video explores GLoRe, a technique for making language model training memory-efficient so that even a single GPU can train a full language model. * *0:30* Large language models (LLMs) are powerful but need a lot of memory. For instance, LaMDA (with 7 billion parameters) requires 58 GB just for storing the model itself, along with information for training. * *2:01* To make LLMs trainable on smaller setups, one approach is training only some sections of the model at a time. * *2:18* LoRA (Low Rank Adaptation) is a popular way to do this, but it has some limitations. *Limitations of LoRA* * *3:20* LoRA may not always match the full potential of a model trained the normal way. * *3:53* LoRA's success can sometimes depend on an initial training phase that may not always be possible. *GLoRe (Gradient Low Rank Projection)* * *5:56* GLoRe offers a different approach: instead of updating the model directly, it projects the instructions for updates (gradients) into a smaller space to save memory. * *7:07* This relies on some special math (singular value decomposition) to create two matrices, P and Q, that help with this projection. *GLoRe vs. LoRA* * *9:21* Key difference: GLoRe projects the updates, LoRA changes the model itself. * *9:32* With the right settings, GLoRe can work just like training the model normally, which LoRA finds harder to achieve. *Important Note about GLoRe* * *12:47* Working in a smaller space has a downside - it can get stuck. GLoRe tackles this by periodically changing the way it projects updates to keep exploring better solutions. *Gradient Updates, Memory Saving, & More* * *14:26* GLoRe has settings to control how often it changes the update process, as the best timing depends on other factors. * *17:34* It can even use just one of the P or Q matrices to save even more memory. * *18:18* To use GLoRe with popular training methods like Adam, there's specific code outlined in the research paper. * *18:44* GLoRe has three key settings: scale factor (Alpha), rank (R), and change frequency (T). * *20:57* GLoRe can be combined with other memory-saving techniques like 8-bit quantization and LOMO for even better results. *GLoRe vs. LoRA: Details* * *24:48* Here's a breakdown: * *Weights:* GLoRe uses the original model weights, LoRA modifies them. * *Optimizer States:* GLoRe tends to save more memory here. * *Multi-Subspace:* GLoRe supports the 'update switching' for better training, LoRA doesn't. * *Pre-training:* GLoRe works even when starting from scratch, LoRA might have trouble. *Results* * *26:21* Generally, in GLoRe, using a higher rank leads to better performance but takes more time. * *27:10* GLoRe consistently beats LoRA in performance across different model sizes. * *27:45* GLoRe gets very close to the performance of normally trained models, while LoRA has a larger performance gap, especially with smaller models. * *28:01* On the GLUE benchmark (testing various language tasks), GLoRe outperforms LoRA across different settings. Disclaimer: I used gemini advanced 1.0 (2024.03.04) to summarize the video transcript. This method may make mistakes in recognizing words and it can't distinguish between speakers.