Umar Jamil

I'm a Machine Learning Engineer from Milan, Italy, teaching complex deep learning and machine learning concepts to my cat, 奥利奥.
我也会一点中文.

1:15:39
21 күн бұрын

Kolmogorov-Arnold Networks: MLP vs KAN, Math, B-Splines, Universal Approximation Theorem

48:46
Ай бұрын

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

2:15:13
3 ай бұрын

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

1:14:29
4 ай бұрын

Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math

1:26:21
5 ай бұрын

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

1:12:53
5 ай бұрын

Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

50:55
5 ай бұрын

Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

49:24
6 ай бұрын

Retrieval Augmented Generation (RAG) Explained: Embedding, Sentence BERT, Vector Database (HNSW)

54:52
7 ай бұрын

BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token

5:03:32
8 ай бұрын

Coding Stable Diffusion from scratch in PyTorch

3:04:11
9 ай бұрын

Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

1:10:55
9 ай бұрын

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

42:53
9 ай бұрын

Segment Anything - Model explanation with code

26:55
10 ай бұрын

LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch

29:58
10 ай бұрын

LongNet: Scaling Transformers to 1,000,000,000 tokens: Python Code + Explanation

21:12
10 ай бұрын

How diffusion models work - explanation and code!

27:12
11 ай бұрын

Variational Autoencoder - Model, ELBO, loss function and maths explained easily!

58:04
Жыл бұрын

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

2:59:24
Жыл бұрын

Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

14:01
Жыл бұрын

CLIP - Paper explanation (training and inference)

6:58
Жыл бұрын

Wav2Lip (generate talking avatar videos) - Paper reading and explanation

Пікірлер

@niysniys1490
Сағат бұрын
Hello, thanks to your great vedio!! There are some puzzle confusing me a lot. I am wondering how to train Diffusion Model with cfg. I think according to input, the targets also have two images. So, the target image for condiftional input is what? And, the target image for non-conditional input is what? 😀😀😀
@tenzinlhakpa1672
2 сағат бұрын
amazing work, thank you so much !
@jerrylin2790
3 сағат бұрын
was immersing myself in the video. all of a sudden, Umar spoke to his cat in Chinese...haha... now I understand why some comments are left in Chinese..
@sharyakbar2086
14 сағат бұрын
can someone please help me how to run this to produce the images from text. I have placed all the files like the GitHub repository still when i run the demo.ipynb file its gives me this error TypeError: 'weights_only' is an invalid keyword argument for Unpickler()
@suriyars4487
20 сағат бұрын
can you please share your slides of this as well as for attention is all you need paper in .pptx( power point format )
@snehotoshbanerjee1938
Күн бұрын
Umar, you are a great teacher. I have not seen such a great explanation of transformer. Your transformer from scratch coding is also awesome. So, basically you understand which part needs more explanation. Thanks for your effort.
@ChukwuemekaAmblessedchinenye
Күн бұрын
can you make tutorial video on model like Perplexity that use website live search
@ChukwuemekaAmblessedchinenye
Күн бұрын
can you make tutorial video on model like Perplexity that use website live search
@ChingyuenLiu
Күн бұрын
Hello Umar, you always produce the most concise and clear content ever! I was wondering if you are planning to do any video on the stable diffusion 3 since the paper is out? It would be really great if you could help explain how the flow matching helps or changes regular diffusion models! Thank you again for your content and work. 非常感谢！
@ziyadmuhammad3734
Күн бұрын
Thanks!
@bensimonjoules4402
Күн бұрын
Amazing content, thanks! I'm very excited about the continual learning properties of these networks.
@agenticmark
2 күн бұрын
Please do a video where you show the process from scratch so we can do this with voice models ✊🏼
@wolfie6175
3 күн бұрын
Good video, quality content.
@terryliu3635
3 күн бұрын
I learnt a lot from following the steps out of this video and create a transformer myself step by step!! Thank you!!
@ariouathanane
3 күн бұрын
Awesome explanation. Cls token is important just because there is no zero values with others token?
@AyushRaj-nt3ot
3 күн бұрын
sir, your explanation is just beyond awesome!!! Thank you so much for creating such content. Sir I didn't get the residual connections part. As I am from India, I was working on Indic Languages, so i had to make more code but that's just okay. I just want if you could please help in understanding beam search code, the one which you also gave in the GitHub File. Also, if you could give the code for evaluating the BLEU Score. I'll be really grateful to you. And again, thank you so much for such a comprehensive content. We'd love to see your more videos especially in Generative AI! P.S. : I didn't understand how you wrote it, what I've understood is that we have to take the input of the previous layer and then add with o/p of the same layer and then apply layer norm on that. Basically Add and then LayerNorm. Please help me correct mysefl!
@freeweed4all
4 күн бұрын
Sei troppo forte, spieghi bene ed è facile seguirti. Vai avanti così!
@harshitkumar5147
4 күн бұрын
This is just awesome!
@codevacaphe3763
4 күн бұрын
Hi, I just happen to see your video. It's really amazing, your channel is so good with valuable information. Hope, you keep this up because I really love your contents.
@andreanegreanu8750
4 күн бұрын
Very clear, well explained, top notch!
@raviparihar3298
4 күн бұрын
best video I have ever seen on whole youtube eon transformer model. Thank you so much sir!
@cristiwally
4 күн бұрын
the constant you scale by the x come from averaging over a bunch of examples generated by the vae, in order to ensure they have unit variance with the variance taken over all dimensions simultaneously, scale_factor = 1 / std(z)
@shajidmughal3386
4 күн бұрын
i came here form your VAE video. after that, should i be doing the 5hr long stable diffusion or this one?? what do you suggest?
@rafa_br34
4 күн бұрын
Great video! I'm wondering, is there any reason to save the positional encoding vector? I don't see why you would need to save it since it seems to always be the same value considering the init parameters don't change.
@shajidmughal3386
4 күн бұрын
Great explanation. Clean!!! Reminds me of school where our physics teacher taught everything practical and it felt so simple. subs+1👍
@elieelezra2734
4 күн бұрын
Good vid boss
@expectopatronum2784
4 күн бұрын
23:39 -> loved that intuitive explanation!
@beincheekym8
4 күн бұрын
Brilliant video! Really clear and with just the right amount of details!
@oiooio7879
5 күн бұрын
Thank you for this video!
@elieelezra2734
5 күн бұрын
Hello Umar, Great as usual, however why do you say at 46:11, that you need to sum log probabilities up? The objective function is the expectation of logarithm of the difference of two weighted log probabilities ratios. I don't get what do you want to sum up exactly? Thank you
@xugefu
5 күн бұрын
Thanks!
@jueying1443
5 күн бұрын
Thanks, could you talk about flash attention?
@aleksandarcvetkovic7045
5 күн бұрын
I looked at many blogs and explanations but none of them got to the practical usage of LoRA and showed exactly how it is used in practice. This is exactly what I was looking for.
@hubertkanyamahanga2782
5 күн бұрын
Hi Umar, thanks for this amazing code explanation. Just one question, how is the prediction_iou computed in the Automatic Mask generation of SAM? I am asking because we only have the model's prediction and to compute iou you need ground truth labels. Thanks!
@mahsakhalili5042
6 күн бұрын
Appreciate it! it helped me a lot
@elieelezra2734
6 күн бұрын
Can't thank you enough : your vids + chatGPT = Best Teacher Ever. I have one question though : it might be silly but I want to be sure of it : does it mean that to get the rewards for all time steps, we need to run the reward model on all truncated responses on the right, so that each response token would be at some point the last token? Am I clear?
@umarjamilai
6 күн бұрын
No, because of how transformer models work, you only need one forward step with all the sequence to get the rewards for all positions. This is also how you train a transformer: with only one pass, you can calculate the hidden state for all the positions and calculate the loss for all positions.
@andreanegreanu8750
5 күн бұрын
@@umarjamilai thanks a lot for all your time. I won't bother you till the next time, I promess, ahahaha
@AdmMusicc
6 күн бұрын
You're on a mission to make the best and friendliest content to consume deep learning algorithms and I am all in for it.
@adscript4713
6 күн бұрын
Can someone please clarify this from the video. In it, both the single-head 20:12 and multi-head attention 28:30 are shown. 1. In the Attention paper, only the Multi Head attention is calculated, right? 2. How is the Multi-Head attention calculated given the Attention(Q, K, V ) formula? Would it be: ■ A) Take the three copies of the original embeddings of size (sequence, dmodel) that have been encoded with the positional information, ■B) Multiply each by the Weights (of size (dmodel , dmodel)) obtained during training ■C) Arbitrarily divide each resulting Q, K and V of size ( sequence , dmodel) by the number of heads (8 in paper) to get 8 heads each of size (sequence * dv (i.e. 64)) , meaning now we have 64 dimensions instead of 512 for each word vector * 8 heads each capturing different syntactic information ■D) Apply the Attention(Q, K, V ) formula using one of the eight heads each for Q, K, V of size (sequence , dv ) to get a total of 8 Attention heads, each of size (Sequence , 64) ■E) We then concatenate (i.e. literally just put each 64 dimensions of each vector next to each other) the 8 heads to bring us back to one matrix of size (sequence length , (8 * 64 = 512) ■ F) Finally, we multiply the concatenated result by Wo of dimension (h * dv, dmodel) resulting in a matrix size of (sequence length, d model (i.e. 512))? Is this correct?
@aryansakhala3930
6 күн бұрын
Some things that were not clear What is d model for getting embeddings? How did the Q, K, and V matrix were calculated? (just got the dimension sense of these matrices) Great explanation about dimension compatibility but still not clear about how they were computed in the first place
@nwanted
7 күн бұрын
Thanks so much Umar, always learn a lot from your video!
@snorkovenko
7 күн бұрын
Great video - nicely structured and very clearly explained. I want to point out on one mistake though - during "Layer Normalization" step the normalization formula you've shown would yield values with mean = 0 and variance = 1, not values in the range [0,1]
@fatemeshams9758
7 күн бұрын
awesome👍
@lokeshreddypolu250
7 күн бұрын
Thanks for the video. Do you know any way on how we can create a dataset for DPO training. I currently have only question, answer pairs. Is it fine if i take y_w as answer and y_l as some random text(which would obviously have lower preference than answer) and then train it?
@lokeshreddypolu250
7 күн бұрын
The potential problem that I think could happen is that having random text may decrease the loss and the policy may not even change much
@orevjoker8332
7 күн бұрын
I hardly ever comment on youtube videos, but wow this was a very well done video!
@swaraj-42
7 күн бұрын
Thank you so much for this !! you made it so effortless hats off !!😄
@sanskargupta7085
7 күн бұрын
I feel lucky enough to have come across this channel, amazing stuff!
@xugefu
7 күн бұрын
Thanks!
@s8x.
8 күн бұрын
50:27 why is it the hidden state for the answer tokens but earlier it was just for the last hidden state?
@xugefu
8 күн бұрын
Thanks!
@andreanegreanu8750
8 күн бұрын
There is something that found out very confusing. It seems that the value function share the same theta parameters than the LLM. That is very unexpected. Can you confirm this please? Thanks in advance