To be honest, this is the only good deep learning youtuber
@anmolagrawal386
Күн бұрын
This is huge! I've been watching your channel for the past few days, and needed this, since I've been working on a DiT based model recently
@Explaining-AI
Күн бұрын
Thats like perfect timing :) Hope the video ends up being of help to you. Please feel free to let me know in case you end up having any questions on it.
@computer_vision
Күн бұрын
take love from Bangladesh . nice content to watch .
@Explaining-AI
Күн бұрын
Thank You :)
@signitureDGK
Күн бұрын
can this be applied to a ViVit model? What changes would need to be made? OpenAi Sora is based on this architecture (ViViT). Im think they use 3D positional encodings and some form of multi-head cross attention for the conditioning in addition to the scale and shift conditioning?
@Explaining-AI
14 сағат бұрын
Yes, I would assume just modulating the layernorm in both spatial and temporal encoders blocks of ViVIT should give us the desired ViVIT block variant. But for video generation(and applying DIT ideas to video), I would suggest to also take a look at the Latte paper & code. They use the same adaptive norm variation block as in DiT together with using same timestep + video class conditioning (and also experiment with ViVIT's tubelet embedding). They also have a t2v variant. Obv not as good as SORA but a good starting point for video generation. Btw for SORA specifically, do check this out - arxiv.org/pdf/2402.17177v2
Пікірлер: 8