The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. BLIP-2 is a generic and efficient pretraining strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pretrained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various visionlanguage tasks, despite having significantly fewer trainable parameters than existing methods. For example, BLIP-2 outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. BLIP-2 also has emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
In this video, I will talk about the following: What can the BLIP-2 model do? How is the BLIP-2 model pretrained? How does BLIP-2 model perform?
For more details, please look at arxiv.org/pdf/2301.12597.pdf and github.com/salesforce/LAVIS/t...
Li, Junnan, Dongxu Li, Silvio Savarese, and Steven Hoi. "Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models." arXiv preprint arXiv:2301.12597 (2023).
Негізгі бет Ғылым және технология BLIP2: BLIP with frozen image encoders and LLMs
Пікірлер: 2