BLIP2: BLIP with frozen image encoders and LLMs

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. BLIP-2 is a generic and efficient pretraining strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pretrained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various visionlanguage tasks, despite having significantly fewer trainable parameters than existing methods. For example, BLIP-2 outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. BLIP-2 also has emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
In this video, I will talk about the following: What can the BLIP-2 model do? How is the BLIP-2 model pretrained? How does BLIP-2 model perform?
For more details, please look at arxiv.org/pdf/2301.12597.pdf and github.com/salesforce/LAVIS/t...
Li, Junnan, Dongxu Li, Silvio Savarese, and Steven Hoi. "Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models." arXiv preprint arXiv:2301.12597 (2023).

Жүктеу

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

InstructBLIP: Vision-Language Models with Instruction Tuning

NERF WAR HEAVY: Drone Battle!

Вечный ДВИГАТЕЛЬ!⚙️ #shorts

Does size matter? BEACH EDITION

Жайдарман | Туған күн 2024 | Алматы

Lecture 10-BLIP:Bootstrapping Language-Image Pretraining for Unified VL Understanding and Generation

Visual Instruction Tuning using LLaVA

The Next Global Superpower Isn't Who You Think | Ian Bremmer | TED

Computer Vision Study Group Session on BLIP-2

Transformer for VS | Flamingo: a Visual Language Model for Few-Shot Learning | Session 5 | CVPR 2022

Large Language Models (LLMs) - Everything You NEED To Know

OpenAI CLIP: ConnectingText and Images (Paper Explained)

LayoutLMv2: Multi modal Pre training for Visually Rich Document Understanding

Сложный РЕМОНТ ТОПОВОГО Samsung Galaxy S22 ULTRA SM-S908E после залития / НЕ ЛОВИТ СЕТИ

Опять съемные крышки в смартфонах? #cmf

Игровой Комп с Авито за 4500р

ПЫШНЫЙ СМАРТФОН с 36 ГБ оперативы? 😲 DOOGEE V Max Plus за 1 минуту

Battery low 🔋 🪫

СТРАННАЯ КОМПЬЮТЕРНАЯ МЫШЬ КОЛЬЦО, ТАКОЙ ТЫ ТОЧНО НЕ ВИДЕЛ

Hisense Official Flagship Store Hisense is the champion What is going on?

BLIP2: BLIP with frozen image encoders and LLMs

Пікірлер: 2