BLIP is a new VLP framework that transfers flexibly to vision-language understanding and generation tasks. BLIP effectively utilizes noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. BLIP achieves state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video language tasks in a zero-shot manner.
Github : github.com/salesforce/BLIP Notebook
Link : github.com/karndeepsingh/self...
Connect with me on :
1. LinkedIn: / karndeepsingh
2. Telegram Group: telegram.me/datascienceclubac...
3. Github: www.github.com/karndeepsingh
Негізгі бет Image Captioning, VQA and Image or Text Embedding Extraction using BLIP |BLIP | Karndeep Singh
Пікірлер