Skip to content

Latest commit

 

History

History
21 lines (11 loc) · 1 KB

README.md

File metadata and controls

21 lines (11 loc) · 1 KB

INTRODUCTION:

This repo contains a VQA model for GIFs which is trained on TumblrGIFs and works by combining the vector embeddings of the frame-based GIFs and text-based questions in the latent space using cross-attention and concatenated embeddings to gather context from the image and enable open-vocabulary answering capabilities. The approach was derived from Language Grounded QFormer for Efficient Vision Language Understanding.

Please refer to the following doc, explaining the specifics of our approach : Doc

Checkpoints :

Checkpoint for our simplified approach - .pth file

Cleaned Subset of the data used for training our models :

You can find the files here - Drive