This repo contains a VQA model for GIFs which is trained on TumblrGIFs and works by combining the vector embeddings of the frame-based GIFs and text-based questions in the latent space using cross-attention and concatenated embeddings to gather context from the image and enable open-vocabulary answering capabilities. The approach was derived from Language Grounded QFormer for Efficient Vision Language Understanding.
Please refer to the following doc, explaining the specifics of our approach : Doc
Checkpoint for our simplified approach - .pth file
You can find the files here - Drive