INTRODUCTION:

This repo contains a VQA model for GIFs which is trained on TumblrGIFs and works by combining the vector embeddings of the frame-based GIFs and text-based questions in the latent space using cross-attention and concatenated embeddings to gather context from the image and enable open-vocabulary answering capabilities. The approach was derived from Language Grounded QFormer for Efficient Vision Language Understanding.

Please refer to the following doc, explaining the specifics of our approach : Doc

Checkpoints :

Checkpoint for our simplified approach - .pth file

Cleaned Subset of the data used for training our models :

You can find the files here - Drive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

INTRODUCTION:

Please refer to the following doc, explaining the specifics of our approach : Doc

Checkpoints :

Cleaned Subset of the data used for training our models :

Files

README.md

Latest commit

History

README.md

File metadata and controls

INTRODUCTION:

Please refer to the following doc, explaining the specifics of our approach : Doc

Checkpoints :

Cleaned Subset of the data used for training our models :