INTRODUCTION:

This repo contains a VQA model for GIFs which is trained on TumblrGIFs and works by combining the vector embeddings of the frame-based GIFs and text-based questions in the latent space using cross-attention and concatenated embeddings to gather context from the image and enable open-vocabulary answering capabilities. The approach was derived from Language Grounded QFormer for Efficient Vision Language Understanding.

Please refer to the following doc, explaining the specifics of our approach : Doc

Checkpoints :

Checkpoint for our simplified approach - .pth file

Cleaned Subset of the data used for training our models :

You can find the files here - Drive

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
main project		main project
simplified approach		simplified approach
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

INTRODUCTION:

Please refer to the following doc, explaining the specifics of our approach : Doc

Checkpoints :

Cleaned Subset of the data used for training our models :

About

Releases

Packages

Contributors 4

Languages

manas1245agrawal/video_question_answering

Folders and files

Latest commit

History

Repository files navigation

INTRODUCTION:

Please refer to the following doc, explaining the specifics of our approach : Doc

Checkpoints :

Cleaned Subset of the data used for training our models :

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages