Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for WebDataset #283

Open
jbilcke opened this issue Mar 3, 2025 · 1 comment
Open

Add support for WebDataset #283

jbilcke opened this issue Mar 3, 2025 · 1 comment

Comments

@jbilcke
Copy link

jbilcke commented Mar 3, 2025

Feature request / 功能建议

I propose for Finetrainers to support WebDataset as a dataset format.

Motivation / 动机

While working on VMS (UI wrapper around Finetrainers) I realized that I ended up using a similar format to WebDataset, except I upload multiple .zip files containing .mp4/.txt pairs, instead of .tar shards.

Unrelated to my project, I also notice some interest about using WebDataset in Finetrainers

Your contribution / 您的贡献

I've started refactoring my project to support WebDataset

@a-r-r-o-w
Copy link
Owner

@jbilcke Thanks for the recommendation! I recently rewrote a majority of the codebase to allow for this. This file lists all the supported dataset formats:

class VideoWebDataset(torch.utils.data.IterableDataset, torch.distributed.checkpoint.stateful.Stateful):

I haven't tested on a large scale run with a big webdataset yet, but I did verify it to be working with in a smaller setting so please let me know if it works when you give it a try.

I have a simple test dataset which I use for verifying loading in the fast-tests. You could try with it for a quick look:

class VideoWebDatasetFastTests(unittest.TestCase):
def setUp(self):
self.num_data_files = 15
self.dataset = VideoWebDataset("finetrainers/dummy-squish-wds", infinite=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants