Add support for WebDataset #4

jbilcke-hf · 2025-03-03T14:33:17Z

Context

When working with hundreds of videos in VMS, we often have to resort to uploading multiple .zip files (eg. 1 GB each, to avoid mega-files)

This practice of having multiple archives containing .mp4 videos + .txt captions is nearly identical to the WebDataset file format, which is designed for large AI/ML training datasets.

Proposal

Add basic support for uploading/importing WebDataset
Implement end-to-end support for WebDataset (see branch webdataset)
Propose the support of WebDataset into Finetrainers

For point 2, here end-to-end support means performing all our processing and transformations (black band removal, captioning..) inside the WebDataset space, instead of the OS file system.

While using WebDataset internally doesn't automatically allow to train datasets greater than what Finetrainers can support, the idea is more about having a long-term vision for VMS to be architecturally independent and adopt future-proof design.

The vision for VMS is to be a standalone app that can be used for annotation only, and to potentially support alternative training backends (Job API, Replicate, Fal, diffusion-pipe etc).

jbilcke-hf added the feature request New feature or request label Mar 3, 2025

jbilcke-hf self-assigned this Mar 3, 2025

jbilcke mentioned this issue Mar 3, 2025

Add support for WebDataset a-r-r-o-w/finetrainers#283

Open

jbilcke-hf added the update available Feature of fix is pushed but needs testing label Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for WebDataset #4

Add support for WebDataset #4

jbilcke-hf commented Mar 3, 2025 •

edited

Loading

Add support for WebDataset #4

Add support for WebDataset #4

Comments

jbilcke-hf commented Mar 3, 2025 • edited Loading

Context

Proposal

jbilcke-hf commented Mar 3, 2025 •

edited

Loading