Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribute parquet population over all users even if num_files < num_users #101

Closed
daverigby opened this issue Jun 11, 2024 · 1 comment
Closed
Labels
enhancement New feature or request

Comments

@daverigby
Copy link
Collaborator

Currently the populate logic (Dataset.get_batch_iterator()) will distribute the dataset at the file granulatity - for a parquet dataset of N files, and U users, it will split the files into into U roughtly equal subsets.

This is fine if N >= U, but if there are many fewer files than users, then some users will have no work do do. In the extreme case where there is only one file (e.g. mnist, yfcc), then we do not have any concurrency for the populate phase.

Improve this situation by distributing the data over all users.

See also #46 .

@daverigby daverigby added the enhancement New feature or request label Jun 11, 2024
@jonathanzxu
Copy link
Contributor

closed in #191

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants