You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the populate logic (Dataset.get_batch_iterator()) will distribute the dataset at the file granulatity - for a parquet dataset of N files, and U users, it will split the files into into U roughtly equal subsets.
This is fine if N >= U, but if there are many fewer files than users, then some users will have no work do do. In the extreme case where there is only one file (e.g. mnist, yfcc), then we do not have any concurrency for the populate phase.
Improve this situation by distributing the data over all users.
Currently the populate logic (
Dataset.get_batch_iterator()
) will distribute the dataset at the file granulatity - for a parquet dataset of N files, and U users, it will split the files into into U roughtly equal subsets.This is fine if N >= U, but if there are many fewer files than users, then some users will have no work do do. In the extreme case where there is only one file (e.g. mnist, yfcc), then we do not have any concurrency for the populate phase.
Improve this situation by distributing the data over all users.
See also #46 .
The text was updated successfully, but these errors were encountered: