make build_batch_data_loader work better when dataset size is not multiple of batch size and num_workers #5035

wat3rBro · 2023-07-17T22:01:55Z

Summary:
Previously ToIterableDataset shards the dataset for each worker in round robin fashion without considering batch size. This combined with drop_last=True can cause issue that more than 1 iteration is dropped, i.e. the number of iterations is less than len(data_loader). Let's say the dataset size is 46 and batch size is 8, when there're 3 DL workers, the dataset would be shared into:

worker 0: [0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45]
worker 1: [1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43]
worker 2: [2, 5, 8, 11, 14, 17, 20, 23, 26, 29, 32, 35, 38, 41, 44]

Since the batching is per worker, the loaded data would be: [0, 3, 6, 9, 12, 15, 18, 21], [1, 4, 7, 10, 13, 16, 19, 22], [2, 5, 8, 11, 14, 17, 20, 23], [24, 27, 30, 33, 36, 39, 42, 45]. It has a few issues:

the data is not loaded in sequence
it potentially wastes data, eg. here it has 46 images, so it can have 5 full batches.
len(dl) returns 5, but it only runs 4 iterations. Although len can be inaccurate for iterable dataset, but it still causes confusion.

This diff changes the shard pattern, so that in the same case, different workers would get:

worker 0: [0, 1, 2, 3, 4, 5, 6, 7, 24, 25, 26, 27, 28, 29, 30, 31]
worker 1: [8, 9, 10, 11, 12, 13, 14, 15, 32, 33, 34, 35, 36, 37, 38, 39]
worker 2: [16, 17, 18, 19, 20, 21, 22, 23, 40, 41, 42, 43, 44, 45]

This would solves the issues above, the loaded data now become: [0, 1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13, 14, 15], [16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31], [32, 33, 34, 35, 36, 37, 38, 39]

Differential Revision: D47529917

facebook-github-bot · 2023-07-17T22:02:40Z

This pull request was exported from Phabricator. Differential Revision: D47529917

facebook-github-bot · 2023-07-24T16:37:12Z

This pull request was exported from Phabricator. Differential Revision: D47529917

…tiple of batch size and num_workers Summary: Pull Request resolved: facebookresearch#5035 Previously `ToIterableDataset` shards the dataset for each worker in round robin fashion without considering batch size. This combined with `drop_last=True` can cause issue that more than 1 iteration is dropped, i.e. the number of iterations is less than `len(data_loader)`. Let's say the dataset size is 46 and batch size is 8, when there're 3 DL workers, the dataset would be shared into: - worker 0: [0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45] - worker 1: [1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43] - worker 2: [2, 5, 8, 11, 14, 17, 20, 23, 26, 29, 32, 35, 38, 41, 44] Since the batching is per worker, the loaded data would be: [0, 3, 6, 9, 12, 15, 18, 21], [1, 4, 7, 10, 13, 16, 19, 22], [2, 5, 8, 11, 14, 17, 20, 23], [24, 27, 30, 33, 36, 39, 42, 45]. It has a few issues: - the data is not loaded in sequence - it potentially wastes data, eg. here it has 46 images, so it can have 5 full batches. - `len(dl)` returns 5, but it only runs 4 iterations. Although `len` can be inaccurate for iterable dataset, but it still causes confusion. This diff changes the shard pattern, so that in the same case, different workers would get: - worker 0: [0, 1, 2, 3, 4, 5, 6, 7, 24, 25, 26, 27, 28, 29, 30, 31] - worker 1: [8, 9, 10, 11, 12, 13, 14, 15, 32, 33, 34, 35, 36, 37, 38, 39] - worker 2: [16, 17, 18, 19, 20, 21, 22, 23, 40, 41, 42, 43, 44, 45] This would solves the issues above, the loaded data now become: [0, 1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13, 14, 15], [16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31], [32, 33, 34, 35, 36, 37, 38, 39] Reviewed By: zechenghe Differential Revision: D47529917 fbshipit-source-id: 83b1843549b68f904b79435ea32a4e21a0cd1ae0

…tiple of batch size and num_workers Summary: Pull Request resolved: facebookresearch#5035 Previously `ToIterableDataset` shards the dataset for each worker in round robin fashion without considering batch size. This combined with `drop_last=True` can cause issue that more than 1 iteration is dropped, i.e. the number of iterations is less than `len(data_loader)`. Let's say the dataset size is 46 and batch size is 8, when there're 3 DL workers, the dataset would be shared into: - worker 0: [0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45] - worker 1: [1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43] - worker 2: [2, 5, 8, 11, 14, 17, 20, 23, 26, 29, 32, 35, 38, 41, 44] Since the batching is per worker, the loaded data would be: [0, 3, 6, 9, 12, 15, 18, 21], [1, 4, 7, 10, 13, 16, 19, 22], [2, 5, 8, 11, 14, 17, 20, 23], [24, 27, 30, 33, 36, 39, 42, 45]. It has a few issues: - the data is not loaded in sequence - it potentially wastes data, eg. here it has 46 images, so it can have 5 full batches. - `len(dl)` returns 5, but it only runs 4 iterations. Although `len` can be inaccurate for iterable dataset, but it still causes confusion. This diff changes the shard pattern, so that in the same case, different workers would get: - worker 0: [0, 1, 2, 3, 4, 5, 6, 7, 24, 25, 26, 27, 28, 29, 30, 31] - worker 1: [8, 9, 10, 11, 12, 13, 14, 15, 32, 33, 34, 35, 36, 37, 38, 39] - worker 2: [16, 17, 18, 19, 20, 21, 22, 23, 40, 41, 42, 43, 44, 45] This would solves the issues above, the loaded data now become: [0, 1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13, 14, 15], [16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31], [32, 33, 34, 35, 36, 37, 38, 39] Reviewed By: zechenghe Differential Revision: D47529917 fbshipit-source-id: 539145d2262663b9a967189f30b3e0a3aa7a4c3e

facebook-github-bot · 2023-07-24T17:19:40Z

This pull request was exported from Phabricator. Differential Revision: D47529917

facebook-github-bot · 2023-07-24T20:04:01Z

This pull request has been merged in 57bdb21.

facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Jul 17, 2023

wat3rBro requested a review from ppwwyyxx July 17, 2023 23:48

wat3rBro force-pushed the export-D47529917 branch from 2fdc6c8 to 97fed4c Compare July 24, 2023 16:37

wat3rBro force-pushed the export-D47529917 branch from 97fed4c to d2abc6e Compare July 24, 2023 17:19

facebook-github-bot closed this in 57bdb21 Jul 24, 2023

facebook-github-bot added the Merged label Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make build_batch_data_loader work better when dataset size is not multiple of batch size and num_workers #5035

make build_batch_data_loader work better when dataset size is not multiple of batch size and num_workers #5035

wat3rBro commented Jul 17, 2023

facebook-github-bot commented Jul 17, 2023

facebook-github-bot commented Jul 24, 2023

facebook-github-bot commented Jul 24, 2023

facebook-github-bot commented Jul 24, 2023

make build_batch_data_loader work better when dataset size is not multiple of batch size and num_workers #5035

make build_batch_data_loader work better when dataset size is not multiple of batch size and num_workers #5035

Conversation

wat3rBro commented Jul 17, 2023

facebook-github-bot commented Jul 17, 2023

facebook-github-bot commented Jul 24, 2023

facebook-github-bot commented Jul 24, 2023

facebook-github-bot commented Jul 24, 2023