-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better kernel launch utilities #3914
base: main
Are you sure you want to change the base?
Conversation
This pull request was exported from Phabricator. Differential Revision: D72095960 |
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Summary: - Add utilities for doing multiple checks prior to launching kernels Differential Revision: D72095960
This pull request was exported from Phabricator. Differential Revision: D72095960 |
Summary: - Add utilities for doing multiple checks prior to launching kernels Differential Revision: D72095960
This pull request was exported from Phabricator. Differential Revision: D72095960 |
Summary: X-link: facebookresearch/FBGEMM#1009 We have a few things to do as part of the boilerplate ritual of launching GPU kernels: - Perform host-side precondition checks on arguments, namly tensors and tensor types - Perform host-side precondition checks on kernel launch parameters, i.e. whether grid size is too large, shared memory is too large, etc - Build `pta::PackedTensorAccessor` from tensors, and pass context into it - Perform host-side post-run checks for CUDA run errors. - Pass source context information around for consistent log and error messages - Reduce the pain with dealing with incorrect macro invocations The `TensorAccessorBuilder`, `SourceContext`, and `KernelLauncher` classes, together with their construction macros, cover these aspects and provide the space for future exansion of boilerplate rituals in a low to zero-cost abstraction, all while improving code ergonomics around their usage. Before: ``` #ifdef FBGEMM_GPU_MEMCHECK const auto func_name = "pruned_array_lookup_from_row_idx_kernel"; #endif pruned_array_lookup_from_row_idx_kernel<<< nbit::div_round_up(num_indices, kForwardMaxThreads), kForwardMaxThreads, 0, at::cuda::getCurrentCUDAStream()>>>( MAKE_PTA_WITH_NAME( func_name, update_row_indices, index_t, 1, 32), MAKE_PTA_WITH_NAME( func_name, update_table_indices, int32_t, 1, 32), MAKE_PTA_WITH_NAME( func_name, index_remappings, remap_t, 1, 32), MAKE_PTA_WITH_NAME( func_name, index_remappings_offsets, int64_t, 1, 32), MAKE_PTA_WITH_NAME(func_name, dense_indices, index_t, 1, 32)); C10_CUDA_KERNEL_LAUNCH_CHECK(); ``` After: ``` FBGEMM_LAUNCH_KERNEL( (pruned_array_lookup_from_row_idx_kernel<index_t, remap_t>), nbit::div_round_up(num_indices, kForwardMaxThreads), kForwardMaxThreads, PTA_B(update_row_indices, index_t, 1, 32), PTA_B(update_table_indices, int32_t, 1, 32), PTA_B(index_remappings, remap_t, 1, 32), PTA_B(index_remappings_offsets, int64_t, 1, 32), PTA_B(dense_indices, index_t, 1, 32)); ``` Differential Revision: D72095960
This pull request was exported from Phabricator. Differential Revision: D72095960 |
Summary: - Add utilities for doing multiple checks prior to launching kernels
Differential Revision: D72095960