Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better kernel launch utilities #3914

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

q10
Copy link
Contributor

@q10 q10 commented Apr 2, 2025

Summary: - Add utilities for doing multiple checks prior to launching kernels

Differential Revision: D72095960

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72095960

Copy link

netlify bot commented Apr 2, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 44691cf
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67edd087887cc80008c9c252
😎 Deploy Preview https://deploy-preview-3914--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@q10 q10 force-pushed the export-D72095960 branch from 2b40dad to 6bb9395 Compare April 2, 2025 08:15
q10 added a commit to q10/FBGEMM that referenced this pull request Apr 2, 2025
Summary:

- Add utilities for doing multiple checks prior to launching kernels

Differential Revision: D72095960
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72095960

@q10 q10 force-pushed the export-D72095960 branch from 6bb9395 to 615cbe7 Compare April 2, 2025 23:29
q10 added a commit to q10/FBGEMM that referenced this pull request Apr 2, 2025
Summary:

- Add utilities for doing multiple checks prior to launching kernels

Differential Revision: D72095960
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72095960

Summary:
X-link: facebookresearch/FBGEMM#1009


We have a few things to do as part of the boilerplate ritual of launching GPU kernels:

- Perform host-side precondition checks on arguments, namly tensors and tensor types
- Perform host-side precondition checks on kernel launch parameters, i.e. whether grid size is too large, shared memory is too large, etc
- Build `pta::PackedTensorAccessor` from tensors, and pass context into it
- Perform host-side post-run checks for CUDA run errors.
- Pass source context information around for consistent log and error messages
- Reduce the pain with dealing with incorrect macro invocations

The `TensorAccessorBuilder`, `SourceContext`, and `KernelLauncher` classes, together with their construction macros, cover these aspects and provide the space for future exansion of boilerplate rituals in a low to zero-cost abstraction, all while improving code ergonomics around their usage.

Before:

```
#ifdef FBGEMM_GPU_MEMCHECK
              const auto func_name = "pruned_array_lookup_from_row_idx_kernel";
#endif
              pruned_array_lookup_from_row_idx_kernel<<<
                  nbit::div_round_up(num_indices, kForwardMaxThreads),
                  kForwardMaxThreads,
                  0,
                  at::cuda::getCurrentCUDAStream()>>>(
                  MAKE_PTA_WITH_NAME(
                      func_name, update_row_indices, index_t, 1, 32),
                  MAKE_PTA_WITH_NAME(
                      func_name, update_table_indices, int32_t, 1, 32),
                  MAKE_PTA_WITH_NAME(
                      func_name, index_remappings, remap_t, 1, 32),
                  MAKE_PTA_WITH_NAME(
                      func_name, index_remappings_offsets, int64_t, 1, 32),
                  MAKE_PTA_WITH_NAME(func_name, dense_indices, index_t, 1, 32));
              C10_CUDA_KERNEL_LAUNCH_CHECK();
```

After:
```
              FBGEMM_LAUNCH_KERNEL(
                  (pruned_array_lookup_from_row_idx_kernel<index_t, remap_t>),
                  nbit::div_round_up(num_indices, kForwardMaxThreads),
                  kForwardMaxThreads,
                  PTA_B(update_row_indices, index_t, 1, 32),
                  PTA_B(update_table_indices, int32_t, 1, 32),
                  PTA_B(index_remappings, remap_t, 1, 32),
                  PTA_B(index_remappings_offsets, int64_t, 1, 32),
                  PTA_B(dense_indices, index_t, 1, 32));            
```

Differential Revision: D72095960
@q10 q10 force-pushed the export-D72095960 branch from 615cbe7 to 44691cf Compare April 3, 2025 00:04
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72095960

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants