Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT]: thread coarsening for larger msm sizes #79

Open
btilmon opened this issue May 26, 2023 · 1 comment
Open

[FEAT]: thread coarsening for larger msm sizes #79

btilmon opened this issue May 26, 2023 · 1 comment
Assignees

Comments

@btilmon
Copy link

btilmon commented May 26, 2023

Description

The current msm.cu implementation launches a monolithic kernel in several places like this:

unsigned NUM_THREADS = 1 << 10;
unsigned NUM_BLOCKS = (total_nof_buckets + NUM_THREADS - 1) / NUM_THREADS;
initialize_buckets_kernel<<<NUM_BLOCKS, NUM_THREADS>>>(buckets, total_nof_buckets);

This assumes the GPU has enough threads to process every element. Grid-strided for loops allow processing more data than GPU threads by making each thread do more work, and potentially allows maximum memory coalescing on larger inputs since we are continuously accessing consecutive memory. I think this should be one of the easier fixes to efficiently enable larger msm sizes.

Working on a pull request but getting comfortable with the msm Rust binding for testing first.

Motivation

From the icicle Discord I see "msm for large sizes" is a sprint priority.

@btilmon btilmon added the type:feature New feature or request label May 26, 2023
@HadarIngonyama
Copy link
Contributor

Looks cool! I've solved some bugs in the MSM and will merge them this week, this should enable experimenting with larger sizes and adding optimizations such as this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants