[FEAT]: thread coarsening for larger msm sizes #79

btilmon · 2023-05-26T06:26:53Z

Description

The current msm.cu implementation launches a monolithic kernel in several places like this:

unsigned NUM_THREADS = 1 << 10;
unsigned NUM_BLOCKS = (total_nof_buckets + NUM_THREADS - 1) / NUM_THREADS;
initialize_buckets_kernel<<<NUM_BLOCKS, NUM_THREADS>>>(buckets, total_nof_buckets);

This assumes the GPU has enough threads to process every element. Grid-strided for loops allow processing more data than GPU threads by making each thread do more work, and potentially allows maximum memory coalescing on larger inputs since we are continuously accessing consecutive memory. I think this should be one of the easier fixes to efficiently enable larger msm sizes.

Working on a pull request but getting comfortable with the msm Rust binding for testing first.

Motivation

From the icicle Discord I see "msm for large sizes" is a sprint priority.

HadarIngonyama · 2023-05-28T08:06:41Z

Looks cool! I've solved some bugs in the MSM and will merge them this week, this should enable experimenting with larger sizes and adding optimizations such as this one.

btilmon added the type:feature New feature or request label May 26, 2023

LeonHibnik assigned HadarIngonyama May 28, 2023

LeonHibnik assigned jeremyfelder Sep 4, 2023

Otsar-Raikou added the Next Sprint label Jan 7, 2024

Otsar-Raikou added backlog and removed Next Sprint labels Jan 31, 2024

DmytroTym added area:msm lang:cuda/cpp labels Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT]: thread coarsening for larger msm sizes #79

[FEAT]: thread coarsening for larger msm sizes #79

btilmon commented May 26, 2023

HadarIngonyama commented May 28, 2023

[FEAT]: thread coarsening for larger msm sizes #79

[FEAT]: thread coarsening for larger msm sizes #79

Comments

btilmon commented May 26, 2023

Description

Motivation

HadarIngonyama commented May 28, 2023