Slow BoomerAMG performance using gpu-aware mpi with Nalu-Wind on Kestrel #1241

wjhorne · 2025-02-27T18:29:13Z

Hi!

I put up this issue to track the ongoing effort to figure out why Nalu-Wind struggles with gpu-aware-mpi enabled in HYPRE when using the cluster Kestrel. The slowdown is within a bunch of slow MPI calls happening within BoomerAMG specifically when it is enabled. If I run with pure GMRES in HYPRE and gpu-aware-mpi things look good.

It is not clear that HYPRE is doing anything necessarily wrong. It is possible that this is a CUDA or HPE problem and people from both parties have been contacted over it. I have attached some shots showing an nsys profile with and without mpi aware enabled. Note the MPI line differences.

To solve this issue I am currently trying CUDA 12.6.2 which might have a related bug fix.

wjhorne · 2025-02-27T18:31:16Z

@neil-lindquist

wjhorne · 2025-02-27T18:42:15Z

I am sad to report that CUDA 12.6 with all the proper linkages did not change the behavior. I am pulling fresh profiles of it and will add the data

victorapm · 2025-02-27T19:41:28Z

@dreachem tagging for reference

wjhorne · 2025-02-27T20:00:58Z

This is a shot of the 12.6.1 data. It is quite similar to what I had from 12.2. I have the full profiles on Kestrel available. Each processor's dataset is about 100 MB and can't be uploaded here. Let me know if you would like it and don't have access to Kestrel's /scratch space

wjhorne · 2025-02-27T20:47:20Z

MPICH_GPU_IPC_ENABLED=0 Allows us to bypass the issue by disabling the slow calls

wjhorne · 2025-02-27T21:27:01Z

We would still like to run things with it enabled, but we are seeing some performance benefits even as is for our initial test cases.

wjhorne · 2025-02-27T21:34:25Z

As a further test, I will also try out an umpire memory pool based on some discussions of other workarounds to IPC we found.

victorapm self-assigned this Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow BoomerAMG performance using gpu-aware mpi with Nalu-Wind on Kestrel #1241

Slow BoomerAMG performance using gpu-aware mpi with Nalu-Wind on Kestrel #1241

wjhorne commented Feb 27, 2025 •

edited

Loading

wjhorne commented Feb 27, 2025 •

edited

Loading

wjhorne commented Feb 27, 2025

victorapm commented Feb 27, 2025

wjhorne commented Feb 27, 2025

wjhorne commented Feb 27, 2025

wjhorne commented Feb 27, 2025

wjhorne commented Feb 27, 2025

Slow BoomerAMG performance using gpu-aware mpi with Nalu-Wind on Kestrel #1241

Slow BoomerAMG performance using gpu-aware mpi with Nalu-Wind on Kestrel #1241

Comments

wjhorne commented Feb 27, 2025 • edited Loading

wjhorne commented Feb 27, 2025 • edited Loading

wjhorne commented Feb 27, 2025

victorapm commented Feb 27, 2025

wjhorne commented Feb 27, 2025

wjhorne commented Feb 27, 2025

wjhorne commented Feb 27, 2025

wjhorne commented Feb 27, 2025

wjhorne commented Feb 27, 2025 •

edited

Loading

wjhorne commented Feb 27, 2025 •

edited

Loading