Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow BoomerAMG performance using gpu-aware mpi with Nalu-Wind on Kestrel #1241

Open
wjhorne opened this issue Feb 27, 2025 · 7 comments
Open
Assignees

Comments

@wjhorne
Copy link

wjhorne commented Feb 27, 2025

Hi!

I put up this issue to track the ongoing effort to figure out why Nalu-Wind struggles with gpu-aware-mpi enabled in HYPRE when using the cluster Kestrel. The slowdown is within a bunch of slow MPI calls happening within BoomerAMG specifically when it is enabled. If I run with pure GMRES in HYPRE and gpu-aware-mpi things look good.

It is not clear that HYPRE is doing anything necessarily wrong. It is possible that this is a CUDA or HPE problem and people from both parties have been contacted over it. I have attached some shots showing an nsys profile with and without mpi aware enabled. Note the MPI line differences.

To solve this issue I am currently trying CUDA 12.6.2 which might have a related bug fix.

Image
Image

@wjhorne
Copy link
Author

wjhorne commented Feb 27, 2025

@neil-lindquist

@victorapm victorapm self-assigned this Feb 27, 2025
@wjhorne
Copy link
Author

wjhorne commented Feb 27, 2025

I am sad to report that CUDA 12.6 with all the proper linkages did not change the behavior. I am pulling fresh profiles of it and will add the data

@victorapm
Copy link
Contributor

@dreachem tagging for reference

@wjhorne
Copy link
Author

wjhorne commented Feb 27, 2025

Image

This is a shot of the 12.6.1 data. It is quite similar to what I had from 12.2. I have the full profiles on Kestrel available. Each processor's dataset is about 100 MB and can't be uploaded here. Let me know if you would like it and don't have access to Kestrel's /scratch space

@wjhorne
Copy link
Author

wjhorne commented Feb 27, 2025

MPICH_GPU_IPC_ENABLED=0 Allows us to bypass the issue by disabling the slow calls

@wjhorne
Copy link
Author

wjhorne commented Feb 27, 2025

We would still like to run things with it enabled, but we are seeing some performance benefits even as is for our initial test cases.

@wjhorne
Copy link
Author

wjhorne commented Feb 27, 2025

As a further test, I will also try out an umpire memory pool based on some discussions of other workarounds to IPC we found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants