-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect results in receive buffer in GPU memory on Aurora #7302
Comments
My immediate work around is to ensure I have a configure option for performing all MPI transfers with a bounce to host memory first. |
We need to nail down the MPICH configuration for dealing with GPU buffers. Can you also provide the out of an MPI program execution with |
Sure ! working on it. |
Environment in the wrapper script prior to mpiexec:
Earlier sections, likely irrelevant here:
[last line repeats]
|
@paboyle Can you share the source code for |
The source is on Aurora, under:
Rebuild instructions were in the bottom of the initial post. https://github.com/paboyle/Grid/ The modifications have lots of debug / logging information, (I kinda had to tear the code apart tracking this back) such as the CRC computations that got me to the smoking gun. I'm reluctant to commit them, but could make a feature branch and put them onto GitHub if really necessary |
@paboyle A few things to try to help us narrow down the suspects -
|
I've just done a recursive chmod to be sure on the subdirectories -- you should be able read if you are on Aurora. |
Will try these |
Was too hasty. There was fail in one case. Double checking. |
@paboyle Another thing to try -
|
|
@paboyle Thanks for the tests. They confirm that the path in question is the GPU IPC path and it doesn't seem to be related to stale IPC cahce. But to confirm, could you also try -
The crc for send buffer does not match the crc for receive buffer. Does that means your datatype has gaps? Is there gaps in the receive buffer? |
The data are contiguous buffers. I run GPU kernels to gather/scatter the data into a buffer as this is typically fastest and simplest I'll get onto the : MPIR_CVAR_CH4_IPC_GPU_HANDLE_CACHE=generic case shortly. |
I have forensically tracked a correctness issue in a test on Aurora, eventually identifying
a GPU memory resident received packet from MPI_Sendrecv as containing incorrect data.
I talked to James Osborn and Xaioyong Jin who said to file this here.
The fail is deterministic for a specific problem size, but depends very critically and bizarrely on the precise
history of the code (GPU memory allocation sequences and/or kernel calls) that were used to initialize all arrays.
Presumably something is being remembered and cached somewhere. Deleting certain sections of kernel
calls fills the arrays differently and then correct results are obtained.
At the point of failure, I
Up to this point, I could also run the same sequence on Frontier with the same problem size and rank geometry.
Key sequence demonstrating the fail is on Aurora in:
/home/paboyle/Pipeline/Grid.save/Grid/cshift/Cshift_mpi.h
Stderr is stored into files Grid.stderr.$mpirank
From rank zero, the preceding sequence gives device buffers, byte count and the CRCs.
In the first below, the CRC of the received GPU data recv 3398638398 matches the received host data hrecv 3398638398
However in the last SendRecv in the sequence plotted there is a miscompare:
FAILED MPI crcs xmit 3004459347 recv 1101278579 hrecv 4248251369, and the buffer contents differ
in all entries.
Grid.stderr.0
If I look at ALL 8 ranks, they all failed at the same packet.
incorrectly received data in the GPU, the test runs to completion and then gets the correct answers.
This is running the standard log modules, and has not enabled pipeline mode - it's the standard vanilla MPICH installed on Aurora.
.
The software is in:
Binary:
Submission script:
To reconfigure and rebuild, assuming "/home/paboyle/Pipeline/Grid.save" is copied to some new location "GRID"
The text was updated successfully, but these errors were encountered: